I am having trouble finding the answer I need for my situation and hoping that someone may be able to help.
Full disclosure, I am pretty new to Python.
What I am trying to achieve is;
Perform a HTTP POST to download a PDF (with Requests, Python 3).
I want to take the stream and give it Google Drive, which will convert back to a PDF and save the file in the Drive folder.
I am ok with the url file request, but I am receiving UnicodeDecodeError on reading the contents. I have tried several files but the same result.
The limitations I have are that I can't install packages (limited to the couple that are provided) and cannot write the output to a file (read-only fs), as I am using Zapier.
This is the (very) basic code I am using;
import requests
url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf'
resp = requests.get(url, stream=True)
resp_bin = (resp.content)
return {'output': resp_bin}
Note that the file is just a random file I found for testing purposes.
Error received is; UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
If anyone has a way of making this work without additional packages, that would be great.
Note: I do also have the option of urllib here and alternatively JavaScript (node.js v4.3.2) with fetch.
Related
I want to get text from a site using Python.
But the site uses JavaScript and the requests package to receive only JavaScript code.
Is there a way to get text without using Selenium?
import requests as r
a=r.get('https://aparat.com/').text
If the site loads content using javascript then the javascript has to be run in order to get the content. I ran into this issue a while back when I did some web scraping, and ended up using Selenium. Yes its slower than BeautifulSoup but it's the easiest solution.
If you know how the server works you could send a request and it should return with content of some kind (whether that be html, json, etc)
Edit: Load the developer tools, go to network tab and refresh the page. Look for an XHR request and the URL it uses. You may be able to use this data for your needs.
For example I found these URLs:
https://www.aparat.com/api/fa/v1/etc/page/config/mode/full
https://www.aparat.com/api/fa/v1/video/video/list/tagid/1?next=1
If you navigate to these in your browser you will notice JSON content, you might be able to use this. I think some of the text is encoded in Unicode e.g \u062e\u0644\u0627\u0635\u0647 \u0628\u0627\u0632\u06cc -> خلاصه بازی
I don't know the specific python implementation you might use. Look for libs that support making http requests and recieving data. That way you can avoid selenium. But you must know the URL's beforehand. Like shown above.
For example this is what I would do:
Make a http request to the URL you find in developer tools
With JSON content, use a JSON parser to get a table/array/dictionary natively. You can then traverse this in the native programming language.
Use a unicode decoder to get the text in normal text format, there might be a lib to do this, but for example on this website using the "Decode/Unescape Unicode Entities" I was able to get the text.
I hope this helps.
Sample code:
import requests;
req = requests.get('https://www.aparat.com/api/fa/v1/video/video/show/videohash/IueKs?pr=1&mf=1&referer=direct')
res = req.json()
#do stuff with res
print(res)
I'm doing Api integration with Suitescript 2.0. A data encoded with base64 is returned from the Api. Here I need to reach the data I want by decoding the base64 and saving the xml data returned as a .zip and unzip it.
The relevant data can be run in Notepad++ with Plugins > MIME Tools > Decode Base64, saved as zip and opened with unzip.
The script I'm working with is a shcedule script.
I tried the two methods mentioned to decode in Suite Answers.
1- From base64 to UTF_8 with N/encode module (Returned result is completely wrong for this problem)
2 - The solution in the link:
https://netsuite.custhelp.com/app/answers/detail/a_id/41271/kw/base64%20decode
(In this solution, when you save the returned data as zip, it gives an "Unexpected end of the archive" error when opening the zip.)
ArrayBuffer() and atob() are not available in Suitescript.
The thing I know will work is to proxy the call through a Lambda on some external system.
However if your data is already in base64 you might try just creating a file cabinet file and give it the base64 encoded value as its content. Netsuite already handles base64 for files so you might be overworking the issue. It doesn't sound like you are actually processing the xml if your end goal is to save it as a zip.
If this doesn't help see my comments regarding some clarifications you could add to your question.
require(["N/encode"], function(encode){
var txt = encode.convert({
string: "your Base64 string",
inputEncoding: encode.Encoding.BASE_64,
outputEncoding: encode.Encoding.UTF_8
});
}
SuiteScript example
All types of encode
I’m a bit new to javascriipt/nodejs and its packages. Is it possible to download a file using my local browser or network? Whenever I look up scraping html files or downloading them, it is always done through a separate package and their server doing a request to a given url. How do I make my own computer download a html file as if I did right click save as on a google chrome webpage without running into any server/security issues and errors with javascript?
Fetching a document over HTTP(S) in Node is definitely possible, although not as simple as some other languages. Here's the basic structure:
const https = require(`https`); // use http if it's an http url;
https.get(URLString, res => {
const buffers = [];
res.on(`data`, data => buffers.push(data));
res.on(`end`, ()=>{
const data = Buffer.concat(buffers);
/*
from here you can do what you want with the data. You can write it to a file
with fs, you can console.log it using data.toString(), etc.
*/
});
})
Edit: I think I missed the main question you had, give me a sec to add that.
Edit 2: If you're comfortable with doing the above, the way you access a website the same way as your browser is to open up the developer tools (F12 on Chrome) go to the network tab, find the request that the browser has made, and then using http(s).get(url, options, callback), set the exact same headers in the options that you see in your browser. Most of the time you won't need all of them, all you'll need is the authentication/session cookie.
I have tried numerous different codes that I have found here along with the following code below I got from learn.microsoft.com. They all give me the same error though. The error I get is "ActiveXObject is not defined". Can someone please tell me how to fix this issue. How do I define the object or is this yet another problem that is related to my host, GoDaddy?!?!
This is for a Plesk/Windows server hosted by GoDaddy.
This is a link is to just one of the codes from stackoverflow that I have tried: Use JavaScript to write to text file?
Microsoft Code
<script>
var fso, tf;
fso = new ActiveXObject("Scripting.FileSystemObject");
tf = fso.CreateTextFile("G:\mysite\file.txt", true);
// Write a line with a newline character.
tf.WriteLine("Testing 1, 2, 3.") ;
// Write three newline characters to the file.
tf.WriteBlankLines(3) ;
// Write a line.
tf.Write ("This is a test.");
tf.Close();
</script>
You can't write to a file on the server with client-side JavaScript (if clients could write arbitrary files on servers then Google's homepage would be vandalised every second).
The code you've found could write to the hard disk of the computer the "page" was loaded on, but only if the "page" was an HTA application and not a web page.
The standard way to send data to an HTTP server from JavaScript is to make an HTTP request. You can do this with an Ajax API like fetch.
You then need a server-side program (written in the language of your choice) that will process the request and write to the file (although due to race conditions, you are normally better off using a database than a flat file).
in my webpage you can read book in pdf format. The problem is that some books have around 1000 pages and the PDF is really big so even if the user reads just 10 pages the server download the full pdf, so this is awful for my hosting account because I have a transfer limit.
What could I do to display the pdf without load the full PDF.
I use pdf.js
Greetings.
ORIGINAL POST:
PDF files are designed in a way that forces the client side to download the whole file just to get the first page.
The last line of the PDF file tells the PDF reader where the root dictionary for the PDF file is located (the root dictionary tells the reader about the page catalog - order of pages - and other data used by the reader).
So, as you can see, the limitations of the PDF design require that you use a server side solution that will create a new PDF with only the page(s) you want to display.
The best solution (in my opinion) is to create a "reader" page (as opposed to a download page) that requests a specific page from the server and allows the user to advance page by page (using AJAX).
The server will need to create a new PDF (file or stream) that contains only the requested page and return it to the reader.
if you are running your server with Ruby (ruby on rails), you can use the combine_pdf gem to load the pdf and send just one page...
You can define a controller method that will look something like this:
def get_page
# read the book
book = CombinePDF.parse IO.read("book.pdf")
# create empty PDF
pdf_with_one_page = CombinePDF.new
# add the page you want
# notice that the pages array is indexed from 0,
# so an adjustment to user input is needed...
pdf_with_one_page << book.pages[ params[:page_number] - 1 ]
# no need to create a file, just stream the data to the client.
send_data pdf_with_one_page.to_pdf, type: 'application/pdf', disposition: 'inline'
end
if you are running PHP or node.js, you will need to find a different server-side solution.
Good luck!
EDIT:
I was looking over the PDF.js project (which looks very nice) and notice the limited support statement for Safari:
"Safari (desktop and mobile) lacks a number of features or has defects, e.g. in typed arrays or HTTP range requests"...
I understand from this statement that on some browsers you can manage a client-side solution based on the HTTP Byte Serving protocol.
This will NOT work with all browsers, but it will keep you from having to use a server-side solution.
I couldn't find the documentation for the PDF.js feature (maybe it defaults to ranges and you just need to set the range...?), but I would go with a server-side solution that I know to work on all browsers.
EDIT 2:
Ignore Edit 1, as iPDFdev pointed out (thank you iPDFdev), this requires a special layout of the PDF file and will not resolve the issue of the browser downloading the whole file.
You can take following approach governed by functionality
Add configuration (i.e. kind of flag) whether you want to display entire PDF or not.
While rendering your response read above mentioned configuration if flag is set generate minimal PDF with 20 pages with hyperlink to download entire PDF else minimal PDF with 20 pages only
When you prepare initial response of your web page add PDF which contains say 20 pages (minimal PDF) only and process the response