JavaScript in requests package python - javascript

I want to get text from a site using Python.
But the site uses JavaScript and the requests package to receive only JavaScript code.
Is there a way to get text without using Selenium?
import requests as r
a=r.get('https://aparat.com/').text

If the site loads content using javascript then the javascript has to be run in order to get the content. I ran into this issue a while back when I did some web scraping, and ended up using Selenium. Yes its slower than BeautifulSoup but it's the easiest solution.
If you know how the server works you could send a request and it should return with content of some kind (whether that be html, json, etc)
Edit: Load the developer tools, go to network tab and refresh the page. Look for an XHR request and the URL it uses. You may be able to use this data for your needs.
For example I found these URLs:
https://www.aparat.com/api/fa/v1/etc/page/config/mode/full
https://www.aparat.com/api/fa/v1/video/video/list/tagid/1?next=1
If you navigate to these in your browser you will notice JSON content, you might be able to use this. I think some of the text is encoded in Unicode e.g \u062e\u0644\u0627\u0635\u0647 \u0628\u0627\u0632\u06cc -> خلاصه بازی
I don't know the specific python implementation you might use. Look for libs that support making http requests and recieving data. That way you can avoid selenium. But you must know the URL's beforehand. Like shown above.
For example this is what I would do:
Make a http request to the URL you find in developer tools
With JSON content, use a JSON parser to get a table/array/dictionary natively. You can then traverse this in the native programming language.
Use a unicode decoder to get the text in normal text format, there might be a lib to do this, but for example on this website using the "Decode/Unescape Unicode Entities" I was able to get the text.
I hope this helps.
Sample code:
import requests;
req = requests.get('https://www.aparat.com/api/fa/v1/video/video/show/videohash/IueKs?pr=1&mf=1&referer=direct')
res = req.json()
#do stuff with res
print(res)

Related

Crawl some of Javascript codes in a web-page

The page I am trying to crawl has includes javascript code. (Possibly using AJAX?) When I crawl the page based on the html code, it can't get the javascript part. How can I do that?
I think I need some libraries in python which can crawl the javascript code including html codes.
Please give me some advice.
Below is the page link:
view-source:http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=652691&tbl=cyber
I recommend two ways.
First, request ajax url directly and parse HTML.
import requests
url = "http://www.bobaedream.co.kr/mycar/proc/mycar_regist_option.php"
data = {'param': 'ALL'}
response = requests.post(url, data=data)
# parse
...
Second, use web driver, like geckodriver, phantomjs and so on, using selenium library.
That library make virtual browser, run javascript and then render the DOM made by javascript.
This is public documents about selenium

Downloading dynamically loaded webpage with python

I have this website and I want to download the content of the page.
I tried selenium, and button clicking with it, but with no success.
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox
import time
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
# setting the url
browser.get("http://bonusbagging.co.uk/oddsmatching.php#")
# finding and clicking the button
button = browser.find_element_by_id('select_button')
button.click()
page = browser.page_source
time.sleep(5)
print(page.encode("utf8"))
This code only downloads the source code, where the data are hidden.
Can someone show me the right way to do that? Or tell my how can be the hidden data downloaded?
Thanks in advance!
I always try to avoid selenium like the plague when scraping; it's very slow and is almost never the best way to go about things. You should dig into the source more before scraping; it was clear on this page that the html was coming in and then a separate call was being made to get the table's data. Why not make the same call as the page? It's lightning fast and requires no html parsing; just returns raw data, which seems to be what you're looking for. the python requests import is perfect for this. Happy Scraping!
import requests
table_data = requests.get('http://bonusbagging.co.uk/odds-server/getdata_slow.php').content
PS: The best way to look for these calls is to open the dev console, and check out the network tab. You can see what calls are being made here. Another way is to go to the sources tab, look for some javascript, and search for ajax calls (that's where I got the url I'm calling to above, the path was: top/odds-server.com/odds-server/js/table_slow.js). The later option is sometimes easier, sometimes it's nearly impossible (if the file is minified/uglified). Do whatever works for you!
Check out the Network tab in Chrome Dev tools. Nab the URL out of there.
What you're looking at is a DataTable. You can use their API to fetch what you need.
Adjust the "start" and/or "length" parameters to fetch the data page-by-page.
It's JSON data, so it'll be super easy to parse.
But be nice and don't hammer this poor guy's server.

Python get URL contents when page requires JavaScript enabled

I am looking to get the contents of a text file hosted on my website using Python. The server requires JavaScript to be enabled on your browser. Therefore when I run:
import urllib2
target_url = "http://09hannd.me/ai/request.txt"
data = urllib2.urlopen(target_url)
I receive a html page saying to enable JavaScript.
I was wondering if there was a way of faking having JS enabled or something.
Thanks
Selenium is the way to go here, but there is another "hacky" option.
Based on this answer: https://stackoverflow.com/a/26393257/2517622
import requests
url = 'http://09hannd.me/ai/request.txt'
response = requests.get(url, cookies={'__test': '2501c0bc9fd535a3dc831e57dc8b1eb0'})
print(response.content) # Output: find me a cafe nearby
I would probably suggest tools like this. https://github.com/niklasb/dryscrape
Additionally you can see more info here: Using python with selenium to scrape dynamic web pages

Python: Issue getting updating html created via JavaScript calls in browser

I am using Python to pull the HTML of a website to get satellite locations. Of course since I am not actually accessing the site via a browser I am not retrieving any html that would be populated by javascript calls.
import urllib.request
page = urllib.request.urlopen('http://n2yo.com/?s=20217')
file = open("textFile", "wb")
satelliteText = page.read()
file.write(satelliteText)
file.close()
I've explored libraries like Windmill that literally run a browser so that you can get that javascript created html, but I am using a Raspberry Pi. I'd rather not install an additional browser.
Is there anyway that I can make the ajax get calls myself that the website is making and retrieve just the data I need?
Looking at this source here: http://www.n2yo.com/js/passes.js it appears that it is calling http://www.n2yo.com/inc/all.php to get the data. By reading through passes.js carefully you should be able to figure out how to parse it.

Generate some xml in javascript, prompt user to save it

I'd like to make an XML document in JavaScript then have a save dialog appear.
It's OK if they have to click before the save can occur.
It's *not* OK if I *have* to use IE to achieve this (I don't even need to support it at all). However, Windows is a required platform (so Firefox or Chrome are the preferred browsers if I can only do this in one browser).
It's *not* OK if I need a web server. But conversely, I don't want to require the JavaScript to be run on a local file only, i.e. elevated privileges -- if possible. That is, I'd like to to run locally or on a *static* host. But just locally is OK.
It's OK to have to bend over backwards to do this. The file won't be very big, but internet access might either be there, be spotty or just not be a possibility at all -- see (3).
So far the only ideas I have seen are to save the XML to an iframe and save that document -- but it seems that you can only do this in IE? Also, that I could construct a data URI and place that in a link. My fear here is that it will just open the XML file in the window, rather than prompt the user to save it.
I know that if I require the JavaScript to be local, I can raise privileges and just directly save the file (or hopefully cause a save dialog box to appear). However, I'd much prefer a solution where I do not require raised privileges (even a Firefox 3.6 only solution).
I apologize if this offends anyone's sensibilities (for example, not supporting every browser). I basically want to write an offline application and Javascript/HTML/CSS seem to be the best candidate considering the complexity of the requirements and the time available. However, I have this single requirement of being able to save data that must be overcome before I can choose this line of development.
How about this downloadify script?
Which is based on Flash and jQuery, which can prompt you dialog box to save file in your computer.
Downloadify.create('downloadify',{
filename: function(){
return document.getElementById('filename').value;
},
data: function(){
return document.getElementById('data').value;
},
onComplete: function(){
alert('Your File Has Been Saved!');
},
onCancel: function(){
alert('You have cancelled the saving of this file.');
},
onError: function(){
alert('You must put something in the File Contents or there will be nothing to save!');
},
swf: 'media/downloadify.swf',
downloadImage: 'images/download.png',
width: 100,
height: 30,
transparent: true,
append: false
});
Using a base64 encoded data URI, this is possible with only html & js. What you can do is encode the data that you want to save (in your case, a string of XML data) into base64, using a js library like jquery-base64 by carlo. Then put the encoded string into a link, and add your link to the DOM.
Example using the library I mentioned (as well as jquery):
<html>
<head>
<title>Example</title>
</head>
<body>
<script>
//include jquery and jquery-base64 here (or whatever library you want to use)
document.write('click to make save dialog');
</script>
</body>
</html>
...and remember to make the content-type something like application/octet-stream so the browser doesn't try to open it.
Warning: some older IE versions don't support base64, but you said that didn't matter, so this should work fine for you.
Without any more insight into your specific requirements, I would not recommend a pure Javascript/HTML solution. From a user perspective you would probably get the best results writing a native application. However if it will be faster to use Javascript/HTML, I recommend using a local application hosting a lightweight web server to serve up your content. That way you can cleanly handle the file saving server-side while focusing the bulk of your effort on the front-end application.
You can code up a web server in - for example - Python or Ruby using very few lines of code and without 3rd party libraries. For example, see:
Making a simple web server in python
WEBrick - Writing a custom servlet
python-trick-really-little-http-server - This one is really simple, and will easily let you server up all of your HTML/CSS/JS files:
"""
Serves files out of its current directory.
Doesn't handle POST requests.
"""
import SocketServer
import SimpleHTTPServer
PORT = 8080
def move():
""" sample function to be called via a URL"""
return 'hi'
class CustomHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
def do_GET(self):
#Sample values in self for URL: http://localhost:8080/jsxmlrpc-0.3/
#self.path '/jsxmlrpc-0.3/'
#self.raw_requestline 'GET /jsxmlrpc-0.3/ HTTP/1.1rn'
#self.client_address ('127.0.0.1', 3727)
if self.path=='/move':
#This URL will trigger our sample function and send what it returns back to the browser
self.send_response(200)
self.send_header('Content-type','text/html')
self.end_headers()
self.wfile.write(move()) #call sample function here
return
else:
#serve files, and directory listings by following self.path from
#current working directory
SimpleHTTPServer.SimpleHTTPRequestHandler.do_GET(self)
httpd = SocketServer.ThreadingTCPServer(('localhost', PORT),CustomHandler)
print "serving at port", PORT
httpd.serve_forever()
Finally - Depending on who will be using your application, you also have the option of compiling a Python program into a Frozen Binary so the end user does not have to have Python installed on their machine.
Javascript is not allowed to write to a local machine. Your question is similar to this one.
I suggest creating a simple desktop app.
Is localhost PHP server ok? Web traditionally can't save to hard drive because of security concerns. PHP can push files though it requires a server.
Print to PDF plugins are available for available for all browsers. Install once, print to PDF forever. Then, you can use a javascript or Flash to call a Print function.
Also, if you are developing for an environment where internet access is spotty, conwider using VB.NET or some other desktop language.
EDIT:
You can use the browser's Print function.
Are you looking for something like this?
If PHP is ok, if would be much easier.
With IE you could use document.execCommand, but I note that IE is not an option.
Here's something that looks like it might help, although it will not prompt with SaveAs dialog, https://developer.mozilla.org/en/Code_snippets/File_I%2F%2FOL.
One simple but odd way to do this that doesn't require any Flash is to create an <a/> with a data URI for its href. This even has pretty good cross-browser support, although for IE it must be at least version 8 and the URI must be < 32k. It looks like someone else on SO has more to say on the topic.
Why not use a hybrid flash for client and some server solution server-side. Most people have flash so you can default to client side to conserve resources on the server.

Categories