Downloading dynamically loaded webpage with python - javascript

I have this website and I want to download the content of the page.
I tried selenium, and button clicking with it, but with no success.
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox
import time
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
# setting the url
browser.get("http://bonusbagging.co.uk/oddsmatching.php#")
# finding and clicking the button
button = browser.find_element_by_id('select_button')
button.click()
page = browser.page_source
time.sleep(5)
print(page.encode("utf8"))
This code only downloads the source code, where the data are hidden.
Can someone show me the right way to do that? Or tell my how can be the hidden data downloaded?
Thanks in advance!

I always try to avoid selenium like the plague when scraping; it's very slow and is almost never the best way to go about things. You should dig into the source more before scraping; it was clear on this page that the html was coming in and then a separate call was being made to get the table's data. Why not make the same call as the page? It's lightning fast and requires no html parsing; just returns raw data, which seems to be what you're looking for. the python requests import is perfect for this. Happy Scraping!
import requests
table_data = requests.get('http://bonusbagging.co.uk/odds-server/getdata_slow.php').content
PS: The best way to look for these calls is to open the dev console, and check out the network tab. You can see what calls are being made here. Another way is to go to the sources tab, look for some javascript, and search for ajax calls (that's where I got the url I'm calling to above, the path was: top/odds-server.com/odds-server/js/table_slow.js). The later option is sometimes easier, sometimes it's nearly impossible (if the file is minified/uglified). Do whatever works for you!

Check out the Network tab in Chrome Dev tools. Nab the URL out of there.
What you're looking at is a DataTable. You can use their API to fetch what you need.
Adjust the "start" and/or "length" parameters to fetch the data page-by-page.
It's JSON data, so it'll be super easy to parse.
But be nice and don't hammer this poor guy's server.

Related

JavaScript in requests package python

I want to get text from a site using Python.
But the site uses JavaScript and the requests package to receive only JavaScript code.
Is there a way to get text without using Selenium?
import requests as r
a=r.get('https://aparat.com/').text
If the site loads content using javascript then the javascript has to be run in order to get the content. I ran into this issue a while back when I did some web scraping, and ended up using Selenium. Yes its slower than BeautifulSoup but it's the easiest solution.
If you know how the server works you could send a request and it should return with content of some kind (whether that be html, json, etc)
Edit: Load the developer tools, go to network tab and refresh the page. Look for an XHR request and the URL it uses. You may be able to use this data for your needs.
For example I found these URLs:
https://www.aparat.com/api/fa/v1/etc/page/config/mode/full
https://www.aparat.com/api/fa/v1/video/video/list/tagid/1?next=1
If you navigate to these in your browser you will notice JSON content, you might be able to use this. I think some of the text is encoded in Unicode e.g \u062e\u0644\u0627\u0635\u0647 \u0628\u0627\u0632\u06cc -> خلاصه بازی
I don't know the specific python implementation you might use. Look for libs that support making http requests and recieving data. That way you can avoid selenium. But you must know the URL's beforehand. Like shown above.
For example this is what I would do:
Make a http request to the URL you find in developer tools
With JSON content, use a JSON parser to get a table/array/dictionary natively. You can then traverse this in the native programming language.
Use a unicode decoder to get the text in normal text format, there might be a lib to do this, but for example on this website using the "Decode/Unescape Unicode Entities" I was able to get the text.
I hope this helps.
Sample code:
import requests;
req = requests.get('https://www.aparat.com/api/fa/v1/video/video/show/videohash/IueKs?pr=1&mf=1&referer=direct')
res = req.json()
#do stuff with res
print(res)

Winnovative HTML to PDF Converter - Not saving a dynamic javascript page

I have tried looking for a solution to this problem on the site, but can't appear to find one. I have limited knowledge about this particular subject, so please excuse my ignorance!
Our website converts HTML to PDF using the Winnovative HTML to PDF converter.
The pages that need to be converted are using KnockoutJS and therefore the HTML code is not in the page source when the page is originally loaded.
I have tried setting a 30 second page delay, but it seems like the converter won't even save our home page, e.g. www.zapkam.com, let alone the pages that I actually need to save, e.g. http://www.zapkam.com/print.htm#/Orders/ZK1019467/Order/
This had previously been working fine on version 11.6.0.0 on a Windows 2008 Server, but since transferring to version 12.5.0.0 on a Windows 2012 Server, it is no longer working.
The fact that it was working before seems to point towards it potentially being a permissions issue as the server is not configured, but I would be very grateful for any insight!!
It will done using Javascript with Canvas,
As I had written code in your Print.html Page..
After successful HTML Rendered we need to call my button "print PDF" find in demo application..
Look into my demo Index page , It will create PDF and write to the client browser..
please check attached application..
www.maplayout.com/zampak.zip
Thanks,
Abhishek

Python get URL contents when page requires JavaScript enabled

I am looking to get the contents of a text file hosted on my website using Python. The server requires JavaScript to be enabled on your browser. Therefore when I run:
import urllib2
target_url = "http://09hannd.me/ai/request.txt"
data = urllib2.urlopen(target_url)
I receive a html page saying to enable JavaScript.
I was wondering if there was a way of faking having JS enabled or something.
Thanks
Selenium is the way to go here, but there is another "hacky" option.
Based on this answer: https://stackoverflow.com/a/26393257/2517622
import requests
url = 'http://09hannd.me/ai/request.txt'
response = requests.get(url, cookies={'__test': '2501c0bc9fd535a3dc831e57dc8b1eb0'})
print(response.content) # Output: find me a cafe nearby
I would probably suggest tools like this. https://github.com/niklasb/dryscrape
Additionally you can see more info here: Using python with selenium to scrape dynamic web pages

python: how to save dynamically rendered html web page code

I have a setup where a web page in a local server (localhost:8080) is changed dynamically by sending sockets that load some scripts (d3 code mainly).
In chrome I can inspect the "rendered html status" of the page, i.e., the resulting html code of the d3/javascript loaded codes. Now, I need to save that "full html snapshot" of the rendered web-page to be able to see it later, in a "static" way.
I have tried many solutions in python, which work well to load a web and save its "on-load" d3/javascript processed content, but DO NOT get info about the code generated "after" the load.
I could also use javascript to make this if no python solution is found.
Remember that I need to retrieve the full html rendered code that has been "dynamically" modified in time, in a chosen moment of time.
Here are a list of questions found in stackoverflow that are related but do not answer this question.
Not answered:
How to save dynamically changed HTML?
Answered but not for dynamically changed html:
Using PyQt4 to return Javascript generated HTML
Not Answered:
How to save dynamically added data to update the page (using jQuery)
Not dynamic:
Python to Save Web Pages
The question could be solved using selenium-python (thanks to #Juca suggestion to use selenium).
Once installed (pip install selenium) this code makes the trick:
from selenium import webdriver
# initiate the browser. It will open the url,
# and we can access all its content, and make actions on it.
browser = webdriver.Firefox()
url = 'http://localhost:8080/test.html'
# the page test.html is changing constantly its content by receiving sockets, etc.
#So we need to save its "status" when we decide for further retrieval)
browser.get(url)
# wait until we want to save the content (this could be a buttonUI action, etc.):
raw_input("Press to print web page")
# save the html rendered content in that moment:
html_source = browser.page_source
# display to check:
print html_source

How to get Dynamic HTML code by PHP or JS

I want to get contents from a website, but when I use file_get_contents() function, I get the HTML code, but some of them lost, I check the site code, I know some parts generate by Ajax, I don't know how to get them, does someone have any suggestions?
I may get some examples,
Site: http://www.drbattery.com/category/notebook+battery/acer/aspire+series.aspx?p=3
Request: I want to get those laptop model which list on this page, such as "Aspire 1690" etc. I need all of those models.
Mhm.
In JS you can access the HTML content in a browser by
document.getElementsByTagName('body')[0].innerHTML
Doing this server-side, you would probably need a headless browser for this.
The tricky part would be detecting, when the content has finished loading and everything is in place. (You wont be able to track AJAX requests by "window.onload".)
Doing it manually, you could add a bookmarklet to your browser, like
javascript:alert(document.getElementsByTagName('body')[0].innerHTML)
You could then select the alert's content by keyboard shortcut (CTRL + A or Command + A), copy it, and hit return (as the dialog's close-button will probably be out of sight).

Categories