How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.
As a simple example
link
should be replaced by the appropriate value the javascript function returns, e.g.
link
A more complex example would be a saved facebook html page which is littered with loads of javascript code.
Probably related to
How to "execute" HTML+Javascript page with Node.js
but do I really need Node.js and JSDOM? Also slightly related is
Python library for rendering HTML and javascript
but I'm not interested in rendering just the pure html output.
You can use Selenium with python as detailed here
Example:
import xmlrpclib
# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)
# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)
import os
os.system('start run_firefox.bat')
print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()
From Mozilla Gecko FAQ:
Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?
A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.
Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.
PhantomJS can be loaded using Selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://seleniumhq.org/')
In [4]: browser.title
Out[4]: u'Selenium - Web Browser Automation'
Related
I have had Xpath work with other things before, in the Chrome browser I can find my xpath in the console with $x('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]/div[2]/div[1]/span[2]) at https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na.
When I try to get this element in code it returns an empty array, anybody know why?
#client.command(pass_context=True)
async def checkChrisPubg(ctx):
page = requests.get('https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na')
tree = html.fromstring(page.content)
duoRank = tree.xpath('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]')
print(duoRank)
print(duoRank) gives me []
So, I tried to do this with PyQt4 and had no real success in practice, a simpler but slightly more invasive resolution is to use Selenium, a webdriver for loading web pages.
I am sure there are multiple solutions to this but I was having a hell of a time even knowing what was wrong until I found my solution.
When using lxml you should ensure the data you are trying to grab is not generated by javascript. To do this you can open Chrome Developer tools, click the menu(3 vertical dots), go to settings, go to the bottom, disable Javascript, and reload the page.
If nothing is there, the page is generated content with Javascript.
A simple solution is below, this will wait for the page to render and then let you parse the tree with lxml.
This solution will require that you use these imports(You must install selenium):
from selenium import webdriver
Now, you can load the page and start scraping:
#Load in your browser(I use chrome)
browser = webdriver.Chrome()
#Choose url you want to scrape
url = 'https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na'
#get the url with Selenium
browser.get(url)
#get the innerhtml from the rendered page
innerHTML = browser.execute_script("return document.body.innerHTML")
#Now use lxml to parse the page
tree = html.fromstring(innerHTML)
#Get your element with xpath
duoRank = tree.xpath('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]/div[2]/div[1]/span[2]/text()')
#close the browser
browser.quit()
My original solution would have been nice, but just didn't work because much of it is deprecated.
What library are you using as a parser?
If xml.etree.ElementTree,
ElementTree provides limited support for XPath expressions. The goal is to support a small subset of the abbreviated syntax; a full XPath engine is outside the scope of the core library.
http://effbot.org/zone/element-xpath.htm
Open page source view-source:https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na here is the script with json playerData at line 491. Just parse it.
I apologize if my question sounds too basic or general, but it has puzzled me for quite a while. I am a political scientist with little IT background. My own research on this question does not solve the puzzle.
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category? I once came across some texts that show in Chrome Inspect, but could not be extracted by Xpath (I am 99.9% certain my Xpath expression was correct). Someone mentioned that the texts might be hidden behind some JavaScript. But this is still speculation, I can't be totally sure that it wasn't due to wrong Xpath expressions. Are there any signs that can make me certain that this is something beyond Scrapy and can only be dealt with programs such as Selenium? Any help appreciated.
-=-=-=-=-=
Edit (1/18/15): The webpage I'm working with is http://yhfx.beijing.gov.cn/webdig.js?z=5. The specific piece of information I want to scrape is circled in red ink (see screenshot below. Sorry, it's in Chinese).
I can see the desired text in Chrome's Inspect, which indicates that the Xpath expression to extract it should be response.xpath("//table/tr[13]/td[2]/text()").extract(). However, the expression doesn't work.
I examined response.body in Scrapy shell. The desired text is not in it. I suspect that it is JavaScript or AJAX here, but in the html, I did not see signs of JavaScript or AJAX. Any idea what it is?
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category?
The browsers do a lot of things when you open a web page. I will be oversimplify the process here:
Performs an HTTP request to the server hosting the web page.
Parses the response, which in most cases is HTML content (text-based format). We will assume we get a HTML response.
Starts the rendering the HTML, executes the Javascript code, retrieves external resources (images, css files, js files, fonts, etc). Not necessarily in this order.
Listens to events that may trigger more requests to inject more content into the page.
Scrapy provides tools to do 1. and 2. Selenium and other tools like Splash do 3., allow you to do 4. and access the rendered HTML.
Now, I think there are three basic cases you face when you want to extract text content from a web page:
The text is in plain HTML format, for example, as a text node or HTML attribute: <a>foo</a>, <a href="foo" />. The content could be visually hidden by CSS or Javascript, but as long is part of the HTML tree we can extract it via XPath/CSS rules.
The content is located in Javascript code. For example: <script>var cfg = {code: "foo"};</script>. We can locate the <script> node with a XPath rule and then use regular expressions to extract the string we want. Also there are libraries that allow us to parse pieces of Javascript so we can load objects easily. A complex solution here is executing the javascript code via a javascript engine.
The content is located in a external resource and is loaded via Ajax/XHR. Here you can emulate the XHR request with Scrapy and the parse the output, which can be a nice JSON object, arbitrary javascript code or simply HTML content. If it gets tricky to reverse engineer how the content is retrieved/parsed then you can use Selenium or Splash as a proxy for Scrapy so you can access the rendered content and still be able to use Scrapy for your crawler.
How you know which case you have? You can simply lookup the content in the response body:
$ scrapy shell http://example.com/page
...
>>> 'foo' in response.body.lower()
True
If you see foo in the web page via the browser but the test above returns False, then it's likely the content is loaded via Ajax/XHR. You have to check the network activity in the browser and see what requests are being done and what are the responses. Otherwise you are in case 1. or 2. You can simply view the source in the browser and search for the content to figure out where is located.
Let say the content you want is located in HTML tags. How do you know if your XPath expression correct? (By correct here we mean that gives you the output you expect)
Well, if you do scrapy shell and response.xpath(expression) returns nothing, then your XPath is not correct. You should reduce the specificity of your expression until you get an output that includes the content you want, and then narrow it down.
I am learning pyqt,use for parse webpage.
Now i want use pyqt evaluate javascript function just like this answer do:
spidermonkey evaluate js function which in remote js file
import urllib2
import spidermonkey
js = spidermonkey.Runtime()
js_ctx = js.new_context()
script = urllib2.urlopen('http://etherhack.co.uk/hashing/whirlpool/js/whirlpool.js').read()
js_ctx.eval_script(script)
js_ctx.eval_script('var s = "abc"')
js_ctx.eval_script('print(HexWhirlpool(s))')
I want know how to achieve the same effect by using pyqt instead of spidermonkey.
PyQt is a UI framework, so it might not be the best arrow in your quiver.
You can use WebView to load and display the web page. Since WebView uses WebKit, it will load and parse the HTML and all CSS and JavaScript in it. But that means you just get the final result - you don't have much control what is loaded and when.
Or you can use a tool like Beautiful Soup to parse the HTML. This gives you full control. You can then try to use spidermonkey to run the JavaScript but that will fail since many global variables like window or document will be missing. Also, there won't be a DOM.
For this, you'd need something like Envjs but there is no Python integration.
Maybe PhantomJS (a script driven Chrome/Chromium browser) is an option. See this question: Is there a way to use PhantomJS in Python?
I'm working on a bookmarking app using Django and would like to extract the title from web pages that use javascript to generate the title. I've looked at windmill and installed/ran selenium, which worked, but I believe these tools are more than what I need to obtain the title of a web page. I'm currently trying to use spynner, but haven't been successful in retrieving the contents after the page is fully rendered. Here is the code that I currently have...
from spynner import Browser
from pyquery import PyQuery
browser = Browser()
browser.set_html_parser(PyQuery)
browser.load("https://www.coursera.org/course/techcity")
I receive a SpynnerTimeout: Timeout reached: 10 seconds error when executing the last line in a python shell. If I execute the last statement again, it will return True, but only the page before the javascript is run is returned, which doesn't have the "correct" page title. I also tried the following:
browser.load("https://www.coursera.org/course/techcity", wait_callback=wait_load(10))
browser.soup("title")[0].text
But this also returns the incorrect title - 'Coursera.org' (i.e. title before the javascript is run).
Here are my questions:
Is there a more efficient recommended approach for extracting a web page title that is dynamically generated with javascript, that uses some other python tool/library? If so, what is that recommended approach? - any example code appreciated.
If using spynner is a good approach, what should I be doing to get the title after the page is loaded, or even better, right after the title has been rendered by the javascript. The code I have now is just what I pieced together from a blog post and looking at the source for spynner on github.
I downloaded one html-page. I need to parse one string form page, but it's behind javascripts. When i run this page in browser - all looks pretty, but in html-code i see something like this:
<script type="text/javascript">function deobfuscate_html(){s007=null;s7125=6 ... #long long string
How can I unpack this? I want to see the pretty raw, like in browser.
I would run the page in a web browser using Selenium (triggered and controlled in Python) - you can then gain access to the fully rendered page.
The chosen answer here will show you how to get the html from a rendered page.