I downloaded one html-page. I need to parse one string form page, but it's behind javascripts. When i run this page in browser - all looks pretty, but in html-code i see something like this:
<script type="text/javascript">function deobfuscate_html(){s007=null;s7125=6 ... #long long string
How can I unpack this? I want to see the pretty raw, like in browser.
I would run the page in a web browser using Selenium (triggered and controlled in Python) - you can then gain access to the fully rendered page.
The chosen answer here will show you how to get the html from a rendered page.
Related
I apologize if my question sounds too basic or general, but it has puzzled me for quite a while. I am a political scientist with little IT background. My own research on this question does not solve the puzzle.
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category? I once came across some texts that show in Chrome Inspect, but could not be extracted by Xpath (I am 99.9% certain my Xpath expression was correct). Someone mentioned that the texts might be hidden behind some JavaScript. But this is still speculation, I can't be totally sure that it wasn't due to wrong Xpath expressions. Are there any signs that can make me certain that this is something beyond Scrapy and can only be dealt with programs such as Selenium? Any help appreciated.
-=-=-=-=-=
Edit (1/18/15): The webpage I'm working with is http://yhfx.beijing.gov.cn/webdig.js?z=5. The specific piece of information I want to scrape is circled in red ink (see screenshot below. Sorry, it's in Chinese).
I can see the desired text in Chrome's Inspect, which indicates that the Xpath expression to extract it should be response.xpath("//table/tr[13]/td[2]/text()").extract(). However, the expression doesn't work.
I examined response.body in Scrapy shell. The desired text is not in it. I suspect that it is JavaScript or AJAX here, but in the html, I did not see signs of JavaScript or AJAX. Any idea what it is?
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category?
The browsers do a lot of things when you open a web page. I will be oversimplify the process here:
Performs an HTTP request to the server hosting the web page.
Parses the response, which in most cases is HTML content (text-based format). We will assume we get a HTML response.
Starts the rendering the HTML, executes the Javascript code, retrieves external resources (images, css files, js files, fonts, etc). Not necessarily in this order.
Listens to events that may trigger more requests to inject more content into the page.
Scrapy provides tools to do 1. and 2. Selenium and other tools like Splash do 3., allow you to do 4. and access the rendered HTML.
Now, I think there are three basic cases you face when you want to extract text content from a web page:
The text is in plain HTML format, for example, as a text node or HTML attribute: <a>foo</a>, <a href="foo" />. The content could be visually hidden by CSS or Javascript, but as long is part of the HTML tree we can extract it via XPath/CSS rules.
The content is located in Javascript code. For example: <script>var cfg = {code: "foo"};</script>. We can locate the <script> node with a XPath rule and then use regular expressions to extract the string we want. Also there are libraries that allow us to parse pieces of Javascript so we can load objects easily. A complex solution here is executing the javascript code via a javascript engine.
The content is located in a external resource and is loaded via Ajax/XHR. Here you can emulate the XHR request with Scrapy and the parse the output, which can be a nice JSON object, arbitrary javascript code or simply HTML content. If it gets tricky to reverse engineer how the content is retrieved/parsed then you can use Selenium or Splash as a proxy for Scrapy so you can access the rendered content and still be able to use Scrapy for your crawler.
How you know which case you have? You can simply lookup the content in the response body:
$ scrapy shell http://example.com/page
...
>>> 'foo' in response.body.lower()
True
If you see foo in the web page via the browser but the test above returns False, then it's likely the content is loaded via Ajax/XHR. You have to check the network activity in the browser and see what requests are being done and what are the responses. Otherwise you are in case 1. or 2. You can simply view the source in the browser and search for the content to figure out where is located.
Let say the content you want is located in HTML tags. How do you know if your XPath expression correct? (By correct here we mean that gives you the output you expect)
Well, if you do scrapy shell and response.xpath(expression) returns nothing, then your XPath is not correct. You should reduce the specificity of your expression until you get an output that includes the content you want, and then narrow it down.
I am using HtmlUnit to read content from a web site.
Everything works perfectly to the point where I am reading the content with:
HtmlDivision div = page.getHtmlElementById("my-id");
Even div.asText() returns the expected String object, but I want to get the original HTML inside <div>...</div> as a String object. How can I do that?
I am not willing to change HtlmUnit to something else, as the web site expects the client to run JavaScript, and HtmlUnit seems to be capable of doing what is required.
If by original HTML you mean the HTML code that HTMLUnit has already formatted then you can use div.asXml(). Now, if you really are looking for the original HTML the server sent you then you won't find a way to do so (at least up to v2.14).
Now, as a workaround, you could get the whole text of the page that the server sent you with this answer: How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?
As a side note, you should probably think twice why you need the HTML code. HTMLUnit will let you get the data from the code, so there shouldn't be any need to store the source code but rather the information it is contained in it. Just my 2 cents.
I have a login page that runs a script from a third-party site as so:
<span id="siteseal">
<script src="https://seal.starfieldtech.com/getSeal?sealID=myspecialid"></script>
</span>
Everything is hunky dory. It performs some javascript and eventually displays an image.
I recently moved it to a separate file, and am including it in the original page using
$('#mydivid').load('/mypath/footer.html');
The entire footer is displayed, and in chrome developer tools I can see that the request is made to starfieldtech and a javascript response is returned, but the image is never displayed.
The getSeal script is pretty simple, looking something like:
<!--
doSomething();
function doSomething() {
// setup a bunch of vars
document.write('<img src="blah" onclick="doStuffOnClick();"/>');
}
function doStuffOnClick() {
// do other stuff on click.
}
// -->
If I create my own script that looks very similar to the above then it replaces the whole page with the output of my script and also shows the intended starfield image.
I have no clue what the problem is and hope the gurus can point out something stupid I'm missing.
As you can't edit the javascipt returned by starfieldtech.com, I think the only sensible way round this is to use phantom.js (or similar) to pre-process third party html on your server before sending to the browser
I'm working on a bookmarking app using Django and would like to extract the title from web pages that use javascript to generate the title. I've looked at windmill and installed/ran selenium, which worked, but I believe these tools are more than what I need to obtain the title of a web page. I'm currently trying to use spynner, but haven't been successful in retrieving the contents after the page is fully rendered. Here is the code that I currently have...
from spynner import Browser
from pyquery import PyQuery
browser = Browser()
browser.set_html_parser(PyQuery)
browser.load("https://www.coursera.org/course/techcity")
I receive a SpynnerTimeout: Timeout reached: 10 seconds error when executing the last line in a python shell. If I execute the last statement again, it will return True, but only the page before the javascript is run is returned, which doesn't have the "correct" page title. I also tried the following:
browser.load("https://www.coursera.org/course/techcity", wait_callback=wait_load(10))
browser.soup("title")[0].text
But this also returns the incorrect title - 'Coursera.org' (i.e. title before the javascript is run).
Here are my questions:
Is there a more efficient recommended approach for extracting a web page title that is dynamically generated with javascript, that uses some other python tool/library? If so, what is that recommended approach? - any example code appreciated.
If using spynner is a good approach, what should I be doing to get the title after the page is loaded, or even better, right after the title has been rendered by the javascript. The code I have now is just what I pieced together from a blog post and looking at the source for spynner on github.
How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.
As a simple example
link
should be replaced by the appropriate value the javascript function returns, e.g.
link
A more complex example would be a saved facebook html page which is littered with loads of javascript code.
Probably related to
How to "execute" HTML+Javascript page with Node.js
but do I really need Node.js and JSDOM? Also slightly related is
Python library for rendering HTML and javascript
but I'm not interested in rendering just the pure html output.
You can use Selenium with python as detailed here
Example:
import xmlrpclib
# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)
# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)
import os
os.system('start run_firefox.bat')
print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()
From Mozilla Gecko FAQ:
Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?
A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.
Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.
PhantomJS can be loaded using Selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://seleniumhq.org/')
In [4]: browser.title
Out[4]: u'Selenium - Web Browser Automation'