I'm working on a bookmarking app using Django and would like to extract the title from web pages that use javascript to generate the title. I've looked at windmill and installed/ran selenium, which worked, but I believe these tools are more than what I need to obtain the title of a web page. I'm currently trying to use spynner, but haven't been successful in retrieving the contents after the page is fully rendered. Here is the code that I currently have...
from spynner import Browser
from pyquery import PyQuery
browser = Browser()
browser.set_html_parser(PyQuery)
browser.load("https://www.coursera.org/course/techcity")
I receive a SpynnerTimeout: Timeout reached: 10 seconds error when executing the last line in a python shell. If I execute the last statement again, it will return True, but only the page before the javascript is run is returned, which doesn't have the "correct" page title. I also tried the following:
browser.load("https://www.coursera.org/course/techcity", wait_callback=wait_load(10))
browser.soup("title")[0].text
But this also returns the incorrect title - 'Coursera.org' (i.e. title before the javascript is run).
Here are my questions:
Is there a more efficient recommended approach for extracting a web page title that is dynamically generated with javascript, that uses some other python tool/library? If so, what is that recommended approach? - any example code appreciated.
If using spynner is a good approach, what should I be doing to get the title after the page is loaded, or even better, right after the title has been rendered by the javascript. The code I have now is just what I pieced together from a blog post and looking at the source for spynner on github.
Related
I have a django 4.0 monolith project; however, one of the objects (a graph) that I pass to the template context to display in the html is slow to load.
I'd like to have the rest of the page render first, while this function is still running/loading. That way, users can see the rest of the page while the graph is still loading. Thoughts on the best approach to this?
I tried creating a custom tag to load the graph, but the browser still waited for the everything to load before showing anything on the page.
I'm wondering if some type of javascript call would accomplish this. Perhaps an ajax call? Or the asyncio python package?
I'm using Delphi's TWebBrowser component to load up some web pages that I want to parse, and they use javascript (AJAX?) to render the user-visible HTML code. The well-documented methods of extracting the HTML from such pages returns a bunch of javascript rather than what the user sees. There are responses to queries here that go back to 2004 and they all return javascript rather than the user-visible HTML. I've seen a couple that suggest alternate ways to access the data, but I have not been able to get any of them to work, nor am I sure how to adapt the code.
My question is, when I load a web page into a TWebBrowser that's perfectly readable after being rendered inside of the TWebBrowser component, how can I extract the HTML that's ultimately rendered inside of that component that makes it visible, rather than the JS code that generates it?
In my case, I'm trying to load a Google Search Result page, but I've heard this is also an issue in lots of news sites like Wall Street Journal, WAPO, and NYTimes.
var
url: string;
d: OleVariant;
begin
// enter something like "dentist in baltimore" in a Google search,
// then copy the contents of the ADDRESS field that it generates and
// paste it here:
url := '... paste URL Google generates here ...';
WebBrowser1.Navigate2( url, 0 {nav_flags} );
// I have an OnNavigate2 handler here, but I'm guessing this works as well
d := WebBrowser1.Document;
memo1.Lines.Text := d.documentElement.outerHTML;
The problem is, the memo contains ... and it's just a bunch of javascript in the HEAD. There's nothing there that resembles what's visible in the TWebBrowser or browser window that this search actually displays to the user.
Someone in another forum suggested it's a timing issue, and to replace the OnNavigationComplete2 that I'm using with OnDocumentComplete. I've actually never seen or heard of OnDocumentComplete, nor have I seen it used in any examples. Certainly none that have been simplified to show everything inline so there are no timing issues that can occur.
But it turns out that this was the crux of the problem in this case, not outerHTML: you need to call an event that's triggered after all of the javascript has finished running, and I believed that the OnNavigationComplete2 did that. My bad.
I'm trying to scrape the following page using hQuery: http://www.oddsportal.com/search/Paris+SG/soccer/
I realised half way that the odds of each game are included using JS (before, it's just -). Is there any way to get the page after the javascript has been executed or should I find another website??
My guess is that you would have to use another browser (not hQuery) and look into the code and see if there are any events that are emitted that you can catch up on.
You cannot using PHP
Scraping a site gives you whatever the server responds with to the HTTP request that you make (from which the "initial" state of the DOM tree is derived, if that content is HTML). It cannot take into account the "current" state of the DOM after it has been modified by Javascript.
You can using other powerful tools like selenium
You would need PhantomJs PHP wrapper for that is easy to use and gives more control and features, please see my answer here
Scraping a dynamically loading website with php curl
Hope it helps
I apologize if my question sounds too basic or general, but it has puzzled me for quite a while. I am a political scientist with little IT background. My own research on this question does not solve the puzzle.
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category? I once came across some texts that show in Chrome Inspect, but could not be extracted by Xpath (I am 99.9% certain my Xpath expression was correct). Someone mentioned that the texts might be hidden behind some JavaScript. But this is still speculation, I can't be totally sure that it wasn't due to wrong Xpath expressions. Are there any signs that can make me certain that this is something beyond Scrapy and can only be dealt with programs such as Selenium? Any help appreciated.
-=-=-=-=-=
Edit (1/18/15): The webpage I'm working with is http://yhfx.beijing.gov.cn/webdig.js?z=5. The specific piece of information I want to scrape is circled in red ink (see screenshot below. Sorry, it's in Chinese).
I can see the desired text in Chrome's Inspect, which indicates that the Xpath expression to extract it should be response.xpath("//table/tr[13]/td[2]/text()").extract(). However, the expression doesn't work.
I examined response.body in Scrapy shell. The desired text is not in it. I suspect that it is JavaScript or AJAX here, but in the html, I did not see signs of JavaScript or AJAX. Any idea what it is?
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category?
The browsers do a lot of things when you open a web page. I will be oversimplify the process here:
Performs an HTTP request to the server hosting the web page.
Parses the response, which in most cases is HTML content (text-based format). We will assume we get a HTML response.
Starts the rendering the HTML, executes the Javascript code, retrieves external resources (images, css files, js files, fonts, etc). Not necessarily in this order.
Listens to events that may trigger more requests to inject more content into the page.
Scrapy provides tools to do 1. and 2. Selenium and other tools like Splash do 3., allow you to do 4. and access the rendered HTML.
Now, I think there are three basic cases you face when you want to extract text content from a web page:
The text is in plain HTML format, for example, as a text node or HTML attribute: <a>foo</a>, <a href="foo" />. The content could be visually hidden by CSS or Javascript, but as long is part of the HTML tree we can extract it via XPath/CSS rules.
The content is located in Javascript code. For example: <script>var cfg = {code: "foo"};</script>. We can locate the <script> node with a XPath rule and then use regular expressions to extract the string we want. Also there are libraries that allow us to parse pieces of Javascript so we can load objects easily. A complex solution here is executing the javascript code via a javascript engine.
The content is located in a external resource and is loaded via Ajax/XHR. Here you can emulate the XHR request with Scrapy and the parse the output, which can be a nice JSON object, arbitrary javascript code or simply HTML content. If it gets tricky to reverse engineer how the content is retrieved/parsed then you can use Selenium or Splash as a proxy for Scrapy so you can access the rendered content and still be able to use Scrapy for your crawler.
How you know which case you have? You can simply lookup the content in the response body:
$ scrapy shell http://example.com/page
...
>>> 'foo' in response.body.lower()
True
If you see foo in the web page via the browser but the test above returns False, then it's likely the content is loaded via Ajax/XHR. You have to check the network activity in the browser and see what requests are being done and what are the responses. Otherwise you are in case 1. or 2. You can simply view the source in the browser and search for the content to figure out where is located.
Let say the content you want is located in HTML tags. How do you know if your XPath expression correct? (By correct here we mean that gives you the output you expect)
Well, if you do scrapy shell and response.xpath(expression) returns nothing, then your XPath is not correct. You should reduce the specificity of your expression until you get an output that includes the content you want, and then narrow it down.
How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.
As a simple example
link
should be replaced by the appropriate value the javascript function returns, e.g.
link
A more complex example would be a saved facebook html page which is littered with loads of javascript code.
Probably related to
How to "execute" HTML+Javascript page with Node.js
but do I really need Node.js and JSDOM? Also slightly related is
Python library for rendering HTML and javascript
but I'm not interested in rendering just the pure html output.
You can use Selenium with python as detailed here
Example:
import xmlrpclib
# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)
# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)
import os
os.system('start run_firefox.bat')
print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()
From Mozilla Gecko FAQ:
Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?
A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.
Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.
PhantomJS can be loaded using Selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://seleniumhq.org/')
In [4]: browser.title
Out[4]: u'Selenium - Web Browser Automation'