No Element in Xpath with lxml : Javascript Generated Page - javascript

I have had Xpath work with other things before, in the Chrome browser I can find my xpath in the console with $x('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]/div[2]/div[1]/span[2]) at https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na.
When I try to get this element in code it returns an empty array, anybody know why?
#client.command(pass_context=True)
async def checkChrisPubg(ctx):
page = requests.get('https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na')
tree = html.fromstring(page.content)
duoRank = tree.xpath('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]')
print(duoRank)
print(duoRank) gives me []

So, I tried to do this with PyQt4 and had no real success in practice, a simpler but slightly more invasive resolution is to use Selenium, a webdriver for loading web pages.
I am sure there are multiple solutions to this but I was having a hell of a time even knowing what was wrong until I found my solution.
When using lxml you should ensure the data you are trying to grab is not generated by javascript. To do this you can open Chrome Developer tools, click the menu(3 vertical dots), go to settings, go to the bottom, disable Javascript, and reload the page.
If nothing is there, the page is generated content with Javascript.
A simple solution is below, this will wait for the page to render and then let you parse the tree with lxml.
This solution will require that you use these imports(You must install selenium):
from selenium import webdriver
Now, you can load the page and start scraping:
#Load in your browser(I use chrome)
browser = webdriver.Chrome()
#Choose url you want to scrape
url = 'https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na'
#get the url with Selenium
browser.get(url)
#get the innerhtml from the rendered page
innerHTML = browser.execute_script("return document.body.innerHTML")
#Now use lxml to parse the page
tree = html.fromstring(innerHTML)
#Get your element with xpath
duoRank = tree.xpath('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]/div[2]/div[1]/span[2]/text()')
#close the browser
browser.quit()
My original solution would have been nice, but just didn't work because much of it is deprecated.

What library are you using as a parser?
If xml.etree.ElementTree,
ElementTree provides limited support for XPath expressions. The goal is to support a small subset of the abbreviated syntax; a full XPath engine is outside the scope of the core library.
http://effbot.org/zone/element-xpath.htm

Open page source view-source:https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na here is the script with json playerData at line 491. Just parse it.

Related

Web Scraping Python using Google Chrome extension

Hi I am a Python newbie and I am webscraping a webpage.
I am using the Google Chrome Developer Extension to identify the class of the objects I want to scrape. However, my code returns an empty array of results whereas the screenshots clearly show that that those strings are in the HTML code.
Chrome Developer
import requests
from bs4 import BeautifulSoup
url = 'http://www.momondo.de/flightsearch/?Search=true&TripType=2&SegNo=2&SO0=BOS&SD0=LON&SDP0=07-09-2016&SO1=LON&SD1=BOS&SDP1=12-09-2016&AD=1&TK=ECO&DO=false&NA=false'
html = requests.get(url)
soup = BeautifulSoup(html.text,"lxml")
x = soup.find_all("span", {"class":"value"})
print(x)
#pprint.pprint (soup.div)
I am very much appreciating your help!
Many thanks!
Converted my comment to an answer...
Make sure the data you are expecting is actually there. Use print(soup.prettify()) to see what was actually returned from the request. Depending on how the site works, the data you are looking for may only exist in the browser after the javascript is processed. You might also want to take a look at selenium

html agility pack is returning javascript code except the actual Html

i want to get the links using c# console from a website using html agility pack but there is java script code written in li and href tag why java script changes code on click i don't know please tell me the solution how t get actual code
<li onmouseover="activate_menu('top-menu-61', 61); void(0);" onmouseout="deactivate_menu('top-menu-61', 61);"><a href="javascript:void();
i can just see this in my li and a tag,how to resolve this and get actual html so i can get links furthur
Try using browser automation tools like Selenium WebDriver to generate a webpage fully, utilizing a real browser, before passing it to HtmlAgilityPack for parsing. Using Selenium should be fairly easy as exemplified below. You only need to make sure that all the needed tools (Selenium library and browser driver of choice) are installed properly beforehand :
// Initialize the Chrome Driver (or any other supported browser)
using (var driver = new ChromeDriver())
{
// open the target page
driver.Navigate().GoToUrl("the_targt_page_url_here");
//maybe add selenium waits if needed,
//to wait until certain element appear in the page
//pass the HTML page to HAP's HtmlDocument
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(driver.PageSource);
}
Selenium also provides ways to locate elements within a page, so it is possible to replace HAP completely with Selenium, if you want.

How to know if web content cannot be handled by Scrapy?

I apologize if my question sounds too basic or general, but it has puzzled me for quite a while. I am a political scientist with little IT background. My own research on this question does not solve the puzzle.
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category? I once came across some texts that show in Chrome Inspect, but could not be extracted by Xpath (I am 99.9% certain my Xpath expression was correct). Someone mentioned that the texts might be hidden behind some JavaScript. But this is still speculation, I can't be totally sure that it wasn't due to wrong Xpath expressions. Are there any signs that can make me certain that this is something beyond Scrapy and can only be dealt with programs such as Selenium? Any help appreciated.
-=-=-=-=-=
Edit (1/18/15): The webpage I'm working with is http://yhfx.beijing.gov.cn/webdig.js?z=5. The specific piece of information I want to scrape is circled in red ink (see screenshot below. Sorry, it's in Chinese).
I can see the desired text in Chrome's Inspect, which indicates that the Xpath expression to extract it should be response.xpath("//table/tr[13]/td[2]/text()").extract(). However, the expression doesn't work.
I examined response.body in Scrapy shell. The desired text is not in it. I suspect that it is JavaScript or AJAX here, but in the html, I did not see signs of JavaScript or AJAX. Any idea what it is?
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category?
The browsers do a lot of things when you open a web page. I will be oversimplify the process here:
Performs an HTTP request to the server hosting the web page.
Parses the response, which in most cases is HTML content (text-based format). We will assume we get a HTML response.
Starts the rendering the HTML, executes the Javascript code, retrieves external resources (images, css files, js files, fonts, etc). Not necessarily in this order.
Listens to events that may trigger more requests to inject more content into the page.
Scrapy provides tools to do 1. and 2. Selenium and other tools like Splash do 3., allow you to do 4. and access the rendered HTML.
Now, I think there are three basic cases you face when you want to extract text content from a web page:
The text is in plain HTML format, for example, as a text node or HTML attribute: <a>foo</a>, <a href="foo" />. The content could be visually hidden by CSS or Javascript, but as long is part of the HTML tree we can extract it via XPath/CSS rules.
The content is located in Javascript code. For example: <script>var cfg = {code: "foo"};</script>. We can locate the <script> node with a XPath rule and then use regular expressions to extract the string we want. Also there are libraries that allow us to parse pieces of Javascript so we can load objects easily. A complex solution here is executing the javascript code via a javascript engine.
The content is located in a external resource and is loaded via Ajax/XHR. Here you can emulate the XHR request with Scrapy and the parse the output, which can be a nice JSON object, arbitrary javascript code or simply HTML content. If it gets tricky to reverse engineer how the content is retrieved/parsed then you can use Selenium or Splash as a proxy for Scrapy so you can access the rendered content and still be able to use Scrapy for your crawler.
How you know which case you have? You can simply lookup the content in the response body:
$ scrapy shell http://example.com/page
...
>>> 'foo' in response.body.lower()
True
If you see foo in the web page via the browser but the test above returns False, then it's likely the content is loaded via Ajax/XHR. You have to check the network activity in the browser and see what requests are being done and what are the responses. Otherwise you are in case 1. or 2. You can simply view the source in the browser and search for the content to figure out where is located.
Let say the content you want is located in HTML tags. How do you know if your XPath expression correct? (By correct here we mean that gives you the output you expect)
Well, if you do scrapy shell and response.xpath(expression) returns nothing, then your XPath is not correct. You should reduce the specificity of your expression until you get an output that includes the content you want, and then narrow it down.

How to retrieve title from dyamically formed web pages

I'm working on a bookmarking app using Django and would like to extract the title from web pages that use javascript to generate the title. I've looked at windmill and installed/ran selenium, which worked, but I believe these tools are more than what I need to obtain the title of a web page. I'm currently trying to use spynner, but haven't been successful in retrieving the contents after the page is fully rendered. Here is the code that I currently have...
from spynner import Browser
from pyquery import PyQuery
browser = Browser()
browser.set_html_parser(PyQuery)
browser.load("https://www.coursera.org/course/techcity")
I receive a SpynnerTimeout: Timeout reached: 10 seconds error when executing the last line in a python shell. If I execute the last statement again, it will return True, but only the page before the javascript is run is returned, which doesn't have the "correct" page title. I also tried the following:
browser.load("https://www.coursera.org/course/techcity", wait_callback=wait_load(10))
browser.soup("title")[0].text
But this also returns the incorrect title - 'Coursera.org' (i.e. title before the javascript is run).
Here are my questions:
Is there a more efficient recommended approach for extracting a web page title that is dynamically generated with javascript, that uses some other python tool/library? If so, what is that recommended approach? - any example code appreciated.
If using spynner is a good approach, what should I be doing to get the title after the page is loaded, or even better, right after the title has been rendered by the javascript. The code I have now is just what I pieced together from a blog post and looking at the source for spynner on github.

How to parse html that includes javascript code

How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.
As a simple example
link
should be replaced by the appropriate value the javascript function returns, e.g.
link
A more complex example would be a saved facebook html page which is littered with loads of javascript code.
Probably related to
How to "execute" HTML+Javascript page with Node.js
but do I really need Node.js and JSDOM? Also slightly related is
Python library for rendering HTML and javascript
but I'm not interested in rendering just the pure html output.
You can use Selenium with python as detailed here
Example:
import xmlrpclib
# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)
# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)
import os
os.system('start run_firefox.bat')
print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()
From Mozilla Gecko FAQ:
Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?
A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.
Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.
PhantomJS can be loaded using Selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://seleniumhq.org/')
In [4]: browser.title
Out[4]: u'Selenium - Web Browser Automation'

Categories