How to get the source code of webbrowser with python - javascript

I am writing a spider with scrapy, however, I come across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?

You can use scraper with Webkit engine available out there.
One of them is dryscrape.
Example:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[#name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[#href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
See more info at:
https://github.com/niklasb/dryscrape
https://dryscrape.readthedocs.org/en/latest/index.html

If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.
More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question:
Headless Browser for Python (Javascript support REQUIRED!)
It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.

I'm trying to find an answer to the same problem for a few days now.
I suggest you try QT framework with WebKit.
There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more complex or you want to have 100% control over your code.
For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the command line but otherwise it's just great.

If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more complex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler.
If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla.org/en/docs/SpiderMonkey

Related

How to save a website made with javascript to a file

A little info:
When 'inspected' (Google Chrome), the website displays the information I need (namely, a simple link to a .pdf).
When I cURL the website, only a part of it gets saved. This coupled with the fact that there are functions and <script> tags leads me to believe that javascript is the culprit (I'm honestly not 100% sure, as I'm pretty new at this).
I need to pull this link periodically, and it changes each time.
The question:
Is there a way for me, in bash, to run this javascript and save the new HTML code it generates to a file?
Not trivially.
Typically, for that approach, you need to:
Construct a DOM from the HTML
Execute the JavaScript in the context of that DOM while resolving URLs relative to the URL you fetched the HTML from
There are tools which can help with this, such as Puppeteer, PhantomJS, and Selenium, but they generally lend themselves to being driven with beefier programming languages than bash.
As an alternative, you can look at reverse engineering the page. It gets the data from somewhere. You can probably work out the URLs (the Network tab of a browser's developer tools is helpful there) and access them directly.
If you want to download a web page that generates itself with JavaScript, you'll need to execute this JavaScript in order to load the page. To achieve this you can use libraries that do this like puppeteer with NodeJS. There's a lot of other libraries, but that's the most popular.
If you're wondering why does this happens, it's because web developers often use frameworks like React, Vue or Angular to quote the most popular ones which only generates a JavaScript output that's not executed by common HTTP requesting libraries.

How can I convert web page with javascript to plain html?

I want to convert some web pages with javascript to plain html, and I found there several ways(pls tell me if I'm wrong):
Use Jython, an example: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/
Use Java together with htmlunit
Use a proxy, an example: http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
Use python together with qt or PyV8
Because I want to make a tiny tool to meet my request, and I thought it somewhat complicated to install V8 and qt, although python is my first choice.
So I tried to make a proxy with gecko, but it seems need a DISPLAY which I can not afford in a remote Linux server.
Now I am trying to use jython, but it seems there is no simple way to just convert a whole page to plain html.
Actually, I want to ask is there a way to convert a web page contains javascript to plain html, just like the brower does. Can node.js do this job?
I've recently built a server on top of PhantomJS that does this. I highly recommend this route.
http://phantomjs.org/
Basically, you write a quick script that has PhantomJS run the page, and configure a trigger method that lets you know the page is finished and sends the data off. My version used the built-in HTTP server, so PhantomJS easily served up the results on its own. This takes about 15 lines of code to do. (Sorry, can't paste it here... wrote it on work time. But, check out the example on their home page. It's almost complete!)

Scraping a dynamically loaded, javascript lidded website using Pythons BeautifulSoup [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

Interacting with webpage from C++ Application

i have a sitation where i want to access HTML DOM object from within my application to update certain parts of web page through javascript commands at run time.
It is a local webpage opened in FireFox which would be accessed by my application, so that the final output is always shown at the webpage which is updated by appliation.
It would be great if you could give me some idea about how this can be accomplished.
I have similar requirement like the webmonkey extension of firefox but need to do it outside of browser from my application.
You can try QtWebKit from the Qt framework, it provides an OO set of classes to interact with webpages from basic actions to very complicated and advanced stuff. I believe you may find your answer there, a link is provided below...
Good Luck
see Here

How to simulate JavaScript in a client C# Applications

I'm writing a web crawler (web spider) that crawl all links in a website.
My application is a Win32 App, written in C# with .Net framework 3.5.
Now I'm using HttpWebRequest an HttpWebResponse to communicate with the web server.
I also built my own Http Parser that can parse anything I want.
I found all link like "href", "src", "action"... in the parse.
But I can not solve one problem: Simulate Client Script in the page (like JS and VBS)
For example, if a link like:
a href = "javascript:buildLink(1)"
... with buildLink(parameter) is a Javascript function that will make a custom link due to the parameter.
Please help me to solve this problem. How to simulate JavaScript in this app? I can parse the HTML source code and take all JavaScript code to another file, but how to simulate a function of it?
Thanks.
Your only real option is to automate a browser. As other answers have said, you cannot reliably simulate browser javascript without having a complete DOM.
There are fortunately ways to automate the browser, check out Selenium.
It has a C# API, so you can control the browser from C#.
Use your .NET web crawler code to crawl the site. Whenever you encounter a href="javascript:... link, handle the page containing the link in Selenium:
Use the Selenium API to tell the browser to load the page.
Use the Selenium API to find all links on the page.
This way, your spider only uses Selenium when necessary (pages without javascript links can be handled by the browser-less spider code you already got). And since this is an embarrassingly parallel workload, you could easily have multiple Selenium processes running at the same time (either on one computer or on other computers).
But remember that href="javascript is hardly the only way a page can have dynamic links. The more common case is probably that a onload or $(document).ready() script manipulates the DOM and adds links that way.
To catch that case (and others), the spider probably will have to use Selenium for all pages that have a <script> tag.
You are basically pretending to be a browser, except that HttpWebRequest only does the networking stuff for you.
I would recommend using the ie web browser control and interop'ing into that from your c# application. That will allow you to run JavaScript, set variables, post, etc etc.
Here's some basic links I found after a search for "ie web browser control":
http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCSMDB12022005001524AM/WebBrowserInCSMDB.aspx
http://support.microsoft.com/kb/313068
This is a problem which is not easily solved. You could consider taking one of the existing JavaScript implementations and porting or interfacing with it somehow.
If I were tackling this problem, I'd probably build a small side application in Java on top of Rhino, with some sort of RPC framework layered on top of that so that I could communicate with it from my primary application.
Unfortunately, without having a complete DOM implementation on top of that, you would be limited to only very simple javascript.
You could execute the javascript by using the MS JScript engine or something similar.
MSDN Reference
Eric Lippert's blog on using Eval (part 1) (part 2) (part 3)
This isn't guaranteed to work, especially if the javascript tries to access the DOM, or somesuch... But for simple scripts, it might be enough.

Categories