I have had a lot of trouble trying to find information or possible examples of this being done.
I would like to render html in a window and take the js from the html and output that to a python code.
The Html is local and there will never be an internet connection for it to run off. Everythin i try shearch for possible answers everyone always seems to relate back to using some small lightweigh browser which in my case isn't an option to use.
Fort some more detail, I am running Selenium-Webdriver
(python) and Iceweasel(Raspberry Pi B+) to get the value of a element from a html page. So using a different browser isnt possible as the lightweight ones are not compatible with selenium. Using Selenium and Iceweasel takes in excess of 2 miunets to fully load up which for what i need it for is far to long.
I had a look into Awesomium but i think it lacks compatability with the Raspberry Pi.
My other thought was to use OpenGL to render the html but found no easy explained examples.
Currently looking into LibRocket, Berkelium and QWebView but again i dont think they will have anythin i need with the compatability i need.
EDIT:
Basically i want a Canvas capeable of rendering HTML to a screen using X11. On the HTML there will be buttons. I want those buttons to preform actions inside a python script.
The way i see it, a browser is basically a toolbar, a canvas and a lot of networking. I want to strip away as much of that as possible and just remain with the canvas.
First go to the directory that you has the local webpage. Than run python -m SimpleHTTPServer 8000. This will "render the html" in a window. Then view source and paste the javascript into a python file. Alternatively if you would like to automate piping the javascript into an out file you can use beautiful soup to select the javascript and write it to any file you want. Then manipulate it in python however you want.
Related
I've searched high and low, but all I can find is questions (and answers) about scraping content that is dynamically generated by Javascript.
I'm putting together a simple tool to audit client websites by finding text in the HTML source and comparing it to a dictionary.
For example, "ga.js" = Google Analytics.
However, I'm noticing that comparable tools are picking up scripts that mine is not... because they don't actually appear in the HTML source. I can only see them through Chrome's Developer Tools:
Here's a capture from Chrome, since I can't post the image...
Those scripts, such as the "reflektion_b.js", are nowhere to be found in the HTML source.
My script, as it stands now, is using urllib2 (urlopen) to fetch and then BeautifulSoup to parse. Can anyone help me our re:getting the list of script sources? Or maybe even being able to read them as well (not 100% necessary, but could come in handy)?
Any help would be much appreciated.
You need to use a headless browser with python API approach. Ghost will probably do what you want.
http://jeanphix.me/Ghost.py/
content that is dynamically generated by Javascript. implies that the Javascript in question is interpreted, which involves a Javascript interpreter.
You probably need an instance of web view with a mechanism to intercept request to figure out which javascript is being loaded in the page.
I am creating a Ruby on Rails app. A specific page in my app is divided into several sections by <div> tags. Each <div> includes a combination of text (using different fonts), symbols and mathematic formulas. I use MathJax and a few other Javascript codes to display them correctly and everything works great on my computer. However Javascript is not enabled on everyone's browser and some Javascript codes might not load correctly on some other people's browsers. One solution I was thinking is this: after all the javascripts are done processing and the page is displayed correctly on my computer (server) I use some code to generate a snapshot of each <div> in PNG and send them to the server (for example I click a <button> tag on the page to activate this code after I'm happy what is displayed is correct). Then I'll save these images in the database and serve them which will look the same on everyone's computer regardless of whether Javascript is enabled, what browser they're using, etc. Is anyone aware of a code or command that I can use? Please note, currently after the page is loaded, Javascripts process the HTML content and produce the correct display. Also I don't want to take a snapshot of the whole page; snapshot of each <div> separately.
Thanks a lot.
Well this is a client-side problem, here is a javascript that will work for you http://experiments.hertzen.com/jsfeedback/
You've got a bit of a problem there. Javascript is not executed until the page has finished loading, i.e. all of the information has already been sent to the client. You're not executing javascript at the server level, so you wouldn't be able to do that kind of processing at all. If they have javascript disabled, your code will never get executed.
You could generate the images using Imagemagick or something similar, I know PHP has bindings for that. There are a couple of extremely messy solutions like rendering it in a browser on the server side with something like selenium, but I definitely wouldn't recommend doing that. Overall, it depends on the platform on which your developing, but most major languages have support for generating images that don't require 100% javascript.
I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.
So I'm working on a just for fun project to get practice using HTML/CSS/Javascript.
I'm using Aptana to write all my code and it is currently set up to run and work in a browser (obviously) it's a text adventure game.
It would be really cool though to be able to compile the code into an executable file that runs in its own window, not in a browser.
Is this something relatively easy to accomplish?
Thanks in advance for any help! :)
FF and Chrome provide a function to run a custom website in an app mode. That means no menubars, no addressbar and a complete window for the website. Maybe this is already what you are looking for.
http://www.rarst.net/software/dedicated-web-app-window/
https://superuser.com/questions/33548/starting-google-chrome-in-application-mode
https://superuser.com/questions/171235/does-internet-explorer-have-something-equivalent-to-chromes-app-mode
But if you are interested in compiled code for speeding up your game, this is not the way to achieve this.
For Windows as OS
see http://www.autoitscript.com/autoit3/docs/libfunctions/_IECreateEmbedded.htm
AutoIt is a scripting language for basically everything (with automation). SciTE is the editor to go.
In the example of the _IECreateEmbedded function, just change:
_IENavigate($oIE, "http://www.autoitscript.com")
to
_IENavigate($oIE, "file://.../thegame.html")
Very simple, you just have to copy-paste it and build it - you can even build it Online: AutoIt Online Compiler
There are many different ways you can acheive this.
If you're only targeting windows machines, then creating a HTA would be the simplest approach.
The modification to the structure of your existing code would be minimal, its essentially changing the file type and adding an extra couple of tags in. If you wanted a single file, instead of an exe and any resources (images etc) that you use you would have to base64 encode your images, and insert external scripts into the main page.
for information about embedding images and icons into a hta: http://www.john-am.com/2010/07/building-a-self-contained-hta-with-embedded-images-and-icons/
You could also use AppJS, node-webkit or similar type projects, but they would add around 30MB of stuff thats not being used.
I am writing a spider with scrapy, however, I come across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?
You can use scraper with Webkit engine available out there.
One of them is dryscrape.
Example:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[#name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[#href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
See more info at:
https://github.com/niklasb/dryscrape
https://dryscrape.readthedocs.org/en/latest/index.html
If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.
More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question:
Headless Browser for Python (Javascript support REQUIRED!)
It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.
I'm trying to find an answer to the same problem for a few days now.
I suggest you try QT framework with WebKit.
There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more complex or you want to have 100% control over your code.
For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the command line but otherwise it's just great.
If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more complex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler.
If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla.org/en/docs/SpiderMonkey