From last 2 weeks I am searching for an answer but not getting a bit of success.
My scenario is, I am using Eclipse for developing Android apps. I want to display route directions (Driving, Walking, Bicycling) between two dynamically entered addresses on Google Maps... I want to make use of Google Maps JavaScript API V3 Services, because of all its awesome functionalities...
http://code.google.com/apis/maps/documentation/javascript/services.html#Directions
I was suggested by some developers on Stack Overflow and they gave this link http://code.google.com/intl/en/apis/maps/articles/android_v3.html#why But this site has code that uses some JavaScripts in the code, if this site is having the correct stuff, where am I supposed to write JavaScript in my eclipse android app because AFAIK code written in eclipse uses only Java framework.. If that site is not a good bet give me some other links that demonstrates with examples...
I am not sure I follow you. On the link provided earlier on SO, I see the Java code:
http://code.google.com/intl/en/apis/maps/articles/android_v3.html#why
That actually uses Android SDK support for Google Maps. If you still want to use the Javascript, you will have to go through WebView. I would otherwise recomment the way it is followed here:
http://code.google.com/intl/en/android/add-ons/google-apis/maps-overview.html
This page was linked to the previous link mentioned above and it uses Java not Javascript.
You do not need to worry about javascript. The MAP_URL (http://gmaps-samples.googlecode.com/svn/trunk/articles-android-webmap/simple-android-map.html) in WebMapActivity has this javascript in the page. So Either you can write your own html page with that javascript (hosted on your server) or you can simply load the MAP_URL into your webview without worrying about javascript.
Related
We have a web app that its content generated by javascript. Can google index those pages?
When we investigate this issue we always found solutions from old pages about using "#!" in links.
In our app the links are like this:
domain.com/paris
domain.com/london
When we use these kind of links, javascript populates content.
Is it wise to use HTML snapshot or do you have any other suggestions?
Short answer
Yes they can crawl JavaScript generated content, as long as you are using pushstates.
Detailed answer
It depends on your setup. Google and Bing CAN crawl javascript and AJAX based content if your are using pushstates. If you do they will handle content coming from AJAX calls, updates to page title or meta tags using javascript, and in general any such things.
Most frontend frameworks like Angular, Ember or Backbone already works with pushstates so in these cases you don't need to do anything. Check whatever system you are using to see how they do things. If you are not using pushstates you will need to implement it on your own or use the whole escapted_fragment html snapshot deal.
So if you use pushstate then yes, search engines can crawl your page just fine. If you don't then no, you will need to implement pushstates or do HTML snapshots.
Bonus info - Unfortunately Facebook does not handle pushstates, so the facebook crawler needs either non-dynamic og-tags or HTML snapshots.
"Generated by JavaScript" is ambiguous. That could mean that you are running a JS script on the server or it could mean that you are making an AJAX call with a JS API. The difference appears to matter as far as Googlebot is concerned. But you don't have to take my word for it, as there is empirical proof of what Googlebot will and won't currently cache as far as JavaScript content in the form of live experiments using both the XMLHTTPRequest API and the Fetch API. So, as you can see, server-side rendering is still going to be the best way to go for SEO.
I am writing a spider with scrapy, however, I come across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?
You can use scraper with Webkit engine available out there.
One of them is dryscrape.
Example:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[#name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[#href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
See more info at:
https://github.com/niklasb/dryscrape
https://dryscrape.readthedocs.org/en/latest/index.html
If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.
More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question:
Headless Browser for Python (Javascript support REQUIRED!)
It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.
I'm trying to find an answer to the same problem for a few days now.
I suggest you try QT framework with WebKit.
There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more complex or you want to have 100% control over your code.
For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the command line but otherwise it's just great.
If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more complex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler.
If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla.org/en/docs/SpiderMonkey
i have a sitation where i want to access HTML DOM object from within my application to update certain parts of web page through javascript commands at run time.
It is a local webpage opened in FireFox which would be accessed by my application, so that the final output is always shown at the webpage which is updated by appliation.
It would be great if you could give me some idea about how this can be accomplished.
I have similar requirement like the webmonkey extension of firefox but need to do it outside of browser from my application.
You can try QtWebKit from the Qt framework, it provides an OO set of classes to interact with webpages from basic actions to very complicated and advanced stuff. I believe you may find your answer there, a link is provided below...
Good Luck
see Here
This is kind of tricky. There is this webpage which, I am guessing, uses some kind of AJAX to pull out content based on the search query. When I fetch the page using get in Perl, it fetches the script code behind the php/html, but not the results which are displayed when the query is searched manually. I need to be able to fetch the content of the results page. Is there anyway to do this in Perl?
Take a look at Selenium RC and the WWW::Selenium module in Perl. With them you can control a real web browser.
Another option is WWW::HtmlUnit which uses the HtmlUnit Java library to execute the JavaScript without a web browser. WWW::HtmlUnit uses Inline::Java to give Perl access to the library. I have found that when installing, it is best to say No to the question "Do you wish to build the JNI extension?".
If you are writing tests that need to check the rendered page, you can have a look at Schwern's javascript-tap-harness, which works with Selenium and handles all the scaffolding.
I also found Using WWW::Selenium To Test Or Automate An Ajax Website pretty useful.
I'm writing a web crawler (web spider) that crawl all links in a website.
My application is a Win32 App, written in C# with .Net framework 3.5.
Now I'm using HttpWebRequest an HttpWebResponse to communicate with the web server.
I also built my own Http Parser that can parse anything I want.
I found all link like "href", "src", "action"... in the parse.
But I can not solve one problem: Simulate Client Script in the page (like JS and VBS)
For example, if a link like:
a href = "javascript:buildLink(1)"
... with buildLink(parameter) is a Javascript function that will make a custom link due to the parameter.
Please help me to solve this problem. How to simulate JavaScript in this app? I can parse the HTML source code and take all JavaScript code to another file, but how to simulate a function of it?
Thanks.
Your only real option is to automate a browser. As other answers have said, you cannot reliably simulate browser javascript without having a complete DOM.
There are fortunately ways to automate the browser, check out Selenium.
It has a C# API, so you can control the browser from C#.
Use your .NET web crawler code to crawl the site. Whenever you encounter a href="javascript:... link, handle the page containing the link in Selenium:
Use the Selenium API to tell the browser to load the page.
Use the Selenium API to find all links on the page.
This way, your spider only uses Selenium when necessary (pages without javascript links can be handled by the browser-less spider code you already got). And since this is an embarrassingly parallel workload, you could easily have multiple Selenium processes running at the same time (either on one computer or on other computers).
But remember that href="javascript is hardly the only way a page can have dynamic links. The more common case is probably that a onload or $(document).ready() script manipulates the DOM and adds links that way.
To catch that case (and others), the spider probably will have to use Selenium for all pages that have a <script> tag.
You are basically pretending to be a browser, except that HttpWebRequest only does the networking stuff for you.
I would recommend using the ie web browser control and interop'ing into that from your c# application. That will allow you to run JavaScript, set variables, post, etc etc.
Here's some basic links I found after a search for "ie web browser control":
http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCSMDB12022005001524AM/WebBrowserInCSMDB.aspx
http://support.microsoft.com/kb/313068
This is a problem which is not easily solved. You could consider taking one of the existing JavaScript implementations and porting or interfacing with it somehow.
If I were tackling this problem, I'd probably build a small side application in Java on top of Rhino, with some sort of RPC framework layered on top of that so that I could communicate with it from my primary application.
Unfortunately, without having a complete DOM implementation on top of that, you would be limited to only very simple javascript.
You could execute the javascript by using the MS JScript engine or something similar.
MSDN Reference
Eric Lippert's blog on using Eval (part 1) (part 2) (part 3)
This isn't guaranteed to work, especially if the javascript tries to access the DOM, or somesuch... But for simple scripts, it might be enough.