Screen scrape a web page that uses javaScript and frames - javascript

I want to scrape data from www.marktplaats.nl . I want to analyze the scraped description, price, date and views in Excel/Access.
I tried to scrape data with Ruby (nokogiri, scrapi) but nothing worked. (on other sites it worked well) The main problem is that for example selectorgadget and the add-on firebug (Firefox) don’t find any css I can use to scrape the page. On other sites I can extract the css with selectorgadget or firebug and use it with nokogiri or scrapi.
Due to lack of experience it is difficult to identify the problem and therefore searching for a solution isn’t easy.
Can you tell me where to start solving this problem and where I maybe can find more info about a similar scraping process?
Thanks in advance!

I used excel web query and it works perfect. You can find a lot about scraping with excel on youtube if you search for mrexcel.
Thanks, Mello

You can try IRobotSoft web scraper. It has good frame support and is free.

Iframes aren't a problem - just access the embedded iframe URL directly. You will find that it redirects in the browser unless you disable JavaScript.
Description and date can be extracted straight from HTML source. However prices are images which will make scraping them more cumbersome.

Related

Can Googlebot crawl javascript generated content?

We have a web app that its content generated by javascript. Can google index those pages?
When we investigate this issue we always found solutions from old pages about using "#!" in links.
In our app the links are like this:
domain.com/paris
domain.com/london
When we use these kind of links, javascript populates content.
Is it wise to use HTML snapshot or do you have any other suggestions?
Short answer
Yes they can crawl JavaScript generated content, as long as you are using pushstates.
Detailed answer
It depends on your setup. Google and Bing CAN crawl javascript and AJAX based content if your are using pushstates. If you do they will handle content coming from AJAX calls, updates to page title or meta tags using javascript, and in general any such things.
Most frontend frameworks like Angular, Ember or Backbone already works with pushstates so in these cases you don't need to do anything. Check whatever system you are using to see how they do things. If you are not using pushstates you will need to implement it on your own or use the whole escapted_fragment html snapshot deal.
So if you use pushstate then yes, search engines can crawl your page just fine. If you don't then no, you will need to implement pushstates or do HTML snapshots.
Bonus info - Unfortunately Facebook does not handle pushstates, so the facebook crawler needs either non-dynamic og-tags or HTML snapshots.
"Generated by JavaScript" is ambiguous. That could mean that you are running a JS script on the server or it could mean that you are making an AJAX call with a JS API. The difference appears to matter as far as Googlebot is concerned. But you don't have to take my word for it, as there is empirical proof of what Googlebot will and won't currently cache as far as JavaScript content in the form of live experiments using both the XMLHTTPRequest API and the Fetch API. So, as you can see, server-side rendering is still going to be the best way to go for SEO.

Load and parse URL via JS in the background

Currently I am trying to develop a little Firefox extension.
In detail: i want to display users from the site dota2lounge.com the current prize of their steam items on the steam community market. My idea was to do this via a Firefox extension which reads the item names from the HTML code on dota2lounge.com . Via JS i would like to search the steam community market for the item names and parse the current prize. This should happen without any further action from the user and without opening extra tabs/windows.
In java i would just load the site into a variable and work with it. How could i do this with JS (or Jquery)? Or maybe there is an even better way in the addon-sdk from firefox which could solve this issue.
Any thoughts and hints are welcome.
This should be pretty simple to do using the Add-on SDK. Here is a list of modules you should look at:
the request module will allow you to make requests to other sites: https://developer.mozilla.org/en-US/Add-ons/SDK/High-Level_APIs/request
while the request module is fine, what you may want to do instead to get info from the steam site is use the page-worker module to load the site and easily extract info from it using jQuery. This is much nicer than using regex. The code would look something like this gist:
https://gist.github.com/canuckistani/6c299c812bbe582d9efb

Scrape Javascript Files with Python

I've searched high and low, but all I can find is questions (and answers) about scraping content that is dynamically generated by Javascript.
I'm putting together a simple tool to audit client websites by finding text in the HTML source and comparing it to a dictionary.
For example, "ga.js" = Google Analytics.
However, I'm noticing that comparable tools are picking up scripts that mine is not... because they don't actually appear in the HTML source. I can only see them through Chrome's Developer Tools:
Here's a capture from Chrome, since I can't post the image...
Those scripts, such as the "reflektion_b.js", are nowhere to be found in the HTML source.
My script, as it stands now, is using urllib2 (urlopen) to fetch and then BeautifulSoup to parse. Can anyone help me our re:getting the list of script sources? Or maybe even being able to read them as well (not 100% necessary, but could come in handy)?
Any help would be much appreciated.
You need to use a headless browser with python API approach. Ghost will probably do what you want.
http://jeanphix.me/Ghost.py/
content that is dynamically generated by Javascript. implies that the Javascript in question is interpreted, which involves a Javascript interpreter.
You probably need an instance of web view with a mechanism to intercept request to figure out which javascript is being loaded in the page.

Using injected JavaScript to copy text from a web page

As part of a job I'm doing on a web site I have to copy a few thousand lines of text from several pages of the old site and paste them into the HTML for the new site. The long and painstaking way of going to the old page and copying the many lines of text and then going to my editor and pasting it there line by line is getting really old. I thought of using injected JavaScript to do this but I'm not quite sure where to start. Thanks in advance for any help.
Here are links to a page of the old site and a page of the new site. As you can see in the tables on each page it would take a ton of time to copy it all manually.
Old site: http://temp.delridgelegalformscom.officelive.com/macorporation1.aspx
New Site: http://ezwebsites.us/delridge/macorporation1.html
In order to do this type of work, you need two things: a way of injecting or executing your script on that page, and a good working knowledge of the Document Object Model for the target site.
I highly recommend using the Firefox plugin FireBug, or some equivalent tool on your browser of choice. FireBug lets you execute commands from a JavaScript console which will help. Hopefully the old site does not have a bunch of <FONT>, <OBJECT> or <IFRAME> tags which will make this even more tedious.
Using a library like Prototype or JQuery will also help selecting parts of the website you need. You can submit results using JQuery like this:
$(function() {
snippet = $('#content-id').html;
$.post('http://myserver/page', {content: snippet});
});
A problem you will very likely run into is the "same origination policy" many browsers enforce for JavaScript. So if your JavaScript was loaded from http://myserver as in this example, you would be OK.
Perhaps another route you can take is to use a scripting language like Ruby, Python, or (if you really have patience) VBA. The script can automate the list of pages to scrape and a target location for the information. It can just as easily package it up as a request to the new server if that's how pages get updated. This way you don't have to worry about injecting the JavaScript and hoping all works without problems.
I think you need Grease Monkey http://www.greasespot.net/

How can I programmatically scrape an image from another website?

A few years ago I helped someone put together a webpage (for local personal use only, not served to the world) that aggregates outdoor webcam photos from several of his favorite websites. It's a time-saver for viewing multiple websites at once. We had it easy when the images on those websites had fixed URLs. And we were able to write some JavaScript code when the URLs changed predictably (e.g., when the url had a date it in). But now he'd like to add an image whose filename changes seemingly at random and I don't know how to handle that. Basically, I'd like to:
Programmatically visit another website to find the URL of a particular image.
Insert that URL into my webpage with an <img> tag.
I realize this is probably a confusing and unusual question. I'm willing to help clarify as much as possible. I'm just not sure how to ask for what this guy wants to do.
Update: David Dorward mentioned that doing this with JavaScript violates the Same Origin Policy. I'm open to suggestions for other ways to approach this problem.
Fetch html of remote page using Cross Domain AJAX.
Then parse it to get urls of images of interest.
Then for each url do <img src=url />
Its probably a big fat violation of copyright.
The picture is most like containered within a page - just regularly visit that page and parse the img tag. Make sure that the random bit you commented on is not just a random parameter to force browsers to fetch the fresh image instead of retrieving a cached version.
If you use php at your project you can use CURL library to get another website content and using regex parse it for getting image url from source code.
You have a Python question in your profile, so I'll just say if I were trying to do this, I'd go with Python & Beautiful Soup. Has the added advantage of being able to handle invalid HTML.

Categories