Scraping a website which has javascript

Scraping a website which has javascript - javascript

I'm looking for a method to scrape a website from server side (which uses javascript) and save the output after analyzing data into a mysql database. I need to navigate from page to page by clicking links and submitting data from the database,without session expiring . Is this possible using phpquery web browser plugin? . I've started doing this using casperjs. I would like to know the pros and cons of both methods. I'm a beginner in the coding space. Please help.

I would recommend that you use PhantomJS or CasperJS and parse the DOM with JavaScript selectors to get the parts of the pages you want back. Don't use phpQuery as it's based on PHP and would require a separate step in your processing versus using just JavaScript DOM parsing. Also, you won't be able to perform click events using PHP. Anything client side would need to be run in PhantomJS or CasperJS.
It might even be possible to write a full scraping engine using just PHP if that's your server side language of choice. You would need to reverse engineer the login process and maintain a cookie jar with your cURL requests to keep your login valid with each request. Once you've established a session with the the website, you can then setup your navigation path with an array of links that you would like to crawl. The idea behind web crawling is that you load a page from some link and process the page and then move to the next link. You continue this process until all pages have been processed and then your crawl is complete.

I would check out Google's guide Making AJAX Applications Crawlable the website you're trying to scrap might have adopted the scheme (making their site's content crawlable).
You want to look for #! in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme.
To put it simply, when you come across a URL like this.
www.example.com/ajax.html#!key=value you would modify it to www.example.com/ajax.html?_escaped_fragment_=key=value. The server should respond with a HTML snapshot of that page.
Here is the Full Specification

Related

Single Page Application Web crawlers and SEO

I have created my blog as a single page application using mithril framework on the front end. To make queries I've used a rest API and Django at the backend. Since everything is rendered using javascript code and when the crawlers hit my blog all they see is an empty page. And to add to that whenever I share a post on social media for instance all Facebook sees is just an empty page and not the post content and title.
I was thinking of looking at the user agents and whenever the USER-AGENT is from a crawler I would feed it the rendered version of the pages but I'm having problems implementing the above method described.
What is the best practice to create a single page app that uses rest API and Django in the backend SEO friendly for web crawlers?

I'm doing this on a project right now, and I would really recommend doing it with Node instead of Python, like this:
https://isomorphic-mithril.mvlabs.it/en/

You might want to look into a server-side rendering of the page that crawlers visit.
Here is a good article on Client Side vs Server Side
I haven't heard of Mithril before, but you might find some plugins that does this for you.
https://github.com/MithrilJS/mithril-node-render

This might help you : https://github.com/sharjeel619/SPA-SEO
The above example is made with Node/Express but you can use the same logic with your Django server.
Logic
A browser requests your single page application from the server,
which is going to be loaded from a single index.html file.
You program some intermediary server code which intercepts the client
request and differentiates whether the request came from a browser or
some social crawler bot.
If the request came from some crawler bot, make an API call to
your back-end server, gather the data you need, fill in that data to
html meta tags and return those tags in string format back to the
client.
If the request didn't come from some crawler bot, then simply
return the index.html file from the build or dist folder of your single page
application.

javascript prevents file_get_contents

I'm trying to scrap the data from the website using file_get_contents but instead of the webpage source I'm getting following code:
<body onload="challenge();">
<script>eval(function(p,a,c,k,e,r){e=function(c){return c.toString(a)};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('1 6(){2.3=\'4=5; 0-7=8; 9=/\';a.b.c()}',13,13,'tax|function|document|cookie|ddosdefend|1d4607e3ac67b865e6c7263260c34e888cae7c56|challenge|age|0|path|window|location|reload'.split('|'),0,{}))
Engine is wordpress. Is there any chance to get real source?

file_get_contents seems to work fine. However, it seems you are not served the desired content but some JavaScript code which needs to be evaluated before redirecting to the content.
This might be because the website you want to scrape uses a DDOS protection (e.g. something like CouldFlare) which detects your simple scraping attempt.
Usually, a DDOS protection service is a proxy between the original webserver and your scraper. It inspects your request behavior, user agent etc. and based on that serves you either the original webserver's content or presents you a challenge (e.g. captcha, or simply requires you to evaluate javascript etc.).
If you can get the IP address of the original webserver, you might be able to directly access it. The DNS resolution for the webserver's name will direct you to the proxy, so you have to look elsewhere. Alternatively, use a web scraping library that emulates real browser behavior in PHP.

Obtain from microstrategy an exported PDF document directly

This is my situation:
I have a third part that uses a software called microstrategy which is able to generate documents and allow to export them as PDF or Excel files. They provide me only web api of this product, and I haven't any web service to work with.
The url is like:
http://<third_part_domain>/microstrategy/asp/Main.aspx?Server=<third_part_domain>&Project=<project_name>&evt=3069&src=Main.aspx.3069&executionMode=3&promptAnswerMode=1&documentID=<doc_id>&uid=<username>&pwd=<password>&<other_parameters_for_request>
I have try to obtain the file (that I must save on server side) by java code, but the response of the link that we use is an HTML page with some code Javascript that does more than one redirect, so I can not interpreted correctly the response and I should use a browser to obtain the PDF.
So I have thought to put the page into a iframe and after a while (usually the server takes 20 second) take the PDF object by javascript code and send to my server. But obviously the third part have another domain and the CORS policies block everything. To make matters worse, I can not use the final url to obtain the file because the microstrategy respond me with an internal page of the administration console.
So, that's my question:
Is there a way (that is not on the microstrategy server side) to obtain directly the PDF from microstrategy?
Or exists a way from client side to bypass the problem of origin control? I have evaluated to implement a proxy for solution but it's too expensive.
Thanks to all!

You need two things in order to download a PDF from MicroStrategy using a URL:
In the document property set that default visualization as PDF. This is pretty trivial and I think any of your MicroStrategy savvy colleague can help you with this.
Disable the waiting page, this is more complicated. When MicroStrategy generates a documents, usually it needs some time, meanwhile the server is working it will show you a waiting page. Useful if the request comes from a human (the human can go on StackOverflow), not that much if the call arrives from API.
The instruction to disable the waiting page are here: TN34124: How to Disable the Wait Page in MicroStrategy Web using the MicroStrategy Web SDK 9.x.
But I read from your question that you have no control on the third party MicroStrategy application. In that case there is little you can do. You can try to ask them to implement the customization to remove the waiting page or allow you to use taskproc API, but that's a story for another day.

Some options:
Ask the third party to schedule the PDF generation on their side and send it via mail to you. Or place it on a shared folder that is shared between you.
Ask for a different URL Tuareg from the file-share menu options. This will give a URL with 'subscriptionid' in it.

Is it smart to use clientside JavaScript to transform a HTML page instead of sending multiple HTML pages from a server?

If I were to have a web server that sent out a html, js and css document to a person and upon them clicking on "Contact Us" from the "Home" page I would have JS to change certain parts of the content and possibly push fake history (I think this is possible but I might not remember right) instead of requesting a new HTML page specially made on the server.
Wouldn't this be much better for my web server due to less GET requests? Would there be any major problems with doing this? Any examples of sites that use client-side page changing?

Wouldn't this be much better for my web server due to less GET requests?
No. HTTP requests aren't that expensive. You'd probably end up sending more data anyway (since you have to send the data for all the pages on the first request, and visitors might not event look at them all).
Would there be any major problems with doing this?
You either:
Do a lot of work or
You have a JS dependent, search engine unfriendly site
Any examples of sites that use client-side page changing?
Github's file viewer does it. (They take the "lot of work" approach).

How to fetch particular HTML contents from remote URL?

I want to fetch particular HTML contents from remote websites url.
The website URL is as follow,
http://www.realtor.com/realestateandhomes-detail/10216-Montwood-Drive_El-Paso_TX_79925_M78337-06548
I want to fetch some specific information from above website url.
Here I attached image it highlight the specific area I want to all highlighted portion from there is a title,image, and descriptions.
How can I fetch the contents using JQuery or Javascript or Json call?
Is any other way to get these?

You might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context.
Scrapers can be written in straight Javascript, executed in the context of the site you're scraping, with a very simple, jQuery-friendly syntax.
It can scrape a single page, an array of pages, or you can define a function to look for more URLs to spider on each page.
It supports JSON and CSV output, either to file or to STDOUT
If the site is static and the structure is uniform, it should be very fast to scrape all the content you need into a structured data format.

This will help you out:
http://papermashup.com/use-jquery-and-php-to-scrape-page-content/

When scraping content, it is vital to consider the following:
Is the content static html or will part of it's content be rendered by ajax-calls?
In the first case, simple http-get-routines like the one used in JNDPNT's comment's Link will be sufficient.
In the second case, you may want to look at automating Selenium via it's Webdriver.
In any case it might be better to ask your colleague if he can provide you with an interface to the raw data, e.g. over a webservice.

If I'm getting you right, you want The user's Browser to scrape The content of another Domain on The Fly, right?
That will Not Be Possible without proxying The Request through some Script on The Same Domain (or via a jsonp Request to a Service that returns The HTML to you) due to The Same Origin Policy.
Sorry to disappoint.

Use the Yahoo Pipes (http://pipes.yahoo.com/pipes/ )service.
This can be used to grab and manipulate the page HTML, extracting the bits you want. Data can then be posted server side using the Web Service module or sent directly to the clients browser using an ordinary javascript callback.

We Keep Coding

JavaScript is the programming language of the Web.

Scraping a website which has javascript - javascript

Related

Single Page Application Web crawlers and SEO

javascript prevents file_get_contents

Obtain from microstrategy an exported PDF document directly

Is it smart to use clientside JavaScript to transform a HTML page instead of sending multiple HTML pages from a server?

How to fetch particular HTML contents from remote URL?

Categories

Resources