Scrape web page data generated by javascript - javascript

My question is: How to scrape data from this website http://vtis.vn/index.aspx But the data is not shown until you click on for example "Danh sách chậm". I have tried very hard and carefully, when you click on "Danh sách chậm" this is onclick event which triggers some javascript functions one of the js functions is to get the data from the server and insert it to a tag/place holder and at this point you can use something like firefox to examine the data and yes, the data is display to users/viewers on the webpage. So again, how can we scrap this data programmatically?
i wrote a scrapping function but ofcourse it does not get the data i want because the data is not available until i click on the button "Danh sách chậm"
<?php
$Page = file_get_contents('http://vtis.vn/index.aspx');
$dom_document = new DOMDocument();
$dom_document->loadHTML($Page);
$dom_xpath_admin = new DOMXpath($dom_document_admin);
$elements = $dom_xpath->query("*//td[#class='IconMenuColumn']");
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo mb_convert_encoding($node->c14n(), 'iso-8859-1', mb_detect_encoding($content, 'UTF-8', true));
}
}

You need to look at PhantomJS.
From their site:
PhantomJS is a headless WebKit with JavaScript API. It has fast and
native support for various web standards: DOM handling, CSS selector,
JSON, Canvas, and SVG.
Using the API you can script the "browser" to interact with that page and scrape the data you need. You can then do whatever you need with it; including passing it to a PHP script if necessary.
That being said, if at all possible try not to "scrape" the data. If there is an ajax call the page is making, maybe there is an API you can use instead? If not, maybe you can convince them to make one. That would of course be much easier and more maintainable than screen scraping.

First, you need PhantomJS. Suggested install method on Linux:
wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
tar xvf phantomjs-2.1.1-linux-x86_64.tar.bz2
cp phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
Second, you need the php-phantomjs package. Assuming you have installed Composer:
composer require jonnyw/php-phantomjs
Or follow installation documentation here.
Third, Load the package to your script, and instead of file_get_contents, you will load the page via PhantomJS
<?php
require ('vendor/autoload.php');
$client = Client::getInstance();
$client->getEngine()->setPath('/usr/local/bin/phantomjs');
$client = Client::getInstance();
$request = $client->getMessageFactory()->createRequest();
$response = $client->getMessageFactory()->createResponse();
$request->setMethod('GET');
$request->setUrl('https://www.your_page_embeded_ajax_request');
$client->send($request, $response);
if($response->getStatus() === 200) {
echo "Do something here";
}

Related

getting OnLoad HTML/DOM for an HTML page in PHP

I am trying to get the HTML (ie what you see initially when the page completes loading) for some web-page URI. Stripping out all error checking and assuming static HTML, it's a single line of code:
function GetDisplayedHTML($uri) {
return file_get_contents($uri);
}
This works fine for static HTML, and is easy to extend by simple parsing, if the page has static file dependencies/references. So tags like <script src="XXX">, <a href="XXX">, <img src="XXX">, and CSS, can also be detected and the dependencies returned in an array, if they matter.
But what about web pages where the HTML is dynamically created using events/AJAX? For example suppose the HTML for the web page is just a brief AJAX-based or OnLoad script that builds the visible web page? Then parsing alone won't work.
I guess what I need is a way from within PHP, to open and render the http response (ie the HTML we get at first) via some javascript engine or browser, and once it 'stabilises', capture the HTML (or static DOM?) that's now present, which will be what the user's actually seeing.
Since such a webpage could continually change itself, I'd have to define "stable" (OnLoad or after X seconds?). I also don't need to capture any timer or async event states (ie "things set in motion that might cause web page updates at some future time"). I only need enough of the DOM to represent the static appearance the user could see, at that time.
What would I need to do, to achieve this programmatically in PHP?
To render page with JS you need to use some browser. PhantomJS was created for tasks like this. Here is simple script to run with Phantom:
var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;
if (args.length === 1) {
console.log('First argument must be page URL!');
} else {
page.open(args[1], function (status) {
window.setTimeout(function () { //Wait for scripts to run
var content = page.content;
console.log(content);
phantom.exit();
}, 500);
});
}
It returns resulting HTML to console output.
You can run it from console like this:
./phantomjs.exe render.js http://yandex.ru
Or you can use PHP to run it:
<?php
$path = dirname(__FILE__);
$html = shell_exec($path . DIRECTORY_SEPARATOR . 'phantomjs.exe render.js http://phantomjs.org/');
echo htmlspecialchars($html);
My PHP code assumes that PhantomJS executable is in the same directory as PHP script.

PHP Get Rendered Javascript Page

I'm developing application using AngularJS. Everything seems to be nice until I meet something that leads me to headache: SEO.
From many references, I found out that AJAX content crawled and indexed by Google bot or Bing bot 'is not that easy' since the crawlers don't render Javascript.
Currently I need a solution using PHP. I use PHP Slim Framework so my main file is index.php which contains function to echo the content of my index.html. My question is:
Is it possible to make a snapshot of rendered Javascript in HTML?
My strategy is:
If the request query string contains _escaped_fragment_, the application will generate a snapshot and give that snapshot as response instead of the exact file.
Any help would be appreciated. Thanks.
After plenty of times searching and researching, I finally managed to solve my problem by mixing PHP with PhantomJS (version 2.0). I use exec() function in PHP to run phantomJS and create Javascript file to take get the content of the targeted URL. Here are the snippets:
index.php
// Let's assume that you have a bin folder under your root folder directory which contains phantomjs.exe and content.js
$script = __DIR__ ."/bin/content.js";
$target = "http://www.kincir.com"; // target URL
$cmd = __DIR__."/bin/phantomjs.exe $script $target";
exec($cmd, $output);
return implode("", $output);
content.js
var webPage = require('webpage');
var system = require('system');
var page = webPage.create();
var url = system.args[1]; // This will get the second argument from $cmd, in this example, it will be the value of $target on index.php which is "http://www.kincir.com"
page.open(url, function (status) {
page.onLoadFinished = function () { // Make sure to return the content of the page once the page is finish loaded
var content = page.content;
console.log(content);
phantom.exit();
};
});
I recently published a project that gives PHP access to a browser. Get it here: https://github.com/merlinthemagic/MTS. It also relies on PhantomJS.
After downloading and setup you would simply use the following code:
$myUrl = "http://www.example.com";
$windowObj = \MTS\Factories::getDevices()->getLocalHost()->getBrowser('phantomjs')->getNewWindow($myUrl);
//now you can either retrive the DOM and parse it, like this:
$domData = $windowObj->getDom();
//this project also lets you manipulate the live page. Click, fill forms, submit etc.

Sharing TinyMCE plugin across multiple applications

I'm using CakePHP 2.4.7 and the TinyMCE plugin from CakeDC.
I set up my CakePHP core along with the plugin in a shared location on my server so that multiple applications can access it. This keeps me from having to update multiple copies of TinyMCE. Everything was working well until I migrated to a new server and updated software.
The new server is running Apache 2.4 instead of 2.2 and using mod_ruid2 instead of suexec.
I now get this error when trying to load the editor:
Fatal Error (4): syntax error, unexpected T_CONSTANT_ENCAPSED_STRING in [/xyz/Plugin/TinyMCE/webroot/js/tiny_mce/tiny_mce.js, line 1]
How should I start debugging this?
Workaround Attempt
I tried adding a symlink from an application's webroot to TinyMCE's plugin webroot. This works in that it loads the js file and the editor, but then TinyMCE plugins are working on the wrong current directory and file management would not be separated.
The problem is the AssetDispatcher filter, it includes css and js files using PHPs include() statement, causing the files to be sent through the PHP parser, where it will stumble over the occurrences of <? in the TinyMCE script.
See https://github.com/.../2.4.7/lib/Cake/Routing/Filter/AssetDispatcher.php#L159-L160
A very annoying, and, since it's undocumented and non-optional, dangerous behavior if you ask me.
Custom asset dispatcher
In case you want to continue to use a plugin asset dispatcher, extend the built in one, and reimplement the AssetDispatcher::_deliverAsset() method with the include functionality removed. Of course this is kinda annoying, maintenance wise, but it's a pretty quick fix.
Something like:
// app/Routing/Filter/MyAssetDispatcher.php
App::uses('AssetDispatcher', 'Routing/Filter');
class MyAssetDispatcher extends AssetDispatcher {
protected function _deliverAsset(CakeResponse $response, $assetFile, $ext) {
// see the source of your CakePHP core for the
// actual code that you'd need to reimpelment
ob_start();
$compressionEnabled = Configure::read('Asset.compress') && $response->compress();
if ($response->type($ext) == $ext) {
$contentType = 'application/octet-stream';
$agent = env('HTTP_USER_AGENT');
if (preg_match('%Opera(/| )([0-9].[0-9]{1,2})%', $agent) || preg_match('/MSIE ([0-9].[0-9]{1,2})/', $agent)) {
$contentType = 'application/octetstream';
}
$response->type($contentType);
}
if (!$compressionEnabled) {
$response->header('Content-Length', filesize($assetFile));
}
$response->cache(filemtime($assetFile));
$response->send();
ob_clean();
// instead of the possible `include()` in the original
// methods source, use `readfile()` only
readfile($assetFile);
if ($compressionEnabled) {
ob_end_flush();
}
}
}
// app/Config/bootstrap.php
Configure::write('Dispatcher.filters', array(
'MyAssetDispatcher', // instead of AssetDispatcher
// ...
));
See also http://book.cakephp.org/2.0/en/development/dispatch-filters.html
Don't just disable short open tags
I'm just guessig here, but the reason why it was working on your other server probably is that short open tags (ie <?) where disabled. However even if that is the problem on your new server, this isn't something you should rely on, the assets are still being served using include(), and you most probably don't want to check all your third party CSS/JS for possible PHP code injections on every update.

Inject local .js file into a webpage?

I'd like to inject a couple of local .js files into a webpage. I just mean client side, as in within my browser, I don't need anybody else accessing the page to be able to see it. I just need to take a .js file, and then make it so it's as if that file had been included in the page's html via a <script> tag all along.
It's okay if it takes a second after the page has loaded for the stuff in the local files to be available.
It's okay if I have to be at the computer to do this "by hand" with a console or something.
I've been trying to do this for two days, I've tried Greasemonkey, I've tried manually loading files using a JavaScript console. It amazes me that there isn't (apparently) an established way to do this, it seems like such a simple thing to want to do. I guess simple isn't the same thing as common, though.
If it helps, the reason why I want to do this is to run a chatbot on a JS-based chat client. Some of the bot's code is mixed into the pre-existing chat code -- for that, I have Fiddler intercepting requests to .../chat.js and replacing it with a local file. But I have two .js files which are "independant" of anything on the page itself. There aren't any .js files requested by the page that I can substitute them for, so I can't use Fiddler.
Since your already using a fiddler script, you can do something like this in the OnBeforeResponse(oSession: Session) function
if ( oSession.oResponse.headers.ExistsAndContains("Content-Type", "html") &&
oSession.hostname.Contains("MY.TargetSite.com") ) {
oSession.oResponse.headers.Add("DEBUG1_WE_EDITED_THIS", "HERE");
// Remove any compression or chunking
oSession.utilDecodeResponse();
var oBody = System.Text.Encoding.UTF8.GetString(oSession.responseBodyBytes);
// Find the end of the HEAD script, so you can inject script block there.
var oRegEx = oRegEx = /(<\/head>)/gi
// replace the head-close tag with new-script + head-close
oBody = oBody.replace(oRegEx, "<script type='text/javascript'>console.log('We injected it');</script></head>");
// Set the response body to the changed body string
oSession.utilSetResponseBody(oBody);
}
Working example for www.html5rocks.com :
if ( oSession.oResponse.headers.ExistsAndContains("Content-Type", "html") &&
oSession.hostname.Contains("html5rocks") ) { //goto html5rocks.com
oSession.oResponse.headers.Add("DEBUG1_WE_EDITED_THIS", "HERE");
oSession.utilDecodeResponse();
var oBody = System.Text.Encoding.UTF8.GetString(oSession.responseBodyBytes);
var oRegEx = oRegEx = /(<\/head>)/gi
oBody = oBody.replace(oRegEx, "<script type='text/javascript'>alert('We injected it')</script></head>");
oSession.utilSetResponseBody(oBody);
}
Note, you have to turn streaming off in fiddler : http://www.fiddler2.com/fiddler/help/streaming.asp and I assume you would need to decode HTTPS : http://www.fiddler2.com/fiddler/help/httpsdecryption.asp
I have been using fiddler script less and less, in favor of fiddler .Net Extensions - http://fiddler2.com/fiddler/dev/IFiddlerExtension.asp
If you are using Chrome then check out dotjs.
It will do exactly what you want!
How about just using jquery's jQuery.getScript() method?
http://api.jquery.com/jQuery.getScript/
save the normal html pages to the file system, add the js files manually by hand, and then use fiddler to intercept those calls so you get your version of the html file

Applying DOM Manipulations to HTML and saving the result?

I have about 100 static HTML pages that I want to apply some DOM manipulations to. They all follow the same HTML structure. I want to apply some DOM manipulations to each of these files, and then save the resulting HTML.
These are the manipulations I want to apply:
# [start]
$("h1.title, h2.description", this).wrap("<hgroup>");
if ( $("h1.title").height() < 200 ) {
$("div.content").addClass('tall');
}
# [end]
# SAVE NEW HTML
The first line (.wrap()) I could easily do with a find and replace, but it gets tricky when I have to determine the calculated height of an element, which can't be easily be determined sans-JavaScript.
Does anyone know how I can achieve this? Thanks!
While the first part could indeed be solved in "text mode" using regular expressions or a more complete DOM implementation in JavaScript, for the second part (the height calculation), you'll need a real, full browser or a headless engine like PhantomJS.
From the PhantomJS homepage:
PhantomJS is a command-line tool that packs and embeds WebKit.
Literally it acts like any other WebKit-based web browser, except that
nothing gets displayed to the screen (thus, the term headless). In
addition to that, PhantomJS can be controlled or scripted using its
JavaScript API.
A schematic instruction (which I admit is not tested) follows.
In your modification script (say, modify-html-file.js) open an HTML page, modify it's DOM tree and console.log the HTML of the root element:
var page = new WebPage();
page.open(encodeURI('file://' + phantom.args[0]), function (status) {
if (status === 'success') {
var html = page.evaluate(function () {
// your DOM manipulation here
return document.documentElement.outerHTML;
});
console.log(html);
}
phantom.exit();
});
Next, save the new HTML by redirecting your script's output to a file:
#!/bin/bash
mkdir modified
for i in *.html; do
phantomjs modify-html-file.js "$1" > modified/"$1"
done
I tried PhantomJS as in katspaugh's answer, but ran into several issues trying to manipulate pages. My use case was modifying the static html output of Doxygen, without modifying Doxygen itself. The goal was to reduce delivered file size by remove unnecessary elements from the page, and convert it to HTML5. Additionally I also wanted to use jQuery to access and modify elements more easily.
Loading the page in PhantomJS
The APIs appear to have changed drastically since the accepted answer. Additionally, I used a different approach (derived from this answer), which will be important in mitigating one of the major issues I encountered.
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
// Reading the page's content into your "webpage"
// This automatically refreshes the page
page.content = fs.read(system.args[1]);
// Make all your changes here
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Preventing JavaScript from Running
My page uses Google Analytics in the footer, and now the page is modified beyond my intention, presumably because javascript was run. If we disable javascript, we can't actually use jQuery to modify the page, so that isn't an option. I've tried temporarily changing the tag, but when I do, every special character is replaced with an html-escaped equivalent, destroying all javascript code on the page. Then, I came across this answer, which gave me the following idea.
var rawPageString = fs.read(system.args[1]);
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
// Make all your changes here
rawPageString = page.content;
rawPageString = rawPageString.replace(/<script type='foo\/bar'/g, "<script");
Adding jQuery
There's actually an example on how to use jQuery. However, I thought an offline copy would be more appropriate. Initially I tried using page.includeJs as in the example, but found that page.injectJs was more suitable for the use case. Unlike includeJs, there's no <script> tag added to the page context, and the call blocks execution which simplifies the code. jQuery was placed in the same directory I was executing my script from.
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type more easily here
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Putting it All Together
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
var rawPageString = fs.read(system.args[1]);
// Prevent in-page javascript execution
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Using it from the command line:
phantomjs modify-html-file.js "input_file.html" "output_file.html"
Note: This was tested and working with PhantomJS 2.0.0 on Windows 8.1.
Pro tip: If speed matters, you should consider iterating the files from within your PhantomJS script rather than a shell script. This will avoid the latency that PhantomJS has when starting up.
you can get your modified content by $('html').html() (or a more specific selector if you don't want stuff like head tags), then submit it as a big string to your server and write the file server side.

Categories