PhanthomJS - How do you capture continuous requests? - javascript

If you go to CNBC.COM and open your debugging tools in "Chrome" or whichever browser you prefer, you can see the network trace. Once the page loads, the quote data for the stocks continuously update.
You'll see something similar like this repeating every few seconds.
http://quote.cnbc.com/quote-html-webservice/quote.htm?partnerId=2&requestMethod=quick&exthrs=1&noform=1&fund=1&output=jsonp&symbols=.SPX|.IXIC|.RUT|.VIX|.GDAXI|.FTSE|.FCHI|.FTMIB|.STOXX|.N225|.SSEC|.HSI|.AXJO|.KS11|%40CL.1|%40LCO.1|%40NG.1|%40RB.1|%40HO.1|%40SI.1|%40GC.1|%40HG.1|%40PL.1|%40PA.1|US10Y|DE10Y-DE|JP10Y-JP|UK10Y-GB|FR10Y-FR|EUR%3D&callback=quoteHandler1
When I use the following code:
var page = require('webpage').create();
page.onResourceRequested = function(request) {
console.log('Request ' + JSON.stringify(request, undefined, 4));
};
page.onResourceReceived = function(response) {
console.log('Receive ' + JSON.stringify(response, undefined, 4));
};
page.open(url)
I see all the initial requested resources for the page. But, not the continuous quote data. I tried setting timeouts setTimeout() for like 20 seconds, to give it additional time to capture the additional requests, but it doesn't.
I just need like 20 seconds worth of additional capture of all the requests to get a few additional qoute.cnbc.com in the trace.
I just started using PhantomJS and I tried just about everything I seen on the web and stackoverflow and nothing helped.

Related

Issues executing working JS with Scraping API: Not making complete XHR Calls

Trying to execute some JavaScript via a scraping API, with the goal of scrolling through dynamically loading pages, then parsing the full response. I tested the script in the console at an Auction site (with single page ticked) and it worked as expected.
//Keep scrolling until the previous doc height is equal to the new height
const infinite_scroll = async () => {
let previous = document.body.scrollHeight;
while (true){
window.scrollTo(0, document.body.scrollHeight);
// Fully Wait Until Page has loaded, then proceed
await new Promise(r => setTimeout(r, 2000));
let new_height = document.body.scrollHeight;
console.log('%s Previous Height: ',previous)
console.log(' ')
console.log('%s New Height: ',new_height)
if (new_height == previous){
break
}
else{
previous = new_height
}
}
};
infinite_scroll();
In the network tab it shows 11 successful XHR calls are made from
'372610?pn=6&ipp=10' to '372610?pn=16&ipp=10' ( 'pn', representing the equivalent to a page number and 'ipp', items per page)
However when I try to pass this script to Scrapfly, the API I'm using, It makes only 3 XHR calls and the last one times out, So instead of getting the entire page I get an extra 20 items.
Using the API Demo (would need to sign up for a free account!) It can be reproduced by adding the script, setting the Rendering time to 10000 ms, and adding the following headers;
emailcta : pagehits%3D1%26userdismissed%3Dfalse
cookie: UseInfiniteScroll=true
Looking at the details for the last XHR call, it times out and has a null response. The request headers for the call is identical to the two previous successful ones so I'm not exactly sure what the issue is.
JS_Rendering time doesn't seem to affect this or the wait time within the infinite scroll function.
All the docs say is:
"We provide a way to inject your javascript to be executed on the web page.
You must base64 your script.Your Javascript will be executed after the rendering delay and before the awaited selector (if defined)." They encode the script in base64 then execute it but the result is drastically different from the one in console

Nightmare.js does not work with Azure webjob

I am trying to run an azure webjob which takes a json object and renders a webpage, then prints it to pdf, via the electron browser in Nightmare.js.
When I run this locally it works perfectly, but when I run it in azure webjob it never completes.
I get the two console.log statements output to the log, but seeing as I can not output anything from the nightmare.js calls, nor display the electron browser window, I have no idea what is going wrong.
There is also a webserver in the script, omitted as it seems to take the request with the json object and pass it to createPage just fine.
I have verified that index.html file is in the right directory. Does anyone know what might be wrong?
var Nightmare = require('nightmare'),
http = require('http');
function createPage(o, final) {
var start = new Date().getTime();
var page = Nightmare({
//show: true, //uncomment to show electron browser window
//openDevTools: { mode: 'detach'}, //uncomment to open developer console ('show: true' needs to be set)
gotoTimeout: 300000, //set timeout for .goto() to 2 minutes
waitTimeout: 300000, //set timeout for .wait() to 5 minutes
executionTimeout: 600000 //set timeout for .evaluate() to 10 minutes
})
.goto('file:\\\\' + __dirname + '\\index.html');
page.wait("#ext-quicktips-tip") //wait till HTML is loaded
.wait(function () { // wait till JS is loaded
console.log('Extjs loaded.');
return !!(Ext.isReady && window.App && App.app);
});
console.log("CreatePage()1");
page.evaluate(function (template, form, lists, printOptions) {
App.pdf.Builder.create({
template: template,
form: form,
lists: lists,
format: o.printOptions.format,
});
console.log('Create done');
}, template, form, o.lists, printOptions);
console.log("CreatePage()2");
page.wait(function () {
console.log('Content created. ' + App.pdf.Builder.ready);
return App.pdf.Builder.ready;
})
.pdf(o.outputDir + form.filename, { "pageSize": "A4", "marginsType": 1 })
.end()
.then(function () {
console.log('Pdf printed, time: ' + (new Date().getTime() - start) / 1000 + ' seconds');
final(true);
})
.catch(function (err) {
console.log('Print Error: ' + err.message);
});
}
Solved
As Rick states in his answer, this will not currently work!
This document lists the current state of webjobs sandbox:
https://github.com/projectkudu/kudu/wiki/Azure-Web-App-sandbox
It has the following paragraph relating to my issue:
PDF generation from HTML
There are multiple libraries used to convert HTML to PDF. Many Windows/.NET specific versions leverage IE APIs and therefore leverage User32/GDI32 extensively. These APIs are largely blocked in the sandbox (regardless of plan) and therefore these frameworks do not work in the sandbox.
There are some frameworks that do not leverage User32/GDI32 extensively (wkhtmltopdf, for example) and we are working on enabling these in Basic+ the same way we enabled SQL Reporting.
I guess for nightmare.js to work you need desktop interaction, which you're not getting on a WebJob.
Taken from this issue on Github:
Nightmare isn't truly headless: it requires an Electron instance to
work, which in turn requires a framebuffer to render properly (at
least, for now).
This will not fly on an Azure WebJob.

Retrieve html content of a page several seconds after it's loaded

I'm coding a script in nodejs to automatically retrieve data from an online directory.
Knowing that I had never done this, I chose javascript because it is a language I use every day.
I therefore from the few tips I could find on google use request with cheerios to easily access components of dom of the page.
I found and retrieved all the necessary information, the only missing step is to recover the link to the next page except that the one is generated 4 seconds after loading of page and link contains a hash so that this step Is unavoidable.
What I would like to do is to recover dom of page 4-5 seconds after its loading to be able to recover the link
I looked on the internet, and much advice to use PhantomJS for this manipulation, but I can not get it to work after many attempts with node.
This is my code :
#!/usr/bin/env node
require('babel-register');
import request from 'request'
import cheerio from 'cheerio'
import phantom from 'node-phantom'
phantom.create(function(err,ph) {
return ph.createPage(function(err,page) {
return page.open(url, function(err,status) {
console.log("opened site? ", status);
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function(err) {
//jQuery Loaded.
//Wait for a bit for AJAX content to load on the page. Here, we are waiting 5 seconds.
setTimeout(function() {
return page.evaluate(function() {
var tt = cheerio.load($this.html())
console.log(tt)
}, function(err,result) {
console.log(result);
ph.exit();
});
}, 5000);
});
});
});
});
but i get this error :
return ph.createPage(function (page) {
^
TypeError: ph.createPage is not a function
Is what I am about to do is the best way to do what I want to do? If not what is the simplest way? If so, where does my error come from?
If You dont have to use phantomjs You can use nightmare to do it.
It is pretty neat library to solve problems like yours, it uses electron as web browser and You can run it with or without showing window (You can also open developer tools like in Google Chrome)
It has only one flaw if You want to run it on server without graphical interface that You must install at least framebuffer.
Nightmare has method like wait(cssSelector) that will wait until some element appears on website.
Your code would be something like:
const Nightmare = require('nightmare');
const nightmare = Nightmare({
show: true, // will show browser window
openDevTools: true // will open dev tools in browser window
});
const url = 'http://hakier.pl';
const selector = '#someElementSelectorWitchWillAppearAfterSomeDelay';
nightmare
.goto(url)
.wait(selector)
.evaluate(selector => {
return {
nextPage: document.querySelector(selector).getAttribute('href')
};
}, selector)
.then(extracted => {
console.log(extracted.nextPage); //Your extracted data from evaluate
});
//this variable will be injected into evaluate callback
//it is required to inject required variables like this,
// because You have different - browser scope inside this
// callback and You will not has access to node.js variables not injected
Happy hacking!

Firefox addon sdk : Get header's loading time

I'm developing an addon for Firefox that watches and filters headers, displaying the one I'm interested by.
I need to get the header's loading time, and this is how I do it, which is quite brutal :
function getHeaderInformations(httpHeader, count){
timer.setTimeout(function() {
try{
//this fails until the header is totally in
var httpStatus = httpHeader.responseStatus;
}catch(errStatus){
getHeaderInformations(httpHeader, ++count);
}, 1/*milliseconde*/);
}
Every milliseconds I test it, if it fails i reload it, else it means that we have the header's status so that it is all completed.
The problem is : I don't think it's a good way to get header's loading time. I never get the same value as HTTPfox, Live HTTP Header or the console, plus it's not reliable.
I've searched for such attributes on https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsIHttpChannel without succes.
I've been through HTTPfox's code, but I must say that I couldn't find how they had this information.
The question is : How can I get the header's loading time properly ?

How to reduce latency and not miss events from server in Server Side Events

es.onmessage = function(e) {
var newElement = document.createElement("li");
newElement.innerHTML = e.data;
eventList.appendChild(newElement);
};
es.onerror = function(e) {
**//what to add here to ensure I don't miss events sent during this time frame, or state**
};
There are situations in the communication between the browser and the client where the server is pushing data as I see in the server log but the client doesn't display it. See a snapshot below
127.0.0.1 - - [06/Mar/2014 21:31:23] "GET /updates/1 HTTP/1.1" 200 - 60.2671
As you can see the URL /updates/1 is the EventSource(url) that I use and 60.2671 is the seconds. Is this bad that this is such a long request or is it supposed to be that way.
This problem is solved I modified my code to add the line
EventMachine::PeriodicTimer.new(20) { out << "data: \n\n" } # required, otherwise the connection is closed in 30-60 sec
Adding this line ensures that the persistent connection that you keep_open stays open for as long as possible. Otherwise it times out and then reconnects causing events being missed.

Categories