I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page.
This should be a simple example with an obvious use-case for PhantomJS. I can't find a decent example, the documentation seems to be all about command line use.
From your comments, I'd guess you have 2 options
Try to find a phantomjs node module - https://github.com/amir20/phantomjs-node
Run phantomjs as a child process inside node - http://nodejs.org/api/child_process.html
Edit:
It seems the child process is suggested by phantomjs as a way of interacting with node, see faq - http://code.google.com/p/phantomjs/wiki/FAQ
Edit:
Example Phantomjs script for getting the pages HTML markup:
var page = require('webpage').create();
page.open('http://www.google.com', function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var p = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML
});
console.log(p);
}
phantom.exit();
});
With v2 of phantomjs-node it's pretty easy to print the HTML after it has been processed.
var phantom = require('phantom');
phantom.create().then(function(ph) {
ph.createPage().then(function(page) {
page.open('https://stackoverflow.com/').then(function(status) {
console.log(status);
page.property('content').then(function(content) {
console.log(content);
page.close();
ph.exit();
});
});
});
});
This will show the output as it would have been rendered with the browser.
Edit 2019:
You can use async/await:
const phantom = require('phantom');
(async function() {
const instance = await phantom.create();
const page = await instance.createPage();
await page.on('onResourceRequested', function(requestData) {
console.info('Requesting', requestData.url);
});
const status = await page.open('https://stackoverflow.com/');
const content = await page.property('content');
console.log(content);
await instance.exit();
})();
Or if you just want to test, you can use npx
npx phantom#latest https://stackoverflow.com/
I've used two different ways in the past, including the page.evaluate() method that queries the DOM that Declan mentioned. The other way I've passed info from the web page is to spit it out to console.log() from there, and in the phantomjs script use:
page.onConsoleMessage = function (msg, line, source) {
console.log('console [' +source +':' +line +']> ' +msg);
}
I might also trap the variable msg in the onConsoleMessage and search for some encapsulate data. Depends on how you want to use the output.
Then in the Nodejs script, you would have to scan the output of the Phantomjs script:
var yourfunc = function(...params...) {
var phantom = spawn('phantomjs', [...args]);
phantom.stdout.setEncoding('utf8');
phantom.stdout.on('data', function(data) {
//parse or echo data
var str_phantom_output = data.toString();
// The above will get triggered one or more times, so you'll need to
// add code to parse for whatever info you're expecting from the browser
});
phantom.stderr.on('data', function(data) {
// do something with error data
});
phantom.on('exit', function(code) {
if (code !== 0) {
// console.log('phantomjs exited with code ' +code);
} else {
// clean exit: do something else such as a passed-in callback
}
});
}
Hope that helps some.
Why not just use this ?
var page = require('webpage').create();
page.open("http://example.com", function (status)
{
if (status !== 'success')
{
console.log('FAIL to load the address');
}
else
{
console.log('Success in fetching the page');
console.log(page.content);
}
phantom.exit();
});
Late update in case anyone stumbles on this question:
A project on GitHub developed by a colleague of mine exactly aims at helping you do that: https://github.com/vmeurisse/phantomCrawl.
It still a bit young, it certainly is missing some documentation, but the example provided should help doing basic crawling.
Here's an old version that I use running node, express and phantomjs which saves out the page as a .png. You could tweak it fairly quickly to get the html.
https://github.com/wehrhaus/sitescrape.git
Related
I have this UserScript in TamperMonkey that I'd like to convert into an extension
let original_fetch = unsafeWindow.fetch;
unsafeWindow.fetch = async (url, init) => {
let response = await original_fetch(url, init)
let respo = response.clone();
// console.log(url)
if (url.includes("SomeString")) {
respo.json().then((info) => {
if (info.step === "lobby") {
setTimeout(doSomething(info.data), 300);
}
});
}
return response;
};
Within my extension, I injected this code with a script element:
let channel = "customChannel";
const oldFetch = window.fetch;
window.fetch = function () {
return new Promise((resolve, reject) => {
oldFetch
.apply(this, arguments)
.then(async (response) => {
const json = await response.clone().json();
const detail = {
json,
fetch: {
url: response.url,
status: response.status,
},
};
window.dispatchEvent(new CustomEvent(channel, { detail }));
resolve(response);
})
.catch((error) => {
reject(error);
});
});
};
Everything works fine except for the fact that the UserScript fetches more requests than the extension.
Can someone explain why and how can I fix this?
EDIT: problem solved
my problem was caused by bad timing, the script was injected after the call was made.
Changing document.head.prepend(script); to document.documentElement.append(script); made it works as intended.
NOTE: Loading the inject script with script.src = ... or script.textContent = ... hasn't made a difference (I decided to use textContent as suggested by wOxxOm)
Thanks everyone who answered and helped me
Update: I missed that the code was injected using script element so the following answer relates to the difference with content script injection.
They are executing in different context/scope.
WebExtension content scripts are injected into content context.
Using unsafeWindow in a userscript will run the fetch in page context which is the same as webpage's own JavaScript (there are some difference between userscript mangers thought).
Without unsafeWindow, if you run fetch it will be somehow similar to WebExtension content script (again, there are some difference between userscript mangers).
In a WebExtension, if you want to run fetch in the page context, you can use window.wrappedJSObject e.g. window.wrappedJSObject.fecth()
We've got a Node.js script that is run once a minute to check the status of our apps. Usually, it works just fine. If the service is up, it exits with 0. If it's down, it exits with 1. All is well.
But every once in a while, it just kinda stops. The console reports "Calling status API..." and stops there indefinitely. It doesn't even timeout at Node's built-in two-minute timeout. No errors, nothing. It just sits there, waiting, forever. This is a problem, because it blocks following status check jobs from running.
At this point, my whole team has looked at it and none of us can figure out what circumstance could make it hang. We've built in a start-to-finish timeout, so that we can move on to the next job, but that essentially skips a status check and creates blind spots. So, I open the question to you fine folks.
Here's the script (with names/urls removed):
#!/usr/bin/env node
// SETTINGS: -------------------------------------------------------------------------------------------------
/** URL to contact for status information. */
const STATUS_API = process.env.STATUS_API;
/** Number of attempts to make before reporting as a failure. */
const ATTEMPT_LIMIT = 3;
/** Amount of time to wait before starting another attempt, in milliseconds. */
const ATTEMPT_DELAY = 5000;
// RUNTIME: --------------------------------------------------------------------------------------------------
const URL = require('url');
const https = require('https');
// Make the first attempt.
make_attempt(1, STATUS_API);
// FUNCTIONS: ------------------------------------------------------------------------------------------------
function make_attempt(attempt_number, url) {
console.log('\n\nCONNECTION ATTEMPT:', attempt_number);
check_status(url, function (success) {
console.log('\nAttempt', success ? 'PASSED' : 'FAILED');
// If this attempt succeeded, report success.
if (success) {
console.log('\nSTATUS CHECK PASSED after', attempt_number, 'attempt(s).');
process.exit(0);
}
// Otherwise, if we have additional attempts, try again.
else if (attempt_number < ATTEMPT_LIMIT) {
setTimeout(make_attempt.bind(null, attempt_number + 1, url), ATTEMPT_DELAY);
}
// Otherwise, we're out of attempts. Report failure.
else {
console.log("\nSTATUS CHECK FAILED");
process.exit(1);
}
})
}
function check_status(url, callback) {
var handle_error = function (error) {
console.log("\tFailed.\n");
console.log('\t' + error.toString().replace(/\n\r?/g, '\n\t'));
callback(false);
};
console.log("\tCalling status API...");
try {
var options = URL.parse(url);
options.timeout = 20000;
https.get(options, function (response) {
var body = '';
response.setEncoding('utf8');
response.on('data', function (data) {body += data;});
response.on('end', function () {
console.log("\tConnected.\n");
try {
var parsed = JSON.parse(body);
if ((!parsed.started || !parsed.uptime)) {
console.log('\tReceived unexpected JSON response:');
console.log('\t\t' + JSON.stringify(parsed, null, 1).replace(/\n\r?/g, '\n\t\t'));
callback(false);
}
else {
console.log('\tReceived status details from API:');
console.log('\t\tServer started:', parsed.started);
console.log('\t\tServer uptime:', parsed.uptime);
callback(true);
}
}
catch (error) {
console.log('\tReceived unexpected non-JSON response:');
console.log('\t\t' + body.trim().replace(/\n\r?/g, '\n\t\t'));
callback(false);
}
});
}).on('error', handle_error);
}
catch (error) {
handle_error(error);
}
}
If any of you can see any places where this could possibly hang without output or timeout, that'd be very helpful!
Thank you,
James Tanner
EDIT: p.s. We use https directly, instead of request so that we don't need to do any installation when the script runs. This is because the script can run on any build machine assigned to Jenkins without a custom installation.
Aren't you missing the .end()?
http.request(options, callback).end()
Something like explained here.
Inside your response callback your not checking the status..
The .on('error', handle_error); is for errors that occur connecting to the server, status code errors are those that the server responds with after a successful connection.
Normally a 200 status response is what you would expect from a successful request..
So a small mod to your http.get to handle this should do..
eg.
https.get(options, function (response) {
if (response.statusCode != 200) {
console.log('\tHTTP statusCode not 200:');
callback(false);
return; //no point going any further
}
....
I am trying to find a way to run npm test using mocha over a HTML DOM. In this case, I am using the global document to retrieve a table out of the DOM. However, when I run npm test I get something like the error:
ReferenceError: document is not defined
at /home/luiz/Projects/linguist-unknown/src/scripts/ling-loader.js:92:61
at extFunc (/home/luiz/Projects/linguist-unknown/src/scripts/ling-loader.js:49:11)
at Array.every (native)
at Utilities.tryMatchUrlExtension (/home/luiz/Projects/linguist-unknown/src/scripts/ling-loader.js:60:25)
at Utilities.<anonymous> (/home/luiz/Projects/linguist-unknown/src/scripts/ling-loader.js:90:16)
at xhr.onload (/home/luiz/Projects/linguist-unknown/src/scripts/ling-loader.js:24:11)
at dispatchEvent (/home/luiz/Projects/linguist-unknown/node_modules/xmlhttprequest/lib/XMLHttpRequest.js:591:25)
at setState (/home/luiz/Projects/linguist-unknown/node_modules/xmlhttprequest/lib/XMLHttpRequest.js:614:14)
at IncomingMessage.<anonymous> (/home/luiz/Projects/linguist-unknown/node_modules/xmlhttprequest/lib/XMLHttpRequest.js:447:13)
at emitNone (events.js:91:20)
at IncomingMessage.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
at _combinedTickCallback (internal/process/next_tick.js:80:11)
at process._tickCallback (internal/process/next_tick.js:104:9)
1) should refresh table
16 passing (3s)
1 failing
1) Loader Utilities should refresh table:
Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves.
I understand that the document is undefined and that I need to, somehow, create one myself, however, I believe that my main problems are:
My first time using npm and mocha and I cannot find anything related to it in their documentation.
Mostly, all problems people have regarding that are related to webbrowsers // I am using CLI, it will be tested with Travis on Github
In my code below you'll see that I solved a similar problem with XMLHttpRequest. However, I just can't figure out the best approach for including the document variable properly into my tests.
Thus, pardon me asking that shall this answer be already there on stackoverflow
My code is the following:
test-utilities.js
...
global.XMLHttpRequest = require('xmlhttprequest').XMLHttpRequest;
global.jsyaml = require('../src/scripts-min/js-yaml.min.js');
global.LinguistHighlighter = require('../src/scripts/ling-highlighter.js').LinguistHighlighter;
var LinguistLoader = require('../src/scripts/ling-loader.js').LinguistLoader;
describe('Loader', function () {
var utilities = new LinguistLoader.Utilities();
it('should refresh table', function(done) {
var location = {
hostname: "github.com",
href: "https://github.com/github-aux/linguist-unknown/blob/chrome/examples/Brain/human_jump.brain",
pathname: "/github-aux/linguist-unknown/blob/chrome/examples/Brain/human_jump.brain"
};
// check if it is not breaking
utilities.refresh(location, function(langObj, table){
done();
});
});
});
...
utilities.js:
...
Utilities.prototype.refresh = function(location, callback) {
var new_url = location.href;
if (new_url === current_url || !this.isGithub(location)) {
return;
}
current_url = new_url;
if (linguistObj === null) {
linguistObj = {
path: this.getPossibleFilepath(location)
};
}
setTimeout(function() {
var downloadHelper = new DownloadHelper();
downloadHelper.load(linguistObj.path, function(objs){
this.tryMatchUrlExtension(current_url, objs, function(langObj){
var table = document.getElementsByClassName("blob-wrapper")[0]
.getElementsByTagName("table")[0];
new LinguistHighlighter.Highlighter(langObj).draw(table);
// callback for tests purposes only
if (callback) {
callback(langObj, table);
}
});
}.bind(this));
}.bind(this), 100);
};
...
Any help is appreciated. Thank you!
I found a very good tool: JSDOM. Its goal is to emulate a subset of a web browser, such as the DOM. With that, I could implement my test-utilities.js file without even touching my utilities.js file, which is pretty much what I wanted.
Here goes the resolution of the file test-utilities.js
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
global.XMLHttpRequest = require('xmlhttprequest').XMLHttpRequest;
global.jsyaml = require('../src/scripts-min/js-yaml.min.js');
global.LinguistHighlighter = require('../src/scripts/ling-highlighter.js').LinguistHighlighter;
var LinguistLoader = require('../src/scripts/ling-loader.js').LinguistLoader;
describe('Loader', function () {
var utilities = new LinguistLoader.Utilities();
it('should refresh the code table', function(done) {
// Download the HTML string and parse it to JSDOM
JSDOM.fromURL("https://github.com/github-aux/linguist-unknown/blob/chrome/examples/Brain/human_jump.brain").then(dom => {
// JSDOM does not support 'innerText' and that is why I am creating this property for all objects.
var o = Object.prototype;
Object.defineProperty(o, "innerText", {
get: function jaca() {
if (this.innerHTML === undefined)
return "";
return this.innerHTML;
}
});
var location = {
hostname: "github.com",
href: "https://github.com/github-aux/linguist-unknown/blob/chrome/examples/Brain/human_jump.brain",
pathname: "/github-aux/linguist-unknown/blob/chrome/examples/Brain/human_jump.brain"
};
// check if it is not breaking
utilities.refresh(location, function(langObj, table) {
done();
});
});
});
That is working properly now! I hope it helps anyone! :D
I am running node.js on raspbian and trying to save/update a file every 2/3 seconds using the following code:
var saveFileSaving = false;
function loop() {
mainLoop = setTimeout(function() {
// update data
saveSaveFile(data, function() {
//console.log("Saved data to file");
loop();
});
}, 1500);
}
function saveSaveFile(data, callback) {
if(!saveFileSaving) {
saveFileSaving = true;
var wstream = fs.createWriteStream(path.join(__dirname, 'save.json'));
wstream.on('finish', function () {
saveFileSaving = false;
callback(data);
});
wstream.on('error', function (error) {
console.log(error);
saveFileSaving = false;
wstream.end();
callback(null);
});
wstream.write(JSON.stringify(data));
wstream.end();
} else {
callback(null);
}
}
When I run this it works fine for an hour then starts spitting out:
[25/May/2016 11:3:4 am] { [Error: EROFS, open '<path to file>']
errno: 56,
code: 'EROFS',
path: '<path to file>' }
I have tried jsonfile plugin which also sends out a similiar write error after an hour.
I have tried both fileSystem.writeFile and fileSystem.writeFileSync both give the same error after an hour.
I was thinking it had to do with the handler not being let go before a new save occurs which is why I started using the saveFileSaving flag.
Resetting the system via hard reset fixes the issue (soft reset does not work as the system seems to be locked up).
Any suggestions guys? I have searched the web and so only found one other question slightly similar from 4 years ago which was left in limbo.
Note: I am using the callback function from the code to continue with the main loop.
I was able to get this working by unlinking the file and saving the file every time I save while it is not pretty it works and shouldn't cause too much overhead.
I also added a backup solution which saves a backup every 5 minutes in case the save file has issues.
Thank you for everyone's help.
Here is my ideas:
1) Check free space when this problem happens by typing in terminal:
df -h
2) Also check if file is editable when problem occurs. with nano or vim and etc.
3) Your code too complicated for simply scheduling data manipulation and writing it to file. Because of even Your file will be busy (saveFileSaving) You will lose data until next iteration, try to use that code:
var
async = require('async'),
fs = require('fs'),
path = require('path');
async.forever(function(next) {
// some data manipulation
try {
fs.writeFileSync(path.join(__dirname, 'save.json'), JSON.stringify(data));
}
catch(ex) {
console.error('Error writing data to file:', ex);
}
setTimeout(next, 2000);
});
4) How about keeping file descriptor open?
var
async = require('async'),
fs = require('fs'),
path = require('path');
var file = fs.createWriteStream(path.join(__dirname, 'save.json'));
async.forever(function(next) {
// some data manipulation
file.write(JSON.stringify(data));
setTimeout(next, 2000);
});
var handleSignal = function (exc) {
// close file
file.end();
if(exc) {
console.log('STOPPING PROCESS BECAUSE OF:', exc);
}
process.exit(-1);
}
process.on('uncaughtException', handleSignal);
process.on('SIGHUP', handleSignal);
5) hardware or software problems (maybe because of OS drivers) with raspberry's storage controller.
I am trying to get amazon pricing information with nodejs.
Here's the target url:
http://aws.amazon.com/ec2/pricing/
But the content of the pricing tables which I am reading in nodejs is not fully rendered and there are only javascripts.
So far I have used jsdom, jquerygo and phantom but I was not successful. Even setting timeouts does not help. Can anyone please provide me with a working solution for this specific case?
Thanks and best regards.
There are different ways to scrape a web page using node.js
I was inspired by spookjs
var Spooky = require('spooky');
var spooky = new Spooky({
child: {
transport: 'http'
},
casper: {
logLevel: 'debug',
verbose: true
}
}, function (err) {
if (err) {
e = new Error('Failed to initialize SpookyJS');
e.details = err;
throw e;
}
spooky.start(
'http://en.wikipedia.org/wiki/Spooky_the_Tuff_Little_Ghost');
spooky.then(function () {
this.emit('hello', 'Hello, from ' + this.evaluate(function () {
return document.title;
}));
});
spooky.run();
});
spooky.on('error', function (e, stack) {
console.error(e);
if (stack) {
console.log(stack);
}
});
spooky.on('console', function (line) {
console.log(line);
});
spooky.on('hello', function (greeting) {
console.log(greeting);
});
spooky.on('log', function (log) {
if (log.space === 'remote') {
console.log(log.message.replace(/ \- .*/, ''));
}
});
Note: Gives flexibility to run casperjs and phantom js using node.js
This solved my issue:
I noticed that when installing phantom module in node, it was complaining about version of phantomjs (version 2) and was downloading version (1.9.8) in some temporary location.
Thus I installed version 1.9.8 instead and set the PATH variable to that. And it worked!
Also must note that inside page.open(...) function you must setTimeout for quite a long time (in my case about 35 seconds) so that the whole page is fully loaded and rendered.