How to print html source to console with phantom-crawler

How to print html source to console with phantom-crawler - javascript

I just downloaded and installed phantom-crawler for nodejs. I copy and pasted the following script into a file called crawler.js:
var Crawler = require('phantom-crawler');
// Can be initialized with optional options object
var crawler = new Crawler();
// queue is an array of URLs to be crawled
crawler.queue.push('https://google.com/');
// Can also do `crawler.fetch(url)` instead of pushing it and crawling it
// Extract plainText out of each phantomjs page
Promise.all(crawler.crawl())
.then(function(pages) {
var texts = [];
for (var i = 0; i < pages.length; i++) {
var page = pages[i];
// suffix Promise to return promises instead of callbacks
var text = page.getPromise('plainText');
texts.push(text);
text.then(function(p) {
return function() {
// Pages are like tabs, they should be closed
p.close()
}
}(page));
}
return Promise.all(texts);
})
.then(function(texts) {
// texts = array of plaintext from the website bodies
// also supports ajax requests
console.log(texts);
})
.then(function () {
// kill that phantomjs bridge
crawler.phantom.then(function (p) {
p.exit();
});
})
I'd like to print the complete html source (in this case from the google page) to the console.
I searched a lot, but I haven't found anything similar, so how do I do this?

get the content instead of the plainText promise.
The module phantom-crawler uses the module node-phantom-simple, which uses phantomjs.
You can find the list of properties you can call in the phantomjs wiki.
var Crawler = require('phantom-crawler');
// Can be initialized with optional options object
var crawler = new Crawler();
// queue is an array of URLs to be crawled
crawler.queue.push('https://google.com/');
// Can also do `crawler.fetch(url)` instead of pushing it and crawling it
// Extract plainText out of each phantomjs page
Promise.all(crawler.crawl())
.then(function(pages) {
var allHtml = [];
for (var i = 0; i < pages.length; i++) {
var page = pages[i];
// suffix Promise to return promises instead of callbacks
var html = page.getPromise('content');
allHtml.push(html);
html.then(function(p) {
return function() {
// Pages are like tabs, they should be closed
p.close()
}
}(page));
}
return Promise.all(allHtml);
})
.then(function(allHtml) {
// allHtml = array of plaintext from the website bodies
// also supports ajax requests
console.log(allHtml);
})
.then(function () {
// kill that phantomjs bridge
crawler.phantom.then(function (p) {
p.exit();
});
})

Related

Array being returned as a string

Experimenting with Adobe extension for After Effects, I'm using JSX to interface with the software and send information to a JS file, which updates the HTML UI panel.
At this point, I'm just trying to console.log some information coming back from the software (composition and layer names).
My problem is that what is supposed to be an array of strings, which I can play with on the JSX side, is received as a string once returned in my JS sheet.
Here is the flow:
main.js
(function() {
'use strict';
var csInterface = new CSInterface();
function init() {
themeManager.init();
$("#btn_test").click(function() {
csInterface.evalScript('main()', function(res) {
console.log(typeof(res));
console.log(res);
});
});
}
init();
}());
hostscript.jsx
function main() {
var project = app.project;
var comp = project.activeItem;
var layers = comp.selectedLayers;
var layersNames = [];
for (var i = 0; i < layers.length; i++) {
layersNames.push(layers[i].name);
}
return layersNames;
}
So in my console I receive both information:
typeof is string
logging res is name1,name2
Any ideas? I can't find a targeted answer on Stack Overflow or on any other Adobe-related forums.

Closures, loops, and promises

I've read several posts on the subject, but unfortunately struggling with my situation.
I am pulling some urls from an array of urls and then using those links to obtain (let's say) the title of the page from those subsites. At the end I want a list of original urls and an array of titles (from the sublinks). i.e., go into a site/domain, find some links, curl those links to find page titles.
Problem is that my titleArray just returns a Promise and not the actual data. I'm totally not getting closures right and promises. Code runs in node as is. I'm using personal sites in my real code, but substituted common sites to show an example.
const popsicle = require('popsicle');
var sites = ['http://www.google.com','http://www.cnn.com'];
// loop through my sites (here I'm just using google and cnn as a test
for(var i=0;i<sites.length;i++) {
// call function to pull subsites and get titles from them
var titleArray = processSites(sites[i]);
console.log(sites[i] + ": " + titleArray);
}
// get request on site and then get subsites
function processSites(url) {
return popsicle.get(url)
.then(function(res) {
var data = res.body;
// let's assume I get another collection of URLs
// that I pull from the main site
var subUrls = ['http://www.yahoo.com','http://www.espn.com'];
var titleArray = [];
for(var j=0;j<subUrls.length;i=j++) {
var title = processSubSites(subUrls[j])
titleArray.push(title);
}
return titleArray;
});
}
function processSubSites(url) {
return popsicle.get(url)
.then(function(res) {
var data = res.body;
// let's say I pull the title of the site somehow
var title = "The Title for " + url;
console.log(title);
return title;
});
}
The result after running this is:
http://www.google.com: [object Promise]
http://www.cnn.com: [object Promise]
The Title for http://www.espn.com
The Title for http://www.espn.com
The Title for http://www.yahoo.com
The Title for http://www.yahoo.com
whereas it should be:
http://www.google.com: ['The Title for http://www.yahoo.com', 'The Title for http://www.espn.com']
http://www.cnn.com: ['The Title for http://www.yahoo.com', 'The Title for http://www.espn.com']
...

You cannot return normal data from inside a Promise. You need to return another Promise to make it chainable. To process multiple Promise objects in loop, you need to push them in an array and call Promise.all(array);
const popsicle = require('popsicle');
var sites = ['http://www.google.com','http://www.cnn.com'];
var titleArrayPromises = [];
// loop through my sites (here I'm just using google and cnn as a test
for(var i=0;i<sites.length;i++) {
titleArrayPromises.push(processSites(sites[i]));
}
var titleArray = [];
Promise.all(titleArrayPromises).then(function (titleArrays) {
for(var i=0; i<titleArrays.length; i++) {
titleArray.concat(titleArrays[i])
}
// You now have all the titles from your site list in the titleArray
})
function processSites(url) {
return popsicle.get(url)
.then(function(res) {
var data = res.body;
// let's assume I get another collection of URLs
// that I pull from the main site
var subUrls = ['http://www.yahoo.com','http://www.espn.com'];
var titlePromiseArray = [];
for(var j=0;j<subUrls.length;j++) {
titlePromiseArray.push(processSubSites(subUrls[j]));
}
return Promise.all(titlePromiseArray);
});
}
function processSubSites(url) {
return popsicle.get(url)
.then(function(res) {
var data = res.body;
// let's say I pull the title of the site somehow
var title = "The Title for " + url;
return Promise.resolve(title);
});
}

How to wait for multiple WebWorkers in a loop

I have the following issue with Web Workers in JS. I have a heavy duty application doing some simulation. The code runs in multiple Web Workers.
The main thread is running on a WebPage. But could also be a Web Worker, if it makes sense.
Example:
var myWebWorkers = [];
function openWorker(workerCount){
for(var i = 0; i < workerCount; i++){
myWebWorkers[i] = new Worker('worker.js');
myWebWorkers[i].onmessage = function(e){
this.result = e.data;
this.isReady = true;
}
}
}
function setWorkerData(somedata){
// somedata.length is always a multiple of myWebWorkers.length
var elementCntPerWorker = somedata.length / myWebWorkers.length;
myWebWorkers.forEach(function(worker, index){
worker.isReady = false;
worker.postMessage(
somedata.slice(index * elementCntPerWorker,
(index + 1) * elementCntPerWorker - 1));
});
}
var somedata = [...];
openWorker(8);
for(var i = 0; i < 10000; i++){
setWorkerData(somedata);
waitUntilWorkersAreDoneButAllowBrowserToReact(myWebWorkers);
if(x % 100) updateSVGonWebPage
}
function waitUntilWorkersAreDoneButAllowBrowserToReact(){
/* wait for all myWebWorkers-onchange event, but
allow browser to react and don't block a full Web Worker
Following example is my intension. But will not work, because
events are not executed until code excution stops.
*/
somedata = [];
for(var i = 0; i < myWebWorkers.length; i++){
while(!myWebWorkers[i].isReady);
somedata = somedata.concat(myWebWorkers.result);
}
}
What I need is really the waitUntilWorkersAreDoneButAllowBrowserToReact function or a concept to get this running. Every searching reagarding Mutex, sleep, etc ends in the following sentences: "JS is single threaded", "This will only work if you are not in a loop", "There is no reason to have a sleep function". etc.
Even when passing the main task to another Worker, I got the problem, that this thread is 100 % duty on checking, if the others are ready, which is waste of energy and processing power.
I would love to have a blocking function like myWebWorker.waitForReady(), which would allow events still to be handled. This would bring javascript to its next level. But may be I missed a simple concept that will do exactly this.
Thank you!

I would love to have a blocking function like myWebWorker.waitForReady()
No, that's not possible. All the statements you researched are correct, web workers stay asynchronous and will only communicate by messages. There is no waiting for events, not even on worker threads.
You will want to use promises for this:
function createWorkers(workerCount, src) {
var workers = new Array(workerCount);
for (var i = 0; i < workerCount; i++) {
workers[i] = new Worker(src);
}
return workers;
}
function doWork(worker, data) {
return new Promise(function(resolve, reject) {
worker.onmessage = resolve;
worker.postMessage(data);
});
}
function doDistributedWork(workers, data) {
// data size is always a multiple of the number of workers
var elementsPerWorker = data.length / workers.length;
return Promise.all(workers.map(function(worker, index) {
var start = index * elementsPerWorker;
return doWork(worker, data.slice(start, start+elementsPerWorker));
}));
}
var myWebWorkers = createWorkers(8, 'worker.js');
var somedata = [...];
function step(i) {
if (i <= 0)
return Promise.resolve("done!");
return doDistributedWork(myWebWorkers, somedata)
.then(function(results) {
if (i % 100)
updateSVGonWebPage();
return step(i-1)
});
}
step(1000).then(console.log);
Promise.all does the magic of waiting for concurrently running results, and the step function does the asynchronous looping using a recursive approach.

Spotify API Create Temp Playlist Not Loading

I'm making a little app that displays a list of the top first song of an artist's related artists. When I try and load my app for the first time, it shows nothing. But, when I "Reload Application" everything seems to work. When I constantly start "Reloading" it keeps adding more of the same tracks to the list as well.
How do I stop it from continually appending more tracks to the list as well as tighten up the code so that it works on load?
require([
'$api/models',
'$views/list#List',
'$api/toplists#Toplist'
], function(models, List, Toplist){
'use strict';
// Build playlist
function buildList(trackURIArray){
var arr = trackURIArray;
models.Playlist
.createTemporary("myTempList")
.done(function(playlist){
playlist.load("tracks").done(function(loadedPlaylist){
for(var i = 0; i < arr.length; i++){
loadedPlaylist.tracks.add(models.Track.fromURI(arr[i]));
}
});
// Create list
var list = List.forPlaylist(playlist,{
style:'rounded'
});
$('#playlistContainer').append(list.node);
list.init();
});
}
// Get top track
function getTopTrack(artist, num, callback){
var artistTopList = Toplist.forArtist(artist);
artistTopList.tracks.snapshot(0, num).done(function (snapshot){
snapshot.loadAll('name').done(function(tracks){
var i, num_toptracks;
num_toptracks = num;
for(i = 0; i < num_toptracks; i++){
callback(artist, tracks[i]);
}
});
});
}
// Get Related
function getRelated(artist_uri){
var artist_properties = ['name', 'popularity', 'related', 'uri'];
models.Artist
.fromURI(artist_uri)
.load(artist_properties)
.done(function (artist){
artist.related.snapshot().done(function(snapshot){
snapshot.loadAll('name').done(function(artists){
var temp = [];
for(var i = 0; i < artists.length; i++){
getTopTrack(artists[i], 1, function(artist, toptrack){
var p, n, u;
p = artist.popularity;
n = artist.name;
u = artist.uri;
temp.push(toptrack.uri);
});
}
// Build a list of these tracks
buildList(temp);
});
});
});
}
getRelated('spotify:artist:2VAvhf61GgLYmC6C8anyX1');
});

By using Promises you can delay the rendering of the list until you have successfully composed the temporary list with your tracks. Also, in order to prevent the addition of repeated tracks on reload, assign a unique name to your temporary playlist.
require([
'$api/models',
'$views/list#List',
'$api/toplists#Toplist'
], function (models, List, Toplist) {
'use strict';
// Build playlist
function buildList(trackURIArray) {
var arr = trackURIArray;
models.Playlist
.createTemporary("myTempList_" + new Date().getTime())
.done(function (playlist) {
playlist.load("tracks").done(function () {
playlist.tracks.add.apply(playlist.tracks, arr).done(function () {
// Create list
var list = List.forCollection(playlist, {
style: 'rounded'
});
$('#playlistContainer').appendChild(list.node);
list.init();
});
});
});
}
// Get top track
function getTopTrack(artist, num) {
var promise = new models.Promise();
var artistTopList = Toplist.forArtist(artist);
artistTopList.tracks.snapshot(0, num).done(function (snapshot) {
snapshot.loadAll().done(function (tracks) {
promise.setDone(tracks[0]);
}).fail(function (f) {
promise.setFail(f);
});
});
return promise;
}
// Get Related
function getRelated(artist_uri) {
models.Artist
.fromURI(artist_uri)
.load('related')
.done(function (artist) {
artist.related.snapshot().done(function (snapshot) {
snapshot.loadAll().done(function (artists) {
var promises = [];
for (var i = 0; i < artists.length; i++) {
var promise = getTopTrack(artists[i], 1);
promises.push(promise);
}
models.Promise.join(promises)
.done(function (tracks) {
console.log('Loaded all tracks', tracks);
})
.fail(function (tracks) {
console.error('Failed to load at least one track.', tracks);
})
.always(function (tracks) {
// filter out results from failed promises
buildList(tracks.filter(function(t) {
return t !== undefined;
}));
});
});
});
});
}
getRelated('spotify:artist:2VAvhf61GgLYmC6C8anyX1');
});

The way I think about stuff like this is to imagine I'm on an super slow connection. If every callback (done, or the function passed to getTopTrack) took 2 seconds to respond, how do I need to structure my code to handle that?
How does that apply here? Well, when you call buildList, temp is actually empty. I suspect if you created the playlist first in getRelated, then added songs to it in your callback for getTopTrack, then it would work because the List would keep itself up to date.
Alternatively, you could rework getTopTrack to return a Promise, join all the top track promises together (see Promise doc's on each() and join()), then build the list when they're all complete.
As far as why you're getting multiple lists, it's because you append a new List each time you call buildList. Though I'm not seeing this behavior when I threw the code as is into my playground area. It only happens once, and when I reload application it starts from scratch. Perhaps you have a reload button which is calling getRelated.
Update I've been trying to get this to work, and having lots of trouble. Tried calling list.refresh after each add. Trying a Promise based method now, but still can't get the List to show anything.

How does jQuery do async:false in its $.ajax method?

I have a similar question here, but I thought I'd ask it a different way to cast a wider net. I haven't come across a workable solution yet (that I know of).
I'd like for XCode to issue a JavaScript command and get a return value back from an executeSql callback.
From the research that I've been reading, I can't issue a synchronous executeSql command. The closest I came was trying to Spin Lock until I got the callback. But that hasn't worked yet either. Maybe my spinning isn't giving the callback chance to come back (See code below).
Q: How can jQuery have an async=false argument when it comes to Ajax? Is there something different about XHR than there is about the executeSql command?
Here is my proof-of-concept so far: (Please don't laugh)
// First define any dom elements that are referenced more than once.
var dom = {};
dom.TestID = $('#TestID'); // <input id="TestID">
dom.msg = $('#msg'); // <div id="msg"></div>
window.dbo = openDatabase('POC','1.0','Proof-Of-Concept', 1024*1024); // 1MB
!function($, window, undefined) {
var Variables = {}; // Variables that are to be passed from one function to another.
Variables.Ready = new $.Deferred();
Variables.DropTableDeferred = new $.Deferred();
Variables.CreateTableDeferred = new $.Deferred();
window.dbo.transaction(function(myTrans) {
myTrans.executeSql(
'drop table Test;',
[],
Variables.DropTableDeferred.resolve()
// ,WebSqlError
);
});
$.when(Variables.DropTableDeferred).done(function() {
window.dbo.transaction(function(myTrans) {
myTrans.executeSql(
'CREATE TABLE IF NOT EXISTS Test'
+ '(TestID Integer NOT NULL PRIMARY KEY'
+ ',TestSort Int'
+ ');',
[],
Variables.CreateTableDeferred.resolve(),
WebSqlError
);
});
});
$.when(Variables.CreateTableDeferred).done(function() {
for (var i=0;i < 10;i++) {
myFunction(i);
};
Variables.Ready.resolve();
function myFunction(i) {
window.dbo.transaction(function(myTrans) {
myTrans.executeSql(
'INSERT INTO Test(TestID,TestSort) VALUES(?,?)',
[
i
,i+100000
]
,function() {}
,WebSqlError
)
});
};
});
$.when(Variables.Ready).done(function() {
$('#Save').removeAttr('disabled');
});
}(jQuery, window);
!function($, window, undefined) {
var Variables = {};
$(document).on('click','#Save',function() {
var local = {};
local.result = barcode.Scan(dom.TestID.val());
console.log(local.result);
});
var mySuccess = function(transaction, argument) {
var local = {};
for (local.i=0; local.i < argument.rows.length; local.i++) {
local.qry = argument.rows.item(local.i);
Variables.result = local.qry.TestSort;
}
Variables.Return = true;
};
var myError = function(transaction, argument) {
dom.msg.text(argument.message);
Variables.result = '';
Variables.Return = true;
}
var barcode = {};
barcode.Scan = function(argument) {
var local = {};
Variables.result = '';
Variables.Return = false;
window.dbo.transaction(function(myTrans) {
myTrans.executeSql(
'SELECT * FROM Test WHERE TestID=?'
,[argument]
,mySuccess
,myError
)
});
for (local.I = 0;local.I < 3; local.I++) { // Try a bunch of times.
if (Variables.Return) break; // Gets set in mySuccess and myError
SpinLock(250);
}
return Variables.result;
}
var SpinLock = function(milliseconds) {
var local = {};
local.StartTime = Date.now();
do {
} while (Date.now() < local.StartTime + milliseconds);
}
function WebSqlError(tx,result) {
if (dom.msg.text()) {
dom.msg.append('<br>');
}
dom.msg.append(result.message);
}
}(jQuery, window);

Is there something different about XHR than there is about the executeSql command?
Kind of.
How can jQuery have an async=false argument when it comes to Ajax?
Ajax, or rather XMLHttpRequest, isn't strictly limited to being asynchronous -- though, as the original acronym suggested, it is preferred.
jQuery.ajax()'s async option is tied to the boolean async argument of xhr.open():
void open(
DOMString method,
DOMString url,
optional boolean async, // <---
optional DOMString user,
optional DOMString password
);
The Web SQL Database spec does also define a Synchronous database API. However, it's only available to implementations of the WorkerUtils interface, defined primarily for Web Workers:
window.dbo = openDatabaseSync('POC','1.0','Proof-Of-Concept', 1024*1024);
var results;
window.dbo.transaction(function (trans) {
results = trans.executeSql('...');
});
If the environment running the script hasn't implemented this interface, then you're stuck with the asynchronous API and returning the result will not be feasible. You can't force blocking/waiting of asynchronous tasks for the reason you suspected:
Maybe my spinning isn't giving the callback chance to come back (See code below).

We Keep Coding

JavaScript is the programming language of the Web.

How to print html source to console with phantom-crawler - javascript

Related

Array being returned as a string

Closures, loops, and promises

How to wait for multiple WebWorkers in a loop

Spotify API Create Temp Playlist Not Loading

How does jQuery do async:false in its $.ajax method?

Categories

Resources