Node JS with Phantom JS to Scrape Dynamic Pages

Node JS with Phantom JS to Scrape Dynamic Pages - javascript

I'm a Java developer, but have played around with javascript very little. I am looking to develop a small Node JS app to parse a dynamic web page... so we need some way to wait until the page is fully loaded. I managed to get a node js project running with a hello world app.
I then updated the project to support PhantomJS via the PhantomJS Node bridge (https://github.com/amir20/phantomjs-node). I was able to successfully run one of their (PhantomJS Node bridge) samples in my node project (see below). While this will successfully write the contents of the web page to a file, the content is not complete, as it does not contain the dynamic data (retrieved via javascript/AJAX).
Can someone tell me a code modification to the below that will allow it to wait until the page is fully loaded prior to writing the file?
** Edit - Just saw where another user has basically the exact same issue, but is unanswered: Dynamic scraping using nodejs and phantomjs
Node js version 6.20, phantom js version 2.1.1, phantom js node (bridge) version 2.1.2
var phantom = require('phantom');
var sitepage = null;
var phInstance = null;
phantom.create()
.then(instance => {
phInstance = instance;
return instance.createPage();
})
.then(page => {
sitepage = page;
return page.open('http://www.somesite.com');
})
.then(status => {
console.log(status);
return sitepage.property('content');
})
.then(content => {
var fs = require('fs');
fs.writeFile("output.html", content, function(err) {
if(err) {
return console.log(err);
}
});
sitepage.close();
phInstance.exit();
})
.catch(error => {
console.log(error);
phInstance.exit();
});

Related

is it possible to run python code outside of electron application

So I created a desktop electron application with javascript html css etc I have a bot that I want to run when a button is clicked by the user the bot is written in python. what the bot does is web scraping using selenium and chrome driver im just wondering is there a way where I could store the bot and its source code outside the clients computer so the source code is not visible and still give the client the ability to use the bot to webscrape.
sorry if this is a rookie question im coming from c++ & swift mobile development and im a junior CS student so im just teaching myself new stuff.

I agree with Chris G in that it would be considered best practice to create a web app with one of Python's many web frameworks (Django, FastAPI, Flask, etc).
Alternatively, with the python-shell package this can be done quite simply with electron:
const { app, BrowserWindow } = require('electron');
const pyshell = require('python-shell')
function createWindow() {
window = new BrowserWindow({ width: 600, height: 450 });
window.loadFile('index.html');
pyshell.run('your_script.py', function (err, results) {
if (err) {
throw err;
}
});
}
app.on('ready', createWindow);
app.on('window-all-closed', () => {
if (process.platform !== 'darwin') {
app.quit()
}
}
Source
Then, with a simple python script your_script.py:
a = 'Foo'
b = 'Bar'
print(a + b)
This example is quite simple. Creating your own web facing API would be your best bet if you don't want to run into any compatibility issues when shipping your app.

I would like to be able to load/edit handlebars.js templates ( express server ) by clicking on a hyperlink , any suggestions?

One of the most time-consuming portions of development on a handlebars.js app is finding the location of the template/partial that is being used for that specific piece of HTML. Is there is a way to create a hyperlink on a localhost server to be able to click and open the template without having to manually search for text etc?
This would also help other users editing the code as it would simplify traceability.
There could be scope for an npm module or similar to do this, just not sure the exact approach.
Likely due to the sublime / browser interface, it should perhaps be a chrome extension. The other option would be to have an admin user-level interface, that allowed code editing from a modal.
Please let me know if you have a solution for this, or if there is some kind of glaringly obvious simple approach that I am missing?

So I have managed to implement a workaround for this using Grunt.
Basically, it adds a button in the HTML page that can be clicked in the browser that opens that handlebars template in sublime for editing. Local
Express
exports.editfile = function(req, res,next) {
var filename = req.query['filename']
const { exec, spawn } = require('child_process');
exec('grunt editfile --filename='+filename, (err, stdout, stderr) => {
if (err) {
console.error(err);
return;
}
console.log(stdout);
});
res.send('complete');
};
Grunt
//In the grunt init
shell: {
'openfile':openFile(),
}
grunt.loadNpmTasks('grunt-shell');
function openFile(){
var openString = ''
var fileName = grunt.option('filename'); //get value of target, my_module
openString = {
command: [
'cd..','cd..','cd..','cd '+process.env.ROOTFOLDER,
'sublime_text.exe "'+fileName+'"',
].join('&&')
}
return openString
}
grunt.registerTask('editfile',[
'shell:openfile'
]);
Handlebars
<a class="btn btn-outline-warning bt-sm adminedit" style="" data-filename="{{filename}}" id="{{uniqueid}}" onclick="editFile(this.id)" href="#" class="text-warning">edit</a>
<script type="text/javascript">
function editFile(iditem){
var editFilename = $('#'+iditem).attr("data-filename")
$.ajax({
url : '/semini/admin/editfile/?filename='+editFilename,
success : function( data ) {
console.log('executing : ',data)
},
error : function( xhr, err ) {
alert('Error',err);
}
});
}
</script>
Note, to get sublime to open from the CMD, you have to add an environment variable in windows. Not too difficult, just google.
It works great ..... but
I cannot get a partial on handlebars to provide its name, and as such this only works for the first rendered page and not partials.

How debug a nodejs API

I've been worked on a vue project.
This vue project use the nodejs API I've created, in simple way, they are two entire differents project which are not located in the same directory and they are launched separately.
The problem is whenever I debug a route with node --inspect --debug-break event_type.controller.js for example named:
"/eventtype/create"
exports.create = (req, res) => {
const userId = jwt.getUserId(req.headers.authorization);
if (userId == null) {
res.status(401).send(Response.response401());
return;
}
// Validate request
if (!req.body.label || !req.body.calendarId) {
res.status(400).send(Response.response400());
return;
}
const calendarId = req.body.calendarId; // Calendar id
// Save to database
EventType.create({
label: req.body.label,
}).then((eventType) => {
Calendar.findByPk(calendarId).then((calendar) => {
eventType.addCalendar(calendar); // Add a Calendar
res.status(201).send(eventType);
}).catch((err) => {
res.status(500).send(Response.response500(err.message));
});
}).catch((err) => {
res.status(500).send(Response.response500(err.message));
});
};
Even if I create a breakpoint on const userId = jwt.getUserId(req.headers.authorization);
and from my vue app I trigger the api createEventType event, my break point is not passed.
Also when I press f8 after the breakpoint on my first line with the debugger, my file close automatically.
I do not use VS Code but Vim for coding but I've heard that maybe Vs Code could allow a simplified way to debug nodesjs application.
NOTE: I use the V8 node debugger.

For newer NodeJS versions (> 7.0.0) you need to use
node --inspect-brk event_type.controller.js
instead of
node --inspect --debug-break event_type.controller.js
to break on the first line of the application code. See https://nodejs.org/api/debugger.html#debugger_advanced_usage for more information.

The solution (even if it's not really a solution) has been to add console.log to the line I wanted to debug.

How to use phantomjs in node-js environment for dynamic-page web scraping?

I am working on web scraping for few task to complete.
I have used node-js request module for page scraping.
It is working fine and great for cookie-session and all.
But it fails when time comes to render Dynamic pages build with some javascript framework like ANGULAR or BACKBONE etc.
I am trying for phantomjs to overcome this thing as i found on google that it is helpful to come over such case.
I also found one nodejs bridge for phantomjs phantom
With phantomjs and this bridge module i am able to achieve same thing nothing more.
var phantom = require('phantom');
var fs = require('fs');
var sitepage = null;
var phInstance = null;
phantom.create()
.then(instance => {
phInstance = instance;
console.log("Instance created");
return instance.createPage();
})
.then(page => {
sitepage = page;
console.log("createing page");
return page.open('https://paytm.com/shop/p/carrier-estrella-plus-1-5-ton-3-star-window-ac-LARCARRIER-ESTRPLAN5550519593A34?src=grid&tracker=%7C%7C%7C%7C%2Fg%2Felectronics%2Flarge-appliances%2F1-5-ton-3-star-ac-starting-at-rs-22699%7C88040%7C1');
})
.then(status => {
//console.log(status);
console.log("getting content of page");
return sitepage.property('content');
})
.then(content => {
console.log("success");
//console.log(content);
fs.writeFile("ok.text", content);
sitepage.close();
phInstance.exit();
})
.catch(error => {
console.log("errr");
//console.log(error);
phInstance.exit();
});
Above is code which i am trying for load one of dynamic website page which is build with angular framework.
Can anybody guide me for same or correct in above code where i am missing right things.

You're getting the content of the page before the dynamic code has run, you need to wait for the load to be completed.
The block behind the page.open would need to wait for the page to complete, if there is an element you know is being fetched from the back end you can lie in wait for that element (see the waitfor example in phantomjs doc).

Calling an external API with a Node application (KeystoneJS)

I'm newer to Node and trying to learn how to modify my Keystone.JS app so it can call data from an API (JSON or XML) and display it in the view that is rendered.
The current code in my app is essentially a cloned version of this demo app https://github.com/JedWatson/keystone-demo except the view engine is Handlebars in my app. What I have tried to so far is is installing the request package and played around with code from the documentation in a my keystone.js file with no luck.
Then I created a model/api.js file, routes/api.js, routes/views/api.js and templates/views/api.hbs and again played with code examples in the request documentation but failed to even grasp what I was doing and how all of these new pages even worked within my app.
I would greatly appreciate figuring out how to call an API and display the requested info in one of the apps rendered views. Thank you in advance!

You could hit the api from your model logic like so https://github.com/r3dm/shpe-sfba/blob/master/models/Event.js#L69 You could use node's built in http library http://devdocs.io/node/http
// Below we call the Facebook api to fill in data for our model
Event.schema.pre('save', function(next) {
var myEvent = this;
var apiCall = 'your API string';
https.get(apiCall, function(res) {
var body = '';
res.on('data', function(d) { body += d; });
res.on('end', function() {
body = JSON.parse(body);
if (body.error) {
var err = new Error('There was an error saving your changes. Make sure the Facebook Event is set to "Public" and try again');
next(err);
} else {
next();
});
})
.on('error', function(e) {
console.log(e);
});
});
If you want the data to be fetched in some other scenario try adding the http request to initLocals in routes/middleware.js.

We Keep Coding

JavaScript is the programming language of the Web.

Node JS with Phantom JS to Scrape Dynamic Pages - javascript

Related

is it possible to run python code outside of electron application

I would like to be able to load/edit handlebars.js templates ( express server ) by clicking on a hyperlink , any suggestions?

How debug a nodejs API

How to use phantomjs in node-js environment for dynamic-page web scraping?

Calling an external API with a Node application (KeystoneJS)

Categories

Resources