Duplicate Array Data Web Scraping - javascript

I can't seem to get the article duplicates out of my web scraper results, this is my code:
app.get("/scrape", function (req, res) {
request("https://www.nytimes.com/", function (error, response, html) {
// Load the HTML into cheerio and save it to a variable
// '$' becomes a shorthand for cheerio's selector commands, much like jQuery's '$'
var $ = cheerio.load(html);
var uniqueResults = [];
// With cheerio, find each p-tag with the "title" class
// (i: iterator. element: the current element)
$("div.collection").each(function (i, element) {
// An empty array to save the data that we'll scrape
var results = [];
// store scraped data in appropriate variables
results.link = $(element).find("a").attr("href");
results.title = $(element).find("a").text();
results.summary = $(element).find("p.summary").text().trim();
// Log the results once you've looped through each of the elements found with cheerio
db.Article.create(results)
.then(function (dbArticle) {
res.json(dbArticle);
}).catch(function (err) {
return res.json(err);
});
});
res.send("You scraped the data successfully.");
});
});
// Route for getting all Articles from the db
app.get("/articles", function (req, res) {
// Grab every document in the Articles collection
db.Article.find()
.then(function (dbArticle) {
res.json(dbArticle);
})
.catch(function (err) {
res.json(err);
});
});
Right now I am getting five copies of each article sent to the user. I have tried db.Article.distinct and various versions of this to filter the results down to only unique articles. Any tips?

In Short:
Switching the var results = [] from an Array to an Object var results = {} did the trick for me. Still haven't figured out the exact reason for the duplicate insertion of documents in database, will update as soon I find out.
Long Story:
You have multiple mistakes and points of improvement there in your code. I will try pointing them out:
Let's follow them first to make your code error free.
Mistakes
1. Although mongoose's model.create, new mongoose() does seem to work fine with Arrays but I haven't seen such a use before and it does not even look appropriate.
If you intend to create documents one after another then represent your documents using an object instead of an Array. Using an array is more mainstream when you intend to create multiple documents at once.
So switch -
var results = [];
to
var results = {};
2. Sending response headers after they are already sent will create for you an error. I don't know if you have already noticed it or not but its pretty much clear upfront as once the error is popped up the remaining documents won't get stored because of PromiseRejection Error if you haven't setup a try/catch block.
The block inside $("div.collection").each(function (i, element) runs asynchronously so your process control won't wait for each document to get processed, instead it would immediately execute res.send("You scraped the data successfully.");.
This will effectively terminate the Http connection between the client and the server and any further issue of response termination calls like res.json(dbArticle) or res.json(err) will throw an error.
So, just comment the res.json statements inside the .create's then and catch methods. This will although terminate the response even before the whole articles are saved in the DB but you need not to worry as your code would still work behind the scene saving articles in database for you (asynchronously).
If you want your response to be terminated only after you have successfully saved the data then change your middleware implementation to -
request('https://www.nytimes.com', (err, response, html) => {
var $ = cheerio.load(html);
var results = [];
$("div.collection").each(function (i, element) {
var ob = {};
ob.link = $(element).find("a").attr("href");
ob.title = $(element).find("a").text();
ob.summary = $(element).find("p.summary").text().trim();
results.push(ob);
});
db.Article.create(results)
.then(function (dbArticles) {
res.json(dbArticles);
}).catch(function (err) {
return res.json(err);
});
});
After making above changes and even after the first one, my version of your code ran fine. So if you want you can continue on with your current version, or you may try reading some points of improvement.
Points of Improvements
1. Era of callbacks is long gone:
Convert your implementation to utilise Promises as they are more maintainable and easier to reason about. Here are the things you can do -
Change request library from request to axios or any one which supports Promises by default.
2. Make effective use of mongoose methods for insertion. You can perform bulk inserts of multiple statements in just one query. You may find docs on creating documents in mongodb quite helpful.
3. Start using some frontend task automation library such as puppeteer or nightmare.js for data scraping related task. Trust me, they make life a hell lot easier than using cheerio or any other library for the same. Their docs are really good and well maintained so you won't have have hard time picking these up.

Related

Assigining Data from firestore get to a variable

Im trying to assign variables to their respected value from the firestore database using the get doc function, I've noticed it does not assign or update the values what so ever.
I've tried to work with async and awaits but cannot seem to make it work.
getFromDatabase(nameOfCollection,nameOfDocument){
const db = firebase.firestore();
var docRef = db.collection(nameOfCollection).doc(nameOfDocument);
docRef.get().then(function(doc) {
if (doc.exists) {
outvariable = doc.data().anyfield; // THIS IS WHAT I WANT
console.log(" Document data:", doc.data());
} else {
console.log("No such document!");
}
}).catch(function(error) {
console.log("Error getting document:", error);
});
}
im expecting outvariable = doc.data().anyfield
Most likely you're confused by the fact that data is loaded from Firestore asynchronously. It's not so much that the data isn't assigned to the values, because it really is. It just happens at a different time than you expect.
It's easiest to see this by adding some simple logging statements around the code that loads data:
const db = firebase.firestore();
var docRef = db.collection(nameOfCollection).doc(nameOfDocument);
console.log("Before starting to load data");
docRef.get().then(function(doc) {
console.log("Got data";
});
console.log("After starting to load data");
When you run this code, the output is:
Before starting to load data
After starting to load data
Got data
This is probably not what you expected, but it's actually completely correct. The data is loaded from Firestore asynchronously (since it may take some time), and instead of waiting, the main code continues. Then when the data is available, your callback function is called with that data.
This means that any code that requires the data from the database must be inside the callback, or be called from there. So the console.log(" Document data:", doc.data()) in your original code should work fine. But a similar console.log outside of the callback won't work, because it runs before the data is available.
This is an extremely common source of confusion for developers new to this type of API. But since most modern web/cloud APIs, and many other APIs, are asynchronous, it's best to learn how to work with them quickly. For that, I recommend reading:
get asynchronous value from firebase firestore reference
Doug's blog post on why Firebase APIs are synchronous
Firestore query in function return
NodeJS, Firestore get field
How do I return the response from an asynchronous call?
The data can be extracted with .data() or .get() to get a specific field.
For example: doc.get(anyfield);
More info can be found on the official documentation.

Save csv-parse output to a variable

I'm new to using csv-parse and this example from the project's github does what I need with one exception. Instead of outputting via console.log I want to store data in a variable. I've tried assigning the fs line to a variable and then returning data rather than logging it but that just returned a whole bunch of stuff I didn't understand. The end goal is to import a CSV file into SQLite.
var fs = require('fs');
var parse = require('..');
var parser = parse({delimiter: ';'}, function(err, data){
console.log(data);
});
fs.createReadStream(__dirname+'/fs_read.csv').pipe(parser);
Here is what I have tried:
const fs = require("fs");
const parse = require("./node_modules/csv-parse");
const sqlite3 = require("sqlite3");
// const db = new sqlite3.Database("testing.sqlite");
let parser = parse({delimiter: ","}, (err, data) => {
// console.log(data);
return data;
});
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
I was also struggling to figure out how to get the data from csv-parse back to the top-level that invokes parsing. Specifically I was trying to get parser.info data at the end of processing to see if it was successful, but the solution for that can work to get the row data as well, if you need.
The key was to wrap all the stream event listeners into a Promise, and within the parser's callback resolve the Promise.
function startFileImport(myFile) {
// THIS IS THE WRAPPER YOU NEED
return new Promise((resolve, reject) => {
let readStream = fs.createReadStream(myFile);
let fileRows = [];
const parser = parse({
delimiter: ','
});
// Use the readable stream api
parser.on('readable', function () {
let record
while (record = parser.read()) {
if (record) { fileRows.push(record); }
}
});
// Catch any error
parser.on('error', function (err) {
console.error(err.message)
});
parser.on('end', function () {
const { lines } = parser.info;
// RESOLVE OUTPUT THAT YOU WANT AT PARENT-LEVEL
resolve({ status: 'Successfully processed lines: ', lines });
});
// This will wait until we know the readable stream is actually valid before piping
readStream.on('open', function () {
// This just pipes the read stream to the response object (which goes to the client)
readStream.pipe(parser);
});
// This catches any errors that happen while creating the readable stream (usually invalid names)
readStream.on('error', function (err) {
resolve({ status: null, error: 'readStream error' + err });
});
});
}
This is a question that suggests confusion about an asynchronous streaming API and seems to ask at least three things.
How do I get output to contain an array-of-arrays representing the parsed CSV data?
That output will never exist at the top-level, like you (and many other programmers) hope it would, because of how asynchronous APIs operate. All the data assembled neatly in one place can only exist in a callback function. The next best thing syntactically is const output = await somePromiseOfOutput() but that can only occur in an async function and only if we switch from streams to promises. That's all possible, and I mention it so you can check it out later on your own. I'll assume you want to stick with streams.
An array consisting of all the rows can only exist after reading the entire stream. That's why all the rows are only available in the author's "Stream API" example only in the .on('end', ...) callback. If you want to do anything with all the rows present at the same time, you'll need to do it in the end callback.
From https://csv.js.org/parse/api/ note that the author:
uses the on readable callback to push single records into a previously empty array defined externally named output.
uses the on error callback to report errors
uses the on end callback to compare all the accumulated records in output to the expected result
...
const output = []
...
parser.on('readable', function(){
let record
while (record = parser.read()) {
output.push(record)
}
})
// Catch any error
parser.on('error', function(err){
console.error(err.message)
})
// When we are done, test that the parsed output matched what expected
parser.on('end', function(){
assert.deepEqual(
output,
[
[ 'root','x','0','0','root','/root','/bin/bash' ],
[ 'someone','x','1022','1022','','/home/someone','/bin/bash' ]
]
)
})
As to the goal on interfacing with sqlite, this is essentially building a customized streaming endpoint.
In this use case, implement a customized writable stream that accepts the output of parser and sends rows to the database.
Then you simply chain pipe calls as
fs.createReadStream(__dirname+'/fs_read.csv')
.pipe(parser)
.pipe(your_writable_stream)
Beware: This code returns immediately. It does not wait for the operations to finish. It interacts with a hidden event loop internal to node.js. The event loop often confuses new developers who are arriving from another language, used to a more imperative style, and skipped this part of their node.js training.
Implementing such a customized writable stream can get complicated and is left as an exercise for the reader. It will be easiest if the parser emits a row, and then the writer can be written to handle single rows. Make sure you are able to notice errors somehow and throw appropriate exceptions, or you'll be cursed with incomplete results and no warning or reason why.
A hackish way to do it would have been to replace console.log(data) in let parser = ... with a customized function writeRowToSqlite(data) that you'll have to write anyway to implement a custom stream. Because of asynchronous API issues, using return data there does not do anything useful. It certainly, as you saw, fails to put the data into the output variable.
As to why output in your modified posting does not contain the data...
Unfortunately, as you discovered, this is usually wrong-headed:
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
Here, the variable output will be a ReadableStream, which is not the same as the data contained in the readable stream. Put simply, it's like when you have a file in your filesystem, and you can obtain all kinds of system information about the file, but the content contained in the file is accessed through a different call.

node.js data consistency when iterating asynchronously

I have a tool who's basic idea is as follows:
//get a bunch of couchdb databases. this is an array
const jsonFile = require('jsonfile');
let dbList = getDbList();
const filePath = 'some/path/to/file';
const changesObject = {};
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db){
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err,changes){
validateChanges(changes);
changesObject['db'] = changes.someAttribute;
//write changes to file
jsonFile.writeFile(filePath, changesObject, function (err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
})
const validateChanges = function(changes) {
if (!validateLogic(changes) sendAlertMail();
}
For performance improvements the iteration is not done synchronously. Therefore there can be multiple iterations running in 'parallel'. My question is can this cause any data inconsistencies and/or any issues with the file writing process?
Edit:
The same file gets written to on each iteration.
Edit:2
The changes are stored as a JSON object with key value pairs. The key being the db name.
If you're really writing to a single file, which you appear to be (though it's hard to be sure), then no; you have a race condition in which multiple callbacks will try to write to the same file, possibly at the same time (remember, I/O isn't done on the JavaScript thread in Node unless you use the *Sync functions), which will at best mean the last one wins and will at worst mean I/O errors because of overlap.
If you're writing to separate files for each db, then provided there's no cross-talk (shared state) amongst validateChanges, validateLogic, sendAlertMail, etc., that should be fine.
Just for detail: It will start tasks (jobs) getting the changes and then writing them out; the callbacks of the calls to get won't be run until later, when all of those jobs are queued.
You are creating closures in loops, but the way you're doing it is okay, both because you're doing it within the forEach callback and because you're not using db in the get callback (which would be fine with the forEach callback but not with some other ways you might loop arrays). Details on that aspect in this question's answers if you're interested.
This line is suspect, though:
let merchantDb = nano.use('db');
I suspect you meant (no quotes):
let merchantDb = nano.use(db);
For what it's worth, it sounds from the updates to the question and your various comments like the better solution would be not to write out the file separately each time. Instead, you want to gather up the changes and then write them out.
You can do that with the classic Node-callback APIs you're using like this:
let completed = 0;
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db) {
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err, changes) {
if (err) {
// Deal with the fact there was an error (don't return)
} else {
validateChanges(changes);
changesObject[db] = changes.someAttribute; // <=== NOTE: This line had 'db' rather than db, I assume that was meant to be just db
}
if (++completed === dbList.length) {
// All done, write changes to file
jsonFile.writeFile(filePath, changesObject, function(err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
}
})
});

How to update a property of an object using Parse Cloud Code?

Every Parse Installation object instance in my Parse database has a pointer to a specific user. I have a background job that runs for every user, and what I want to do in this part of the background job is to set the respective user Installation's channel property to ["yesPush"], for push notification targeting purposes.
The way I figured I would do it is by querying for the specific Parse.Installation instance, and then setting its channels property. This doesn't seem to be working however. I'm trying to follow the guidelines in the Parse Cloud Code Docs, but it's either not the correct use case, or I'm not following it correctly.
Code:
var installationQuery = new Parse.Query(Parse.Installation);
installationQuery.equalTo('userId', parentUser);
installationQuery.find().then(function(results) {
return Parse.Object.set('channels', ["yesPush"]);
});
The way I would do it is as follows:
// for every User
var userQuery = new Parse.Query(Parse.User);
userQuery.each(function(user) {
var installationQuery = new Parse.Query(Parse.Installation);
installationQuery.equalTo('userId', user);
return installationQuery.first().then(function(installation) {
installation.set('channels', ['yesPush']);
return installation.save();
});
}).then(function() {
status.success();
}, function(error) {
status.error(error.message);
});
The each() function is how you process large sets of data in a job. If you are performing other async tasks inside it you need to return their promise.
The first() function is much faster and easier if we only expect one record.
I am calling set() on the actual installation object returned by first().
I am returning the save() promise to allow promise chaining.

JavaScript leaking memory (Node.js/Restify/MongoDB)

Update 4: By instantiating the restify client (see controllers/messages.js) outside of the function and calling global.gc() after every request it seems the memory growth rate has been reduced a lot (~500KB per 10secs). Yet, the memory usage is still constantly growing.
Update3: Came across this post: https://journal.paul.querna.org/articles/2011/04/05/openssl-memory-use/
It might be worth noting that I'm using HTTPS with Restify.
Update 2: Updated the code below to the current state. I've tried swapping out Restify with Express. Sadly this didn't make any difference. It seems that the api call at the end of the chain (restify -> mongodb -> external api) causes everything to retain to memory.
Update 1: I have replaced Mongoose with the standard MongoDB driver. Memory usage seems to grow less fast, yet the leak remains..
I've been working on trying to locate this leak for a couple of days now.
I'm running an API using Restify and Mongoose and for every API call I do at least one MongoDB lookup. I've got about 1-2k users that hit the API multiple times in a day.
What I have tried
I've isolated my code to just using Restify and used ApacheBench to fire a huge amount of requests (100k+). The memory usage stays around 60MB during the test.
I've isolated my code to just using Restify and Mongoose and tested it the same way as above. Memory usage stays around 80MB.
I've tested the full production code locally using ApacheBench. Memory usage stays around 80MB.
I've automatically dumped the heap on intervals. The biggest heap dump I had was 400MB. All I can see that there are tons of Strings and Arrays but I cannot clearly see a pattern in it.
So, what could be wrong?
I've done the above tests using just one API user. This means that Mongoose only grabs the same document over and over. The difference with production is that a lot of different users hit the API meaning mongoose gets a lot of different documents.
When I start the nodejs server the memory quickly grows to 100MB-200MB. It eventually stabilizes around 500MB. Could this mean that it leaks memory for every user? Once every user has visited the API it will stabilize?
I've included my code below which outlines the general structure of my API. I would love to know if there's a critical mistake in my code or any other approach to finding out what is causing the high memory usage.
Code
app.js
var restify = require('restify');
var MongoClient = require('mongodb').MongoClient;
// ... setup restify server and mongodb
require('./api/message')(server, db);
api/message.js
module.exports = function(server, db) {
// Controllers used for retrieving accounts via MongoDB and communication with an external api
var accountController = require('../controllers/accounts')(db);
var messageController = require('../controllers/messages')();
// Restify bind to put
server.put('/api/message', function(req, res, next) {
// Token from body
var token = req.body.token;
// Get account by token
accountController.getAccount(token, function(error, account) {
// Send a message using external API
messageController.sendMessage(token, account.email, function() {
res.send(201, {});
return next();
});
});
});
};
controllers/accounts.js
module.exports = function(db) {
// Gets account by a token
function getAccount(token, callback) {
var ObjectID = require('mongodb').ObjectID;
var collection = db.collection('accounts');
collection.findOne({
token: token
}, function(error, account) {
if (error) {
return callback(error);
}
if (account) {
return callback('', account);
}
return callback('Account not found');
});
}
};
controllers/messages.js
module.exports = function() {
function sendMessage(token, email, callback) {
// Get a token used for external API
getAccessToken(function() {}
// ... Setup client
// Do POST
client.post('/external_api', values, function(err, req, res, obj) {
return callback();
});
});
}
return {
sendMessage: sendMessage
};
};
Heap snapshot of suspected leak
Might be a bug in getters, I got it when using virtuals or getters for mongoose schema https://github.com/LearnBoost/mongoose/issues/1565
It's actually normal to only see string and arrays, as most programs are largely based on them. The profiler that allow sorting by total object count are therefore not of much use as they many times give the same results for many different programs.
A better way to use the memory profiling of chrome is to take one snapshot for example after one user calls an API, and then a second heap snapshot after a second user called the API.
The profiler gives the possibility to compare two snapshots and see what is the difference between one and the other (see this tutorial), this will help understand why the memory grew in an unexpected way.
Objects are retained in memory because there is still a reference to them that prevents the object from being garbage collected.
So another way to try to use the profiler to find memory leaks is to look for an object that you believe should not be there and see what is it's retaining paths, and see if there are any unexpected paths.
Not sure whether this helps, but could you try to remove unnecessary returns?
api/message.js
// Send a message using external API
messageController.sendMessage(token, account.email, function() {
res.send(201, {});
next(); // remove 'return'
});
controllers/accounts.js
module.exports = function(db) {
// Gets account by a token
function getAccount(token, callback) {
var ObjectID = require('mongodb').ObjectID;
var collection = db.collection('accounts');
collection.findOne({
token: token
}, function(error, account) {
if (error) {
callback(error); // remove 'return'
} else if (account) {
callback('', account); // remove 'return'
} else {
callback('Account not found'); // remove 'return'
}
});
}
return { // I guess you missed to copy this onto the question.
getAccount: getAccount
};
};
controllers/messages.js
// Do POST
client.post('/external_api', values, function(err, req, res, obj) {
callback(); // remove 'return'
});
Your issue is in the getAccount mixed with how GC work's.
When you chain lots of function the GC only clears one at a time and the older something is on memory the less chances it has of being collected so on your get account you need at least that I can count 6 calls to global.gc() or auto executes before it can be collected by this time the GC assumes its something that it probably wont collect so it doesn't check it anyway.
collection{
findOne{
function(error, account){
callback('', account)
sendMessage(...)
getAccessToken(){
Post
}
}
}
}
}
}
as suggested by Gene remove this chaining.
PS: This is just a representation of how the GC works and depends on Implementation but you get the point.

Categories