I'm building a web scraper in nodeJS that uses request and cheerio to parse the DOM. While I am using node, I believe this is more of a general javascript question.
tl;dr - creating ~60,000 - 100,000 objects, uses up all my computer's RAM, get an out of memory error in node.
Here's how the scraper works. It's loops within loops, I've never designed anything this complex before so there might be way better ways to do this.
Loop 1: Creates 10 objects in array called 'sitesArr'. Each object represents one website to scrape.
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description'
},
// ... x10
]
Loop 2: Loops through 'sitesArr'. For each site it goes to the homepage via 'request' and gets a list of category links, usually 30-70 URLs. Appends these URLs to the current 'sitesArr' object to which they belong, in an array property whose name is 'categories'.
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description',
categories: [
{
name: 'shoes',
url: 'www.basedomain.com/shoes'
},{
name: 'socks',
url: 'www.basedomain.com/socks'
} // x 50
]
},
// ... x10
]
Loop 3: Loops through each 'category'. For each URL it gets a list of products links and puts them in an array. Usually ~300-1000 products per category
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description',
categories: [
{
name: 'shoes',
url: 'www.basedomain.com/shoes',
products: [
'www.basedomain.com/shoes/product1.html',
'www.basedomain.com/shoes/product2.html',
'www.basedomain.com/shoes/product3.html',
// x 300
]
},// x 50
]
},
// ... x10
]
Loop 4: Loops through each of the 'products' array, goes to each URL and creates an object for each.
var product = {
infoLink: "www.basedomain.com/shoes/product1.html",
description: "This is a description for the object",
title: "Product 1",
Category: "Shoes",
imgs: ['http://foo.com/img.jpg','http://foo.com/img2.jpg','http://foo.com/img3.jpg'],
price: 60,
currency: 'USD'
}
Then, for each product object I'm shipping them off to a MongoDB function which does an upsert into my database
THE ISSUE
This all worked just fine, until the process got large. I'm creating about 60,000 product objects every time this script runs, and after a little while all of my computer's RAM is being used up. What's more, after getting about halfway through my process I get the following error in Node:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
I'm very much of the mind that this is a code design issue. Should I be "deleting" the objects once I'm done with them? What's the best way to tackle this?
Related
I'm trying to understand why Javascript array sort doesn't work with the following logic. I have no problems making my own algorithm to sort this array, but I'm trying to make it with the Javascript sort built-in method to understand it better.
In this code, I want to push entities that "belongs to" another entity to the bottom, so entities that "has" other entities appear on the top. But apparently, the sort method doesn't compare all elements with each other, so the logic doesn't work properly.
Am I doing something wrong, or it is the correct behavior for the Javascript sort method?
The code I'm trying to execute:
let entities = [
{
name: 'Permission2',
belongsTo: ['Role']
},
{
name: 'Another',
belongsTo: ['User']
},
{
name: 'User',
belongsTo: ['Role', 'Permission2']
},
{
name: 'Teste',
belongsTo: ['User']
},
{
name: 'Role',
belongsTo: ['Other']
},
{
name: 'Other',
belongsTo: []
},
{
name: 'Permission',
belongsTo: ['Role']
},
{
name: 'Test',
belongsTo: []
},
]
// Order needs to be Permission,
let sorted = entities.sort((first, second) => {
let firstBelongsToSecond = first.belongsTo.includes(second.name),
secondBelongsToFirst = second.belongsTo.includes(first.name)
if(firstBelongsToSecond) return 1
if(secondBelongsToFirst) return -1
return 0
})
console.log(sorted.map(item => item.name))
As you can see, "Role" needs to appear before "User", "Other" before "Role", etc, but it doesn't work.
Thanks for your help! Cheers
You're running into literally how sorting is supposed to work: sort compares two elements at a time, so let's just take some (virtual) pen and paper and write out what your code is supposed to do.
If we use the simplest array with just User and Role, things work fine, so let's reduce your entities to a three element array that doesn't do what you thought it was supposed to do:
let entities = [
{
name: 'User',
belongsTo: ['Role', 'Permission2']
},
{
name: 'Test',
belongsTo: []
},
{
name: 'Role',
belongsTo: ['Other']
}
]
This will yield {User, Test, Role} when sorted, because it should... so let's see why it should:
pick elements [0] and [1] from [user, test, role] for comparison
compare(user, test)
user does not belong to test
test does not belong to user
per your code: return 0, i.e. don't change the ordering
we slide the compare window over to [1] and [2]
compare(test, role)
test does not belong to role
role does not belong to test
per your code: return 0, i.e. don't change the ordering
we slide the compare window over to [2] and [3]
there is no [3], we're done
The sorted result is {user, test, role}, because nothing got reordered
So the "bug" is thinking that sort compares everything-to-everything: as User and Role are not adjacent elements, they will never get compared to each other. Only adjacent elements get compared.
This question already has answers here:
MongoDB: How to update multiple documents with a single command?
(13 answers)
Closed 3 years ago.
I looked at other questions and I feel mine was different enough to ask.
I am sending a (potentially) large amount of information back to my backend, here is an example data set:
[ { orders: [Array],
_id: '5c919285bde87b1fc32b7553',
name: 'Test',
date: '2019-03-19',
customerName: 'Amego',
customerPhone: '9991112222',
customerStreet: 'Lost Ave',
customerCity: 'WestZone',
driver: 'CoolCat',
driverReq: false, // this is always false when it is ready to print
isPrinted: false, // < this is important
deliveryCost: '3',
total: '38.48',
taxTotal: '5.00',
finalTotal: '43.48',
__v: 0 },
{ orders: [Array],
_id: '5c919233bde87b1fc32b7552',
name: 'Test',
date: '2019-03-19',
customerName: 'Foo',
customerPhone: '9991112222',
customerStreet: 'Found Ave',
customerCity: 'EastZone',
driver: 'ChillDog',
driverReq: false,// this is always false when it is ready to print
isPrinted: false, // < this is important
deliveryCost: '3',
total: '9.99',
taxTotal: '1.30',
finalTotal: '11.29',
__v: 0 },
{ orders: [Array],
_id: '5c91903b6e0b7f1f4afc5c43',
name: 'Test',
date: '2019-03-19',
customerName: 'Boobert',
customerPhone: '9991112222',
customerStreet: 'Narnia',
customerCity: 'SouthSzone',
driver: 'SadSeal',
driverReq: false,// this is always false when it is ready to print
isPrinted: false, // < this is important
deliveryCost: '3',
total: '41.78',
taxTotal: '5.43',
finalTotal: '47.21',
__v: 0 } ] }
My front end can find all the orders that include isPrinted:false, I then allow the end user to 'print' all the orders that are prepared, in which, I need to change isPrinted into true, that way when I pull up a next batch I won't have reprints.
I was looking at db.test.updateMany({foo: "bar"}, {$set: {isPrinted: true}}), and I currently allow each order to set a new driver, which I update by:
Order.update({
_id: mongoose.Types.ObjectId(req.body.id)
},
{
$set: {
driver:req.body.driver, driverReq:false
}
which is pretty straight forward, as only 1 order comes back at a time.
I have considered my front end doing a foreach and posting each order individually, then updating the isPrinted individually but that seems quite inefficient. Is there a elegant solutions within mongo for this?
I'm not sure how I would user updateMany considering each _id is unique, unless I grab all the order's who are both driverReq:false and isPrinted:false (because that is the case where they are ready to print.
I found a solution, that was in fact using UpdateMany.
Order.updateMany({
isPrinted: false, driverReq:false
},
{
$set: {
isPrinted: true
}
consider there this special case where both are false when it needs to be changed too true. But I do wonder if there is a way to iterate over multiple document id's with ease.
Not quite sure if that title was the best I could do.
I'm a pretty new to js and keep running into problems ... I hope some of you have the time to give me a pointer or two on this scenario.
I have several objects that looks pretty much like this - except from the fact that there are 28 instances of every "room" type. I need to split this object into multiple objects - one for each "room" type. In some of my objects there are only one room type - whilst in others there are 3 or 4.
[ { id: 1
created: 2018-12-29T13:18:05.788Z,
room: 'Double Room'
type: 'Standard'
price: 500
},
{ id: 29
created: 2018-12-29T13:18:05.788Z,
room: 'Twin Room'
type: 'Standard'
price: 500
},
{ id: 58
created: 2018-12-29T13:18:05.788Z,
room: 'Family Room'
type: 'Standard'
price: 900
},
]
Oh, and it's important that the instances don't "loose" their order in the array - since it's date related and need to be presentet in an ascending order. And vanilla js only.
Is array.map() the function I'm looking for to solve this problem? Is it posible to do this without iteration?
My final goal is to create some kind of generic function that can sort this out for all my objects.
And guys: happy hollidays!
You could take an object as hash table for the wanted groups. Then iterate the objects and assign the object to the group. If the group does not exist, create a new group with an array.
function groupBy(array, key) {
var groups = Object.create(null);
array.forEach(o => (groups[o[key]] = groups[o[key]] || []).push(o));
return groups;
}
var data = [{ id: 1, created: '2018-12-29T13:18:05.788Z', room: 'Double Room', type: 'Standard', price: 500 }, { id: 29, created: '2018-12-29T13:18:05.788Z', room: 'Twin Room', type: 'Standard', price: 500 }, { id: 58, created: '2018-12-29T13:18:05.788Z', room: 'Family Room', type: 'Standard', price: 900 }],
groupedByRoom = groupBy(data, 'room');
console.log(groupedByRoom);
.as-console-wrapper { max-height: 100% !important; top: 0; }
Our GraphQL server responds to a query with data that includes an array of objects each of which shares the same id and different values for a different key. For instance, we might have an array that looks like:
[
{ id: 123, name: 'foo', type: 'bar', cost: 5 },
{ id: 123, name: 'foo', type: 'bar', cost: 6 },
{ id: 123, name: 'foo', type: 'bar', cost: 7 },
{ id: 123, name: 'foo', type: 'bar', cost: 8 }
]
We can see in the Network tab that the response from the server has the correct data in it. However, by the time it goes through processing by the Apollo Client module the array has been transformed into something that might look like this:
[
{ id: 123, name: 'foo', type: 'bar', cost: 5 },
{ id: 123, name: 'foo', type: 'bar', cost: 5 },
{ id: 123, name: 'foo', type: 'bar', cost: 5 },
{ id: 123, name: 'foo', type: 'bar', cost: 5 }
]
Essentially what we're seeing is that if all of the objects in an array share the same value for id then all objects in the array become copies of the first object in the array.
Is this the intended behavior of Apollo Client? We thought maybe it had something to do with incorrect caching, but we were also wondering if maybe Apollo Client assumed that subsequent array members with the same id were the same object.
It looks like this is behavior as intended. The Apollo Client normalizes on id.
As the other answer suggests this happens because Apollo normalises by ID. There's a very extensive article on the official blog that explains the rationale of it, along with the underlying mechanisms.
In short, as seen by Apollo's cache, your array of objects contains 4 instances of the same Object (id 123). Same ID, same object.
This is a fair assumption on Apollo's side, but not so much in your case.
You have to explicitly tell Apollo that these are indeed 4 different items that should be treated differently.
In the past we used dataIdFromObject, and you can see an example here.
Today, you would use typePolicies and keyfields:
const cache = new InMemoryCache({
typePolicies: {
YourItem: {
// Combine the fields that make your item unique
keyFields: ['id', 'cost'],
}
},
});
Docs
It works for me:
const cache: InMemoryCache = new InMemoryCache({ dataIdFromObject: o => false )};
previous answer solves this problem too!
Also you can change the key name(for example id => itemId) on back-end side and there won't be any issue!
I have the same issue. My solution is to set fetchPolicy: "no-cache" just for this single API so you don't have to change the InMemoryCache.
Note that setting fetchPolicy to network-only is insufficient because it still uses the cache.
fetchPolicy document
I am working on a application which is nicely modularized using requirejs. One of the modules called data service is in charge of providing other modules with data. Pretty much all get* methods of this module return javascript script objects in the the following format:
res = {
totalRows: 537,
pageSize: 10,
page: 15,
rows: [
{
id: 1,
name: 'Angelina'
...
},
{
id: 2,
name: 'Halle'
...
},
{
id: 3,
name: 'Scarlet'
...
},
{
id: 4,
name: 'Rihanna'
...
},
{
id: 5,
name: 'Shakira'
...
},
....
//10 rows
{
id: 10,
name: 'Kate'
...
}
]
}
Is it possible to initialize the data table by providing it with rows for the current page, current page number, page size and the total number of records or pages so that it "knows" which page is currently being displayed as well as the number of available pages. Which in turn would allow the DT to build the pager correctly allowing the user to navigate to other pages in which case we would make another call to data service module to retrieve data from the database for the selected page.