How to handle massive text-delimited files with NodeJS

How to handle massive text-delimited files with NodeJS - javascript

We're working with an API-based data provided that allows us to analyze large sets of GIS data in relation to provided GeoJSON areas and specified timestamps. When the data is aggregated by our provider, it can be marked as complete and alert our service via a callback URL. From there, we have a list of the reports we've run with their relevant download links. One of the reports we need to work with is a TSV file with 4 columns, and looks like this:
deviceId | timestamp | lat | lng
Sometimes, if the area we're analyzing is large enough, these files can be 60+GB large. The download link links to a zipped version of the files, so we can't read them directly from the download URL. We're trying to get the data in this TSV grouped by deviceId and sorted by timestamp so we can route along road networks using the lat/lng in our routing service. We've used Javascript for most of our application so far, but this service poses unique problems that may require additional software and/or languages.
Curious how others have approached the problem of handling and processing data of this size.
We've tried downloading the file, piping it into a ReadStream, and allocating all the available cores on the machine to process batches of the data individually. This works, but it's not nearly as fast as we would like (even with 36 cores).

From Wikipedia:
Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because ... only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures.
In other words, if you try to do it without looking at the end of the zip file first, you may end up accidentally including deleted files. So you can't trust streaming unzippers. However, if the zip file hasn't been modified since it was created, perhaps streaming parsers can be trusted. If you don't want to risk it, then don't use a streaming parser. (Which means you were right to download the file to disk first.)
To some extent it depends on the structure of the zip archive: If it consists of many moderately sized files, and if they can all be processed independently, then you don't need to have very much of it in memory at any one time. On the other hand, if you try to process many files in parallel then you may run into the limit on the number of filehandles that can be open. But you can get round this using something like queue.
You say you have to sort the data by device ID and timestamp. That's another part of the process that can't be streamed. If you need to sort a large list of data, I'd recommend you save it to a database first; that way you can make it as big as your disk will allow, but also structured. You'd have a table where the columns are the columns of the TSV. You can stream from the TSV file into the database, and also index the database by deviceId and timestamp. And by this I mean a single index that uses both of those columns, in that order.
If you want a distributed infrastructure, maybe you could store different device IDs on different disks with different CPUs etc ("sharding" is the word you want to google). But I don't know whether this will be faster. It would speed up the disk access. But it might create a bottleneck in network connections, through either latency or bandwidth, depending on how interconnected the different device IDs are.
Oh, and if you're going to be running multiple instances of this process in parallel, don't forget to create separate databases, or at the very least add another column to the database to distinguish separate instances.

Related

MongoDB Document Size Limitations

I have a collection of novels that looks as follows:
The Words array contains all words along with additional linguistic information related to each word. When I try to add longer texts (100k words +), I get the error:
RangeError: attempt to write outside buffer bounds
Which, I have gathered, means that the BSON document is larger than 16 mb and therefore above the limit.
I'm assuming this is a relatively common situation. I am now considering how to work around this limitation - For example, I could split the novel into various chunks of 10k words. Or does this mean that the document should make up a separate collection (ie. one new collection per text uploaded) - this makes the least sense to me.
Is there a standard/suggested approach to designing a MongoDB database in this case?
Also, is it possible to check the size of the BSON before inserting a document in JS/Node?

Do you absolutely need to store the contents of the books in MongoDB? If you're simply serving the contents to users or processing them in bulk, I suggest storing them on disk or in an AWS S3 bucket or similar.
If you need the book contents to live in the database, try using the MongoDB GridFS:
GridFS is a specification for storing and retrieving files that exceed
the BSON-document size limit of 16 MB.
Instead of storing a file in a single document, GridFS divides the file into parts, or chunks, and stores each chunk as a separate document
When you query GridFS for a file, the driver will reassemble the chunks as needed. You can perform range queries on files stored through GridFS. You can also access information from arbitrary sections of files, such as to “skip” to the middle of a video or audio file.
Read more here:
https://docs.mongodb.com/manual/core/gridfs/

How to automatically search and filters all engineers from linkedin and store results in excel?

Does anyone know how i can parse LinkedIn accounts? Or Any tool( not paid ).
For example:
I will look for "Software Engineer" from Dallas,TX.
Tool will automatically pick all candidates from linkedin or for example first 100 candidates, and store their First Name, Last Name , LinkedinLink and Experience in excel document? ( Or from specific company)
Is it should be done threw API, or there specific account which allow to do this? Or does anyone knows tools which will help to do this? Or Script?
I need to parse a large amount of candidates , 100+ maybe 1000+ and store them.
I have multiple thoughts about implementation but i feel that it 100% already implemented.

https://developer.linkedin.com/docs/rest-api
Use linked in APIs to fetch data and process it however you would like. I don't know how much of 'private' fields you can get access to but names seem to be there.
I use nodeJS to process excel data - xlsx is a very good option but it only allows synchronous execution so you would have to spawn another process. It also has filter function so you can do whatever you want with it.
The problem that I had faced with parsing large data into excel is that excel file is a compressed xml format so it takes a long time to parse both reading and writing. A faster option would be to create and read csv which excel can naturally do as well.

Save configuration information using javascript/php

I am working with php and heatmap js to generate a heat-map.
I was thinking of going down the path of allowing the user to upload a floor-map jpg file initially and then allow him to add the sensor names to different locations in the floor-map.
Once the sensor locations are specified, I need to save that configuration to an XML file. Once I have this set of information (img_id, [sensorid1,x1,y1], [sensorid2,x2,y2],..,[sensoridn,xn,yn]), I can query my database for the latest values of sensors and then display as heat-map on the image (on the specific sensors' x and y coordinates) real-time.
I would like to know if saving the configuration as XML is the right way of doing it. Is there there a better way of temporarily storing the information using javascript/PHP?

There are likely a bunch of ways to solve this. My preference would be for JSON, as it is natively supported by Javascript and PHP. It is also MUCH easier to read and write.
When you say "saving", what do you mean? If you need it to be stored server side, then creating DB entities that the data structure can be mapped to and stored in will be far better than trying to create files server-side. Depending on how the app gets hosted, you may not have permission to do that, and if your server ever goes away you could loose that data (However, there are safe ways to create files using a service like AWS S3). Storing it in a database not only gives you a single place to worry about backups, but also lets you query the data in interesting and powerful ways (SQL etc) easily, without having to figure out how to do that for files with every new query.

Breeze.js cache limitations? Or Browser?

We are investigating using Breeze for field deployment of some tools. The scenario is this -- an auditor will visit sites in the field, where most of the time there will be no -- or very degraded -- internet access. Rather than replicate our SQL database on all the laptops and tablets (if that's even possible), we are hoping to use Breeze to cache the data and then store it locally so it is accessible when there is not a usable connection.
Unfortunately, Breeze seems to choke when caching any significant amount of data. Generally on Chrome it's somewhere between 8 and 13MB worth of entities (as measured by the HTTPResponse headers). This can change a bit depending on how many tabs I have open and such, but I have not been able to move that more than 10%. the error I get is the Chrome tab crashes and tells me to reload. The error is replicable (I download the data in 100K chunks and it fails on the same read every time and works fine if I stop it after the previous read) When I change the page size, it always fails within the same range.
Is this a limitation of Breeze, or Chrome? Or windows? I tried it on Firefox, and it handles even less data before the whole browser crashes. IE fares a little better, but none of them do great.
Looking at performance in task manager, I get the following:
IE goes from 250M memory usage to 1.7G of memory usage during the caching process and caches a total of about 14MB before throwing an out-of-memory error.
Chrome goes from 206B memory usage to about 850M while caching a total of around 9MB
Firefox goes from around 400M to about 750M and manages to cache about 5MB before the whole program crashes.
I can calculate how much will be downloaded with any selection criteria, but I cannot find a way to calculate how much data can be handled by any specific browser instance. This makes using Breeze for offline auditing close to useless.
Has anyone else tackled this problem yet? What are the best approaches to handling something like this. I've thought of several things, but none of them are ideal. Any ideas would be appreciated.
ADDED At Steve Schmitt's request:
Here are some helpful links:
Metadata
Entity Diagram (pdf) (and html and edmx)
The first query, just to populate the tags on the page runs quickly and downloads minimal data:
var query = breeze.EntityQuery
.from("Countries")
.orderBy("Name")
.expand("Regions.Districts.Seasons, Regions.Districts.Sites");
Once the user has select the Sites s/he wishes to cache, the following two queries are kicked off (used to be one query, but I broke it into two hoping it would be less of a burden on resources -- it didn't help). The first query (usually 2-3K entities and about 2MB) runs as expected. Some combination of the predicates listed are used to filter the data.
var qry = breeze.EntityQuery
.from("SeasonClients")
.expand("Client,Group.Site,Season,VSeasonClientCredit")
.orderBy("DistrictId,SeasonId,GroupId,ClientId")
var p = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p1 = breeze.Predicate("SeasonId", "==", SeasonId);
var p2 = breeze.Predicate("DistrictId", "==", DistrictId);
var p3 = breeze.Predicate("Group.Site.SiteId", "in", SiteIds);
After the first query runs, the second query (below) runs (also using some combination of the predicates listed to filter the data. At about 9MB, it will have about 50K rows to download). When the total download burden between the two queries is between 10MB and 13MB, browsers will crash.
var qry = breeze.EntityQuery
.from("Repayments")
.orderBy('SeasonId,ClientId,RepaymentDate');
var p1 = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p2 = breeze.Predicate("SeasonId", "==", SeasonId);
var p3 = breeze.Predicate("DistrictId", "==", DistrictId);
var p4 = breeze.Predicate("SiteId", "in", SiteIds);
Thanks for the interest, Steve. You should know that the Entity Relationships are inherited and currently in production supporting the majority of the organization's operations, so as few changes as possible to that would be best. Also, the hope is to grow this from a reporting application to one with which data entry can be done in the field (so, as I understand it, using projections to limit the data wouldn't work).
Thanks for the interest, and let me know if there is anything else you need.

Here are some suggestions based on my experience building on an offline capable web application using breeze. Some or all of these might not make sense for your use cases...
Identify which entity types need to be editable vs which are used to fill drop-downs etc. Load non-editable data using the noTracking query option and cache them in localStorage yourself using JSON.stringify. This avoids the overhead of coercing the data into entities, change tracking, etc. Good candidates for this approach in your model might be entity types like Country, Region, District, Site, etc.
If possible, provide a facility in your application for users to identify which records they want to "take offline". This way you don't need to load and cache everything, which can get quite expensive depending on the number of relationships, entities, properties, etc.
In conjunction with suggestion #2, avoid loading all the editable data at once and avoid using the same EntityManager instance to load each set of data. For example, if the Client entity is something that needs to be editable out in the field without a connection, create a new EntityManager, load a single client (expanding any children that also need to be editable) and cache this data separately from other clients.
Cache the breeze metadata once. When calling exportEntities the includeMetadata argument should be false. More info on this here.
To create new EntityManager instances make use of the createEmptyCopy method.
EDIT:
I want to respond to this comment:
Say I have a client who has bills and payments. That client is in a
group, in a site, in a region, in a country. Are you saying that the
client, payment, and bill information might each have their own EM,
while the location hierarchy might be in a 4th EM with no-tracking?
Then when I refer to them, I wire up the relationships as needed using
LINQs on the different EMs (give me all the bills for customer A, give
me all the payments for customer A)?
It's a bit of a judgement call in terms of deciding how to separate things out. Some of what I'm suggesting might be overkill, it really depends on the amount of data and the way your application is used.
Assuming you don't need to edit groups, sites, regions and countries while offline, the first thing I'd do would be to load the list of groups using the noTracking option and cache them in localStorage for offline use. Then do the same for sites, regions and countries. Keep in mind, entities loaded with the noTracking option aren't cached in the entity manager so you'll need to grab the query result, JSON.stringify it and then call localStorage.setItem. The intent here is to make sure your application always has access to the list of groups, sites, regions, etc so that when you display a form to edit a client entity you'll have the data you need to populate the group, site, region and country select/combobox/dropdown.
Assuming the user has identified the subset of clients they want to work with while offline, I'd then load each of these clients one at a time (including their payment and bill information but not expanding their group, site, region, country) and cache each client+payments+bills set using entityManager.exportEntities. Reasoning here is it doesn't make sense to load several clients plus their payments and bills into the same EntityManager each time you want to edit a particular client. That could be a lot of unnecessary overhead, but again, this is a bit of a judgement call.

#Jeremy's answer was excellent and very helpful, but didn't actually answer the question, which I was starting to think was unanswerable, or at least the wrong question. However #Steve in the comments gave me the most appropriate information for this question.
It is neither Breeze nor the Browser, but rather Knockout. Apparently the knockout wrapper around the breeze entities uses all that memory (at least while loading the entities and in my environment). As described above, Knockout/Breeze would crap out after reading around 5MB of data, causing Chrome to crash with over 1.7GB of memory usage (from a pre-download memory usage around 300MB). Rewriting the app in ANgularJS eliminated the problem. So far I have been able to download over 50MB from the exact same EF6 model into Breeze/Angular, total Chrome memory usage never went above 625MB.
I will be testing larger payloads, but 50 MB more than satisfies my needs for the moment. Thanks everyone for your help.

Use .js files for caching large dropdown lists

I would like to keep the contents of large UI lists cached on the client, and updated according to criterial or regularly. Client side code can then just fill the dropdowns locally, avoiding long page download times.
These lists can be close to 4k items, and dynamically filtering them without caching would result in several rather large round trips.
How can I go about this? I mean, what patterns and strategies would be suitable for this?

Aggressive caching of JSON would work for this, you just hash the JS file and throw it on the end of it's URL to update it when it changes. One revision might look like this:
/media/js/ac.js?1234ABCD
And when the file changes, the hash changes.
/media/js/ac.js?4321DCBA
This way, when a client loads the page, your server-side code links to the hashed URL, and the client will get a 304 Not Modified response on their next page load (assuming you have this enabled on your server). If you use this method you should set the files to never expire, as the "expiring" portion will be dealt with by the hash, i.e., when the JS file does expire, the hash will change and the client won't get a 304, but rather a 200.
ac.js might contain a list or other iterable that your autocomplete code can parse as it's completion pool and you'd access it just like any other JS variable.
Practically speaking, though, this shouldn't be necessary for most projects. Using something like memcached server-side and gzip compression will make the file both small and amazingly fast to load. If the list is HUGE (say thousands of thousands of items) you might want to consider this.

Combres is a good solution for this - it will track changes and have the browser cache the js forever until a change is made, in which case it changes the URL of the item.
http://combres.codeplex.com/

You might consider rather than storing the data locally using jQuery and AJAX to dynamically update the dropdown lists. Calls can be made whenever needed and the downloads would be pretty quick.
Just a thought.
This might be helpful:
http://think2loud.com/using-jquery-and-xml-to-populate-a-drop-down-box/

If its just textual data, you have compression enabled on the web server, and there are less than 100 items, then there may be no need to maintain lists in the client script.
Its usually best to put all your data (list items are data) in one place so you dont have to worry about synchronization.

We Keep Coding

JavaScript is the programming language of the Web.