Stream JSON-parsable array to file - javascript

So you're reading data from a file, cleaning out the data, and writing it back to another file, but the new file isn't accepted JSON format.
You need to fill an object in the new file. You get a chunk from the file, alter it, and save it to the new file.
For this you stream the data out, edit the chunks, and stream it back into the other file. Great.
You're sure to add , after each item to keep the array readable later on,
but now the last item has a trailing comma ,...
You don't know the count of items in the original file, and you also don't know when the reader is at the last item.
You use something like JSONStream on that array but JSONStream also does not provide the index.
The only end events are for your writers and readers.
How do you remove the trailing comma before/after writing?
read_file = 'animals.json' //very large file
write_file = 'brown_dogs.json' //moderately large file
let read_stream = fs.createReadStream(read_file);
let write_stream = fs.createWriteStream(write_file);
let dog_stream = require('JSONStream').parse('array_of_animals.dogs.*');
write_stream
.on('finish', () => {
//the writer is done writing my list of dogs, but my array has a
//trailing comma, now my brown_dogs.json isn't parsable
})
.write('{"brown_dogs": ['); //lets start
read_stream
.pipe(dog_stream)
.on('data', dog => {
//basic logic before we save the item
if (dog._fur_colour === 'brown'){
let _dog = {
type : dog._type,
colour : dog._fur_colour,
size : dog._height
}
};
//we write our accepted dog
write_stream.write(JSON.stringify(_dog) + ',');
}
})
.on('end', () => {
//done reading animals.json
write_stream.write(']}');
})
--
If your resulting JSON file is small, you may simply add all the dogs to an array and only save all the contents to the file in one go. This means the file is not only JSON friendly, but also small enough to simply open with JSON.parse()
If your resulting JSON file is large, you may need to stream the items out in any case. Luckily JSONStream allows us to not only extract each dog individually but also ignore the trailing comma.
This is what I understand to be the solution...but I don't think it's perfect. Why can't the file be accepted JSON, regardless of the size.

This is actually very simple.
Add an empty string var to the beginning of the insert. Set the string to a separator after the first insert.
//update this string after the first insert
let separator = '';
read_stream
.pipe(dog_stream)
.on('data', dog => {
//basic logic before we save the item
if (dog._fur_colour === 'brown'){
let _dog = {
type : dog._type,
colour : dog._fur_colour,
size : dog._height
}
};
//we write our accepted dog
write_stream.write(separator + JSON.stringify(_dog));
//update this after first insert
separator = ',';
}
})

I think
I added toJSONArray method exactly for this in scramjet see docs here. It puts a comma only between the chunks.
The code would look like this:
fs.createReadStream(read_file)
.pipe(require('JSONStream').parse('array_of_animals.dogs.*'))
.pipe(new DataStream())
.filter(dog => dog._fur_colour === 'brown') // this will filter out the non-brown dogs.
.map(dog => { // remap the data
reutrn {
type : dog._type,
colour : dog._fur_colour,
size : dog._height
};
})
.toJSONArray(['{"brown_dogs": [', ']}']) // add your enclosure
.pipe(fs.createWriteStream(write_file));
This code should make a fine JSON.

Related

URL Parse Exercise (JavaScript)

So here is a description of the problem that I've been talked to solve:
We need some logic that extracts the variable parts of a url into a hash. The keys
of the extract hash will be the "names" of the variable parts of a url, and the
values of the hash will be the values. We will be supplied with:
A url format string, which describes the format of a url. A url format string
can contain constant parts and variable parts, in any order, where "parts"
of a url are separated with "/". All variable parts begin with a colon. Here is
an example of such a url format string:
'/:version/api/:collection/:id'
A particular url instance that is guaranteed to have the format given by
the url format string. It may also contain url parameters. For example,
given the example url format string above, the url instance might be:
'/6/api/listings/3?sort=desc&limit=10'
Given this example url format string and url instance, the hash we want that
maps all the variable parts of the url instance to their values would look like this:
{
version: 6,
collection: 'listings',
id: 3,
sort: 'desc',
limit: 10
}
So I technically have a semi-working solution to this but, my questions are:
Am I understanding the task correctly? I'm not sure if I'm supposed to be dealing with two inputs (URL format string and URL instance) or if I'm just supposed to be working with one URL as a whole. (my solution takes two separate inputs)
In my solution, I keep reusing the split() method to chunk the array/s down and it feels a little repetitive. Is there a better way to do this?
If anyone can help me understand this challenge better and/or help me clean up my solution, it would be greatly appreciated!
Here is my JS:
const obj = {};
function parseUrl(str1, str2) {
const keyArr = [];
const valArr = [];
const splitStr1 = str1.split("/");
const splitStr2 = str2.split("?");
let val1 = splitStr2[0].split("/");
let val2 = splitStr2[1].split("&");
splitStr1.forEach((i) => {
keyArr.push(i);
});
val1.forEach((i) => {
valArr.push(i);
});
val2.forEach((i) => {
keyArr.push(i.split("=")[0]);
valArr.push(i.split("=")[1]);
});
for (let i = 0; i < keyArr.length; i++) {
if (keyArr[i] !== "" && valArr[i] !== "") {
obj[keyArr[i]] = valArr[i];
}
}
return obj;
};
console.log(parseUrl('/:version/api/:collection/:id', '/6/api/listings/3?sort=desc&limit=10'));
And here is a link to my codepen so you can see my output in the console:
https://codepen.io/TOOTCODER/pen/yLabpBo?editors=0012
Am I understanding the task correctly? I'm not sure if I'm supposed to
be dealing with two inputs (URL format string and URL instance) or if
I'm just supposed to be working with one URL as a whole. (my solution
takes two separate inputs)
Yes, your understanding of the problem seems correct to me. What this task seems to be asking you to do is implement a route parameter and a query string parser. These often come up when you want to extract data from part of the URL on the server-side (although you don't usually need to implement this logic your self). Do keep in mind though, you only want to get the path parameters if they have a : in front of them (currently you're retrieving all values for all), not all parameters (eg: api in your answer should be excluded from the object (ie: hash)).
In my solution, I keep reusing the split() method to chunk the array/s
down and it feels a little repetitive. Is there a better way to do
this?
The number of .split() methods that you have may seem like a lot, but each of them is serving its own purpose of extracting the data required. You can, however, change your code to make use of other array methods such as .map(), .filter() etc. to cut your code down a little. The below code also considers the case when no query string (ie: ?key=value) is provided:
function parseQuery(queryString) {
return queryString.split("&").map(qParam => qParam.split("="));
}
function parseUrl(str1, str2) {
const keys = str1.split("/")
.map((key, idx) => [key.replace(":", ""), idx, key.charAt(0) === ":"])
.filter(([,,keep]) => keep);
const [path, query = ""] = str2.split("?");
const pathParts = path.split("/");
const entries = keys.map(([key, idx]) => [key, pathParts[idx]]);
return Object.fromEntries(query ? [...entries, ...parseQuery(query)] : entries);
}
console.log(parseUrl('/:version/api/:collection/:id', '/6/api/listings/3?sort=desc&limit=10'));
It would be even better if you don't have to re-invent the wheel, and instead make use of the URL constructor, which will allow you to extract the required information from your URLs more easily, such as the search parameters, this, however, requires that both strings are valid URLs:
function parseUrl(str1, str2) {
const {pathname, searchParams} = new URL(str2);
const keys = new URL(str1).pathname.split("/")
.map((key, idx) => [key.replace(":", ""), idx, key.startsWith(":")])
.filter(([,,keep]) => keep);
const pathParts = pathname.split("/");
const entries = keys.map(([key, idx]) => [key, pathParts[idx]]);
return Object.fromEntries([...entries, ...searchParams]);
}
console.log(parseUrl('https://www.example.com/:version/api/:collection/:id', 'https://www.example.com/6/api/listings/3?sort=desc&limit=10'));
Above, we still need to write our own custom logic to obtain the URL parameters, however, we don't need to write any logic to extract the query string data as this is done for us by using URLSearchParams. We're also able to lower the number of .split()s used as we can obtain use the URL constructor to give us an object with a parsed URL already. If you end up using a library (such as express), you will get the above functionality out-of-the-box.

Alpha sort according to a specific part of string

I have a list of filenames, each file looking like this:
https://www.dropbox.com/s/xxx/NF%208700%test%test%file.pdf?raw=1
These files are dynamically put into a list, so they don't remain in their correct order. I'd like to sort the whole list at once, but only by the end of the url (the file name).
Like this:
/NF%208700%test%test%file.pdf?raw=1
I can get the end of the file by doing this
let index = curr.lastIndexOf("/");
console.log(curr.slice(index));
But my question is, how can I sort a whole list by only a specific part of a string?
Possible solution
const data = ['https://www.dropbox.com/s/xxx/NF%2', 'https://www.dropbox.com/s/xxx/NF%1'];
const result = data.sort((nameA, nameB) => nameA.split('/').slice(-1)[0].localeCompare(nameB.split('/').slice(-1)[0]));
console.log(result);
You're looking for Array.prototype.sort().
Something along the lines of:
const sortedFiles = files.sort((a, b) => {
return a.slice(a.lastIndexOf('/')) > b.slice(b.lastIndexOf('/'))
})

How can I efficiently write numeric data to a file?

Say I have an array containing a million random numbers:
[ 0.17309080497872764, 0.7861753816498267, ...]
I need to save them to disk, to be read back later. I could store them in a text format like JSON or csv, but that will waste space. I'd prefer a binary format where each number takes up only 8 bytes on disk.
How can I do this using node?
UPDATE
I did not find an answer to this specific question, with a full example, in the supposedly duplicate question. I was able to solve it myself, but in a verbose way that could surely be improved:
// const a = map(Math.random, Array(10));
const a = [
0.9651891365487693,
0.7385397746441058,
0.5330173086062189,
0.08100066198727673,
0.11758119861500771,
0.26647845473863674,
0.0637438360410223,
0.7070151519015955,
0.8671093412761386,
0.20282735866103718
];
// write the array to file as raw bytes (80B total)
var wstream = fs.createWriteStream('test.txt');
a.forEach(num => {
const b = new Buffer(8);
b.writeDoubleLE(num);
wstream.write(b);
})
wstream.end(() => {
// read it back
const buff = fs.readFileSync('test.txt');
const aa = a.map((_, i) => buff.readDoubleLE(8*i));
console.log(aa);
});
I think this was answered in Read/Write bytes of float in JS
The ArrayBuffer solution is probably what you are looking for.

Line-oriented streams in Node.js

I'm developing a multi-process application using Node.js. In this application, a parent process will spawn a child process and communicate with it using a JSON-based messaging protocol over a pipe. I've found that large JSON messages may get "cut off", such that a single "chunk" emitted to the data listener on the pipe does not contain the full JSON message. Furthermore, small JSON messages may be grouped in the same chunk. Each JSON message will be delimited by a newline character, and so I'm wondering if there is already a utility that will buffer the pipe read stream such that it emits one line at a time (and hence, for my application, one JSON document at a time). This seems like it would be a pretty common use case, so I'm wondering if it has already been done.
I'd appreciate any guidance anyone can offer. Thanks.
Maybe Pedro's carrier can help you?
Carrier helps you implement new-line
terminated protocols over node.js.
The client can send you chunks of
lines and carrier will only notify you
on each completed line.
My solution to this problem is to send JSON messages each terminated with some special unicode character. A character that you would never normally get in the JSON string. Call it TERM.
So the sender just does "JSON.stringify(message) + TERM;" and writes it.
The reciever then splits incomming data on the TERM and parses the parts with JSON.parse() which is pretty quick.
The trick is that the last message may not parse, so we simply save that fragment and add it to the beginning of the next message when it comes. Recieving code goes like this:
s.on("data", function (data) {
var info = data.toString().split(TERM);
info[0] = fragment + info[0];
fragment = '';
for ( var index = 0; index < info.length; index++) {
if (info[index]) {
try {
var message = JSON.parse(info[index]);
self.emit('message', message);
} catch (error) {
fragment = info[index];
continue;
}
}
}
});
Where "fragment" is defined somwhere where it will persist between data chunks.
But what is TERM? I have used the unicode replacement character '\uFFFD'. One could also use the technique used by twitter where messages are separated by '\r\n' and tweets use '\n' for new lines and never contain '\r\n'
I find this to be a lot simpler than messing with including lengths and such like.
Simplest solution is to send length of json data before each message as fixed-length prefix (4 bytes?) and have a simple un-framing parser which buffers small chunks or splits bigger ones.
You can try node-binary to avoid writing parser manually. Look at scan(key, buffer) documentation example - it does exactly line-by line reading.
As long as newlines (or whatever delimiter you use) will only delimit the JSON messages and not be embedded in them, you can use the following pattern:
let buf = ''
s.on('data', data => {
buf += data.toString()
const idx = buf.indexOf('\n')
if (idx < 0) { return } // No '\n', no full message
let lines = buf.split('\n')
buf = lines.pop() // if ends in '\n' then buf will be empty
for (let line of lines) {
// Handle the line
}
})

Count number of lines in CSV with Javascript

I'm trying to think of a way to count the number of lines in a .csv file using Javascript, any useful tips or resources someone can direct me to?
Depends what you mean by a line. For simple number of newlines, Robusto's answer is fine.
If you want to know how many rows of CSV data that represents, things may be a little more difficult, as a CSV field may itself contain a newline:
field1,"field
two",field3
...is one row, at least in CSV as defined by RFC4180. (It's one of the aggravating features of CSV that there are so many non-standard variants; the RFC itself was very late to the game.)
So if you need to cope with that case you'll have to essentially parse each field.
A field can be raw, or (necessarily if it contains \n or ,) quoted, with " represented as double quotes. So a regex for one field would be:
"([^"]|"")*"|[^,\n]*
and so for a whole row (assuming it is not empty):
("([^"]|"")*"|[^,\n]*)(,("([^"]|"")*"|[^,\n]*))*\n
and to get the number of those:
var rowsn= csv.match(/(?:"(?:[^"]|"")*"|[^,\n]*)(?:,(?:"(?:[^"]|"")*"|[^,\n]*))*\n/g).length;
If you are lucky enough to be dealing with a variant of CSV that complies with RFC4180's recommendation that there are no " characters in unquoted fields, you can make this a bit more readable. Split on newlines as before and count the number of " characters in each line. If it's an even number, you have a complete line; if it's an odd number you've got a split.
var lines= csv.split('\n');
for (var i= lines.length; i-->0;)
if (lines[i].match(/"/g).length%2===1)
lines.splice(i-1, 2, lines[i-1]+lines[i]);
var rowsn= lines.length;
To count the number of lines in a document (once you have it as a string in Javascript), simply do:
var lines = csvString.split("\n").length;
You can use the '.' to match everything on a line except the newline at the end-
it won't count quoted new lines. Use the 'm' for multiline flag, as well as 'g' for global.
function getLines(s){
return s.match(/^(.*)$/mg);
}
alert(getLines(string).length)
If you don't mind skipping empty lines it is simpler-
but sometimes you need to keep them for spaceing.
function getLines(s){
return s.match(/(.+)/g);
}
If you are asking to count number of rows in a csv then you can use this example..
http://purbayubudi.wordpress.com/2008/11/09/csv-parser-using-javascript/
It takes the csv file and in a popup window display number of rows..
Here is a sample code in Typescript.
FileReader is needed to obtain the contents.
Returning a promise makes it easy to wait for the results of the asynchronous readAsText and onload function.
const countRowsInCSV = async (csvFile: File): Promise<number> => {
return new Promise((resolve, reject) => {
try {
const reader = new FileReader();
reader.onload = (event: any) => {
const cvsData = event.target.result;
const rowData = cvsData.split('\n');
resolve(rowData.length);
};
reader.readAsText(csvFile);
} catch (error: any) {
reject(error);
}
});
};
const onChangeFile = (selectedFile: File) => {
const totalRows = await countRowsInCSV(selectedFile);
}

Categories