How to parse a dirty CSV with Node.js?

How to parse a dirty CSV with Node.js? - javascript

I'm scratching my head on a CSV file I cannot parse correctly, due to many errors. I extracted a sample you can download here: Test CSV File
Main errors (or what generated an error) are:
Quotes & commas (many errors when trying to parse the file with R)
Empty rows
Unexpected line break inside a field
I first decided to use Regular Expression line by line to clean the data before loading them into R but couldn't solve the problem and it was two slow (200Mo file)
So I decided to use a CSV parser under Node.js with the following code:
'use strict';
const Fs = require('fs');
const Csv = require('csv');
let input = 'data_stack.csv';
let readStream = Fs.createReadStream(input);
let option = {delimiter: ',', quote: '"', escape: '"', relax: true};
let parser = Csv.parse(option).on('data', (data) => {
console.log(data)
});
readStream.pipe(parser)
But:
Some rows are parsed correctly (array of strings)
Some are not parsed (all fields are one string)
Some rows are still empty (can be solve by adding skip_empty_lines: true to the options)
I don't know how to handle the unexpected line break.
I don't know how to make this CSV clean, neither with R nor with Node.js.
Any help?
EDIT:
Following #Danny_ds solution, I can parse it correctly. Now I cannot stringify it back correctly.
with console.log(); I get a proper object but when I'm trying to stringify it, I don't get a clean CSV (still have line break and empty rows).
Here is the code I'm using:
'use strict';
const Fs = require('fs');
const Csv = require('csv');
let input = 'data_stack.csv';
let output = 'data_output.csv';
let readStream = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);
let opt = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};
let transformer = Csv.transform(data => {
let dirty = data.toString();
let replace = dirty.replace(/\r\n"/g, '\r\n').replace(/"\r\n/g, '\r\n').replace(/""/g, '"');
return replace;
});
let parser = Csv.parse(opt);
let stringifier = Csv.stringify();
readStream.pipe(transformer).pipe(parser).pipe(stringifier).pipe(writeStream);
EDIT 2:
Here is the final code that works:
'use strict';
const Fs = require('fs');
const Csv = require('csv');
let input = 'data_stack.csv';
let output = 'data_output.csv';
let readStream = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);
let opt = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};
let transformer = Csv.transform(data => {
let dirty = data.toString();
let replace = dirty
.replace(/\r\n"/g, '\r\n')
.replace(/"\r\n/g, '\r\n')
.replace(/""/g, '"');
return replace;
});
let parser = Csv.parse(opt);
let cleaner = Csv.transform(data => {
let clean = data.map(l => {
if (l.length > 100 || l[0] === '+') {
return l = "Encoding issue";
}
return l;
});
return clean;
});
let stringifier = Csv.stringify();
readStream.pipe(transformer).pipe(parser).pipe(cleaner).pipe(stringifier).pipe(writeStream);
Thanks to everyone!

I don't know how to make this CSV clean, neither with R nor with
Node.js.
Actually, it is not as bad as it looks.
This file can easily be converted to a valid csv using the following steps:
replace all "" with ".
replace all \n" with \n.
replace all "\n with \n.
With \n meaning a newline, not the characters "\n" which also appear in your file.
Note that in your example file \n is actually \r\n (0x0d, 0x0a), so depending on the software you use you may need to replace \n in \r\n in the above examples. Also, in your example there is a newline after the last row, so a quote as the last character will be replaced too, but you might want to check this in the original file.
This should produce a valid csv file:
There will still be multiline fields, but that was probably intended. But now those are properly quoted and any decent csv parser should be able to handle multiline fields.
It looks like the original data has had an extra pass for escaping quote characters:
If the original fields contained a , they were quoted, and if those fields already contained quotes, the quotes were escaped with another quote - which is the right way to do.
But then all rows containing a quote seem to have been quoted again (actually converting those rows to one quoted field), and all the quotes inside that row were escaped with another quote.
Obviously, something went wrong with the multiline fields. Quotes were added between the multiple lines too, which is not the right way to do.

The data is not too messed up to work with. There is a clear pattern.
General steps:
Temporarily remove mixed format inner fields (beginning with double(or more) quotes and having all kinds of characters.
Remove quotes from start and end of quoted lines giving clean CSV
Split data into columns
Replace removed fields
Step 1 above is the most important. If you apply this then the problems with new lines, empty rows and quotes and commas disappear. If you look in the data you can see columns 7, 8 and 9 contain mixed data. But it is always delimited by 2 quotes or more. e.g.
good,clean,data,here,"""<-BEGINNING OF FIELD DATA> Oh no
++\n\n<br/>whats happening,, in here, pages of chinese
characters etc END OF FIELD ->""",more,clean,data
Here is a working example based on the file provided:
fs.readFile('./data_stack.csv', (e, data) => {
// Take out fields that are delimited with double+ quotes
var dirty = data.toString();
var matches = dirty.match(/""[\s\S]*?""/g);
matches.forEach((m,i) => {
dirty = dirty.replace(m, "<REPL-" + i + ">");
});
var cleanData = dirty
.split('\n') // get lines
// ignore first line with column names
.filter((l, i) => i > 0)
// remove first and last quotation mark if exists
.map(l => l[0] === '"' ? l.substring(1, l.length-2) : l) // remove quotes from quoted lines
// split into columns
.map(l => l.split(','))
// return replaced fields back to data (columsn 7,8 and 9)
.map(col => {
if (col.length > 9) {
col[7] = returnField(col[7]);
col[8] = returnField(col[8]);
col[9] = returnField(col[9]);
}
return col;
function returnField(f) {
if (f) {
var repls = f.match(/<.*?>/g)
if (repls)
repls.forEach(m => {
var num = +m.split('-')[1].split('>')[0];
f = f.replace(m, matches[num]);
});
}
return f;
}
})
return cleanData
});
Result:
Data looks pretty clean. All rows produce the expected number of columns matching the header (last 2 rows shown):
...,
[ '19403',
'560e348d2adaffa66f72bfc9',
'done',
'276',
'2015-10-02T07:38:53.172Z',
'20151002',
'560e31f69cd6d5059668ee16',
'""560e336ef3214201030bf7b5""',
'a+�a��a+�a+�a��a+�a��a+�a��',
'',
'560e2e362adaffa66f72bd99',
'55f8f041b971644d7d861502',
'foo',
'foo',
'foo#bar.com',
'bar.com' ],
[ '20388',
'560ce1a467cf15ab2cf03482',
'update',
'231',
'2015-10-01T07:32:52.077Z',
'20151001',
'560ce1387494620118c1617a',
'""""""Final test, with a comma""""""',
'',
'',
'55e6dff9b45b14570417a908',
'55e6e00fb45b14570417a92f',
'foo',
'foo',
'foo#bar.com',
'bar.com' ],

Following on from my comment:
The data is too messed up to fix in one step, don't try.
Firstly decide whether double-quotes and/or comma's might be part of the data. If they are not, remove the double-quotes with a simple regex.
Next, there should be 14 commas on each line. Read the file as text and count the number of commas on each line in turn. Where there are less than 14, check the following line and if the sum of the commas is 14, merge the 2 lines. If the sum is less than 14, check the next line and continue until you have 14 commas. If the next line takes you over 14 there is a serious error so make a note of the line numbers - you will probably have to fix by hand. Save the resulting file.
With luck, you will now have a file that can be processed as a CSV. If not, come back with the partially tidied file and we can try to help further.
It should go without saying that you should process a copy of the original, you are unlikely to get it right first time :)

Related

What's the most efficient way of getting the directory from file path without file name in JavaScript?

I want to get the directory from the file path without the file name in JavaScript. I want the inputs and outputs in the following behavior.
Input: '/some/path/to/file.txt'
Output: '/some/path/to'
Input: '/some/path/to/file'
Output: '/some/path/to/file'
Input: '/some/folder.with/dot/path/to/file.txt'
Output: '/some/folder.with/dot/path/to'
Input: '/some/file.txt/path/to/file.txt'
Output: '/some/file.txt/path/to'
I was thinking of doing this using RegExp. But, not sure how the exact RegExp should be written.
Can someone help me with an EFFICIENT solution other than that or the RegExp?

Looking at your examples looks like you want to treat anything except last filename as directory name where filename always contains a dot.
To get that part, you can use this code in Javascript:
str = str.replace(/\/\w+\.\w+$/, "");
Regex \/\w+\.\w+$ matches a / and 1+ word characters followed by a dot followed by another 1+ word characters before end of string. Replacement is just an empty string.
However, do keep in mind that some filenames may not contain any dot character and this replacement won't work in those cases.

You could use lastIndexOf to get the index and then use slice to get the desired result.
const strArr = [
"/some/path/to/file.txt",
"/some/path/to/file",
"/some/folder.with/dot/path/to/file.txt",
"/some/file.txt/path/to/file.txt",
];
const result = strArr.map((s) => {
if (s.match(/.txt$/)) {
const index = s.lastIndexOf("/");
return s.slice(0, index !== -1 ? index : s.length);
} else return s;
});
console.log(result);
Using regex
const strArr = [
"/some/path/to/file.txt",
"/some/path/to/file",
"/some/folder.with/dot/path/to/file.txt",
"/some/file.txt/path/to/file.txt",
];
const result = strArr.map((s) => s.replace(/\/\w+\.\w+$/, ""));
console.log(result);

Unexpected Behavior When Escaping Backslashes JS

so i'm making a simple function that separates the file name and the directory path. I believe there is an easier way with node's Path module but I thought i'd do it myself for this project.
so the problem is when i'm writing a backslash character in a string, I'm escaping them in the string like "directory\AnothaDirectory". It runs, but the double "\" and the "\\" used in order to escape are still remaining in the strings after they are parsed. ex: "C:\\Documents\Newsletters".
I have tried both to use single backslahses, which throws compiler errors as one could expect. but I have also tried to use forward slashes. what could be the reason the backslashes are not being escaped?
function splitFileNameFromPath(path,slashType){
let pathArray = path.split(slashType),
fileName = pathArray[pathArray.length - 1],
elsIndexes = pathArray.length - 1,
pathSegs = pathArray.slice(0, elsIndexes);
let dirPath = pathSegs.join(slashType);
//adds an extra slash after drive name and colon e.g."C:\\"
dirPath = dirPath.replace( new RegExp("/\\/","ug"), "\\" )
//removes illegal last slash
let pathSeg = pathSegs.slice(0,-1)
return [dirPath, fileName]
}
let res = splitFileNameFromPath("C:\\\\Documents\\Newsletters\\Summer2018.pdf","\\");
console.log(res)

There are some moments in this code I do not understand.
"C:\\\\Documents\\Newsletters\\Summer2018.pdf" (i.e. "C:\\Documents\Newsletters\Summer2018.pdf") does not seem like a valid Windows path as there are no double slashes after the drive letter usually used (it is not like in the URL 'https://...').
new RegExp("/\\/","ug") is equal to /\/\//gu and does not match anythhing.
The result of let pathSeg = pathSegs.slice(0,-1) is not used at all.
It seems to me this code is enough to achive the task:
'use strict';
function splitFileNameFromPath(path, slashType) {
const pathArray = path.split(slashType),
fileName = pathArray.pop(),
dirPath = pathArray.join(slashType);
return [dirPath, fileName];
}
const path = "C:\\Documents\\Newsletters\\Summer2018.pdf";
const slash = "\\";
const res = splitFileNameFromPath(path, slash);
console.log(res);
console.log(path === res.join(slash));

How do you convert a text file to CSV file while keeping text file format using JS?

I am reading in a text file with some data that looks like this:
This is my file
showing some data
data1 = 12
data2 = 156
I want to convert this data into a CSV file while keeping the same format of the text file, like this:
This,is,my,file
showing,some,data
data1,=,12
data2,=,156
My first attempt was to read the text file into a string. The split that string into an array, splitting it at every ' ' (space) char. However, that doesn't seem to work.
I also attempted to split the string into an array at every 'newline' char but it doesn't seem to work.
Can anyone guide me in the right direction? Or how should I go about doing this?
Thanks

You should be able to:
split on line breaks
split on white space
join with commas
join with new lines:
let s = `This is my file
showing some data
data1 = 12
data2 = 156`
let text = s.split('\n') // split lines
.map(line => line.split(/\s+/).join(',')) // split spaces then join with ,
.join('\n') // rejoin lines
console.log(text)
You could also just replace all non-linebreak whitespace with commas:
let s = `This is my file
showing some data
data1 = 12
data2 = 156`
console.log(s.replace(/[^\S\n]+/g, ','))

Try str.replace(/ /g, ',');
Code
str = "This is my file\n\
showing some data\n\
data = 12\n\
data2 = 156";
document.write((str.replace(/ /g, ',')).replace(/\n/g,"<br />"));

Regex delete certain lines

I have a csv file which has some lines containing :. I need completely remove those lines. This is what I've done so far.
var array = fs.readFileSync('../list/fusion.csv').toString();
var pattern = /^\:/gm;
var best = array.replace(pattern, '');
fs.writeFile('../list/full.csv', best, function (err) {
if (err) return console.log(err);
});
I try to replace : with space. My pattern works in regex101, but when i run code nothing happens.

You can do this way also to remove the line that has :. I've added a demo depicting how to remove the whole line that contains that unnecessary character : in your .csv file
const regex = /^.*(:).*$/gm;
const str = `id,name,age
1,aaboss,11
2,eeboss,18
3,:ddboss,15
4,ccboss,14
:5,aboss,13
6,boss,12
7,boss,100:
8,boss,12
`;
const subst = ``;
// The substituted value will be contained in the result variable
// using replace again to remove the empty lines
const result = str.replace(regex, subst).replace(/(^[ \t]*\n)/gm, "");
console.log(result);
REGEX: https://regex101.com/r/JHeRyl/1

If you mean to remove lines containing : you should specify the m flag to allow ^ to match the beginning of every line rather than just the beginning of the string. Your pattern should also be made to match the entire line rather than just :.
Change:
var pattern = /^\:/g;
to:
var pattern = /^.*?:.*?$/gm;

Needing an alternative to eval() and better way for replacing string values

The API i'm working with responds back with a base64 encoded ruby hash (similar to a json object, but specifically for ruby) that has been converted to a string before base64 encoding
From javascript upon retrieving the encoded string, when decoded I get a string in the same shape as the ruby string it originated as on the server
// Decoded example String
"{:example=>'string',:another_example=>'string'}"
I am able to parse out the ruby string to a JSON object using string replace and eval() but I know eval() is evil. Also, there is no way to handle any other key value pairs that may pop up in the future.
How should this be re written with no eval and no direct string replacements?
var storedToken = base64url.decode(window.localStorage.authtoken).replace(':example=>', 'example:').replace(':another_example=>', 'another_example:')
var parsedTokenString = JSON.stringify(eval('(' + storedToken + ')'))
var newJsonObject = JSON.parse(parsedTokenString)

Replace and then JSON.parse:
const storedToken = "{:example=>'string',:another_example=>'string'}";
const json = storedToken
.replace(/:(\w+)/g, '"$1"')
.replace(/=>/g, ':')
.replace(/'/g, '"');
const obj = JSON.parse(json)
console.log(obj);
You will probably want to tighten this up to avoid things breaking when the string values contain things like :foo or escaped single quotes.
However, as mentioned in other answers and comments, you should really change the server to return JSON, which is easy enough with Ruby's to_json.

So you have a string like:
"{:example=>'string',:another_example=>'string'}"
that you'd like to convert to an object like (using JSON):
'{":example":"string", ":another_example":"string"}'
It's unclear to me if the colon before :example is part of the property name or a token indicating a property name, I've assumed it's part of the name (but that's easy to modify).
A regular expression might be used, however if the tokens are:
{ start of notation
=> property name, value separator
, property/value pair separator
} end of notation
Then a simple parser/formatter might be something like:
function stringToJSON(s) {
var resultText = '';
var tokens = {
'{' : '{', // token: replacement
'=>': ':',
',' : ',',
'}' : '}'
};
var multiTokens = {
'=': '=>' // token start: full token
};
var buff = '';
// Process each character
for (var i = 0, iLen = s.length; i < iLen; i++) {
// Collect characters and see if they match a token
buff = s[i];
// Deal with possible multi-character tokens
if (buff in multiTokens) {
// Check with next character and add to buff if creates a token
// Also increment i as using next character
if ((buff + s[i + 1]) in tokens) {
buff += s[++i];
}
}
// Now check in tokens
if (buff in tokens) {
// Tokens are always surrounded by ", except for first { and last }
// but deal with those at the end
resultText += '"' + tokens[buff] + '"';
// Otherwise, deal with single characters
} else {
// Single quotes not allowed
if (buff == "'") buff = '';
// Add buff to result
resultText += buff;
}
}
// Remove leading and trailing "
return resultText.replace(/^\"|\"$/g, '');
}
var s = "{:example=>'string',:another_example=>'string'}";
console.log(stringToJSON(s));
// Convert to object
console.log(JSON.parse(stringToJSON(s)));
Your string notation might be more complex, but I think you get the gist of it. You may need to trim whitespace around tokens, but since the property names aren't surrounded by quotes it's hard to know what to keep and what to throw away. You could include tokens in the data by quoting, then throwing away the quote character, e.g.:
\=>
could be treated as the literal "=>", not a token.
More tokens and processing steps can be added fairly easily. The multi-character tokens can get interesting, especially if you go to 3 or for characters.
Array methods and regular expressions can be used for token matching and processing, however I think loops are a good place to start to get the logic sorted, sugar can be added later.

We Keep Coding

JavaScript is the programming language of the Web.

How to parse a dirty CSV with Node.js? - javascript

Related

What's the most efficient way of getting the directory from file path without file name in JavaScript?

Unexpected Behavior When Escaping Backslashes JS

How do you convert a text file to CSV file while keeping text file format using JS?

Regex delete certain lines

Needing an alternative to eval() and better way for replacing string values

Categories

Resources