Parsing multiple large JSON files with node to mongoDB - javascript

I am parsing multiple large JSON files to my mongoDB database. At the moment I am using stream-json npm package. After I load one file I change the filename that I am loading and relaunch the script to load the next file. This is unnecessarily time consuming. So how can I iterate through all the files automatically? At the moment my code looks like this:
const StreamArray = require('stream-json/utils/StreamArray');
const path = require('path');
const fs = require('fs');
const filename = path.join(__dirname, './data/xa0.json'); //The next file is named xa1.json and so on.
const stream = StreamArray.make();
stream.output.on('data', function (object) {
// my function block
});
stream.output.on('end', function () {
console.log('File Complete');
});
fs.createReadStream(filename).pipe(stream.input);
I tried iterating through the names of the files by adding a loop that would add +1 to the filename i.e. xa0 to xa1 at the same point where the script console.log('File Complete') but this didn't work. Any ideas how I might be able to achieve this or something similar.

Just scan your JSON files directory using fs.readdir. It will return a list of file names that you can then iterate, something like this :
fs.readdir("./jsonfiles", async (err, files) => {
for( file in files ){
await saveToMongo("./jsonfiles/" + file)
}
})
So you just launch your script once and wait until full completion.
Of course, in order for it to be awaited, you need to promisify the saveToMongo function, something like :
const saveToMongo = fileName => {
return new Promise( (resolve, reject) => {
// ... logic here
stream.output.on('end', function () {
console.log('File Complete');
resolve() // Will trigger the next await
});
})
}

Related

folders and files are not visible after uploading file though multer

I am working on a small project. discussing Step by step
At first I am uploading zip files though multer
extracting those files (How can I call extract function after completing upload using multer?)
After extracting those I am trying to filter those files
after filtering those files I want to move some files to another directory
in my main index.js I have
A simple route to upload files which is working
// MAIN API ENDPOINT
app.post("/api/zip-upload", upload, async (req, res, next) => {
console.log("FIles - ", req.files);
});
Continuous checking for if there is any zip file that needs to unzip but the problem is after uploading it's not showing any files or dir
// UNZIP FILES
const dir = `${__dirname}/uploads`;
const files = fs.readdirSync("./uploads");
const filesUnzip = async () => {
try {
if (fs.existsSync(dir)) {
console.log("files - ", files);
for (const file of files) {
console.log("file - ", file);
try {
const extracted = await extract("./uploads/" + file, { dir: __dirname + "/uploads/" });
console.log("Extracted - ",extracted);
// const directories = await fs.statSync(dir + '/' + file).isDirectory();
} catch (bufErr) {
// console.log("------------");
console.log(bufErr.syscall);
}
};
// const directories = await files.filter(function (file) { return fs.statSync(dir + '/' + file).isDirectory(); });
// console.log(directories);
}
} catch (err) {
console.log(err);
}
return;
}
setInterval(() => {
filesUnzip();
}, 2000);
Moving files to static directory but here is the same problem no directory found
const getAllDirs = async () => {
// console.log(fs.existsSync(dir));
// FIND ALL DIRECTORIES
if (fs.existsSync(dir)) {
const directories = await files.filter(function (file) { return fs.statSync(dir + '/' + file).isDirectory(); });
console.log("Directories - ",directories);
if (directories.length > 0) {
for (let d of directories) {
const subdirFiles = fs.readdirSync("./uploads/" + d);
for (let s of subdirFiles) {
if (s.toString().match(/\.xml$/gm) || s.toString().match(/\.xml$/gm) !== null) {
console.log("-- ", d + "/" + s);
const move = await fs.rename("uploads/" + d + "/" + s, __dirname + "/static/" + s, (err) => { console.log(err) });
console.log("Move - ", move);
}
}
}
}
}
}
setInterval(getAllDirs, 3000);
There are so many issues with your code, I don't know where to begin:
Why are you using fs.xxxSync() methods if all your functions are async? Using xxxSync() methods is highly discouraged because it's blocking the server (ie parallel requests can't/won't be accepted while a sync reading is in progress). The fs module supports a promise api ...
Your "Continuous checking" for new files is always checking the same (probably empty) files array because it seems you are executing files = fs.readdirSync("./uploads"); only once (probably at server start, but I can't tell for sure because there isn't any context for that snippet)
You shouldn't be polling that "uploads" directory. Because as writing a file (if done properly) is an asynchronous process, you may end up reading incomplete files. Instead you should trigger the unzipping from your endpoint handler. Once it is hit, body.files contains the files that have been uploaded. So you can simply use this array to start any further processing instead of frequently polling a directory.
At some points you are using the callback version of the fs API (for instance fs.rename(). You cannot await a function that expects a callback. Again, use the promise api of fs.
EDIT
So I'm trying to address your issues. Maybe I can't solve all of them because of missing infomation, but you should get the general idea.
First of all, you shuld use the promise api of the fs module. And also for path manipulation, you should use the available path module, which will take care of some os specific issues.
const fs = require('fs').promises;
const path = require('path');
Your API endpoint isn't currently returning anything. I suppose you stripped away some code, but still. Furthermore, you should trigger your filehandling from here, so you don't have to do directory polling, which is
error prone,
wasting resources and
if you do it synchronously like you do blocks the server
app.post("/api/zip-upload", upload, async (req, res, next) => {
console.log("FIles - ", req.files);
//if you want to return the result only after the files have been
//processed use await
await handleFiles(req.files);
//if you want to return to the client immediately and process files
//skip the await
//handleFiles(req.files);
res.sendStatus(200);
});
Handling the files seems to consist of two different steps:
unzipping the uploaded zip files
copying some of the extracted files into another directory
const source = path.join(".", "uploads");
const target = path.join(__dirname, "uploads");
const statics = path.join(__dirname, "statics");
const handleFiles = async (files) => {
//a random folder, which will be unique for this upload request
const tmpfolder = path.join(target, `tmp_${new Date().getTime()}`);
//create this folder
await fs.mkdir(tmpfolder, {recursive: true});
//extract all uploaded files to the folder
//this will wait for a list of promises and resolve once all of them resolved,
await Promise.all(files.map(f => extract(path.join(source, f), { dir: tmpfolder })));
await copyFiles(tmpfolder);
//you probably should delete the uploaded zipfiles and the tempfolder
//after they have been handled
await Promise.all(files.map(f => fs.unlink(path.join(source, f))));
await fs.rmdir(tmpfolder, { recursive: true});
}
const copyFiles = async (tmpfolder) => {
//get all files/directory names in the tmpfolder
const allfiles = await fs.readdir(tmpfolder);
//get their stats
const stats = await Promise.all(allfiles.map(f => fs.stat(path.join(tmpfolder, f))));
//filter directories only
const dirs = allfiles.filter((_, i) => stats[i].isDirectory());
for (let d of dirs) {
//read all filenames in the subdirectory
const files = await fs.readdir(path.join(tmpfolder, d)));
//filter by extension .xml
const xml = files.filter(x => path.extname(x) === ".xml");
//move all xml files
await Promise.all(xml.map(f => fs.rename(path.join(tmpfolder, d, f), path.join(statics, f))));
}
}
That should do the trick. Of course you may notice there is no error handling with this code. You should add that.
And I'm not 100% sure about your paths. You should consider the following
./uploads refers to a directory uploads in the current working directory (whereever that may be)
${__dirname}/uploads refers to a directory uploads which is in the same directory as the script file currently executing Not sure if that is the directory you want ...
./uploads and ${__dirname}/uploads may point to the same folder or to completely different folders. No way knowing that without additional context.
Furthermore in your code you extract the ZIP files from ./uploads to ${__dirname}/uploads and then later try to copy XML files from ./uploads/xxx to ${__dirname}/statics, but there won't be any directory xxx in ./uploads because you extracted the ZIP file to a (probably) completely different folder.

node.js stream pipeline separately over all files in folder

I wrote a pipeline that reads an input stream from a file, transforms it and streams it out to an output file. The pipeline works well for one file. Now I want to iterate through all files in a folder, producing a separate output file for each input file.
I am completely new to JS and synchronous streams. Did my best to figure it out. However, given n input files, what I currently get is n copies processed from the last input file.
const fs = require('fs');
const pipeline = require("stream");
async function processFolder(inPath, outPath) {
const dir = await fs.promises.opendir(inPath)
for await (const dirent of dir) {
let inFile = inPath.concat(dirent.name);
let outFile = outPath.concat(dirent.name.slice(0,-4), ".json");
console.log(inFile);
await pipeline(
fs.createReadStream(inFile).setEncoding('utf8'),
transform,
fs.createWriteStream(outFile),
(err) => {
if (err) {
console.error('Pipeline failed', err);
} else {
console.log('Pipeline succeeded');
}
}
);
}
}
processFolder(inPath, outPath);
How can I change the code above so that it processes all files in the directory?
Thank you!

Node js async await functions doesn't wait each other

I have a project that has functions that read files and extract their hash code. After these hash codes are extracted in the project, subfiles are built one by one. Finally, what I want to do is to throw all these hash codes into an array and create a json file. I need to do this after the IterateFolders() function has run and finished in readDirectory function. But console.log is running on a bottom line without waiting for this function, please help.
My functions are as follows:
//Calculate build time
function getBuildTime(start,end) {
let time = (end - start);
let buildTime = `${new Date().toLocaleDateString()} ${new Date().toLocaleTimeString()} Build time: ${time} ms \n`
fs.writeFile('build-time.log', buildTime,function (err) { //output log file
if (err) return console.log(err);
});
}
//async metaHash calculation from folder path
async function computeMetaHash(folder, inputHash = null) {
const hash = inputHash ? inputHash : createHash('sha256');
const info = await fsp.readdir(folder, { withFileTypes: true });
//construct a string from the modification date, the filename and the filesize
for (let item of info) {
const fullPath = path.join(folder, item.name)
if (item.isFile()) {
const statInfo = await fsp.stat(fullPath); //stat return all informations about file
// compute hash string name:size:mtime
const fileInfo = `${fullPath}:${statInfo.size}:${statInfo.mtimeMs}`;
hash.update(fileInfo);
} else if (item.isDirectory()) {
// recursively walk sub-folders
await computeMetaHash(fullPath, hash);
}
}
// if not being called recursively, get the digest and return it as the hash result
if (!inputHash) {
return hash.digest('base64');
}
}
async function iterateFolders(folderPath) {
folderPath.forEach(function (files) {
//function takes folder path as inputh
computeMetaHash(files).then(result => { //call create hash function
console.log({"path":files,"hashCode":result});
}).then(()=>{ //build fragments
//The files is array, so each. files is the folder name. can handle the folder.
console.log("%s build...", files);
execSync(`cd ${files} && npm run build`, { encoding: 'utf-8' });
}).then(()=>{// Finish timing
end = new Date().getTime();
getBuildTime(start,end);
}).catch(err => {
console.log(err);
});
});
}
async function readDirectory() {
let files = await readdir(p)
const folderPath = files.map(function (file) {
//return file or folder path
return path.join(p, file);
}).filter(function (file) {
//use sync judge method. The file will add next files array if the file is directory, or not.
return fs.statSync(file).isDirectory();
})
//check hash.json exist or not
if (fs.existsSync(hashFile)) {
// path exists
console.log("File exists: ", hashFile);
}
else
{
//This is the first pipeline, all fragments will build then hash.json will created.
console.log(hashFile," does NOT exist, build will start and hash.json will created:");
// Start timing
start = new Date().getTime();
iterateFolders(folderPath,files);
console.log("IT WILL BE LAST ONE ")
}
}
readDirectory();
Well if you want to wait for its execution, then you have to use await :) Currently it's just iterateFolders(folderPath,files);, so you run it, but you don't wait for it.
await iterateFolders(folderPath,files);
That's your first issue. Then this method runs some loop and calls some other methods. But first async-await needs to return a promise (which you do not do). And second - it doesn't work in forEach, as stated in the comments above. Read Using async/await with a forEach loop for more details.
Fix those three issues and you'll make it.
In the iterateFolders function, you need to await computeMetaHash calls. To do so you can either use a for loop instead of calling forEach on folderPath or change forEach to map and use Promise.all.
Using the for loop method (synchronous):
async function iterateFolders(folderPath) {
for (let files of folderPath) {
//function takes folder path as inputh
await computeMetaHash(files).then(result => { //call create hash function
console.log({"path":files,"hashCode":result});
}).then(()=>{ //build fragments
//The files is array, so each. files is the folder name. can handle the folder.
console.log("%s build...", files);
execSync(`cd ${files} && npm run build`, { encoding: 'utf-8' });
}).then(()=>{// Finish timing
end = new Date().getTime();
getBuildTime(start,end);
}).catch(err => {
console.log(err);
});
}
}
Using the Promise.all method (asynchronous):
async function iterateFolders(folderPath) {
return Promise.all(folderPath.map(function (files) {
//function takes folder path as inputh
return computeMetaHash(files).then(result => { //call create hash function
console.log({"path":files,"hashCode":result});
}).then(()=>{ //build fragments
//The files is array, so each. files is the folder name. can handle the folder.
console.log("%s build...", files);
execSync(`cd ${files} && npm run build`, { encoding: 'utf-8' });
}).then(()=>{// Finish timing
end = new Date().getTime();
getBuildTime(start,end);
}).catch(err => {
console.log(err);
});
}));
}
If you prefer, using async/await also allows you to get rid of the then and catch in both methods which I believe makes it a little easier to read and understand.
Here's an example using the Promise.all method:
async function iterateFolders(folderPath) {
return Promise.all(folderPath.map(async (files) => {
try {
const result = await computeMetaHash(files);
console.log({ path: files, hashCode: result });
// build fragments
//The files is array, so each. files is the folder name. can handle the folder.
console.log('%s build...', files);
execSync(`cd ${files} && npm run build`, { encoding: 'utf-8' });
// Finish timing
const end = Date.now();
getBuildTime(start, end);
} catch(err) {
console.log(err);
}
}));
}
You might also want to check out for await... of
Note: you also need to await iterateFolders when it's called in readDirectory.

Why does my async not work well with the third call but perfect with the first two

So I have a mind boggler, my app grabs a git repo (simple-git) then makes an npm i on the files inside (shelljs) and then zips using archiver. Of course it needs to be async but the first part and second part work however at the point of archiving it fails (the next step is for axios to await the zip being done), also before this when I ran the zip code with the grabbing repo code it would create the zip file in the correct root directory but now does it in the folder directory instead (repo/folder) , the zip is empty now though instead of zipping the other contents. Please assist if you can
The code:
// // //Make call
function makeCall() {
return new Promise((resolve) => {
resolve(
git()
.silent(true)
.clone(remote, ["-b", branch])
.then(() => console.log("finished"))
.catch((err) => console.error("failed: ", err))
);
});
}
//Make NPM Modules
async function makeModules() {
await makeCall();
const pathOfFolder = path.join(__dirname, "folder/sub");
shell.cd(pathOfFolder)
return new Promise((resolve) => {
resolve(
shell.exec("npm i"));
});
}
async function makeZip() {
await makeModules();
const output = fs.createWriteStream(`${branch}.zip`);
const archive = archiver("zip");
output.on("close", function () {
console.log(archive.pointer() + " total bytes");
console.log(
"archiver has been finalized and the output file descriptor has closed."
);
});
archive.pipe(output);
// append files from a sub-directory, putting its contents at the root of archive
archive.directory("folder", false);
// append files from a sub-directory and naming it `new-subdir` within the archive
archive.directory("subdir/", "new-subdir");
archive.finalize();
}
makeZip();
Resolved it, moved to other files and set path correctly

Read a text file using Node.js?

I need to pass in a text file in the terminal and then read the data from it, how can I do this?
node server.js file.txt
How do I pass in the path from the terminal, how do I read that on the other side?
You'll want to use the process.argv array to access the command-line arguments to get the filename and the FileSystem module (fs) to read the file. For example:
// Make sure we got a filename on the command line.
if (process.argv.length < 3) {
console.log('Usage: node ' + process.argv[1] + ' FILENAME');
process.exit(1);
}
// Read the file and print its contents.
var fs = require('fs')
, filename = process.argv[2];
fs.readFile(filename, 'utf8', function(err, data) {
if (err) throw err;
console.log('OK: ' + filename);
console.log(data)
});
To break that down a little for you process.argv will usually have length two, the zeroth item being the "node" interpreter and the first being the script that node is currently running, items after that were passed on the command line. Once you've pulled a filename from argv then you can use the filesystem functions to read the file and do whatever you want with its contents. Sample usage would look like this:
$ node ./cat.js file.txt
OK: file.txt
This is file.txt!
[Edit] As #wtfcoder mentions, using the "fs.readFile()" method might not be the best idea because it will buffer the entire contents of the file before yielding it to the callback function. This buffering could potentially use lots of memory but, more importantly, it does not take advantage of one of the core features of node.js - asynchronous, evented I/O.
The "node" way to process a large file (or any file, really) would be to use fs.read() and process each available chunk as it is available from the operating system. However, reading the file as such requires you to do your own (possibly) incremental parsing/processing of the file and some amount of buffering might be inevitable.
Usign fs with node.
var fs = require('fs');
try {
var data = fs.readFileSync('file.txt', 'utf8');
console.log(data.toString());
} catch(e) {
console.log('Error:', e.stack);
}
IMHO, fs.readFile() should be avoided because it loads ALL the file in memory and it won't call the callback until all the file has been read.
The easiest way to read a text file is to read it line by line. I recommend a BufferedReader:
new BufferedReader ("file", { encoding: "utf8" })
.on ("error", function (error){
console.log ("error: " + error);
})
.on ("line", function (line){
console.log ("line: " + line);
})
.on ("end", function (){
console.log ("EOF");
})
.read ();
For complex data structures like .properties or json files you need to use a parser (internally it should also use a buffered reader).
You can use readstream and pipe to read the file line by line without read all the file into memory one time.
var fs = require('fs'),
es = require('event-stream'),
os = require('os');
var s = fs.createReadStream(path)
.pipe(es.split())
.pipe(es.mapSync(function(line) {
//pause the readstream
s.pause();
console.log("line:", line);
s.resume();
})
.on('error', function(err) {
console.log('Error:', err);
})
.on('end', function() {
console.log('Finish reading.');
})
);
I am posting a complete example which I finally got working. Here I am reading in a file rooms/rooms.txt from a script rooms/rooms.js
var fs = require('fs');
var path = require('path');
var readStream = fs.createReadStream(path.join(__dirname, '../rooms') + '/rooms.txt', 'utf8');
let data = ''
readStream.on('data', function(chunk) {
data += chunk;
}).on('end', function() {
console.log(data);
});
The async way of life:
#! /usr/bin/node
const fs = require('fs');
function readall (stream)
{
return new Promise ((resolve, reject) => {
const chunks = [];
stream.on ('error', (error) => reject (error));
stream.on ('data', (chunk) => chunk && chunks.push (chunk));
stream.on ('end', () => resolve (Buffer.concat (chunks)));
});
}
function readfile (filename)
{
return readall (fs.createReadStream (filename));
}
(async () => {
let content = await readfile ('/etc/ssh/moduli').catch ((e) => {})
if (content)
console.log ("size:", content.length,
"head:", content.slice (0, 46).toString ());
})();

Categories