stream large files in node.js lambda

stream large files in node.js lambda - javascript

I am new to javascript have written some nodejs code to calculate checksum of files in S3 by streaming using the crypto module. It does fine when the items are small sizes (1-5GB), larger files will timeout because not all the stream data has been consumed by the time lambda timeout is up and end event have not been reached. I am wondering if there are ways to tune this code such that it will handles the big files in about 30gb range? I noticed that in my lambda the cpu memory is barely being fully utilized, each time it only uses about 10% 148mb/1530mb allocated, can I do anything there? Any help is appreciated, thanks!
var AWS = require('aws-sdk');
const crypto = require('crypto');
const fs = require('fs');
const s3 = new AWS.S3();
let s3params = {
Bucket: 'nlm-qa-int-draps-bucket',
//Key: filename.toString(),
Key: '7801339A.mkv',
};
let hash = crypto.createHash('md5');
let stream = s3.getObject(s3params).createReadStream();
stream.on('data', (data) => {
hash.update(data);
});
stream.on('end', () => {
var digest = hash.digest('hex');
console.log("this is md5 value from digest: " + digest);
callback(null, digest);
digest = digest.toString().replace(/[^A-Za-z 0-9 \.,\?""!##\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]`~]*/g, '');

Related

How to write BLE write characteristic over 512B

I have a client attempting to send images to a server over BLE.
Client Code
//BoilerPlate to setup connection and whatnot
sendFile.onclick = async () => {
var fileList = document.getElementById("myFile").files;
var fileReader = new FileReader();
if (fileReader && fileList && fileList.length) {
fileReader.readAsArrayBuffer(fileList[0]);
fileReader.onload = function () {
var imageData = fileReader.result;
//Server doesn't get data if I don't do this chunking
imageData = imageData.slice(0,512);
const base64String = _arrayBufferToBase64(imageData);
document.getElementById("ItemPreview").src = "data:image/jpeg;base64," + base64String;
sendCharacteristic.writeValue(imageData);
};
}
};
Server Code
MyCharacteristic.prototype.onWriteRequest = function(data, offset, withoutResponse, callback) {
//It seems this will not print out if Server sends over 512B.
console.log(this._value);
};
My goal is to send small images (Just ~6kb)...These are still so small that'd I'd still prefer to use BLE over a BT Serial Connection. Is the only way this is possible is to perform some chunking and then streaming the chunks over?
Current 'Chunking' Code
const MAX_LENGTH = 512;
for (let i=0; i<bytes.byteLength; i+= MAX_LENGTH) {
const end = (i+MAX_LENGTH > bytes.byteLength) ? bytes.byteLength : i+MAX_LENGTH;
const chunk = bytes.slice(i, end);
sendCharacteristic.writeValue(chunk);
await sleep(1000);
}
The above code works, however it sleeps in between sends. I'd rather not do this because there's no guarantee a previous packet will be finished sending and I could sleep longer than needed.
I'm also perplexed on how the server code would then know the client has finished sending all bytes and can then assemble them. Is there some kind of pattern to achieving this?

BLE characteristic values can only be 512 bytes, so yes the common way to send larger data is to split it into multiple chunks. Use "Write Without Response" for best performance (MTU-3 must be at least as big as your chunk).

How to find if Azure File exists on NodeJS

I'm using the azure file storage, and using express JS to write a backend to render the contents stored in the azure file storage.
I am writing the code based on https://learn.microsoft.com/en-us/javascript/api/#azure/storage-file-share/shareserviceclient?view=azure-node-latest
const { ShareServiceClient, StorageSharedKeyCredential } = require("#azure/storage-file-share");
const account = "<account>";
const accountKey = "<accountkey>";
const credential = new StorageSharedKeyCredential(account, accountKey);
const serviceClient = new ShareServiceClient(
`https://${account}.file.core.windows.net`,
credential
);
const shareName = "<share name>";
const fileName = "<file name>";
// [Node.js only] A helper method used to read a Node.js readable stream into a Buffer
async function streamToBuffer(readableStream) {
return new Promise((resolve, reject) => {
const chunks = [];
readableStream.on("data", (data) => {
chunks.push(data instanceof Buffer ? data : Buffer.from(data));
});
readableStream.on("end", () => {
resolve(Buffer.concat(chunks));
});
readableStream.on("error", reject);
});
}
And you can view the contents through
const downloadFileResponse = await fileClient.download();
const output = await streamToBuffer(downloadFileResponse.readableStreamBody)).toString()
Thing is, I only want to find if the file exists and not spend time downloading the entire file, how could I do this?
I looked at https://learn.microsoft.com/en-us/javascript/api/#azure/storage-file-share/shareserviceclient?view=azure-node-latest
to see if the file client class has what I want, but it doesn't seem to have methods useful for this.

If you are using #azure/storage-file-share (version 12.x) Node package, there's an exists method in ShareFileClient. You can use that to find if a file exists or not. Something like:
const fileExists = await fileClient.exists();//returns true or false.

Running async function locally using NodeJS in Windows 10

I am struggling to run an async function taken from a Google example alongside the Environment Variables on Windows 10. I have created a bucket at GCS and uploaded my .raw file.
I then created a .env file which contains the following
HOST=localhost
PORT=3000
GOOGLE_APPLICATION_CREDENTIALS=GDeveloperKey.json
Doing this in AWS Lambda is just a case of wrapping the code within exports.handler = async (event, context, callback) => {
How can I emulate the same locally in Windows 10?
// Imports the Google Cloud client library
const speech = require('#google-cloud/speech');
// Creates a client
const client = new speech.SpeechClient();
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';
const config = {
encoding: encoding,
sampleRateHertz: sampleRateHertz,
languageCode: languageCode,
};
const audio = {
uri: gcsUri,
};
const request = {
config: config,
audio: audio,
};
// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
const [operation] = await client.longRunningRecognize(request);
// Get a Promise representation of the final result of the job
const [response] = await operation.promise();
const transcription = response.results
.map(result => result.alternatives[0].transcript)

Wrap your await statements into an immediately-invoked async function.
Ex:
(async () => {
// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
const [operation] = await client.longRunningRecognize(request);
// Get a Promise representation of the final result of the job
const [response] = await operation.promise();
const transcription = response.results
.map(result => result.alternatives[0].transcript)
})();

How to write a large amount of random bytes to file

I need to generate some files of varying sizes to test an application, I thought the easiest way to do so would be to write a Node script to do so.
I wrote the following code but the process crashes when the file size exceeds my memory.
const fs = require("fs");
const crypto = require('crypto');
const gb = 1024 * 1024 * 1024;
const data = crypto.randomBytes(gb * 5);
fs.writeFile('bytes.bin', data, (err) => {
if (err) throw err;
console.log('The file has been saved!');
});

On top of your memory issue you will also have an issue with the crypto module as the amount of bytes it can generate is limited.
You will need to use fs.createWriteStream to generate and write the data in chunks rather than generating it in one go.
Here is a modified version of some code from the Node documentation on streams to stream chunks of random bytes to a file:
const fs = require("fs");
const crypto = require('crypto');
const fileName = "random-bytes.bin";
const fileSizeInBytes = Number.parseInt(process.argv[2]) || 1000;
console.log(`Writing ${fileSizeInBytes} bytes`)
const writer = fs.createWriteStream(fileName)
writetoStream(fileSizeInBytes, () => console.log(`File created: ${fileName}`));
function writetoStream(bytesToWrite, callback) {
const step = 1000;
let i = bytesToWrite;
write();
function write() {
let ok = true;
do {
const chunkSize = i > step ? step : i;
const buffer = crypto.randomBytes(chunkSize);
i -= chunkSize;
if (i === 0) {
// Last time!
writer.write(buffer, callback);
} else {
// See if we should continue, or wait.
// Don't pass the callback, because we're not done yet.
ok = writer.write(buffer);
}
} while (i > 0 && ok);
if (i > 0) {
// Had to stop early!
// Write some more once it drains.
writer.once('drain', write);
}
}
}
There are also online tools which let you generate files of your required size with less setup. The files are also generated on your system so they don't have to be downloaded over the wire.

Javascript - Read parquet data (with snappy compression) from AWS s3 bucket

In nodeJS, I am trying to read a parquet file (compression='snappy') but not successful.
I used https://github.com/ironSource/parquetjs npm module to open local file and read it but reader.cursor() throws cryptic error 'not yet implemented'. It does not matter which compression (plain, rle, or snappy) was used to create input file, it throws same error.
Here is my code:
const readParquet = async (fileKey) => {
const filePath = 'parquet-test-file.plain'; // 'snappy';
console.log('----- reading file : ', filePath);
let reader = await parquet.ParquetReader.openFile(filePath);
console.log('---- ParquetReader initialized....');
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
if (cursor) {
console.log('---- cursor initialized....');
let record = await cursor.next() ; // this line throws exception
while (record) {
console.log(record);
record = await cursor.next();
}
}
await reader.close();
console.log('----- done with reading parquet file....');
return;
};
Call to read:
let dt = readParquet(fileKeys.dataFileKey);
dt
.then((value) => console.log('--------SUCCESS', value))
.catch((error) => {
console.log('-------FAILURE ', error); // Random error
console.log(error.stack);
})
More info:
1. I have generated my parquet files in python using pyarrow.parquet
2. I used 'SNAPPY' compression while writing file
3. I can read these files in python without any issue
4. My schema is not fixed (unknown) each time I write parquet file. I do not create schema while writing.
5. error.stack prints undefined in console
6. console.log('-------FAILURE ', error); prints "not yet implemented"
I would like to know if someone has encountered similar problem and has ideas/solution to share. BTW my parquet files are stored on AWS S3 location (unlike in this test code). I still have to find solution to read parquet file from S3 bucket.
Any help, suggestions, code example will be highly appreciated.

Use var AWS = require('aws-sdk'); to get data from S3.
Then use node-parquet to read parquet file into variable.
import np = require('node-parquet');
// Read from a file:
var reader = new np.ParquetReader(`file.parquet`);
var parquet_info = reader.info();
var parquet_rows = reader.rows();
reader.close();
parquet_rows = parquet_rows + "\n";

There is a fork of https://github.com/ironSource/parquetjs here: https://github.com/ZJONSSON/parquetjs which is a "lite" version of the ironSource project. You can install it using npm install parquetjs-lite.
The ZJONSSON project comes with a function ParquetReader.openS3, which accepts an s3 client (from version 2 of the AWS SDK) and params ({Bucket: 'x', Key: 'y'}). You might want to try and see if that works for you.
If you are using version 3 of the AWS SDK / S3 client, I have a compatible fork here: https://github.com/entitycs/parquetjs (see tag feature/openS3v3).
Example usage from the project's README.md:
const parquet = require("parquetjs-lite");
const params = {
Bucket: 'xxxxxxxxxxx',
Key: 'xxxxxxxxxxx'
};
// v2 example
const AWS = require('aws-sdk');
const client = new AWS.S3({
accessKeyId: 'xxxxxxxxxxx',
secretAccessKey: 'xxxxxxxxxxx'
});
let reader = await parquet.ParquetReader.openS3(client,params);
//v3 example
const {S3Client, HeadObjectCommand, GetObjectCommand} = require('#aws-sdk/client-s3');
const client = new S3Client({region:"us-east-1"});
let reader = await parquet.ParquetReader.openS3(
{S3Client:client, HeadObjectCommand, GetObjectCommand},
params
);
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
console.log(record);
}

We Keep Coding

JavaScript is the programming language of the Web.

stream large files in node.js lambda - javascript

Related

How to write BLE write characteristic over 512B

How to find if Azure File exists on NodeJS

Running async function locally using NodeJS in Windows 10

How to write a large amount of random bytes to file

Javascript - Read parquet data (with snappy compression) from AWS s3 bucket

Categories

Resources