Javascript - Read parquet data (with snappy compression) from AWS s3 bucket - javascript

In nodeJS, I am trying to read a parquet file (compression='snappy') but not successful.
I used https://github.com/ironSource/parquetjs npm module to open local file and read it but reader.cursor() throws cryptic error 'not yet implemented'. It does not matter which compression (plain, rle, or snappy) was used to create input file, it throws same error.
Here is my code:
const readParquet = async (fileKey) => {
const filePath = 'parquet-test-file.plain'; // 'snappy';
console.log('----- reading file : ', filePath);
let reader = await parquet.ParquetReader.openFile(filePath);
console.log('---- ParquetReader initialized....');
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
if (cursor) {
console.log('---- cursor initialized....');
let record = await cursor.next() ; // this line throws exception
while (record) {
console.log(record);
record = await cursor.next();
}
}
await reader.close();
console.log('----- done with reading parquet file....');
return;
};
Call to read:
let dt = readParquet(fileKeys.dataFileKey);
dt
.then((value) => console.log('--------SUCCESS', value))
.catch((error) => {
console.log('-------FAILURE ', error); // Random error
console.log(error.stack);
})
More info:
1. I have generated my parquet files in python using pyarrow.parquet
2. I used 'SNAPPY' compression while writing file
3. I can read these files in python without any issue
4. My schema is not fixed (unknown) each time I write parquet file. I do not create schema while writing.
5. error.stack prints undefined in console
6. console.log('-------FAILURE ', error); prints "not yet implemented"
I would like to know if someone has encountered similar problem and has ideas/solution to share. BTW my parquet files are stored on AWS S3 location (unlike in this test code). I still have to find solution to read parquet file from S3 bucket.
Any help, suggestions, code example will be highly appreciated.

Use var AWS = require('aws-sdk'); to get data from S3.
Then use node-parquet to read parquet file into variable.
import np = require('node-parquet');
// Read from a file:
var reader = new np.ParquetReader(`file.parquet`);
var parquet_info = reader.info();
var parquet_rows = reader.rows();
reader.close();
parquet_rows = parquet_rows + "\n";

There is a fork of https://github.com/ironSource/parquetjs here: https://github.com/ZJONSSON/parquetjs which is a "lite" version of the ironSource project. You can install it using npm install parquetjs-lite.
The ZJONSSON project comes with a function ParquetReader.openS3, which accepts an s3 client (from version 2 of the AWS SDK) and params ({Bucket: 'x', Key: 'y'}). You might want to try and see if that works for you.
If you are using version 3 of the AWS SDK / S3 client, I have a compatible fork here: https://github.com/entitycs/parquetjs (see tag feature/openS3v3).
Example usage from the project's README.md:
const parquet = require("parquetjs-lite");
const params = {
Bucket: 'xxxxxxxxxxx',
Key: 'xxxxxxxxxxx'
};
// v2 example
const AWS = require('aws-sdk');
const client = new AWS.S3({
accessKeyId: 'xxxxxxxxxxx',
secretAccessKey: 'xxxxxxxxxxx'
});
let reader = await parquet.ParquetReader.openS3(client,params);
//v3 example
const {S3Client, HeadObjectCommand, GetObjectCommand} = require('#aws-sdk/client-s3');
const client = new S3Client({region:"us-east-1"});
let reader = await parquet.ParquetReader.openS3(
{S3Client:client, HeadObjectCommand, GetObjectCommand},
params
);
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
console.log(record);
}

Related

Flushing file's with webkit's filesystem API with Safari

I am trying to use filesystem api to create permanent files, write and read data from them.
Although I succeeded creating the file and writing data to it, after calling flush() the file becomes empty (and it's size is 0).
The files that I created exist and I can still see them in a different running of safari, but the data is lost and the file's are all 0 sized.
Even if I try to read the file just after writing to it and flushing, the data is lost and it's size returns to 0.
Does anybody know what I am doing wrong?
I tried running this:
console.log("Starting");
async function files() {
// Set a Message
const message = "Thank you for reading this.";
const fileName = "Draft.txt";
// Get handle to draft file
const root = await navigator.storage.getDirectory();
const draftHandle = await root.getFileHandle(fileName, { create: true });
console.log("File Name: ", fileName);
// Get sync access handle
const accessHandle = await draftHandle.createSyncAccessHandle();
// Get size of the file.
const fileSize = await accessHandle.getSize();
console.log("File Size: ", fileSize); // getting 0 here
// Read file content to a buffer.
const buffer = new DataView(new ArrayBuffer(fileSize));
const readBuffer = accessHandle.read(buffer, { at: 0 });
console.log("Read: ", readBuffer); // getting 0 here because the file was just created
// Write the message to the file.
const encoder = new TextEncoder();
const encodedMessage = encoder.encode(message);
const writeBuffer = accessHandle.write(encodedMessage, { at: readBuffer });
console.log("Write: ", writeBuffer); // writing 27 bytes here, succeeding
// Persist changes to disk.
accessHandle.flush();
// Always close FileSystemSyncAccessHandle if done.
accessHandle.close();
console.log("Closed file ");
// Find files under root/ and print their details.
for await (const handle of root.values()) {
console.log('Item in root: ', handle.name);
if (handle.kind == "file") {
let f = await handle.getFile();
console.log("File Details: ", f); // I can see here the file I created now and earlier created files, but all of them are 0 sized
}
}
}
files();

How to find if Azure File exists on NodeJS

I'm using the azure file storage, and using express JS to write a backend to render the contents stored in the azure file storage.
I am writing the code based on https://learn.microsoft.com/en-us/javascript/api/#azure/storage-file-share/shareserviceclient?view=azure-node-latest
const { ShareServiceClient, StorageSharedKeyCredential } = require("#azure/storage-file-share");
const account = "<account>";
const accountKey = "<accountkey>";
const credential = new StorageSharedKeyCredential(account, accountKey);
const serviceClient = new ShareServiceClient(
`https://${account}.file.core.windows.net`,
credential
);
const shareName = "<share name>";
const fileName = "<file name>";
// [Node.js only] A helper method used to read a Node.js readable stream into a Buffer
async function streamToBuffer(readableStream) {
return new Promise((resolve, reject) => {
const chunks = [];
readableStream.on("data", (data) => {
chunks.push(data instanceof Buffer ? data : Buffer.from(data));
});
readableStream.on("end", () => {
resolve(Buffer.concat(chunks));
});
readableStream.on("error", reject);
});
}
And you can view the contents through
const downloadFileResponse = await fileClient.download();
const output = await streamToBuffer(downloadFileResponse.readableStreamBody)).toString()
Thing is, I only want to find if the file exists and not spend time downloading the entire file, how could I do this?
I looked at https://learn.microsoft.com/en-us/javascript/api/#azure/storage-file-share/shareserviceclient?view=azure-node-latest
to see if the file client class has what I want, but it doesn't seem to have methods useful for this.
If you are using #azure/storage-file-share (version 12.x) Node package, there's an exists method in ShareFileClient. You can use that to find if a file exists or not. Something like:
const fileExists = await fileClient.exists();//returns true or false.

Downloading an Azure Storage Blob using pure JavaScript and Azure-Storage-Js

I'm trying to do this with just pure Javascript and the SDK. I am not using Node.js. I'm converting my application from v2 to v10 of the SDK azure-storage-js-v10
The azure-storage.blob.js bundled file is compatible with UMD
standard, if no module system is found, following global variable
will be exported: azblob
My code is here:
const serviceURL = new azblob.ServiceURL(`https://${account}.blob.core.windows.net${accountSas}`, pipeline);
const containerName = "container";
const containerURL = azblob.ContainerURL.fromServiceURL(serviceURL, containerName);
const blobURL = azblob.BlobURL.fromContainerURL(containerURL, blobName);
const downloadBlobResponse = await blobURL.download(azblob.Aborter.none, 0);
The downloadBlobResponse looks like this:
downloadBlobResponse
Using v10, how can I convert the downloadBlobResponse into a new blob so it can be used in the FileSaver saveAs() function?
In azure-storage-js-v2 this code worked on smaller files:
let readStream = blobService.createReadStream(containerName, blobName, (err, res) => {
if (error) {
// Handle read blob error
}
});
// Use event listener to receive data
readStream.on('data', data => {
// Uint8Array retrieved
// Convert the array back into a blob
var newBlob = new Blob([new Uint8Array(data)]);
// Saves file to the user's downloads directory
saveAs(newBlob, blobName); // FileSaver.js
});
I've tried everything to get v10 working, any help would be greatly appreciated.
Thanks,
You need to get the body by await blobBody.
downloadBlobResponse = await blobURL.download(azblob.Aborter.none, 0);
// data is a browser Blob type
const data = await downloadBlobResponse.blobBody;
Thanx Mike Coop and Xiaoning Liu!
I was busy making a Vuejs plugin to download blobs from a storage account. Thanx to you, I was able to make this work.
var FileSaver = require('file-saver');
const { BlobServiceClient } = require("#azure/storage-blob");
const downloadButton = document.getElementById("download-button");
const downloadFiles = async() => {
try {
if (fileList.selectedOptions.length > 0) {
reportStatus("Downloading files...");
for await (const option of fileList.selectedOptions) {
var blobName = option.text;
const account = '<account name>';
const sas = '<blob sas token>';
const containerName = '< container name>';
const blobServiceClient = new BlobServiceClient(`https://${account}.blob.core.windows.net${sas}`);
const containerClient = blobServiceClient.getContainerClient(containerName);
const blobClient = containerClient.getBlobClient(blobName);
const downloadBlockBlobResponse = await blobClient.download(blobName, 0, undefined);
const data = await downloadBlockBlobResponse.blobBody;
// Saves file to the user's downloads directory
FileSaver.saveAs(data, blobName); // FileSaver.js
}
reportStatus("Done.");
listFiles();
} else {
reportStatus("No files selected.");
}
} catch (error) {
reportStatus(error.message);
}
};
downloadButton.addEventListener("click", downloadFiles);
Thanks Xiaoning Liu!
I'm still learning about async javascript functions and promises. Guess I was just missing another "await". I saw that "downloadBlobResponse.blobBody" was a promise and also a blob type, but, I couldn't figure out why it wouldn't convert to a new blob. I kept getting the "Iterator getter is not callable" error.
Here's my final working solution:
// Create a BlobURL
const blobURL = azblob.BlobURL.fromContainerURL(containerURL, blobName);
// Download blob
downloadBlobResponse = await blobURL.download(azblob.Aborter.none, 0);
// In browsers, get downloaded data by accessing downloadBlockBlobResponse.blobBody
const data = await downloadBlobResponse.blobBody;
// Saves file to the user's downloads directory
saveAs(data, blobName); // FileSaver.js

NodeJS (JavaScript/TypeScript) - Error while reading Parquet file

I am trying to read parquet file with nodejs:
var parquet = require('parquetjs');
(
async () => {
try {
// create new ParquetReader that reads from 'fruits.parquet`
let reader = await parquet.ParquetReader.openFile('f1.snappy.parquet');
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
console.log(record);
}
} catch (e) {
console.log('error while reading a parquet file:\n', e)
}
}
) ();
Getting error:
error while reading a parquet file:
invalid page type: DICTIONARY_PAGE
Same parquet file - I can read with Python pyarrow library without issue..
What can be the reason?
I was having this same issue (and some others when reading nested objects) with parquetjs
I switched to https://www.npmjs.com/package/parquetjs-lite and now everything works smoothly
parquetjs-lite is a fork of parquetjs so I didn't need to change any of my code

Node.js - Getting empty files when unzipping and uploading to GCS

I am trying to create a service that gets a zip file, unpacks it, and uploads its contents to a Google Cloud Storage bucket.
The unzipping part seems to work well, but in my GCS bucket all the files seem to be empty.
I'm using the following code:
app.post('/fileupload', function(req, res) {
var form = new formidable.IncomingForm();
form.parse(req, function (err, fields, files) {
const uuid = uuidv4();
console.log(files.filetoupload.path); // temporary path to zip
fs.createReadStream(files.filetoupload.path)
.pipe(unzip.Parse())
.on('entry', function (entry) {
var fileName = entry.path;
var type = entry.type; // 'Directory' or 'File'
var size = entry.size;
const gcsname = uuid + '/' + fileName;
const blob = bucket.file(gcsname);
const blobStream = blob.createWriteStream(entry.path);
blobStream.on('error', (err) => {
console.log(err);
});
blobStream.on('finish', () => {
const publicUrl = format(`https://storage.googleapis.com/${bucket.name}/${blob.name}`);
console.log(publicUrl); // file on GCS
});
blobStream.end(entry.buffer);
});
});
});
I'm quite new to Node.js so I'm probably overlooking something - I've spent some time on documentation but I don't quite know what to do.
Could someone advise on what might be the problem?
The fs.createWriteStream() takes file path as argument but GCS createWriteStream() takes options
As per the example in this documentation the recommended way would be:
const stream = file.createWriteStream({
metadata: {
contentType: req.file.mimetype
},
resumable: false
});
instead of:
const blobStream = blob.createWriteStream(entry.path).
Check whether your buffer is undefined or not . It may be due to unspecified disk/Mem storage that the buffer remains undefined .

Categories