nodejs express stream from array - javascript

I'm building an app which i need to stream data to client, my data is simply an array of objects .
this is the for loop which makes the array
for(let i =0;i<files.length;i++){
try {
let file = files[i]
var musicPath = `${baseDir}/${file}`
let meta = await getMusicMeta(musicPath)
musics.push(meta)
}
right now I wait for the loop to finish it's works then I send the whole musics array to client, I want to use stream to send musics array one by one to client instead of waiting for the loop to finish

Use scramjet and send the stream straight to the response:
const {DataStream} = require("scramjet");
// ...
response.writeHead(200);
DataStream.fromArray(files)
// all the magic happens below - flow control
.map(file => getMusicMeta(`${baseDir}/${file}`))
.toJSONArray()
.pipe(response);
Scramjet will make use of your flow control and most importantly - it'll get the result out faster than any other streaming framework.
Edit: I wrote a couple lines of code to make this use case easier in scramjet. :)

Related

Get data from Stream.Writable into a string variable

I am using the #kubernetes/client-node library.
My end goal is to execute commands (say "ls") and get the output for further processing.
The .exec() method requires providing two Writeable streams (for the WebSocket to write the output to), and one Readable stream (for pushing our commands to).
The code I have looks something like this:
const outputStream = new Stream.Writable();
const commandStream = new Stream.Readable();
const podExec = await exec.exec(
"myNamespace",
"myPod",
"myContainer",
["/bin/sh", "-c"],
outputStream,
outputStream,
commandStream,
true
);
commandStream.push("ls -l\n");
// get the data from Writable stream here
outputStream.destroy();
commandStream.destroy();
podExec.close();
I am pretty new to JS and am having trouble getting the output from the Writable stream since it doesn't allow direct reading.
Creating a Writable stream to a file and then reading from it seems unnecessarily overcomplicated.
I would like to write the output as a string to a variable.
Has anyone encountered the same task before, and if so, what can you suggest to get the command output?
I would appreciate any help on this matter!

how to maintain the order of http requests using node js?

I have a bunch of data that I want to send to a server through http. However in the server side I need to process the data in the same order as they were sent(e.g. if the order of sending is elem1, elem2 and elem3, I would like to process elem1 first, then elem2 and then elem3). Since in http, there is no grantee that the order will be maintained I need some way to maintain the order.
Currently I am keeping the data in a queue and I send one element and await for the response. Once the response reaches me I send the next element.
while (!queue.isEmpty()) {
let data = queue.dequeue();
await sendDataToServer(data);
}
I am not very sure if this will actually work in a production environment and what will be the impact on the performance.
Any sort of help is much appreciated. Thank you
Sorry, I don't have enough reputation to comment, thus I am posting this as an answer.
Firstly, your code will work as intended.
However, since the server has to receive them in order, the performance won't be good. If you can change the server, I suggest you implement it like this:
Add an ID to each data item.
Send all the data items, no need to ensure order.
Create a buffer on the server, the buffer will be able to contain all the data items.
The server receives the items and puts them into the buffer in the right position.
Example code:
Client (see Promise.all)
let i = 0;
let promises = [];
await sendDataLengthToServer(queue.length());
while (!queue.isEmpty()) {
let data = queue.dequeue();
data.id = i;
// no need to wait for a request to finish
promises.push(sendDataToServer(data));
}
await Promise.all(promises);
Server (pseudo-code)
length = receiveDataLengthFromClient()
buffer = new Array(length)
int received = 0
onDataReceivedFromClient(data, {
received = received + 1
buffer[data.id] = data
if (received == length) {
// the buffer contains the data in the right order
}
})

Best way to push one more scrape after all are done

I have following scenario:
My scrapes are behind a login, so there is one login page that I always need to hit first
then I have a list of 30 urls that can be scraped asynchronously for all I care
then at the very end, when all those 30 urls have been scraped I need to hit one last separate url to put the results of the 30 URL scrape into a firebase db and to do some other mutations (like geo lookups for addresses etc)
Currently I have all 30 urls in a request queue (through the Apify web-interface) and I'm trying to see when they are all finished.
But obviously they all run async so that data is never reliable
const queue = await Apify.openRequestQueue();
let pendingRequestCount = await queue.getInfo();
The reason why I need that last URL to be separate are two-fold:
Most obvious reason being that I need to be sure I have the
results of all 30 scrapes before I send everything to DB
neither of the 30 URL's allow me to do Ajax / Fetch calls, which
I need for sending to Firebase and do the geo lookups of addresses
Edit: Tried this based on answer from #Lukáš Křivka. handledRequestCount in the while loop reaches a max of 2, never 4 ... and Puppeteer just ends normally. I've put the "return" inside the while loop because otherwise requests never finish (of course).
In my current test setup I have 4 urls to be scraped (in the Start URLS input fields of Puppeteer Scraper (on Apify.com) and this code :
let title = "";
const queue = await Apify.openRequestQueue();
let {handledRequestCount} = await queue.getInfo();
while (handledRequestCount < 4){
await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
console.log(`Curently handled here: ${handledRequestCount} --- waiting`) // this goes max to '2'
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};
}
log.info("Here I want to add another URL to the queue where I can do ajax stuff to save results from above runs to firebase db");
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};
I would need to see your code to answer completely correctly but this has solutions.
Simply use Apify.PuppeteerCrawler for the 30 URLs. Then you run the crawler with await crawler.run().
After that, you can simply load the data from the default dataset via
const dataset = await Apify.openDataset();
const data = await dataset.getdata().then((response) => response.items);
And do whatever with the data, you can even create new Apify.PuppeteerCrawler to crawl the last URL and use the data.
If you are using Web Scraper though, it is a bit more complicated. You can either:
1) Create a separate actor for the Firebase upload and pass it a webhook from your Web Scraper to load the data from it. If you look at the Apify store, we already have a Firestore uploader.
2) Add a logic that will poll the requestQueue like you did and only when all the requests are handled, you proceed. You can create some kind of loop that will wait. e.g.
const queue = await Apify.openRequestQueue();
let { handledRequestCount } = await queue.getInfo();
while (handledRequestCount < 30) {
console.log(`Curently handled: ${handledRequestCount } --- waiting`)
await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
}
// Do your Firebase stuff
In the scenario where you have one async function that's called for all 30 URLs you scrape, first make sure the function returns its result after all necessary awaits, you could await for Promise.all(arrayOfAll30Promises) then run your last piece of code
Because I was not able to get consistent results with the {handledRequestCount} from getInfo() (see my edit in my original question), I went another route.
I'm basically keeping a record of which URL's have already been scraped via the key/value store.
urls = [
{done:false, label:"vietnam", url:"https://en.wikipedia.org/wiki/Vietnam"},
{done:false , label:"cambodia", url:"https://en.wikipedia.org/wiki/Cambodia"}
]
// Loop over the array and add them to the Queue
for (let i=0; i<urls.length; i++) {
await queue.addRequest(new Apify.Request({ url: urls[i].url }));
}
// Push the array to the key/value store with key 'URLS'
await Apify.setValue('URLS', urls);
Now every time I've processed an url I set its "done" value to true.
When they are all true I'm pushing another (final) url into the queue:
await queue.addRequest(new Apify.Request({ url: "http://www.placekitten.com" }));

Can I [de]serialize a dictionary of dataframes in the arrow/js implementation?

I want to use Apache Arrow to send data from a Django backend to a Angular frontend. I want to use a dictionary of dataframes/tables as payload in messages. It's posssible with pyarrow to share data in this way between python microservices, but i cant find a way with the javascript implementation of arrow.
Is there a way to deserialize/serialize a dictionary with strings as keys and dataframes/tables as values in the javascript side with Arrow?
Yes, a variant of this is possible using the RecordBatchReader and RecordBatchWriter IPC primitives in both pyarrow and ArrowJS.
On the python side, you can serialize a Table to a buffer like this:
import pyarrow as pa
def serialize_table(table):
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, table.schema)
writer.write_table(table)
writer.close()
return sink.getvalue().to_pybytes()
# ...later, in your route handler:
bytes = serialize_table(create_your_arrow_table())
Then you can send the bytes in the response body. If you have multiple tables, you can concatenate the buffers from each as one large payload.
I'm not sure what functionality exists to write multipart/form-body responses in python, but that's probably the best way to craft the response if you want the tables to be sent with their names (or any other metadata you wish to include).
On the JavaScript side, you can read the the response either with Table.from() (if you have just one table), or the RecordBatchReader if you have more than one, or if you want to read each RecordBatch in a streaming fashion:
import { Table, RecordBatchReader } from 'apache-arrow'
// easy if you want to read the first (or only) table in the response
const table = await Table.from(fetch('/table'))
// or for mutliple tables on the same stream, or to read in a streaming fashion:
for await (const reader of RecordBatchReader.readAll(fetch('/table'))) {
// Buffer all batches into a table
const table = await Table.from(reader)
// Or process each batch as it's downloaded
for await (const batch of reader) {
}
}
You can see more examples of this in our tests for ArrowJS here:
https://github.com/apache/arrow/blob/3eb07b7ed173e2ecf41d689b0780dd103df63a00/js/test/unit/ipc/writer/stream-writer-tests.ts#L40
You can also see some examples in a little fastify plugin I wrote for consuming and producing Arrow payloads in node: https://github.com/trxcllnt/fastify-arrow

Performing async function on each element in array?

So I have an array of URLs
I want to pull the html from each (for which I am using restler node.js library)
Then select some of that data to act on via jquery (for which I am using cheerio node.js library)
The code I have works, but duplicates the pulled data by however many URLS there are.
I am doing this in Node but suspect it's a generalized Javascript matter that I don't understand too well.
url.forEach(function(ugh){
rest.get(ugh).on('complete', function(data) {
$ = cheerio.load(data);
prices.push($(".priceclass").text());
//i only want this code to happen once per item in url array
//but it happens url.length times per item
//probably because i don't get events or async very well
});
});
So if there are 3 items in the 'url' array, the 'prices' array with the data I want will have 9 items. Which I don't want
--EDIT:
Added a counter to verify that the 'complete' callback was executing array-length times per array item.
x=0;
url.forEach(function(ugh){
rest.get(ugh).on('complete', function(data) {
var $ = cheerio.load(data);
prices.push($(".priceclass").text());
console.log(x=x+1);
});
});
Console outputs 1 2 3 4 5 6 7 8 9
I was thinking that I might be going about this wrong. I've been trying to push some numbers onto an array, and then outside the callbacks do something with that array.
Anyways it seems clear that >1 restler eventlisteners aren't gonna work together at all.
Maybe rephrasing the question would help:
How would I scrape a number of URLs, then act on that data?
Currently looking into request & async libraries, via code from the extinguished node.io library
To answer the rephrased question, scramjet is great for this if you use ES6+ and node which I assume you do:
How would I scrape a number of URLs, then act on that data?
Install the packages:
npm install scramjet node-fetch --save
Scramjet works on streams - it will read your list of url's and make each url a stream that you can work with as simple as with Array's. node-fetch is a simple node module that follows the standard Fetch Web API.
A simple example that also reads the url's from a file, assuming you store them one per line:
const {StringStream} = require("scramjet");
const fs = require("fs")
const fetch = require("node-fetch");
fs.createReadStream(process.argv[2]) // open the file for reading
.pipe(new StringStream()) // redirect it to scramjet stream
.split("\n") // split line by line
.map((url) => fetch(url)) // get the URL from the endpoint
.map((resp) => JSON.parse(resp)) // parse the response
.toArray() // accumulate the data into an Array
.then(
(data) => doYourStuff(data), // do the calculations
(err) => showErrorMessage(data)
)
Thanks to the way scramjet works, you needn't worry about error handling (all errors are caught automatically) and managing simultaneous requests. If you can parse the files url by url, then you can also make this very memory and resources efficient - as it won't ready and try to fetch all the items at once, but it will do some work in parallel.
There are more examples and the full API description in the scramjet docs.

Categories