How to get the HTML from a website using NodeJS?

How to get the HTML from a website using NodeJS? - javascript

I know this is a pretty basic question, but I can't get anything working.
I have a list of URL's and I need to get the HTML from them using NodeJS.
I have tried using Axios, but the response returned is always undefined.
I am hitting the endpoint /process-logs with a post request and the body consists of logFiles (which is an array).
router.post("/process-logs", function (req, res, next) {
fileStrings = req.body.logFiles;
for (var i = 0; i < fileStrings.length; i++) {
axios(fileStrings[i]).then(function (response) {
console.log(response.body);
});
}
res.send("done");
});
A sample fileString is of the form https://amazon-artifacts.s3.ap-south-1.amazonaws.com/q-120/log1.txt.
How can I parallelize this process to do the same task for multiple files at a time?

I can think of two approaches here:
the first one is to use ES6 promises (promise.all) and Async/Await feature, by chunking the fileStrings array into n chunks. This is a basic approach and you have to handle a lot of cases.
This is a general idea of the flow i am thinking of:
async function handleChunk (chunk) {
const toBeFullfilled = [];
for (const file of chunk) {
toBeFullfilled.push(axios.get(file)); // replace axios.get with logic per file
}
return Promise.all(toBeFullfilled);
}
async function main() {
try {
const fileStrings = req.body.logfiles;
for (i; i < fileStrings; i += limit) {
let chunk = fileStrings.slice(i, i+limit);
const results = await handleChunk(chunk);
console.log(results);
}
}
catch (e) {
console.log(e);
}
}
main().then(() => { console.log('done')}).catch((e) => { console.log(e) });
one of the drawbacks is we are processing chunks sequentially (chunk by chunk, still better than file-by-file), one enhancement could be to chunk the fileStrings ahead of time and process the chunks concurrently (it really depends on what you're trying to achieve and what are the limitations you have)
the second approach is to use Async library , which has many control flows and collections that allows you to configure the concurreny ... etc. (i really recommend using this approach)
You should have a look at Async's Queue Control Flow to run same task for multiple files concurrently.

Related

Optimizing a file content parser class written in TypeScript

I got a typescript module (used by a VSCode extension) which accepts a directory and parses the content contained within the files. For directories containing large number of files this parsing takes a bit of time therefore would like some advice on how to optimize it.
I don't want to copy/paste the entire class files therefore will be using a mock pseudocode containing the parts that I think are relevant.
class Parser {
constructor(_dir: string) {
this.dir = _dir;
}
parse() {
let tree: any = getFileTree(this.dir);
try {
let parsedObjects: MyDTO[] = await this.iterate(tree.children);
} catch (err) {
console.error(err);
}
}
async iterate(children: any[]): Promise<MyDTO[]> {
let objs: MyDTO[] = [];
for (let i = 0; i < children.length; i++) {
let child: any = children[i];
if (child.type === Constants.FILE) {
let dto: FileDTO = await this.heavyFileProcessingMethod(file); // this takes time
objs.push(dto);
} else {
// child is a folder
let dtos: MyDTO[] = await this.iterateChildItems(child.children);
let dto: FolderDTO = new FolderDTO();
dto.files = dtos.filter(item => item instanceof FileDTO);
dto.folders = dtos.filter(item => item instanceof FolderDTO);
objs.push(FolderDTO);
}
}
return objs;
}
async heavyFileProcessingMethod(file: string): Promise<FileDTO> {
let content: string = readFile(file); // util method to synchronously read file content using fs
return new FileDTO(await this.parseFileContent(content));
}
async parseFileContent(content): Promise<any[]> {
// parsing happens here and the file content is parsed into separate blocks
let ast: any = await convertToAST(content); // uses an asynchronous method of an external dependency to convert content to AST
let blocks = parseToBlocks(ast); // synchronous method called to convert AST to blocks
return await this.processBlocks(blocks);
}
async processBlocks(blocks: any[]): Promise<any[]> {
for (let i = 0; i < blocks.length; i++) {
let block: Block = blocks[i];
if (block.condition === true) {
// this can take some time because if this condition is true, some external assets will be downloaded (via internet)
// on to the caller's machine + some additional processing takes place
await processBlock(block);
}
}
return blocks;
}
}
Still sort of a beginner to TypeScript/NodeJS. I am looking for a multithreading/Java-esque solution here if possible. In the context of Java, this.heavyFileProcessingMethod would be a instance of Callable object and this object would be pushed into a List<Callable> which would then be executed parallelly by an ExecutorService returning List<Future<Object>>.
Basically I want all files to be processed parallelly but the function must wait for all the files to be processed before returning from the method (so the entire iterate method will only take as long as the time taken to parse the largest file).
Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well? If so, can it be used in this situation? If my Parser class needs to be refactored to accommodate this change (or any other suggested change) it's no issue.
EDIT: Using Promise.all
async iterate(children: any[]): Promise<MyDTO>[] {
let promises: Promies<MyDTO>[] = [];
for(let i = 0; i <children.length; i++) {
let child: any = children[i];
if (child.type === Constants.FILE) {
let promise: Promise<FileDTO> = this.heavyFileProcessingMethod(file); // this takes time
promises.push(promise);
} else {
// child is a folder
let dtos: Promise<MyDTO>[] = this.iterateChildItems(child.children);
let promise: Promise<FolderDTO> = this.getFolderPromise(dtos);
promises.push(promise);
}
}
return promises;
}
async getFolderPromise(promises: Promise<MyDTO>[]): Promise<FolderDTO> {
return Promise.all(promises).then(dtos => {
let dto: FolderDTO = new FolderDTO();
dto.files = dtos.filter(item => item instanceof FileDTO);
dto.folders = dtos.filter(item => item instanceof FolderDTO);
return dto;
})
}

first: Typescript is really Javascript
Typescript is just Javascript with static type checking, and those static types are erased when the TS is transpiled to JS. Since your question is about algorithms and runtime language features, Typescript has no bearing; your question is a Javascript one. So right off the bat that tells us the answer to
Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well?
is YES.
As to the second part of your question,
can it be used in this situation?
the answer is YES, but...
second: Use Worker Threads only if the task is CPU bound.
Can does not necessarily mean you should. It depends on whether your processes are IO bound or CPU bound. If they are IO bound, you're most likely far better off relying on Javascript's longstanding asynchronous programming model (callbacks, Promises). But if they are CPU bound, then using Node's relatively new support for Thread-based parallelism is more likely to result in throughput gains. See Node.js Multithreading!, though I think this one is better: Understanding Worker Threads in Node.js.
While worker threads are lighter weight than previous Node options for parallelism (spawning child processes), it is still relatively heavy weight compared to threads in Java. Each worker runs in its own Node VM, regular variables are not shared (you have to use special data types and/or message passing to share data). It had to be done this way because Javascript is designed around a single-threaded programming model. It's extremely efficient within that model, but that design makes support for multithreading harder. Here's a good SO answer with useful info for you: https://stackoverflow.com/a/63225073/8910547
My guess is your parsing is more IO bound, and the overhead of spawning worker threads will outweigh any gains. But give it a go and it will be a learning experience. :)

It looks like your biggest problem is navigating the nested directory structure and keeping individual per-file and per-dir promises organized. My suggestion would be to do that in a simpler way.
Have a function that takes a directory path and returns a flat list of all files it can find, no matter how deep, in a manner similar to the find program. This function can be like this:
import * as fs from 'fs/promises'
import * as path from 'path'
async function fileList(dir: string): Promise<string[]> {
let entries = await fs.readdir(dir, {withFileTypes: true})
let files = entries
.filter(e => e.isFile())
.map(e => path.join(dir, e.name))
let dirs = entries
.filter(e => e.isDirectory())
.map(e => path.join(dir, e.name))
let subLists = await Promise.all(dirs.map(d => fileList(d)))
return files.concat(subLists.flat())
}
Basically, obtain directory entries, find (sub)directories and iterate them recursively in parallel. Once the iteration is complete, flatten, merge and return the list.
Using this function, you can apply your heavy task to all files at once by simply using map + Promise.all:
let allFiles = await fileList(dir)
let results = await Promise.all(allFiles.map(f => heavyTask(f)))

MongoDB Aggregation+Cursor coupled with JS Generators

It is the first time I am using Generators in JavaScript (Node.js). I just wanted someone to validate if the implementation is correct.
If not, please suggest a better way.
So, there are 2 questions here:
Is the generators based implementation optimized. And before being optimized, is it correct. (I am getting the results, however, just wanted to understand it from the blocking/non-blocking perspective).
I am making 10 API calls using node-fetch and collecting the promises in an array. If I need to deploy this on AWS Lambda, is it okay if I do not use Promise.all(array) and simply return irrespective of whether the API calls are successful or not? Anyway I do not care about the response. I just trigger the API.
Requirements:
Host a Node.js function that talks to MongoDB using Mongoose driver on AWS Lambda.
Fetch 10000 documents from MongoDB.
The documents have the following schema. The _id key holds a String value of length 163:
{
_id: "cSKhwtczH4QV7zM-43wKH:APA91bF678GW3-EEe8YGt3l1kbSpGJ286IIY2VjImfCL036rPugMkudEUPbtcQsC"
}
I am interested in the value of _ids and I am able to process just 1000 _ids at a time.
Those are put into an array and an API call is made. Hence, for 10000 _ids, I need to make 10 API calls with 10 of those arrays.
Implementation I did using Node.js Generators:
async function* generatorFunction() {
await connectMongo();
const cursor = Model.aggregate([
//... some pipeline here
]).cursor();
const GROUP_LIMIT = 1000;
let newGroup = [];
for await (const doc of cursor) {
if (newGroup.length == GROUP_LIMIT) {
yield newGroup;
newGroup = [];
}
newGroup.push(doc._id);
}
yield newGroup;
await disconnectMongo();
}
const gen = generatorFunction();
(async () => {
const promises = [];
for await (const newGroup of gen) {
// POST request using node-fetch
// Ignore the syntax
promises.push(fetch('API', { body: { newGroup } }));
}
// Do I need to do this?
// Or can this be skipped?
// I do not care about the response anyway...
// I just need to trigger the API and forget...
await Promise.all(promises);
return {
success: true,
};
})();

How to handle streaming data using fetch?

I have used async for with great success in handling output streams from processes with node.js, but I'm struggling to get something that I was hoping could "just work" with the browser fetch API.
This works great to async'ly handle chunks of output streaming from a process:
for await (const out of proc.child.stdout) {
...
}
(in an async function context here of course)
I tried to do something similar in a browser where I want to gain access to the data while it is being sent to me from the server.
for await (const chunk of (await fetch('/data.jsonl')).body) {
console.log('got', chunk);
}
This does not work in Chrome (Uncaught TypeError: (intermediate value).body is not async iterable).
For my use case this is not necessary, so I am simply using let data = await (await fetch(datapath)).text(); in my client code for now. This is analogous to the typical use of .json() instead of .text() on the awaited fetch, so no processing can begin until the entire response is received by the browser. This is not ideal for obvious reasons.
I was looking at Oboe.js (I think the relevant impl is somewhere near here) which pretty much deals with this but its internals are fairly ugly so it looks like that might be the only way to do this for now?
If async iteration isn't implemented (meaning async for cannot be used yet) isn't there another way to use the ReadableStream in a practical way?

Unfortunately async iterable support is not yet implemented, despite being in the spec. Instead you can manually iterate, as shown in this example from the spec. (I'll convert examples to async/await for you in this answer.)
const reader = response.body.getReader();
const { value, done } = await reader.read();
if (done) {
console.log("The stream was already closed!");
} else {
console.log(value);
}
You can use recursion or a loop to do this repeatedly, as in this other example:
async function readAllChunks(readableStream) {
const reader = readableStream.getReader();
const chunks = [];
let done, value;
while (!done) {
({ value, done } = await reader.read());
if (done) {
return chunks;
}
chunks.push(value);
}
}
console.log(await readAllChunks(response.body));

According to the spec, a ReadableStream such as the fetch-API's Response.body does have a getIterator method. For some reason it's not async-iterable itself, you explicitly have to call that method:
const response = await fetch('/data.json');
if (!response.ok)
throw new Error(await response.text());
for await (const chunk of response.body.getIterator()) {
console.log('got', chunk);
}

I believe the current state of affairs in mid 2020 is that async for does not work on the fetch body yet.
https://github.com/whatwg/streams/issues/778 This issue appears to have tracking bugs for browsers and none of them have the functionality implemented yet.
I don't currently know of another way to make use of the .body ReadableStream provided by fetch.
The standard way to do the task implicit in the question is to use a websocket.

Setting delay/timeout for axios requests in map() function

I am using node and axios (with TS, but that's not too important) to query an API. I have a suite of scripts that make calls to different endpoints and log the data (sometimes filtering it.) These scripts are used for debugging purposes. I am trying to make these scripts "better" by adding a delay between requests so that I don't "blow up" the API, especially when I have a large array I'm trying to pass. So basically I want it to make a GET request and pause for a certain amount of time before making the next request.
I have played with trying setTimeout() functions, but I'm only putting them in places where they add the delay after the requests have executed; everywhere I have inserted the function has had this result. I understand why I am getting this result, I just had to try everything I could to at least increase my understanding of how things are working.
I have though about trying to set up a queue or trying to use interceptors, but I think I might be "straying far" from a simpler solution with those ideas.
Additionally, I have another "base script" that I wrote on the fly (sorta the birth point for this batch of scripts) that I constructed with a for loop instead of the map() function and promise.all. I have played with trying to set the delay in that script as well, but I didn't get anywhere helpful.
var axios = require('axios');
var fs = require('fs');
const Ids = [arrayOfIds];
try {
// Promise All takes an array of promises
Promise.all(Ids.map(id => {
// Return each request as its individual promise
return axios
.get(URL + 'endPoint/' + id, config)
}))
.then((vals) =>{
// Vals is the array of data from the resolved promise all
fs.appendFileSync(`${__dirname}/*responseOutput.txt`,
vals.map((v) => {
return `${JSON.stringify(v.data)} \n \r`
}).toString())
}).catch((e) => console.log)
} catch (err) {
console.log(err);
}
No errors with the above code; just can't figure out how to put the delay in correctly.

You could try Promise.map from bluebird
It has the option of setting concurrency
var axios = require('axios');
var fs = require('fs');
var Promise = require('bluebird');
const Ids = [arrayOfIds];
let concurrency = 3; // only maximum 3 HTTP request will run concurrently
try {
Promise.map(Ids, id => {
console.log(`starting request`, id);
return axios.get(URL + 'endPoint/' + id, config)
}, { concurrency })
.then(vals => {
console.log({vals});
})
;
} catch (err) {
console.log(err);
}

Best way to calling API inside for loop using Promises

I have 500 millions of object in which each has n number of contacts as like below
var groupsArray = [
{'G1': ['C1','C2','C3'....]},
{'G2': ['D1','D2','D3'....]}
...
{'G2000': ['D2001','D2002','D2003'....]}
...
]
I have two way of implementation in nodejs which is based on regular promises and another one using bluebird as shown below
Regular promises
...
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
]
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
// console.log(groupIndex)
_.each(contactlists,function(contactData){
ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData).then(function(result) {
console.log(result.body);
// Code depending on result
}).catch(function() {
// An error occurred
});
})
})
})
...
Using bluebird way i have used concurrency to check how to control the queue of promises
...
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
var contacts = [];
// console.log(groupIndex)
_.each(contactlists,function(contactData){
contacts.push({
contact_name: 'Contact ' + contactData
});
})
groups.push({
task_name: 'Group ' + groupIndex,
contacts: contacts
});
})
})
Promise.each(groups, group =>
Promise.map(group.contacts,
contact => new Promise((resolve, reject) => {
/*setTimeout(() =>
resolve(group.task_name + ' ' + contact.contact_name), 1000);*/
request.get('http://localhost:3001/api/getcontactdata/'+group.task_name+'/'+contact.contact_name,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
}).then(log => console.log(log.body)),
{
concurrency: 50
}).then(() => console.log())).then(() => {
console.log('All Done!!');
});
...
I want to know when dealing with 100 millions of api call inside loop using promises. please advise the best way to call API asynchronously and deal the response later.

My answer using regular Node.js promises (this can probably easily be adapted to Bluebird or another library).
You could fire off all Promises at once using Promise.all:
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
];
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
Promise.all(groupsArray.map(group => ajax("your-url-here")))
.then(results => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
Using Promise.all attempts to run all your requests in parallel. This probably won't work well when you have 500 million requests to make all being attempted at the same time!
A more effective way to do it is to use the JavaScript reduce function to sequence your requests one after the other:
// ... Setup as before ...
const results = [];
groupsArray.reduce((prevPromise, group) => {
return prevPromise.then(() => {
return ajax("your-url-here")
.then(result => {
// Process a single result if necessary.
results.push(result); // Collect your results.
});
});
},
Promise.resolve() // Seed promise.
);
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This example chains together the promises so that the next one only starts once the previous one completes.
Unfortunately the sequencing approach will be very slow because it has to wait until each request has completed before starting a new one. Whilst each request is in progress (it takes time to make an API request) your CPU is sitting idle whereas it could be working on another request!
A more efficient, but complicated approach to this problem is to use a combination of the above approaches. You should batch your requests so that the requests in each batch (of say 10) are executed in parallel and then the batches are sequenced one after the other.
It's tricky to implement this yourself - although it's a great learning exercise
- using a combination of Promise.all and the reduce function, but I'd suggest using the library async-await-parallel. There's a bunch of such libraries, but I use this one and it works well and it easily does the job you want.
You can install the library like this:
npm install --save async-await-parallel
Here's how you would use it:
const parallel = require("async-await-parallel");
// ... Setup as before ...
const batchSize = 10;
parallel(groupsArray.map(group => {
return () => { // We need to return a 'thunk' function, so that the jobs can be started when they are need, rather than all at once.
return ajax("your-url-here");
}
}, batchSize)
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This is better, but it's still a clunky way to make such a large amount of requests! Maybe you need to up the ante and consider investing time in proper asynchronous job management.
I've been using Kue lately for managing a cluster of worker processes. Using Kue with the Node.js cluster library allows you to get proper parallelism happening on a multi-core PC and you can then easily extend it to muliple cloud-based VMs if you need even more grunt.
See my answer here for some Kue example code.

In my opinion you have two problems coupled in one questions - I'd decouple them.
#1 Loading of a large dataset
Operation on such a large dataset (500m records) will surely cause some memory limit issues sooner or later - node.js runs in a single thread and that is limited to use approx 1.5GB of memory - after that your process will crash.
In order to avoid that you could be reading your data as a stream from a CSV - I'll use scramjet as it'll help us with the second problem, but JSONStream or papaparse would do pretty well too:
$ npm install --save scramjet
Then let's read the data - I'd assume from a CSV:
const {StringStream} = require("scramjet");
const stream = require("fs")
.createReadStream(pathToFile)
.pipe(new StringStream('utf-8'))
.csvParse()
Now we have a stream of objects that will return the data line by line, but only if we read it. Solved problem #1, now to "augment" the stream:
#2 Stream data asynchronous augmentation
No worries - that's just what you do - for every line of data you want to fetch some additional info (so augment) from some API, which by default is asynchronous.
That's where scramjet kicks in with just couple additional lines:
stream
.flatMap(groupData => Object.entries(groupData))
.flatMap(([groupIndex, contactList]) => contactList.map(contactData => ([contactData, groupIndex])
// now you have a simple stream of entries for your call
.map(([contactData, groupIndex]) => ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData))
// and here you can print or do anything you like with your data stream
.each(console.log)
After this you'd need to accumulate the data or output it to stream - there are numbers of options - for example: .toJSONArray().pipe(fileStream).
Using scramjet you are able to separate the process to multiple lines without much impact on performance. Using setOptions({maxParallel: 32}) you can control concurrency and best of all, all this will run with a minimal memory footprint - much much faster than if you were to load the whole data into memory.
Let me know how if this is helpful - your question is quite complex so let me know if you run into any problems - I'll be happy to help. :)

We Keep Coding

JavaScript is the programming language of the Web.

How to get the HTML from a website using NodeJS? - javascript

Related

Optimizing a file content parser class written in TypeScript

MongoDB Aggregation+Cursor coupled with JS Generators

How to handle streaming data using fetch?

Setting delay/timeout for axios requests in map() function

Best way to calling API inside for loop using Promises

Categories

Resources