Universal Sentence Encoder tensorflowjs optimize performance using webworker - javascript

I am using the following code to initiate Webworker which creates embeddings using Universal Sentence Encoder
const initEmbeddingWorker = (filePath) => {
let worker = new Worker(filePath);
worker.postMessage({init: 'init'})
worker.onmessage = (e) => {
worker.terminate();
}
}
Webworker code
onmessage = function (e) {
if(e.data.init && e.data.init === 'init') {
fetchData();
}
}
const fetchData = () => {
//fetches data from indexeddb
createEmbedding(data, storeEmbedding);
}
const createEmbedding = (data, callback) => {
use.load().then(model => {
model.embed(data).then(embeddings => {
callback(embeddings);
})
});
}
const storeEmbedding = (matrix) => {
let data = matrix.arraySync();
//store data in indexeddb
}
It takes 3 minutes to create 100 embeddings using 10 Webworkers running simultaneously and each worker creating embeddings for 10 sentences. The time taken to create embeddings is too large as I need to create embedding for more than 1000 sentences which takes around 25 to 30 minutes.
Whenever this code runs it hogs all the resources which makes the machine very slow and almost unusable.
Are there any performance optimizations that are missing?

Using 10 webworkers means that the machine used to run it has at least 11 cores. Why this assumption ? (number of webworker + main thread )
To leverage the use of webworker to the best, each webworker should be run on a different core. What happens when there are more workers than cores ? Well the program won't be as fast as expected because a lot of times will be used exchanging communications between the cores.
Now let's look at what happens on each core.
arraySync is a blocking call preventing that thread from be using for another thing.
Instead of using arraySync, array can be used.
const storeEmbedding = async (matrix) => {
let data = await matrix.array();
//store data in indexeddb
}
array and its counterpart arraySync are slower compare to data and dataSync. It will be better to store the flatten data, output of data.
const storeEmbedding = async (matrix) => {
let data = await matrix.data();
//store data in indexeddb
}

Related

JS Worker Performance - Parsing JSON

I'm experimenting with Workers as my user interface is very slow due to big tasks running in the background.
I'm starting at the simplest tasks such as parsing JSON. See below for my very simple code to create an async function running on a Worke.
Performance wise there is a big difference between:
JSON.parse(jsonStr);
and
await parseJsonAsync(jsonStr);
JSON.parse() takes 1ms whereas parseJsonAsync takes 102ms!
So my question is: are the overheads really that big for running worker threads or am I missing something ?
const worker = new Worker(new URL('../workers/parseJson.js', import.meta.url));
export async function parseJsonAsync(jsonStr) {
return new Promise(
(resolve, reject) => {
worker.onmessage = ({
data: {
jsonObject
}
}) => {
resolve(jsonObject);
};
worker.postMessage({
jsonStr: jsonStr,
});
}
);
}
parseJson.js
self.onmessage = ({
data: {
jsonStr
}
}) => {
let jsonObject = null;
try {
jsonObject = JSON.parse(jsonStr);
} catch (ex) {
} finally {
self.postMessage({
jsonObject: jsonObject
});
}
};
I can now confirm that the overhead of transferring message between threads is pretty big. But the raw performance of worker (at least in executing JSON.parse) is close to main thread.
TL;DR: Just compare numbers in 2 tables. Without sending big object via postMessage, worker perf is just fine.
For test payload jsonStr, I create a string of a long list of [{"foo":"bar"}, ...] repeat n times. The number of items in jsonStr can be tuned by changing Array.from({ length: number }).
I then do postMessage(jsonStr) to run JSON.parse in worker, when done parsing it sends back the parsed jsonObject. In main thread just I call JSON.parse(jsonStr) directly.
runTest(delay) use setTimeout to wait until the worker startup to run the actual test. runTest() without delay runs immediately so we can measure worker startup time.
Code for the test.
const blobURL = URL.createObjectURL(
new Blob(
[
"(",
function () {
self.onmessage = ({ data: jsonStr }) => {
let jsonObject = null;
try {
jsonObject = JSON.parse(jsonStr);
self.postMessage(["done", jsonObject]);
} catch (e) {
self.postMessage(["error", e]);
}
};
}.toString(),
")()",
],
{ type: "application/javascript" }
)
);
const worker = new Worker(blobURL);
const jsonStr = "[" + Array.from({ length: 1_000 }, () => `{"foo":"bar"}`).join(",") + "]";
function test(payload) {
worker.onmessage = ({ data }) => {
const delta = performance.now() - t0;
console.log("worker", delta);
console.log("worker response", data[0]);
};
const t0 = performance.now();
worker.postMessage(payload);
testParseJsonInMain(payload);
}
function testParseJsonInMain(payload) {
let obj;
try {
const t0 = performance.now();
obj = JSON.parse(payload);
const delta = performance.now() - t0;
console.log("main", delta);
} catch {}
}
function runTest(delay) {
if (delay) {
setTimeout(() => test(jsonStr), delay);
} else {
test(jsonStr);
}
}
runTest(1000);
I observe that it takes around 30ms to start the worker on my machine. If test run after worker startup, I got these numbers (unit in milliseconds):
#items in payload
main
worker
1,000
0.2
2.1
10,000
1.3
9.8
100,000
15.4
73.5
1,000,000
165
854
10,000,000
2633
15312
When payload reaches 10 million items, the worker really struggles (takes 15 seconds). At 10 million items, the jsonStr is around 140MB.
But if the worker does not send back the parsed jsonObject, the numbers are so much better. Just make a little change to above test code:
// worker code changed from:
self.postMessage(["done", jsonObject]);
// to:
self.postMessage(["done", typeof jsonObject]);
#items in payload
main
worker
1,000
0.2
1.2
10,000
2.1
3.5
100,000
15.7
26.2
1,000,000
196
232
10,000,000
2249
2801
P.S. I've actually done another test. Instead of postMessage(jsonStr), I used TextEncoder to turn the string into ArrayBuffer, then postMessage(arrayBuffer, arrayBuffer) which supposedly transfers the underlying memory from main thread directly to worker.
I did not see real difference in terms of time consumed, in fact it gets a little bit slower. Guess sending large string isn't an issue.

Delayed read performance when using navigator.serial for serial communication

I've been trying out the web serial API in chrome (https://web.dev/serial/) to do some basic communication with an Arduino board. I've noticed quite a substantial delay when reading data from the serial port however. This same issue is present in some demos, but not all.
For instance, using the WebSerial demo linked towards the bottom has a near instantaneous read:
While using the Serial Terminal example results in a read delay. (note the write is triggered at the moment of a character being entered on the keyboard):
WebSerial being open source allows for me to check for differences between my own implementation, however I am seeing performance much like the second example.
As for the relevant code:
this.port = await navigator.serial.requestPort({ filters });
await this.port.open({ baudRate: 115200, bufferSize: 255, dataBits: 8, flowControl: 'none', parity: 'none', stopBits: 1 });
this.open = true;
this.monitor();
private monitor = async () => {
const dataEndFlag = new Uint8Array([4, 3]);
while (this.open && this.port?.readable) {
this.open = true;
const reader = this.port.readable.getReader();
try {
let data: Uint8Array = new Uint8Array([]);
while (this.open) {
const { value, done } = await reader.read();
if (done) {
this.open = false;
break;
}
if (value) {
data = Uint8Array.of(...data, ...value);
}
if (data.slice(-2).every((val, idx) => val === dataEndFlag[idx])) {
const decoded = this.decoder.decode(data);
this.messages.push(decoded);
data = new Uint8Array([]);
}
}
} catch {
}
}
}
public write = async (data: string) => {
if (this.port?.writable) {
const writer = this.port.writable.getWriter();
await writer.write(this.encoder.encode(data));
writer.releaseLock();
}
}
The equivalent WebSerial code can be found here, this is pretty much an exact replica. From what I can observe, it seems to hang at await reader.read(); for a brief period of time.
This is occurring both on a Windows 10 device and a macOS Monterey device. The specific hardware device is an Arduino Pro Micro connected to a USB port.
Has anyone experienced this same scenario?
Update: I did some additional testing with more verbose logging. It seems that the time between the write and read is exactly 1 second every time.
the delay may result from SerialEvent() in your arduino script: set Serial.setTimeout(1);
This means 1 millisecond instead of default 1000 milliseconds.

How to make a certain number of functions run parallel in loop in NodeJs?

I'm looking for a way to run 3 same-functions at once in a loop and wait until it finish and continues to run another 3 same-functions. I think it involves a loop, promise API. But my solution is fail. It would be great if you could tell me what did I do wrong and how to fix it.
Here is what I have done so far:
I have a download function (call downloadFile), an on-hold function (call runAfter) and a multi download function (call downloadList). They look like this:
const https = require('https')
const fs = require('fs')
const { join } = require('path')
const chalk = require('chalk') // NPM
const mime = require('./MIME') // A small module read Json and turn it to object. It returns a file extension string.
exports.downloadFile = url => new Promise((resolve, reject) => {
const req = https.request(url, res => {
console.log('Accessing:', chalk.greenBright(url))
console.log(res.statusCode, res.statusMessage)
// console.log(res.headers)
const ext = mime(res)
const name = url
.replace(/\?.+/i, '')
.match(/[\ \w\.-]+$/i)[0]
.substring(0, 250)
.replace(`.${ext}`, '')
const file = `${name}.${ext}`
const stream = fs.createWriteStream(join('_DLs', file))
res.pipe(stream)
res.on('error', reject)
stream
.on('open', () => console.log(
chalk.bold.cyan('Download:'),
file
))
.on('error', reject)
.on('close', () => {
console.log(chalk.bold.cyan('Completed:'), file)
resolve(true)
})
})
req.on('error', reject)
req.end()
})
exports.runAfter = (ms, url) => new Promise((resolve, reject) => {
setTimeout(() => {
this.downloadFile(url)
.then(resolve)
.catch(reject)
}, ms);
})
/* The list param is Array<String> only */
exports.downloadList = async (list, options) => {
const opt = Object.assign({
thread: 3,
delayRange: {
min: 100,
max: 1000
}
}, options)
// PROBLEM
const multiThread = async (pos, run) => {
const threads = []
for (let t = pos; t < opt.thread + t; t++) threads.push(run(t))
return await Promise.all(threads)
}
const inQueue = async run => {
for (let i = 0; i < list.length; i += opt.thread)
if (opt.thread > 1) await multiThread(i, run)
else await run(i)
}
const delay = range => Math.floor(
Math.random() * (new Date()).getHours() *
(range.max - range.min) + range.min
)
inQueue(i => this.runAfter(delay(opt.delayRange), list[i]))
}
The downloadFile will download anything from the link given. The runAfter will delay a random ms before excute downloadFile. The downloadList receive a list of URL and pass each of it to runAfter to download. And that (downloadList) is where the trouble begin.
If I just pass the whole list through simple loop and execute a single file at once. It's easy. But if I pass a large requests, like a list with 50 urls. It would take long time. So I decide to make it run parallel at 3 - 5 downloadFile at once, instead of one downloadFile. I was thinking about using async/await and Promise.all to solve the problem. However, it's crash. Below is the NodeJs report:
<--- Last few GCs --->
[4124:01EF5068] 75085 ms: Scavenge 491.0 (493.7) -> 490.9 (492.5) MB, 39.9 / 0.0 ms (average mu = 0.083, current mu = 0.028) allocation failure
[4124:01EF5068] 75183 ms: Scavenge 491.4 (492.5) -> 491.2 (493.2) MB, 29.8 / 0.0 ms (average mu = 0.083, current mu = 0.028) allocation failure
<--- JS stacktrace --->
==== JS stack trace =========================================
0: ExitFrame [pc: 00B879E7]
Security context: 0x03b40451 <JSObject>
1: multiThread [04151355] [<project folder>\inc\Downloader.js:~62] [pc=03C87FBF](this=0x03cfffe1 <JSGlobal Object>,0,0x041512d9 <JSFunction (sfi = 03E2E865)>)
2: inQueue [041513AD] [<project folder>\inc\Downloader.js:70] [bytecode=03E2EA95 offset=62](this=0x03cfffe1 <JSGlobal Object>,0x041512d9 ...
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Writing Node.js report to file: report.20200428.000236.4124.0.001.json
Node.js report completed
Apparently, a sub-function of downloadList (multiThread) is a cause but I couldn't read those number (seems like a physical address of RAM or something), so I have no idea how to fix it. I'm not a professional engineer so I would appreciate if you could give me a good explanation.
Addition information:
NodeJs version: 12.13.1
Localhost: Aspire SW3-013 > 1.9GB (2GB in spec) / Intel Atom CPU Z3735F
Connecting to Internet via WiFi (Realtek drive)
OS: Windows 10 (no other choice)
In case you might ask:
Why wrapping Promise for downloadFile? For further application, like I can put it in other app which only require one download at a time.
Does runAfter important? Maybe no, just a little challenges for myself. But it could be useful if servers require delay download time.
Homework or Business? None, hobby only. I plan to build a app to fetch and download image from API of Unsplash. So I prefer a good explanation what I did wrong and how to fix it rather then a code that simple works.
Your for-loop in multiThread never ends because your continuation condition is t < opt.thread + t. This will always be true if opt.thread is not zero. You will have an infinite loop here, and that's the cause of your crash.
I suspect you wanted to do something like this:
const multiThread = async (pos, run) => {
const threads = [];
for (let t = 0; t < opt.thread && pos+t < list.length; t++) {
threads.push(run(pos + t));
}
return await Promise.all(threads);
};
The difference here is that the continuation condition for the loop should be limiting itself to a maximum of opt.thread times, and also not going past the end of the number of entries in the list array.
If the list variable isn't global (ie, list.length is not available in the multiThread function), then you can leave out the second part of the condition and just handle it in the run function like this so that any values of i past the end of the list are ignored:
inQueue(i => {
if (i < list.length) this.runAfter(delay(opt.delayRange), list[i])
})

429 Too Many Requests - Angular 7 - on multiple file upload

I have this problem when I try to upload more than a few hundred of files at the same time.
The API interface is for one file only so I have to call the service sending each file. Right now I have this:
onFilePaymentSelect(event): void {
if (event.target.files.length > 0) {
this.paymentFiles = event.target.files[0];
}
let i = 0;
let save = 0;
const numFiles = event.target.files.length;
let procesed = 0;
if (event.target.files.length > 0) {
while (event.target.files[i]) {
const formData = new FormData();
formData.append('file', event.target.files[i]);
this.payrollsService.sendFilesPaymentName(formData).subscribe(
(response) => {
let added = null;
procesed++;
if (response.status_message === 'File saved') {
added = true;
save++;
} else {
added = false;
}
this.payList.push({ filename, message, added });
});
i++;
}
}
So really I have a while for sending each file to the API but I get the message "429 too many request" on a high number of files. Any way I can improve this?
Working with observables will make that task easier to reason about (rather than using imperative programming).
A browser usually allows you to make 6 request in parallel and will queue the others. But we don't want the browser to manage that queue for us (or if we're running in a node environment we wouldn't have that for ex).
What do we want: We want to upload a lot of files. They should be queued and uploaded as efficiently as possible by running 5 requests in parallel at all time. (so we keep 1 free for other requests in our app).
In order to demo that, let's build some mocks first:
function randomInteger(min, max) {
return Math.floor(Math.random() * (max - min + 1)) + min;
}
const mockPayrollsService = {
sendFilesPaymentName: (file: File) => {
return of(file).pipe(
// simulate a 500ms to 1.5s network latency from the server
delay(randomInteger(500, 1500))
);
}
};
// array containing 50 files which are mocked
const files: File[] = Array.from({ length: 50 })
.fill(null)
.map(() => new File([], ""));
I think the code above is self explanatory. We are generating mocks so we can see how the core of the code will actually run without having access to your application for real.
Now, the main part:
const NUMBER_OF_PARALLEL_CALLS = 5;
const onFilePaymentSelect = (files: File[]) => {
const uploadQueue$ = from(files).pipe(
map(file => mockPayrollsService.sendFilesPaymentName(file)),
mergeAll(NUMBER_OF_PARALLEL_CALLS)
);
uploadQueue$
.pipe(
scan(nbUploadedFiles => nbUploadedFiles + 1, 0),
tap(nbUploadedFiles =>
console.log(`${nbUploadedFiles}/${files.length} file(s) uploaded`)
),
tap({ complete: () => console.log("All files have been uploaded") })
)
.subscribe();
};
onFilePaymentSelect(files);
We use from to send the files one by one into an observable
using map, we prepare our request for 1 file (but as we don't subscribe to it and the observable is cold, the request is just prepared, not triggered!)
we now use mergeMap to run a pool of calls. Thanks to the fact that mergeMap takes the concurrency as an argument, we can say "please run a maximum of 5 calls at the same time"
we then use scan for display purpose only (to count the number of files that have been uploaded successfully)
Here's a live demo: https://stackblitz.com/edit/rxjs-zuwy33?file=index.ts
Open up the console to see that we're not uploading all them at once

Dumping indexedDB data

Working on a Chrome Extension, which needs to integrate with IndexedDB. Trying to figure out how to use Dexie.JS. Found a bunch of samples. Those don't look too complicated. There is one specific example particularly interesting for exploring IndexedDB with Dexie at https://github.com/dfahlander/Dexie.js/blob/master/samples/open-existing-db/dump-databases.html
However, when I run the one above - the "dump utility," it does not see IndexedDB databases, telling me: There are databases at the current origin.
From the developer tools Application tab, under Storage, I see my IndexedDB database.
Is this some sort of a permissions issue? Can any indexedDB database be accessed by any tab/user?
What should I be looking at?
Thank you
In chrome/opera, there is a non-standard API webkitGetDatabaseNames() that Dexie.js uses to retrieve the list of database names on current origin. For other browsers, Dexie emulates this API by keeping an up-to-date database of database-names for each origin, so:
For chromium browsers, Dexie.getDatabaseNames() will list all databases at current origin, but for non-chromium browsers, only databases created with Dexie will be shown.
If you need to dump the contents of each database, have a look at this issue, that basically gives:
interface TableDump {
table: string
rows: any[]
}
function export(db: Dexie): TableDump[] {
return db.transaction('r', db.tables, ()=>{
return Promise.all(
db.tables.map(table => table.toArray()
.then(rows => ({table: table.name, rows: rows})));
});
}
function import(data: TableDump[], db: Dexie) {
return db.transaction('rw', db.tables, () => {
return Promise.all(data.map (t =>
db.table(t.table).clear()
.then(()=>db.table(t.table).bulkAdd(t.rows)));
});
}
Combine the functions with JSON.stringify() and JSON.parse() to fully serialize the data.
const db = new Dexie('mydb');
db.version(1).stores({friends: '++id,name,age'});
(async ()=>{
// Export
const allData = await export (db);
const serialized = JSON.stringify(allData);
// Import
const jsonToImport = '[{"table": "friends", "rows": [{id:1,name:"foo",age:33}]}]';
const dataToImport = JSON.parse(jsonToImport);
await import(dataToImport, db);
})();
A working example for dumping data to a JSON file using the current indexedDB API as described at:
https://developers.google.com/web/ilt/pwa/working-with-indexeddb
https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Using_IndexedDB
The snippet below will dump recent messages from a gmail account with the Offline Mode enabled in the gmail settings.
var dbPromise = indexedDB.open("your_account#gmail.com_xdb", 109, function (db) {
console.log(db);
});
dbPromise.onerror = (event) => {
console.log("oh no!");
};
dbPromise.onsuccess = (event) => {
console.log(event);
var transaction = db.transaction(["item_messages"]);
var objectStore = transaction.objectStore("item_messages");
var allItemsRequest = objectStore.getAll();
allItemsRequest.onsuccess = function () {
var all_items = allItemsRequest.result;
console.log(all_items);
// save items as JSON file
var bb = new Blob([JSON.stringify(all_items)], { type: "text/plain" });
var a = document.createElement("a");
a.download = "gmail_messages.json";
a.href = window.URL.createObjectURL(bb);
a.click();
};
};
Running the code above from DevTools > Sources > Snippets will also let you set breakpoints and debug and inspect the objects.
Make sure you set the right version of the database as the second parameter to indexedDB.open(...). To peek at the value used by your browser the following code can be used:
indexedDB.databases().then(
function(r){
console.log(r);
}
);

Categories