Identify performance bottlenecks for XSLT transform using Saxon-JS

Identify performance bottlenecks for XSLT transform using Saxon-JS - javascript

Can anyone provide some guidance on pinpointing the bottleneck in a transform?
This is a node.js implementation of Saxon-JS. I'm trying to increase the speed of transforming some XML documents so that I can provide a Synchronous API that responds in under 60sec ideally (230sec is the hard limit of the Application Gateway). I need to be able to handle up to 50MB size XML files as well.
I've run node's built profiler (https://nodejs.org/en/docs/guides/simple-profiling/). But it's tough to make sense of the results given that the source code of the free version of Saxon-JS is not really human-readable.
My Code
const path = require('path');
const SaxonJS = require('saxon-js');
const { loadCodelistsInMem } = require('../standards_cache/codelists');
const { writeFile } = require('../config/fileSystem');
const config = require('../config/config');
const { getStartTime, getElapsedTime } = require('../config/appInsights');
// Used for easy debugging the xslt stylesheet
// Runs iati.xslt transform on the supplied XML
const runTransform = async (sourceFile) => {
try {
const fileName = path.basename(sourceFile);
const codelists = await loadCodelistsInMem();
// this pulls the right array of SaxonJS resources from the resources object
const collectionFinder = (url) => {
if (url.includes('codelist')) {
// get the right filepath (remove file:// and after the ?
const versionPath = url.split('schemata/')[1].split('?')[0];
if (codelists[versionPath]) return codelists[versionPath];
}
return [];
};
const start = getStartTime();
const result = await SaxonJS.transform(
{
sourceFileName: sourceFile,
stylesheetFileName: `${config.TMP_BASE_DIR}/data-quality/rules/iati.sef.json`,
destination: 'serialized',
collectionFinder,
logLevel: 10,
},
'async'
);
console.log(`${getElapsedTime(start)} (s)`);
await writeFile(`performance_tests/output/${fileName}`, result.principalResult);
} catch (e) {
console.log(e);
}
};
runTransform('performance_tests/test_files/test8meg.xml');
Example console output:
❯ node --prof utils/runTransform.js
SEF generated by Saxon-JS 2.0 at 2021-01-27T17:10:38.029Z with -target:JS -relocate:true
79.938 (s)
❯ node --prof-process isolate-0x102d7b000-19859-v8.log > v8_log.txt
Files:
stylesheet
Example XML: is test8meg.xml
Node Profiling log v8_log.txt
Snippet of the V8 log of the largest performance offender:
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 1.0% are not shown.
ticks parent name
33729 52.5% T __ZN2v88internal20Builtin_ConsoleClearEiPmPNS0_7IsolateE
6901 20.5% T __ZN2v88internal20Builtin_ConsoleClearEiPmPNS0_7IsolateE
3500 50.7% T __ZN2v88internal20Builtin_ConsoleClearEiPmPNS0_7IsolateE
3197 91.3% LazyCompile: *k /Users/nosvalds/Projects/validator-api/node_modules/saxon-js/SaxonJS2N.js:287:264
3182 99.5% LazyCompile: *<anonymous> /Users/nosvalds/Projects/validator-api/node_modules/saxon-js/SaxonJS2N.js:682:218
2880 90.5% LazyCompile: *d /Users/nosvalds/Projects/validator-api/node_modules/saxon-js/SaxonJS2N.js:734:184
Thanks a lot. There aren't a ton of resources on this anymore to walk myself through. I've also already tried:
Using the stylesheetInternal parameter with pre-parsed JSON (didn't make a large difference)
Splitting the document into separate documents that only contain one activities <iati-activity> child element inside the root <iati-activities> root element, transforming each separately, and putting it back together this ended up taking 2x as long.
Best,
Nik

You asked the same question at https://saxonica.plan.io/boards/5/topics/8105?r=8106, and I have responded there. I know StackOverflow doesn't like link-only answers, but I prefer to support users via our own support channels rather than via StackOverflow where possible.

Related

Running stateful commands in PowerShell through Node.js

Context: I have a javascript file that activates PowerShell's native SpeechSynthesizer module. The script receives a message and passes that through to PowerShell, where it is rendered as speech.
Problem: there is horrible latency (~5sec) between execution and response. This is because the script creates an entirely new PowerShell session and SpeechSynthesizer object with every execution.
Objective: I want to change the script so that a single PowerShell session and SpeechSynthesizer object is persisted and used across multiple sessions. I believe this will eradicate the latency completely.
Limiting Factor: this modification requires making the PowerShell execution stateful. Currently, I don't know how to incorporate stateful commands for the PowerShell in a javascript file.
Code:
const path = require('path');
const Max = require('max-api');
const { exec } = require('child_process');
// This will be printed directly to the Max console
Max.post(`Loaded the ${path.basename(__filename)} script`);
const execCommand = command => {
// Max.post(`Running command: ${command}`);
exec(command, {'shell':'powershell.exe'}, (err, stdout, stderr) => {
if (err) {
// node couldn't execute the command
Max.error(stderr);
Max.error(err);
return;
}
// the *entire* stdout and stderr (buffered)
Max.outletBang()
});
}
// Use the 'outlet' function to send messages out of node.script's outlet
Max.addHandler("speak", (msg) => {
let add = 'Add-Type -AssemblyName System.speech'
let create = '\$speak = New-Object System.Speech.Synthesis.SpeechSynthesizer'
let speak = `\$speak.Speak(\'${msg}\')`
let command = ([add,create,speak]).join('; ')
execCommand(command)
});
Objective, Re-stated: I want to move the add and create commands to a 'create' handler which will only be ran once. The speak command will be run an arbitrary amount of times afterward.
Attempted Solution: I've found one package (https://github.com/bitsofinfo/powershell-command-executor) that supposedly supports stateful PowerShell commands, but it's very complicated. Also, the author mentions a risk of command injection and other insecurities, of which I have no knowledge of.
Any and all suggestions are welcome. Thanks!

Service Worker Strategy for Cached First Scenario - Preloading Screen

I'm currently working on a small web app which should implement a cached first scenario (users download the wep app in a wifi provided base and then should be able use it offline outside)
I'm not using any framework and therefore implement the caching (SW) myself.
As I also integrate some playcanvas content (which has its own loading screen) over an iframe I was wondering what overall strategy in terms of loading would make sense.
In a similar project I simply let the service worker download the assets parallel to the (initial) load of the application.
But it came to my mind that that it would be better to implement a workflow which is closer to an native app behavior - meaning showing a overall loading screen during the service worker download process and building/showing my main application after this process is finished (or did fail -> forced network scenario or did happen before -> offline scenario). Another solution would be to show a non blocking "Assets are still being downloaded" banner.
The main thoughts leading mit to the second workflow where:
The SW-Loading screen / banner could provide better feedback to the user: "All assets downloaded - I'm safe to go offline", while the old scenario could cause issues here - successfully showing the the user the first state - while some critical files are still downloaded in the back.
With the SW-Loading screen the download process is a bit more controllable/understandable for me - as the parallel process of an SW-Download and the Playcanvas Loading for example become sequential.
It would be great if someone could provide me feedback/info:
if I'm on the right track with this second scenario for being better or it just being overhead
how / if it might be possible to implement a cheap loading screen, meaning for example 100 of 230 Files downloaded or else.
better strategies for this scenario in general
As always, thanks for any heads up in advance.

A lot of this comes down to what you want your users to experience. The underlying technology is there to accomplish any of the scenarios you outline.
For instance, if you want to show information about the precaching progress during initial service worker installation, you could do that by adding code along the lines of the following.
In your service worker:
const PRECACHE_NAME = "...";
const URLS_TO_PRECACHE = [
// ...
];
async function notifyClients(urlsCached, totalURLs) {
const clients = await self.clients.matchAll({ includeUncontrolled: true });
for (const client of clients) {
client.postMessage({ urlsCached, totalURLs });
}
}
self.addEventListener("install", (event) => {
event.waitUntil(
(async () => {
const cache = await caches.open(PRECACHE_NAME);
const totalURLs = URLS_TO_PRECACHE.length;
let urlsCached = 0;
for (const urlToPrecache of URLS_TO_PRECACHE) {
await cache.add(urlToPrecache);
urlsCached++;
await notifyClients(urlsCached, totalURLs);
}
})()
);
});
In your client pages:
// Optional: if controller is not set, then there isn't already a
// previous service worker, so this is a "first-time" install.
// If you would prefer, you could add this event listener
// unconditionally, and you'll get update messages even when there's an
// updated service worker.
if (!navigator.serviceWorker.controller) {
navigator.serviceWorker.addEventListener("message", (event) => {
const { urlsCached, totalURLs } = event.data;
// Display a message about how many URLs have been cached.
});
}

How to use C++ application in a Node.js server?

My goal is to use a C++ application in a web server written in JavaScript (such as Node.js).
Do you have a solution for combining the two?

I'm not going to go too much in depth but in this case spawning process would be the "go to" option I guess.
Something like
const fs = require('fs');
const { spawn, exec } = require('child_process');
const logStream = fs.createWriteStream('./logFile.log');
const spawnedProcess = spawn("./some/path/to/executable.exe", [ "-flag1", "-flag2" ]);
// Handle error
spawnedProcess.stderr.pipe(logStream);
// Read data
spawnedProcess.stdout.on('data', data => {
console.log(data);
});
// Handle on exit
spawnedProcess.on('exit', c => {
console.log(`Process closed with code: ${c}`);
});
// Send something to the process (the process has to handle it)
spawnedProcess.stdin.write("some command or whatever\n");
It will widely differ if it's your cpp app so you can implement handling this kind of communication or not. There's still a possibility to write some C++ "proxy" to let this kind of thing work though.
If that won't work for you then let's hope that in some time someone with better idea will share some solution here.

Nuxt FingerprintJS Module Server and Client solution?

I using fingerprintJS in NuxtJS+Firebase Projects VuexStore.
When i call that function in client side can get Visitor ID. But i cant get if i use in server side like a nuxtServerInit.
const fpPromise = FingerprintJS.load();
const abc = (async() => {
const fp = await fpPromise
const result = await fp.get()
const visitorId = result.visitorId
return visitorId;
})()
abc.then(
function(value) {
state.visitorId = value
},
function(error) {
return error
}
)
is there a solution to this?

From the NuxtJS documentation (about server rendering):
Because you are in a Node.js environment you have access to Node.js
objects such as req and res. You do not have access to the window or
document objects as they belong to the browser environment. You can
however use window or document by using the beforeMount or mounted
hooks.
FingerprintJS depends heavily (example here) on the browser (hence browser fingerprinting). That means it needs e.g. window object which is not available in the server-side rendering context.
I'm not very experienced with NuxtJS, however, according to the documentation, you should add your fingerprinting code to the .vue file like
if (process.client) {
require('external_library')
}
Good luck!

Back Pressure on fetch() not working in Google Chrome

I am having trouble consuming the response from my WebFlux server via JavaScript's new Streams API.
I can see via Curl (with the help of --limit-rate) that the server is slowing down as expected, but when I try to consume the body in Google Chrome (64.0.3282.140), it it not slowing down like it should. In fact, Chrome downloads and buffers about 32 megabytes from the server even though only about 187 kB are passed to write().
Is there something wrong with my JavaScript?
async function fetchStream(url, consumer) {
const response = await fetch(url, {
headers: {
"Accept": "application/stream+json"
}
});
const decoder = new TextDecoder("utf-8");
let buffer = "";
await response.body.pipeTo(new WritableStream({
async write(chunk) {
buffer += decoder.decode(chunk);
const blocks = buffer.split("\n");
if (blocks.length === 1) {
return;
}
const indexOfLastBlock = blocks.length - 1;
for (let index = 0; index < indexOfLastBlock; index ++) {
const block = blocks[index];
const item = JSON.parse(block);
await consumer(item);
}
buffer = blocks[indexOfLastBlock];
}
}));
}
According the the specification for Streams,
If no strategy is supplied, the default behavior will be the same as a
CountQueuingStrategy with a high water mark of 1.
So it should slow down the promise returned by consumer(item) resolves very slowly, right?

Looking at the Backpressure support in the Streams API, it seems that Backpressure information is communicated within the Streams chain and not over the network. In this case, we can assume an unbounded queue somewhere and this would explain the behavior you're seeing.
This other github issue suggests that the Backpressure information indeed stops at the TCP level - they just stop reading from the TCP socket which, depending on the current TCP window size/TCP configuration, means the buffers will be filled and then TCP flow control kicks in. As this issue states, they can't set the window size manually and they have to let the TCP stack handle things from there.
HTTP/2 supports flow control at the protocol level, but I don't know if the browser implementations leverage that with the Streams API.
I can't explain the behavior difference you're seeing, but I think you might be reading too much in the Backpressure support here and that this works as expected according to the spec.

We Keep Coding

JavaScript is the programming language of the Web.

Identify performance bottlenecks for XSLT transform using Saxon-JS - javascript

You asked the same question at https://saxonica.plan.io/boards/5/topics/8105?r=8106, and I have responded there. I know StackOverflow doesn't like link-only answers, but I prefer to support users via our own support channels rather than via StackOverflow where possible.

Related

Running stateful commands in PowerShell through Node.js

Service Worker Strategy for Cached First Scenario - Preloading Screen

How to use C++ application in a Node.js server?

Nuxt FingerprintJS Module Server and Client solution?

Back Pressure on fetch() not working in Google Chrome

Categories

Resources