How to make concurrent Node.js stream processing while preserving order?

How to make concurrent Node.js stream processing while preserving order? - javascript

I have a complex data processing pipeline using streams, in which I have a readable stream input, a writable stream output, and a series of transform streams (let's call them step1, step2, step3, and step4). While step1, step3, and output are stateless, relying only on the data chunks coming in to produce their output, chunk for chunk, step2 and step4 are aggregation steps, collecting data from multiple chunks to produce their output, and often having outputs that overlap time-wise (e.g. chunk1, chunk3 and chunk5 might produce output1, chunk2 and chunk4 might produce output2, and so on).
Currently, the pipeline is structured as follows:
input.pipe(step1).pipe(step2).pipe(step3).pipe(step4).pipe(output);
This pipeline is very computationally expensive, and as such I'd like to split it across multiple instances, preferably running on multiple cores. Node.js streams guarantee order preservation, so Node.js seems to balance the message passing such that data chunks coming out of one step first get passed into the next step first, and this is a property I'd need to have on whatever method I come up with of making this computation concurrent.
I'm definitely not asking for hand-holding, more if anyone has solved this problem before, and the general approach used for this kind of thing. I'm not really sure where to start.

Although I haven't been able to complete the order preservation, the streaming framework I'm supporting, scramjet, will get you really close to achieve your goal.
I'll nudge you here towards the best solution:
let seq = 0;
source.pipe(new DataStream())
.map(data => {data, itr: seq++}) // mark your order
.separate(x => x % 8) // separate into 8 streams
.cluster((stream) => { // spawn subprocesses
// do your multi threaded transforms here
}, {threads: 8})
.mux((a, b) => a.itr - b.itr) // merge in the order above
At some point I will introduce reordering, but in order to keep abstraction I can't take too many shortcuts, but you can take yours like the 2^52 limitation of count in the above example (seq will run out of bit space to increment then).
This should lead you towards some solution.

Related

Why is using a loop to iterate from start of array to end faster than iterating both start to end and end to start?

Given an array having .length 100 containing elements having values 0 to 99 at the respective indexes, where the requirement is to find element of of array equal to n : 51.
Why is using a loop to iterate from start of array to end faster than iterating both start to end and end to start?
const arr = Array.from({length: 100}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start");
for (let i = 0; i < len; i++) {
if (arr[i] === n) break;
}
console.timeEnd("iterate from start");
const arr = Array.from({length: 100}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start and end");
for (let i = 0, k = len - 1; i < len && k >= 0; i++, k--) {
if (arr[i] === n || arr[k] === n) break;
}
console.timeEnd("iterate from start and end");
jsperf https://jsperf.com/iterate-from-start-iterate-from-start-and-end/1

The answer is pretty obvious:
More operations take more time.
When judging the speed of code, you look at how many operations it will perform. Just step through and count them. Every instruction will take one or more CPU cycles, and the more there are the longer it will take to run. That different instructions take a different amount of cycles mostly does not matter - while an array lookup might be more costly than integer arithmetic, both of them basically take constant time and if there are too many, it dominates the cost of our algorithm.
In your example, there are few different types of operations that you might want to count individually:
comparisons
increments/decrements
array lookup
conditional jumps
(we could be more granular, such as counting variable fetch and store operations, but those hardly matter - everything is in registers anyway - and their number basically is linear to the others).
Now both of your code iterate about 50 times - they element on which they break the loop is in the middle of the array. Ignoring off-by-a-few errors, those are the counts:
| forwards | forwards and backwards
---------------+------------+------------------------
>=/===/< | 100 | 200
++/-- | 50 | 100
a[b] | 50 | 100
&&/||/if/for | 100 | 200
Given that, it's not unexpected that doing twice the works takes considerably longer.
I'll also answer a few questions from your comments:
Is additional time needed for the second object lookup?
Yes, every individual lookup counts. It's not like they could be performed at once, or optimised into a single lookup (imaginable if they had looked up the same index).
Should there be two separate loops for each start to end and end to start?
Doesn't matter for the number of operations, just for their order.
Or, put differently still, what is the fastest approach to find an element in an array?
There is no "fastest" regarding the order, if you don't know where the element is (and they are evenly distributed) you have to try every index. Any order - even random ones - would work the same. Notice however that your code is strictly worse, as it looks at each index twice when the element is not found - it does not stop in the middle.
But still, there are a few different approaches at micro-optimising such a loop - check these benchmarks.
let is (still?) slower than var, see Why is using `let` inside a `for` loop so slow on Chrome? and Why is let slower than var in a for loop in nodejs?. This tear-up and tear-down (about 50 times) of the loop body scope in fact does dominate your runtime - that's why your inefficient code isn't completely twice as slow.
comparing against 0 is marginally faster than comparing against the length, which puts looping backwards at an advantage. See Why is iterating through an array backwards faster than forwards, JavaScript loop performance - Why is to decrement the iterator toward 0 faster than incrementing and Are loops really faster in reverse?
in general, see What's the fastest way to loop through an array in JavaScript?: it changes from engine update to engine update. Don't do anything weird, write idiomatic code, that's what will get optimised better.

#Bergi is correct. More operations is more time. Why? More CPU clock cycles.
Time is really a reference to how many clock cycles it takes to execute the code.
In order to get to the nitty-gritty of that you need to look at the machine level code (like assembly level code) to find the true evidence. Each CPU (core?) clock cycle can execute one instruction, so how many instructions are you executing?
I haven't counted the clock cycles in a long time since programming Motorola CPUs for embedded applications. If your code is taking longer then it is in fact generating a larger instruction set of machine code, even if the loop is shorter or runs an equal amount of times.
Never forget that your code is actually getting compiled into a set of commands that the CPU is going to execute (memory pointers, instruction-code level pointers, interrupts, etc.). That is how computers work and its easier to understand at the micro controller level like an ARM or Motorola processor but the same is true for the sophisticated machines that we are running on today.
Your code simply does not run the way you write it (sounds crazy right?). It is run as it is compiled to run as machine level instructions (writing a compiler is no fun). Mathematical expression and logic can be compiled in to quite a heap of assembly, machine level code and that is up to how the compiler chooses to interpret it (it is bit shifting, etc, remember binary mathematics anyone?)
Reference:
https://software.intel.com/en-us/articles/introduction-to-x64-assembly
Your question is hard to answer but as #Bergi stated the more operations the longer, but why? The more clock cycles it takes to execute your code. Dual core, quad core, threading, assembly (machine language) it is complex. But no code gets executed as you have written it. C++, C, Pascal, JavaScript, Java, unless you are writing in assembly (even that compiles down to machine code) but it is closer to actual execution code.
A masters in CS and you will get to counting clock cycles and sort times. You will likely make you own language framed on machine instruction sets.
Most people say who cares? Memory is cheap today and CPUs are screaming fast and getting faster.
But there are some critical applications where 10 ms matters, where an immediate interrupt is needed, etc.
Commerce, NASA, a Nuclear power plant, Defense Contractors, some robotics, you get the idea . . .
I vote let it ride and keep moving.
Cheers,
Wookie

Since the element you're looking for is always roughly in the middle of the array, you should expect the version that walks inward from both the start and end of the array to take about twice as long as one that just starts from the beginning.
Each variable update takes time, each comparison takes time, and you're doing twice as many of them. Since you know it will take one or two less iterations of the loop to terminate in this version, you should reason it will cost about twice as much CPU time.
This strategy is still O(n) time complexity since it only looks at each item once, it's just specifically worse when the item is near the center of the list. If it's near the end, this approach will have a better expected runtime. Try looking for item 90 in both, for example.

Selected answer is excellent. I'd like to add another aspect: Try findIndex(), it's 2-3 times faster than using loops:
const arr = Array.from({length: 900}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start");
for (let i = 0; i < len; i++) {
if (arr[i] === n) break;
}
console.timeEnd("iterate from start");
console.time("iterate using findIndex");
var i = arr.findIndex(function(v) {
return v === n;
});
console.timeEnd("iterate using findIndex");

The other answers here cover the main reasons, but I think an interesting addition could be mentioning cache.
In general, sequentially accessing an array will be more efficient, particularly with large arrays. When your CPU reads an array from memory, it also fetches nearby memory locations into cache. This means that when you fetch element n, element n+1 is also probably loaded into cache. Now, cache is relatively big these days, so your 100 int array can probably fit comfortably in cache. However, on an array of much larger size, reading sequentially will be faster than switching between the beginning and the end of the array.

Batching streams in RxJS that involve asynchronous steps

I am processing a stream of data pulled, usually, from a RxNode.fromReadableStream() where there is an intermediate step that is asynchronous and may only run with a certain concurrency, and at the same time the output step can only run in serial, in discrete batches.
Basically a classic transformation pipeline:
cat file | parallel -P5 transform | xargs -n1000 process
With promises it would be something like (extremely simplified):
lines.on('readable', function read() {
let line;
while (line = lines.read()) {
queue.push(line);
if (queue.length >= 10) {
return Promise.map(queue, item => someAsyncTransformations(item), {concurrency: 5})
.then(items => someFinalAsyncProcessing(items))
.then(read);
}
}
});
By inserting the controller after the transforms, I can ensure that the transforms execute independently, giving me pipelining; and the combination of defers and merging seems to let me control asynchronicity and concurrency.
However, I'm using controlled() to make sure the map step effectively pauses processing until done. It looks really ugly — isn't there a better, non-manual way to implement lossless backpressure?
Also, I believe I have to use the same control trick with the transformers, otherwise the entire stream will be transformed in one go and blow memory usage. Is there a more idiomatic way?
Finally: The above actually doesn't work if the size of the input isn't divisible by the batch size. For some reason if I do request() with more than the number of the remaining items, the entire pipeline just quits; onComplete isn't even called. Actaully, if I initally call request() with a large number, the entire pipeline repeats seemingly indefinitely with the same contents. Bug or intended behaviour?
Edit: I found I could use concatMap. Nevermind.

Logic for grouping similar parameters

I am trying to figure out the best logic for grouping parameters within a certain tolerance. It's easier to explain with an example...
Task1: parameter1=140
Task2: parameter1=137
Task3: parameter1=142
Task4: parameter1=139
Task5: parameter1=143
If I want to group tasks if they are within 2 of each other, I think I need to do several passes. For the example, the desired result would be this:
Task4 covers Task1, Task2, and Task4
Task3 covers Task3 and Task5
There are multiple possibilities because Task1 could also cover 3 and 4 but then 2 and 5 would be two additional tasks that are by themselves. Basically, I would want the fewest number of tasks that are within 2 of each other.
I am currently trying to do this in excel VBA but I may port the code to php later. I really just don't know where to start because it seems pretty complex.

You'l need a clustering algorithm i'd assume. Consider the following parameters-
Task1: parameter1=140
Task2: parameter1=142
Task3: parameter1=144
Task4: parameter1=146
Task5: parameter1=148
Depending on your logic the clustering will get weird here. If you simply check each number for numbers near it all of these will be clustered. But does 140 and 148 deserve to be in the same cluster? Try kmeans clustering. There will be some gray area but the result will be relatively accurate.
http://en.wikipedia.org/wiki/K-means_clustering

You can group tasks in a single pass if you decide the group boundaries before looking at the tasks. Here's a simple example using buckets of width 4, based on your goal to group tasks within +/-2 of each other:
Dim bucket As Integer
For Each parameter In parameters
bucket = Round(parameter / 4, 0)
' ... do something now that you know what bucket the task is in
Next parameter
If the groups provided by fixed buckets don't fit the data closely enough for your needs, you will need to use an algorithm that makes multiple passes. Since the data in your example is one-dimensional, you can (and should!) use simpler techniques than k-means clustering.
A good next place to look might be Literate Jenks Natural Breaks and How The Idea Of Code is Lost, with a very well commented Jenks Natural Breaks Optimization in JavaScript.

Solver for TSP-like Puzzle, perhaps in Javascript

I have created a puzzle which is a derivative of the travelling salesman problem, which I call Trace Perfect.
It is essentially an undirected graph with weighted edges. The goal is to traverse every edge at least once in any direction using minimal weight (unlike classical TSP where the goal is to visit every vertex using minimal weight).
As a final twist, an edge is assigned two weights, one for each direction of traversal.
I create a new puzzle instance everyday and publish it through a JSON interface.
Now I know TSP is NP-hard. But my puzzles typically have only a good handful of edges and vertices. After all they need to be humanly solvable. So a brute force with basic optimization might be good enough.
I would like to develop some (Javascript?) code that retrieves the puzzle from the server, and solves with an algorithm in a reasonable amount of time. Additionally, it may even post the solution to the server to be registered in the leader board.
I have written a basic brute force solver for it in Java using my back-end Java model on the server, but the code is too fat and runs out of heap-space quick, as expected.
Is a Javascript solver possible and feasible?
The JSON API is simple. You can find it at: http://service.traceperfect.com/api/stov?pdate=20110218 where pdate is the date for the puzzle in yyyyMMdd format.
Basically a puzzle has many lines. Each line has two vertices (A and B). Each line has two weights (timeA for when traversing A -> B, and timeB for when traversing B -> A). And this should be all you need to construct a graph data structure. All other properties in the JSON objects are for visual purposes.
If you want to become familiar with the puzzle, you can play it through a flash client at http://www.TracePerfect.com/
If anyone is interested in implementing a solver for themselves, then I will post detail about the API for submitting the solution to the server, which is also very simple.
Thank you for reading this longish post. I look forward to hear your thoughts about this one.

If you are running out of heap space in Java, then you are solving it wrong.
The standard way to solve something like this is to do a breadth-first search, and filter out duplicates. For that you need three data structures. The first is your graph. The next is a queue named todo of "states" for work you have left to do. And the last is a hash that maps the possible "state" you are in to the pair (cost, last state).
In this case a "state" is the pair (current node, set of edges already traversed).
Assuming that you have those data structures, here is pseudocode for a full algorithm that should solve this problem fairly efficiently.
foreach possible starting_point:
new_state = state(starting_point, {no edges visited})
todo.add(new_state)
seen[new_state] = (0, null)
while todo.workleft():
this_state = todo.get()
(cost, edges) = seen[this_state]
foreach directed_edge in graph.directededges(this_state.current_node()):
new_cost = cost + directed_edge.cost()
new_visited = directed_edge.to()
new_edges = edges + directed_edge.edge()
new_state = state(new_visited, new_edges)
if not exists seen[new_state] or new_cost < seen[new_state][0]:
seen[new_state] = (new_cost, this_state)
queue.add(new_state)
best_cost = infinity
full_edges = {all possible edges}
best_state
foreach possible location:
end_state = (location, full_edges)
(cost, last_move) = seen[end_state]
if cost < best_cost:
best_state = end_state
best_cost = cost
# Now trace back the final answer.
path_in_reverse = []
current_state = best_state
while current_state[1] is not empty:
previous_state = seen[current_state][1]
path_in_reverse.push(edge from previous_state[0] to current_state[0])
current_state = previous_state
And now reverse(path_in_reverse) gives you your optimal path.
Note that the hash seen is critical. It is what prevents you from getting into endless loops.
Looking at today's puzzle, this algorithm will have a maximum of a million or so states that you need to figure out. (There are 2**16 possible sets of edges, and 14 possible nodes you could be at.) That is likely to fit into RAM. But most of your nodes only have 2 edges connected. I would strongly advise collapsing those. This will reduce you to 4 nodes and 6 edges, for an upper limit of 256 states. (Not all are possible, and note that multiple edges now connect two nodes.) This should be able to run very quickly with little use of memory.

For most parts of graph you can apply http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg.
This way you can obtain number of lines that you should repeat in order to solve.
At beginning you should not start at nodes which has short vertices over which you should travel two times.
If I summarize:
start at node whit odd number of edges.
do not travel over lines which sit on even node more than once.
use shortest path to travel from one odd node to another.
Simple recursive brute force solver whit this heuristic might be good way to start.
Or another way.
Try to find shortest vertices that if you remove them from graph remining graph will have only two odd numbered nodes and will be considered solvable as Koningsberg bridge. Solution is solving graph without picking up pencil on this reduced graph and once you hit node whit "removed" edge you just go back and forward.

On your java backend you might be able to use this TSP code (work in progress) which uses Drools Planner (open source, java).

Javascript: What's the algorithmic performance of 'splice'?

That is, would I be better suited to use some kind of tree or skip list data structure if I need to be calling this function a lot for individual array insertions?

You might consider whether you want to use an object instead; all JavaScript objects (including Array instances) are (highly-optimized) sets of key/value pairs with an optional prototype An implementation should (note I don't say "does") have a reasonable performance hashing algorithm. (Update: That was in 2010. Here in 2018, objects are highly optimized on all significant JavaScript engines.)
Aside from that, the performance of splice is going to vary a lot between implementations (e.g., vendors). This is one reason why "don't optimize prematurely" is even more appropriate advice for JavaScript applications that will run in multiple vendor implementations (web apps, for instance) than it is even for normal programming. Keep your code well modularized and address performance issues if and when they occur.

Here's a good rule of thumb, based on tests done in Chrome, Safari and Firefox: Splicing a single value into the middle of an array is roughly half as fast as pushing/shifting a value to one end of the array. (Note: Only tested on an array of size 10,000.)
http://jsperf.com/splicing-a-single-value
That's pretty fast. So, it's unlikely that you need to go so far as to implement another data structure in order to squeeze more performance out.
Update: As eBusiness points out in the comments below, the test performs an expensive copy operation along with each splice, push, and shift, which means that it understates the difference in performance. Here's a revised test that avoids the array copying, so it should be much more accurate: http://jsperf.com/splicing-a-single-value/19

Move single value
// tmp = arr[1][i];
// arr[1].splice(i, 1); // splice is slow in FF
// arr[1].splice(end0_1, 0, tmp);
tmp = arr[1][i];
ii = i;
while (ii<end0_1)
{
arr[1][ii] = arr[1][++ii];
cycles++;
}
arr[1][end0_1] = tmp;

We Keep Coding

JavaScript is the programming language of the Web.