I am writing a javascript HTML5 phonegap application.
I am using SQLite to store data.
For the sake of performance, i am undertaking database inserts asynchronously i.e. I do not need to call the next operation at the callback of the previous Database operation.
Internally, i believe javascript is creating a worker for each operation hence multi-threading so to speak.
The problem now is , How do i know that all workers have completed their tasks? e.g, in order to tell the user that all data has been saved?
If I understand your request correctly, you are queueing up DB inserts to run asynchronously and you want to be able to check back at a later time to see if all the requests are finished. I would do something like this:
function asyncTask() {
//
// Do real work here
//
runningTasks--
}
//in your init section, setup a global variable to track number of tasks
runningTasks = 0
//when you need to create a new task, increment the counter
runningTasks++;
setTimeout (asyncTask,1);
if (runningTasks > 0) {
//something still running
} else {
//all tasks are done.
}
In another language you would need to worry about race conditions when testing and setting the runningTasks varaible, but AFAIK, Javascript is only implemented as single threaded so you don't need to worry about that.
Related
I would like to know if it is possible to detect that a thread is already running a Cloud Functions, and if possible to also detect if it is running on a particular ID's data. I think I could have a variable stored in firebase memory of the ID in Firebase Database that the function is being run on from the Database, and to remove the variable when the function is done running,but the concern is of two writes to the database happening subsequently and very rapidly, causing the initial thread to not be able to write to memory fast enough before the second thread checks if the variable is there, especially on a cold start from the firebase thread - which in my understanding is a variable amount of time in which either thread could potentially spin up first.
My use case is this:
Let's say a write to the realtime database happens from the client side that causes a trigger for Cloud Functions to run a handler. This handlers job is to loop through and do work with the snapshot of records that was just written to by the client, and using a loop will parse each record in the snapshot, and when it is done, delete them. The handler works great until another record is written to the same group of records in the database before the handler's job is done, which causes a second handler thread to spin up, and start moving through the records in the same group of records, which would cause records to be iterated over twice, and possibly the data to be handled twice.
I have other solutions for my particular case which I can use instead, but it involves just allowing each record to trigger a separate thread like normal.
Thanks in advance!
There is no way to track running instances "in-memory" for Cloud Functions, as each function invocation may be running in entirely different virtual infra. Instead, what you'd most likely want to do here is have some kind of lock persisted in e.g. the Firebase Realtime Database using a transaction. So you'd do something like:
When the function invocation starts, generate a random "worker ID".
Run a transaction to check a DB path derived from the file you're processing. If it's empty, or populated with a timestamp that is older than a function timeout, write your worker ID and the current timestamp to the location. If it's not empty or the timestamp is fresh, exit your function immediately because there's already an active worker.
Do your file processing.
Run another transaction that deletes the lock from the DB if the worker ID in the DB still matches your worker ID.
This will prevent two functions from processing the same file at the same time. It will mean, however, that any functions that execute while a path is locked will be discarded (which may or may not be what you want).
Lets assume I run this piece of code.
var score = 0;
for (var i = 0; i < arbitrary_length; i++) {
async_task(i, function() { score++; }); // increment callback function
}
In theory I understand that this presents a data race and two threads trying to increment at the same time may result in a single increment, however, nodejs(and javascript) are known to be single threaded. Am I guaranteed that the final value of score will be equal to arbitrary_length?
Am I guaranteed that the final value of score will be equal to
arbitrary_length?
Yes, as long as all async_task() calls call the callback once and only once, you are guaranteed that the final value of score will be equal to arbitrary_length.
It is the single-threaded nature of Javascript that guarantees that there are never two pieces of Javascript running at the exact same time. Instead, because of the event driven nature of Javascript in both browsers and node.js, one piece of JS runs to completion, then the next event is pulled from the event queue and that triggers a callback which will also run to completion.
There is no such thing as interrupt driven Javascript (where some callback might interrupt some other piece of Javascript that is currently running). Everything is serialized through the event queue. This is an enormous simplification and prevents a lot of stickly situations that would otherwise be a lot of work to program safely when you have either multiple threads running concurrently or interrupt driven code.
There still are some concurrency issues to be concerned about, but they have more to do with shared state that multiple asynchronous callbacks can all access. While only one will ever be accessing it at any given time, it is still possible that a piece of code that contains several asynchronous operations could leave some state in an "in between" state while it was in the middle of several async operations at a point where some other async operation could run and could attempt to access that data.
You can read more about the event driven nature of Javascript here: How does JavaScript handle AJAX responses in the background? and that answer also contains a number of other references.
And another similar answer that discusses the kind of shared data race conditions that are possible: Can this code cause a race condition in socket io?
Some other references:
how do I prevent event handlers to handle multiple events at once in javascript?
Do I need to be concerned with race conditions with asynchronous Javascript?
JavaScript - When exactly does the call stack become "empty"?
Node.js server with multiple concurrent requests, how does it work?
To give you an idea of the concurrency issues that can happen in Javascript (even without threads and without interrupts, here's an example from my own code.
I have a Raspberry Pi node.js server that controls the attic fans in my house. Every 10 seconds it checks two temperature probes, one inside the attic and one outside the house and decides how it should control the fans (via relays). It also records temperature data that can be presented in charts. Once an hour, it saves the latest temperature data that was collected in memory to some files for persistence in case of power outage or server crash. That saving operation involves a series of async file writes. Each one of those async writes yields control back to the system and then continues when the async callback is called signaling completion. Because this is a low memory system and the data can potentially occupy a significant portion of the available RAM, the data is not copied in memory before writing (that's simply not practical). So, I'm writing the live in-memory data to disk.
At any time during any of these async file I/O operations, while waiting for a callback to signify completion of the many file writes involved, one of my timers in the server could fire, I'd collect a new set of temperature data and that would attempt to modify the in-memory data set that I'm in the middle of writing. That's a concurrency issue waiting to happen. If it changes the data while I've written part of it and am waiting for that write to finish before writing the rest, then the data that gets written can easily end up corrupted because I will have written out one part of the data, the data will have gotten modified from underneath me and then I will attempt to write out more data without realizing it's been changed. That's a concurrency issue.
I actually have a console.log() statement that explicitly logs when this concurrency issue occurs on my server (and is handled safely by my code). It happens once every few days on my server. I know it's there and it's real.
There are many ways to work around those types of concurrency issues. The simplest would have been to just make a copy in memory of all the data and then write out the copy. Because there are not threads or interrupts, making a copy in memory would be safe from concurrency (there would be no yielding to async operations in the middle of the copy to create a concurrency issue). But, that wasn't practical in this case. So, I implemented a queue. Whenever I start writing, I set a flag on the object that manages the data. Then, anytime the system wants to add or modify data in the stored data while that flag is set, those changes just go into a queue. The actual data is not touched while that flag is set. When the data has been safely written to disk, the flag is reset and the queued items are processed. Any concurrency issue was safely avoided.
So, this is an example of concurrency issues that you do have to be concerned about. One great simplifying assumption with Javascript is that a piece of Javascript will run to completion without any thread of getting interrupted as long as it doesn't purposely return control back to the system. That makes handling concurrency issues like described above lots, lots easier because your code will never be interrupted except when you consciously yield control back to the system. This is why we don't need mutexes and semaphores and other things like that in our own Javascript. We can use simple flags (just a regular Javascript variable) like I described above if needed.
In any entirely synchronous piece of Javascript, you will never be interrupted by other Javascript. A synchronous piece of Javascript will run to completion before the next event in the event queue is processed. This is what is meant by Javascript being an "event-driven" language. As an example of this, if you had this code:
console.log("A");
// schedule timer for 500 ms from now
setTimeout(function() {
console.log("B");
}, 500);
console.log("C");
// spin for 1000ms
var start = Date.now();
while(Data.now() - start < 1000) {}
console.log("D");
You would get the following in the console:
A
C
D
B
The timer event cannot be processed until the current piece of Javascript runs to completion, even though it was likely added to the event queue sooner than that. The way the JS interpreter works is that it runs the current JS until it returns control back to the system and then (and only then), it fetches the next event from the event queue and calls the callback associated with that event.
Here's the sequence of events under the covers.
This JS starts running.
console.log("A") is output.
A timer event is schedule for 500ms from now. The timer subsystem uses native code.
console.log("C") is output.
The code enters the spin loop.
At some point in time part-way through the spin loop the previously set timer is ready to fire. It is up to the interpreter implementation to decide exactly how this works, but the end result is that a timer event is inserted into the Javascript event queue.
The spin loop finishes.
console.log("D") is output.
This piece of Javascript finishes and returns control back to the system.
The Javascript interpreter sees that the current piece of Javascript is done so it checks the event queue to see if there are any pending events waiting to run. It finds the timer event and a callback associated with that event and calls that callback (starting a new block of JS execution). That code starts running and console.log("B") is output.
That setTimeout() callback finishes execution and the interpreter again checks the event queue to see if there are any other events that are ready to run.
Node uses an event loop. You can think of this as a queue. So we can assume, that your for loop puts the function() { score++; } callback arbitrary_length times on this queue. After that the js engine runs these one by one and increase score each time. So yes. The only exception if a callback is not called or the score variable is accessed from somewhere else.
Actually you can use this pattern to do tasks parallel, collect the results and call a single callback when every task is done.
var results = [];
for (var i = 0; i < arbitrary_length; i++) {
async_task(i, function(result) {
results.push(result);
if (results.length == arbitrary_length)
tasksDone(results);
});
}
No two invocations of the function can happen at the same time (b/c node is single threaded) so that will not be a problem. The only problem would be ifin some cases async_task(..) drops the callback. But if, e.g., 'async_task(..)' was just calling setTimeout(..) with the given function, then yes, each call will execute, they will never collide with each other, and 'score' will have the value expected, 'arbitrary_length', at the end.
Of course, the 'arbitrary_length' can't be so great as to exhaust memory, or overflow whatever collection is holding these callbacks. There is no threading issue however.
I do think it’s worth noting for others that view this, you have a common mistake in your code. For the variable i you either need to use let or reassign to another variable before passing it into the async_task(). The current implementation will result in each function getting the last value of i.
first question here but i really don't know where to go. I cannot find anything that help me on google.
i'm doing huge processing server side and i would like to keep track of the state and show it on the client side.
For that purpose i have a variable that i'm updating as the process go through. To keep track of it i'm using that client side:
Template.importJson.onCreated(function () {
Session.set('import_datas', null);
this.autorun(function(){
Meteor.call('readImportState', function(err, response) {
console.log(response);
if (response !== undefined) {
Session.set('importingMessage',response);
}
});
})
});
I'm reading it from template that way (in template.mytemplate.helpers):
readImportState: function() {
return Session.get('importingMessage');
},
And here is the server side code to be called by meteor.call:
readImportState: function() {
console.log(IMPORT_STATE);
return IMPORT_STATE;
}
The client grab the value at start but it is never updated later....
What am i missing here?
If somebody could point me in the right direction that would be awesome.
Thank you :)
TL;DR
As of this writing, the only easy way to share reactive state between the server and the client is to use the publish/subscribe mechanism. Other solutions will be like fighting an uphill battle.
In-memory State
Here's the (incorrect) solution you are looking for:
When the job starts, write to some in-memory state on the server. This probably looks like a global or file scoped variable like jobStates, where jobStates is an object with user ids as its keys, and state strings as its values.
The client should periodically poll the server for the current state. Note an autorun doesn't work for Meteor.call (there is no reactive state forcing the autorun to execute again) - you'd need to actually poll every N seconds via setInterval.
When the job completes, modify jobStates.
When the client sees a completed state, inform the user and cancel the setInterval.
Because the server could restart for any number of reasons while the job is running (and consequently forget its in-memory state), we'll need to build in some fault tolerance for both the state and the job itself. Write the job state to the database whenever it changes. When the server starts, we'll read this state back into jobStates.
The model above assumes only a single server is running. If there exist multiple server instances, each one will need to observe the collection in order to write to its own jobStates. Alternatively, the method from (2) should just read the database instead of actually keeping jobStates in memory.
This approach is complicated and error prone. Furthermore, it requires writing the state to the database anyway in order to handle restarts and multiple server instances.
Publish/Subscribe
As the job state changes, write the current state to the database. This could be to a separate collection just for job states, or it could be a collection with all the metadata used to execute the job (helpful for fault tolerance), or it could be to the document the job is producing (if any).
Publish the necessary document(s) to the client.
Subscribe for the document(s) on the client and use a simple find or findOne in a template to display the state to the user.
Optional: clean up the state document(s) periodically using with something like synced cron.
As you can see, the publish/subscribe mechanism is considerably easier to implement because most of the work is done for you by meteor.
Our API needs to send data to Zapier if some specific data was modified in our DB.
For example, we have a company table and if the name or the address field was modified, we trigger the Zapier hook.
Sometimes our API receives multiple change requests in a few minutes, but we don't want to trigger the Zapier hook multiple times (since it is quite expensive), so we call a setTimeout() (and overwrites the existing setTimeout) on each modify requests , with a 5000ms delay.
It works fine, and there are no multiple Zapier hook calls even if we get a lot modify requests from client in this 5000ms period.
Now - since our traffic is growing - we'd like to set up multiple node.js instances behind some load balancer.
But in this case the different Node.js instances can not use - and overwrite - the same setTimeout instance, which would cause a lot useless Zapier calls.
Could you guys help us, how to solve this problem - while remaining scalable?
If you want to keep a state between separate instances you should consider, from an infrastructure point of view, some locking mechanism such as Redis.
Whenever you want to run the Zapier call, if no lock is active, you set one on Redis, all other calls won't be triggered as it is locked, whenever the setTimeout callback runs, you disable the Lock.
Beware that Redis might become a SPOF, I don't know where you are hosting your services, but that might be an important point to consider.
Edit:
The lock on Redis might have a reference to the last piece of info you want to update. So on the first request you set the data to be saved on Redis, wait 5 seconds, and update. If any modifications were made in that time frame, it will be stored on Redis, that way you'll only update on 5 second intervals, you'll need to add some extra logic here though. Example:
function zapierUpdate(data) {
if (isLocked()) {
// Locked! We will update the data that needs to be saved on the
// next setTimeout callback
updateLockData(data);
} else {
// First lock and save data.
lock(data);
// and update in 5 seconds
setTimeout(function(){
// getLockData fetches the data on Redis and releases the lock
var newData = getLockData();
// Update the latest data that might have been updated.
callZapierNow(newData);
},5000);
}
}
My app's framework is built around collapsing backbone models sending the data via websockets and updating models on other clients with the data. My question is how should I batch these updates for times when an action triggers 5 changes in a row.
The syncing method is set up to update on any change but if I set 5 items at the same time I don't want it to fire 5 times in a row.
I was thinking I could do a setTimeout on any sync that gets cleared if something else tries to sync within a second of it. Does this seem like the best route or is there a better way to do this?
Thanks!
i haven't done this with backbone specifically, but i've done this kind of batching of commands in other distributed (client / server) apps in the past.
the gist of it is that you should start with a timeout and add a batch size for further optimization, if you see the need.
say you have a batch size of 10. what happens when you get 9 items stuffed into the batch and then the user just sits there and doesn't do anything else? the server would never get notified of the things the user wanted to do.
timeout generally works well to get small batches. but if you have an action that generates a large number of related commands you may want to batch all of the commands and send them all across as soon as they are ready instead of waiting for a timer. the time may fire in the middle of creating the commands and split things apart in a manner that causes problems, etc.
hope that helps.
Underscore.js, the utility library that Backbone.js uses, has several functions for throttling callbacks:
throttle makes a version of a function that will execute at most once every X milliseconds.
debounce makes a version of a function that will only execute if X milliseconds elapse since the last time it was called
after makes a version of a function that will execute only after it has been called X times.
So if you know there are 5 items that will be changed, you could register a callback like this:
// only call callback after 5 change events
collection.on("change", _.after(5, callback));
But more likely you don't, and you'll want to go with a timeout approach:
// only call callback 30 milliseconds after the last change event
collection.on("change", _.debounce(30, callback));