Optimizing an asynchronous find algorithm - javascript

I have a series of consecutively-named pages (URLs, like: http://example.com/book/1, http://example.com/book/2, etc.) but I have no way of knowing how many pages there are in advance. I need to retrieve (a particular part of) each page, keep the obtained info in order, miss no page, and request a minimum amount of null pages.
Currently, I have a recursive asynchronous function which is a bit like this:
pages = []
getPage = (page = 1) ->
xhr.get "http://example.com/book/#{1}", (response) ->
if isValid response
pages.push response
getPage page++
else
event.trigger "haveallpages"
getPage()
xhr.get and event.trigger is pseudo-code and are currently jQuery methods (but that may change). isValid is also pseudo-code, in reality the test in defined within the function, but it's complex and not relevant to the question.
This works well but is slow as only one request is processed at a time. What I'm looking for is a way to make better use of the asynchronous nature of XHRs and retrieve the complete list in less time. Is there a pattern which could help me here? Or a better algorithm?

Just fire simultaneous requests while keeping count of them. There is no need to guess the upper bound, simply stop when requests start to fail like in your original code.
This will generate at most concurrency-1 wasted requests:
pages = []
concurrency = 5
currentPage = 0
haveAllPages = false
getPage = (p) ->
xhr.get "http://example.com/book/#{p}", (response) ->
if isValid response
pages.push response
getPage ++currentPage if not haveAllPages
else
haveAllPages = true
while concurrency--
getPage ++currentPage

Related

Global memoizing fetch() to prevent multiple of the same request

I have an SPA and for technical reasons I have different elements potentially firing the same fetch() call pretty much at the same time.[1]
Rather than going insane trying to prevent multiple unrelated elements to orchestrate loading of elements, I am thinking about creating a gloabalFetch() call where:
the init argument is serialised (along with the resource parameter) and used as hash
when a request is made, it's queued and its hash is stored
when another request comes, and the hash matches (which means it's in-flight), another request will NOT be made, and it will piggy back from the previous one
async function globalFetch(resource, init) {
const sigObject = { ...init, resource }
const sig = JSON.stringify(sigObject)
// If it's already happening, return that one
if (globalFetch.inFlight[sig]) {
// NOTE: I know I don't yet have sig.timeStamp, this is just to show
// the logic
if (Date.now - sig.timeStamp < 1000 * 5) {
return globalFetch.inFlight[sig]
} else {
delete globalFetch.inFlight[sig]
}
const ret = globalFetch.inFlight[sig] = fetch(resource, init)
return ret
}
globalFetch.inFlight = {}
It's obviously missing a way to have the requests' timestamps. Plus, it's missing a way to delete old requests in batch. Other than that... is this a good way to go about it?
Or, is there something already out there, and I am reinventing the wheel...?
[1] If you are curious, I have several location-aware elements which will reload data independently based on the URL. It's all nice and decoupled, except that it's a little... too decoupled. Nested elements (with partially matching URLs) needing the same data potentially end up making the same request at the same time.
Your concept will generally work just fine.
Some thing missing from your implementation:
Failed responses should either not be cached in the first place or removed from the cache when you see the failure. And failure is not just rejected promises, but also any request that doesn't return an appropriate success status (probably a 2xx status).
JSON.stringify(sigObject) is not a canonical representation of the exact same data because properties might not be stringified in the same order depending upon how the sigObject was built. If you grabbed the properties, sort them and inserted them in sorted order onto a temporary object and then stringified that, it would be more canonical.
I'd recommend using a Map object instead of a regular object for globalFetch.inFlight because it's more efficient when you're adding/removing items regularly and will never have any name collision with property names or methods (though your hash would probably not conflict anyway, but it's still a better practice to use a Map object for this kind of thing).
Items should be aged from the cache (as you apparently know already). You can just use a setInterval() that runs every so often (it doesn't have to run very often - perhaps every 30 minutes) that just iterates through all the items in the cache and removes any that are older than some amount of time. Since you're already checking the time when you find one, you don't have to clean the cache very often - you're just trying to prevent non-stop build-up of stale data that isn't going to be re-requested - so it isn't getting automatically replaced with newer data and isn't being used from the cache.
If you have any case insensitive properties or values in the request parameters or the URL, the current design would see different case as different requests. Not sure if that matters in your situation or not or if it's worth doing anything about it.
When you write the real code, you need Date.now(), not Date.now.
Here's a sample implementation that implements all of the above (except for case sensitivity because that's data-specific):
function makeHash(url, obj) {
// put properties in sorted order to make the hash canonical
// the canonical sort is top level only,
// does not sort properties in nested objects
let items = Object.entries(obj).sort((a, b) => b[0].localeCompare(a[0]));
// add URL on the front
items.unshift(url);
return JSON.stringify(items);
}
async function globalFetch(resource, init = {}) {
const key = makeHash(resource, init);
const now = Date.now();
const expirationDuration = 5 * 1000;
const newExpiration = now + expirationDuration;
const cachedItem = globalFetch.cache.get(key);
// if we found an item and it expires in the future (not expired yet)
if (cachedItem && cachedItem.expires >= now) {
// update expiration time
cachedItem.expires = newExpiration;
return cachedItem.promise;
}
// couldn't use a value from the cache
// make the request
let p = fetch(resource, init);
p.then(response => {
if (!response.ok) {
// if response not OK, remove it from the cache
globalFetch.cache.delete(key);
}
}, err => {
// if promise rejected, remove it from the cache
globalFetch.cache.delete(key);
});
// save this promise (will replace any expired value already in the cache)
globalFetch.cache.set(key, { promise: p, expires: newExpiration });
return p;
}
// initalize cache
globalFetch.cache = new Map();
// clean up interval timer to remove expired entries
// does not need to run that often because .expires is already checked above
// this just cleans out old expired entries to avoid memory increasing
// indefinitely
globalFetch.interval = setInterval(() => {
const now = Date.now()
for (const [key, value] of globalFetch.cache) {
if (value.expires < now) {
globalFetch.cache.delete(key);
}
}
}, 10 * 60 * 1000); // run every 10 minutes
Implementation Notes:
Depending upon your situation, you may want to customize the cleanup interval time. This is set to run a cleanup pass every 10 minutes just to keep it from growing unbounded. If you were making millions of requests, you'd probably run that interval more often or cap the number of items in the cache. If you aren't making that many requests, this can be less frequent. It is just to clean up old expired entries sometime so they don't accumulate forever if never re-requested. The check for the expiration time in the main function already keeps it from using expired entries - that's why this doesn't have to run very often.
This looks as response.ok from the fetch() result and promise rejection to determine a failed request. There could be some situations where you want to customize what is and isn't a failed request with some different criteria than that. For example, it might be useful to cache a 404 to prevent repeating it within the expiration time if you don't think the 404 is likely to be transitory. This really depends upon your specific use of the responses and behavior of the specific host you are targeting. The reason to not cache failed results is for cases where the failure is transitory (either a temporary hiccup or a timing issue and you want a new, clean request to go if the previous one failed).
There is a design question for whether you should or should not update the .expires property in the cache when you get a cache hit. If you do update it (like this code does), then an item could stay in the cache a long time if it keeps getting requested over and over before it expires. But, if you really want it to only be cached for a maximum amount of time and then force a new request, you can just remove the update of the expiration time and let the original result expire. I can see arguments for either design depending upon the specifics of your situation. If this is largely invariant data, then you can just let it stay in the cache as long as it keeps getting requested. If it is data that can change regularly, then you may want it to be cached no more than the expiration time, even if its being requested regularly.
Consider using a ServiceWorker or Workbox to separate caching logic from your application. The Stale-While-Revalidate strategy could apply here.

Stopping synchronous function after 2 seconds

I'm using the npm library jsdiff, which has a function that determines the difference between two strings. This is a synchronous function, but given two large, very different strings, it will take extremely long periods of time to compute.
diff = jsdiff.diffWords(article[revision_comparison.field], content[revision_comparison.comparison]);
This function is called in a stack that handles an request through Express. How can I, for the sake of the user, make the experience more bearable? I think my two options are:
Cancelling the synchronous function somehow.
Cancelling the user request somehow. (But would this keep the function still running?)
Edit: I should note that given two very large and different strings, I want a different logic to take place in the code. Therefore, simply waiting for the process to finish is unnecessary and cumbersome on the load - I definitely don't want it to run for any long period of time.
fork a child process for that specific task, you can even create a queu to limit the number of child process that can be running in a given moment.
Here you have a basic example of a worker that sends the original express req and res to a child that performs heavy sync. operations without blocking the main (master) thread, and once it has finished returns back to the master the outcome.
Worker (Fork Example) :
process.on('message', function(req,res) {
/* > Your jsdiff logic goes here */
//change this for your heavy synchronous :
var input = req.params.input;
var outcome = false;
if(input=='testlongerstring'){outcome = true;}
// Pass results back to parent process :
process.send(req,res,outcome);
});
And from your Master :
var cp = require('child_process');
var child = cp.fork(__dirname+'/worker.js');
child.on('message', function(req,res,outcome) {
// Receive results from child process
console.log('received: ' + outcome);
res.send(outcome); // end response with data
});
You can perfectly send some work to the child along with the req and res like this (from the Master): (imagine app = express)
app.get('/stringCheck/:input',function(req,res){
child.send(req,res);
});
I found this on jsdiff's repository:
All methods above which accept the optional callback method will run in sync mode when that parameter is omitted and in async mode when supplied. This allows for larger diffs without blocking the event loop. This may be passed either directly as the final parameter or as the callback field in the options object.
This means that you should be able to add a callback as the last parameter, making the function asynchronous. It will look something like this:
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
});
Now, you have several choices:
Return directly to the user and keep the function running in the background.
Set a 2 second timeout (or whatever limit fits your application) using setTimeout as outlined in this
answer.
If you go with option 2, your code should look something like this
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
return callback(err, diff);
});
//if this was called, it means that the above operation took more than 2000ms (2 seconds)
setTimeout(function() { return callback(); }, 2000);

async control flow in javascript

I'm writing a browser extension for chrome that uses Web SQL for local storage. The code for both of these components seems to heavily rely on async operations. I have a good understanding of asynchronous operations but not a whole lot of experience writing code that relies this heavily on them.
For Example:
var CSUID = "";
//this is an async callback for handling browser tab updates
function checkForValidUrl(tabId, changeInfo, tab) {
getCookies("http://www.cleansnipe.com/", "CSUID", handleCookie);
if(CSUID != ""){ //this could be in handleCookie if i could access the tab
//do stuff with the tab
}
}
function handleCookie(cookie) {
if (cookie != "" && cookie != null) {
CSUID = cookie;
}
}
In order to overcome the lack of ability to pass/return variables into/from these handlers I find myself creating global variables and setting them in the handlers. Of course this doesn't work as expected because the variable is often accessed before the callback has executed.
What is the best practice for handling this situation? I thought to use global flags/counters with a while loop to pause execute but this seems messy prone to application hang.
If jQuery is an option, it has a beautiful system of what it calls deferred objects. It allows for graceful and effective management of asynchronous situations - or indeed synchronous, as the cases may be or vary.
(Deferreds aren't limited to just jQuery, but jQuery has a nice API for them).
Here's a simple example purely to demonstrate the concept.
//get data func
function get_data(immediate) {
//if immediate, return something synchonously
if (immediate)
return 'some static data';
//else get the data from something over AJAX
else
return $.get('some_url');
}
//two requests two get data - one asynchronous, one synchronous
var data1 = get_data(true), data2 = get_data();
//do something when both resolved
$.when(data1, data2).done(function(data1, data2) {
//callback code here...
});
Deferreds don't have to involve AJAX; you can create your own deferred objects (jQuery's AJAX requests automatically make and return them) and resolve/reject them manually. I did a blog post on this a few months back - it might help.

Is this pair of AJAX requests a race condition?

I'm creating a website that will feature news articles. These articles will appear in two columns at the bottom of the page. There will be a button at the bottom to load additional news stories. That means that I need to be able to specify what news story to load. Server-side, I'm simply implementing this with a LIMIT clause in my SQL statement, supplying the :first parameter like so:
SELECT *
FROM news
ORDER BY `date` DESC
LIMIT :first, 1
This means that, client-side, I need to keep track of how many news items I've loaded. I've implemented this by having the function to load new information be kept in an object with a property holding the number of items loaded. I'm worried that this is somehow a race condition that I am not seeing, though, where my loadNewInformation() will be called twice before the number is incremented. My code is as follows:
var News = {
newInfoItems: 0,
loadNewInformation: function(side) {
this.newInfoItems += 1;
jQuery.get(
'/api/news/'+ (this.newInfoItems - 1),
function(html) {
jQuery('div.col'+side).append(html);
}
);
}
}
On page load, this is being called in the following fashion:
News.loadNewInformation('left');
News.loadNewInformation('right');
I could have implemented this in such a way that the success handler of a first call made another AJAX request for the second, which clearly would not be a race condition...but this seems like sloppy code. Thoughts?
(Yes, there is a race condition.)
Addressing Just the JavaScript
All JavaScript code on a page (excluding Web-Workers), which includes callbacks, is run "mutually exclusive".
In this case, because newInfoItems is eagerly evaluated, it is not even that complex: both "/api/news/0" and "/api/news/1" are guaranteed to be fetched (or fail in an attempt). Compare it to this:
// eager evaluation: value of url computed BEFORE request
// this is same as in example (with "fixed" increment order ;-)
// and is just to show point
var url = "/api/news/" + this.newInfoItems
this.newInfoItems += 1;
jQuery.get(url,
function(html) {
// only evaluated on AJAX callback - order of callbacks
// not defined, but will still be mutually exclusive.
jQuery('div.col'+side).append(html);
}
);
However, the order in which the AJAX requests complete is not defined and is influenced by both the server and browser. Furthermore, as discussed below, there is no atomic context established between the server and individual AJAX requests.
Addressing the JavaScript in Context
Now, even though it's established that "/api/news/0" and "/api/news/1" will be invoked, imagine this unlikely, but theoretically possible situation:
articles B,A exist in database
browser sends both AJAX requests -- asynchronously or synchronously, it doesn't matter!
an article is added to the database sometime between when
the server processes the news/0 request, and
the server processes the news/1 request
Then, this happens:
news/0 returns article B (articles B,A in database)
article C added
news/1 returns article B (articles C,B,A in database)
Note that article B was returned twice! Oops :)
So, while the race-condition "seems fairly unlikely", it does exist. A similar race condition (with different results) can occur if news/1 is processed before news/0 and (once again) an article is added between the requests: there no atomic guarantee in either case!
(The above race condition would be more likely if executing the AJAX requests in-series as the time for a new article being added is increased.)
Possible Solution
Consider fetching say, n (2 is okay!) articles in a single request (e.g. "/api/latest/n"), and then laying out the articles as appropriate in the callback. For instance, the first half of the articles on the left and the second half on right, or whatever is appropriate.
As well as eliminating the particular race-condition above by making the single request an atomic action -- with respect to article additions -- it will also result in less network traffic and less work for the server.
The fetch for the API might then look like:
SELECT *
FROM news
ORDER BY `date` DESC
LIMIT :n
Happy coding.
Yes, technically, this could create a race condition of sorts. The calls are asynchronous, and if the first got held up for some reason, the second could return first.
However, as you don't have a great deal that goes on in your callback functions that depend on the presence of the other 'side' being populated I don't see where it should cause you too much grief.
Shouldn't be any race conditions, must be something else wrong in your code. The counter is incremented before your .get() call, so prior to each get the counter should be incremented correctly. Short example to demostrate that it works when called sequentially: http://jsfiddle.net/KkpwF/
I reckon you're hinting at the newItemInfo counter?
Observations:
You're calling News.loadNewInformation twice, with a callback function which will be executed on AJAX completion.
The callback method you provide does not alter the News object.
In this case, you don't have to worry. The second call to News.loadNewInformation will only be executed once the first call completes - that is, excluding executing the callback method. Hence your newItemInfo counter will contain the correct value.
Try this:
loadNewInformation: function(side) {
jQuery.get(
'/api/news/'+ (this.newInfoItems++),
function(html) {
jQuery('div.col'+side).append(html);
}
);
}

Ignoring old multiple asynchronous ajax requests

I've got a custom javascript autocomplete script that hits the server with multiple asynchronous ajax requests. (Everytime a key gets pressed.)
I've noticed that sometimes an earlier ajax request will be returned after a later requests, which messes things up.
The way I handle this now is I have a counter that increments for each ajax request. Requests that come back with a lower count get ignored.
I'm wondering: Is this proper? Or is there a better way of dealing with this issue?
Thanks in advance,
Travis
You can store a "global" currentAjaxRequest, which holds the structure of the last XHR request. Then you can abort the current request when you make a new one.
For example:
var currentAjaxRequest = null;
function autoCompleteStuff() {
if(currentAjaxRequest !== null) {
currentAjaxRequest.abort();
}
currentAjaxRequest = $.get(..., function(...) {
currentAjaxRequest = null;
...
});
}
To avoid naming conflicts, wrap that in an anonymous, instantly-executed function, if needed.

Categories