Autocomplete performance and private "magic search"

Autocomplete performance and private "magic search" - javascript

Poor performance of autocomplete fields reduces their usefulness. If the client-side implementation has to call an endpoint that does heavy db lookup, the response time can easily get frustrating.
One neat approach comes from AWS Case Study: IMDb. It used to come with a diagram (no longer available), but in a nutshell a prediction tree would be generated and stored for every combination that can resolve in a meaningful way. E.g. resolutions for sta would include Star Wars, Star Trek, Sylvester Stallone which will be stored, but stb will not resolve to anything meaningful and will not be stored.
To get the lowest possible latency, all possible results are
pre-calculated with a document for every combination of letters in
search. Each document is pushed to Amazon Simple Storage Service
(Amazon S3) and thereby to Amazon CloudFront, putting the documents
physically close to the users. The theoretical number of possible
searches to calculate is mind-boggling—a 20-character search has 23 x
1030 combinations—but in practice, using IMDb's authority on movie and
celebrity data can reduce the search space to about 150,000 documents,
which Amazon S3 and Amazon CloudFront can distribute in just a few
hours. IMDb creates indexes in several languages with daily updates
for datasets of over 100,000 movie and TV titles and celebrity names.
How would one achieve a similarly performant experience be achieved with private data? E.g. autocompleting client names, job ids, invoice numbers... Storing different documents/decision trees for separate users sounds expensive, especially if some of the data (client names?) could be available for multiple users.

You right that such workload requires some special optimizations.
You can use ready search engine like Apache lucene or Solr (wich is REST API wrapper for lucene)
This engine optimized for full text searches and can work with private data.
Work steps:
Install solr (or lucene)
Design schema for storing information (what fields and what types of searchs you need)
Load data into it ( via bach operations or on update basis)
Query searches based on solrs query language (similar to google search).
In this place you could add special restrictions based on user_id or any over parameter in addition to original user query. So private data wouldn't mess between users.

I actually agree with CGI. The best solution is a 3rd party search engine. Anything else is trying to build your own search engine. I'm really not sure what the hardware at your disposal by your post so i'll give a possible solution for a lowbrow if all you got is LAMP hosting.
So in your PHP code you would make a query string like:
$qstr = "SELECT * FROM Clients WHERE `name` like '%".$search."%' ORDER BY popularity DESC LIMIT 0,100";
Than increment the popularity column for every record that is found via the "search engine."
On the front end (Lets say your using Dojo) you could do something like...
<script>
require(["dojo/on", "dojo/dom", "dojo/request/xhr", "dojo/domReady!"], function (on, dom, xhr) {
on(dom.byId('txtSearch'), "change", function(evt) {
if (typeof searchCheck !== undefined) clearTimeout(searchCheck);
searchCheck = setTimeout(function() { //keep from flooding XHR
xhr("fetch-json-results.php", {
handleAs: "json"
}).then(function(data){
//update txtSearch combo store
});
}, 500);
});
});
</script>
<input id="txtSearch" type="text" data-dojo-type="dijit/form/ComboBox" data-dojo-props="intermediateChanges:true">
This would be a low tech low budget (LAMP) equiv answer.

Related

What is the best way to do complicated string search on 5M records ? Application layer or DB layer?

I have a use case where I need to do complicated string matching on records of which there are about 5.1 Million of. When I say complicated string matching, I mean using library to do fuzzy string matching. (http://blog.bripkens.de/fuzzy.js/demo/)
The database we use at work is SAP Hana which is excellent for retrieving and querying because it's in memory so I would like to avoid pulling data out of there and re-populating it in memory on the application layer but at the same time I cannot take advantages of the libraries (there is an API for fuzzy matching in the DB but it's not comprehensive enough for us).
What is the middle ground here? If I do pre-processing and associate words in the DB with certain keywords the user might search for I can cut down the overhead but are there any best practises that are employed when It comes to this ?
If it matters. The list is a list of Billing Descriptors (that show up on CC statements) therefore, the user will search these descriptors to find out which companies the descriptor belongs too.

Assuming your "billing descriptor" is a single column, probably of type (N)VARCHAR I would start with a very simple SAP HANA fuzzy search, e.g.:
SELECT top 100 SCORE() AS score, <more fields>
FROM <billing_documents>
WHERE CONTAINS(<bill_descr_col>, <user_input>, FUZZY(0.7))
ORDER BY score DESC;
Maybe this is already good enough when you want to apply your js library on the result set. If not, I would start to experiment with the similarCalculationMode option, like 'similarcalculationmode=substringsearch' etc. And I would always have a look at the response times, they can be higher when using some of the options.
Only if response times are to high, or many active concurrent users are using your query, I would try to create a fuzzy search index on your search column. If you need more search options, you can also create a fullext index.
But that all really depends on you use case, the values you want to compare etc.
There is a very comprehensive set of features and options for different use cases, check help.sap.com/hana/SAP_HANA_Search_Developer_Guide_en.pdf.
In a project we did a free style search on several address columns (name, surname, company name, post code, street) and we got response times of 100-200ms on ca 6 Mio records WITHOUT using any special indexes.

Meteor, mongodb - canteen optimization

TL;DR:
I'm making an app for a canteen. I have a collection with the persons and a collection where I "log" every meat took. I need to know those who DIDN'T take the meal.
Long version:
I'm making an application for my local Red Cross.
I'm trying to optimize this situation:
there is a canteen at wich the helped people can take food at breakfast, lunch and supper. We need to know how many took the meal (and this is easy).
if they are present they HAVE TO take the meal and eat, so we need to know how many (and who) HAVEN'T eat (this is the part that I need to optimize).
When they take the meal the "cashier" insert their barcode, the program log the "transaction" in the log collection.
Actually, on creation of the template "canteen" I create a local collection "meals" and populate it with the data of all the people in the DB, (so ID, name, fasting/satiated), then I use this collection for my counters and to display who took the meal and who didn't.
(the variable "mealKind" is = "breakfast" OR "lunch" OR "dinner" depending on the actual serving.)
Template.canteen.created = function(){
Meals=new Mongo.Collection(null);
var today= new Date();today.setHours(0,0,1);
var pers=Persons.find({"status":"present"},{fields:{"Name":1,"Surname":1,"barcode":1}}).fetch();
pers.forEach(function(uno){
var vediamo=Log.findOne({"dest":uno.codice,"what":mealKind, "when":{"$gte": today}});
if(typeof vediamo=="object"){
uno['eat']="satiated";
}else{
uno['eat']="fasting";
}
Meals.insert(uno);
});
};
Template.canteen.destroyed = function(){
meals.remove({});
};
From the meal collection I estrapolate the two colums of people satiated (with name, surname and barcode) and fasting, and I also use two helpers:
fasting:function(){
return Meals.find({"eat":"fasting"});
}
"countFasting":function(){
return Meals.find({"eat":"fasting"}).count();
}
//same for satiated
This was ok, but now the number of people is really increasing (we are arount 1000 and counting) and the creation of the page is very very slow, and usually it stops with errors so I can read that "100 fasting, 400 satiated" but I have around 1000 persons in the DB.
I can't figure out how to optimize the workflow, every other method that I tried involved (in a manner or another) more queries to the DB; I think that I missed the point and now I cannot see it.
I'm not sure about aggregation at this level and inside meteor, because of minimongo.
Although making this server side and not client side is clever, the problem here is HOW discriminate "fasting" vs "satiated" without cycling all the person collection.
+1 if the solution is compatibile with aleed:tabular

EDIT
I am still not sure about what is causing your performance issue (too many things in client memory / minimongo, too many calls to it?), but you could at least try different approaches, more traditionally based on your server.
By the way, you did not mention either how you display your data or how you get the incorrect reading for your number of already served / missing Persons?
If you are building a classic HTML table, please note that browsers struggle rendering more than a few hundred rows. If you are in that case, you could implement a client-side table pagination / infinite scrolling. Look for example at jQuery DataTables plugin (on which is based aldeed:tabular). Skip the step of building an actual HTML table, and fill it directly using $table.rows.add(myArrayOfData).draw() to avoid the browser limitation.
Original answer
I do not exactly understand why you need to duplicate your Persons collection into a client-side Meals local collection?
This requires that you have first all documents of Persons sent from server to client (this may not be problematic if your server is well connected / local. You may also still have autopublish package on, so you would have already seen that penalty), and then cloning all documents (checking for your Logs collection to retrieve any previous passages), effectively doubling your memory need.
Is your server and/or remote DB that slow to justify your need to do everything locally (client side)?
Could be much more problematic, should you have more than one "cashier" / client browser open, their Meals local collections will not be synchronized.
If your server-client connection is good, there is no reason to do everything client side. Meteor will automatically cache just what is needed, and provide optimistic DB modification to keep your user experience fast (should you structure your code correctly).
With aldeed:tabular package, you can easily display your Persons big table by "pages".
You can also link it with your Logs collection using the dburles:collection-helpers (IIRC there is an example en the aldeed:tabular home page).

What is Xdata protocol?

I need to understand what Xdata protocol is. I searched on internet but I can't find anything helps. We are a REU team trying to add a sensor on a interface that only take sensors with Xdata protocol.

Looking for the same information, this is the best document I could find:
ftp://aftp.cmdl.noaa.gov/user/jordan/iMet-1-RSB%20Radiosonde%20XDATA%20Daisy%20Chaining.pdf
Basically, an XDATA packet is ASCII-formated, and looks like this:
xdata=01010123456789abcdef
Where:
"xdata=" - header
"01" - instrument ID
"01" - instrument position in daisy chain (more than one instrument can be connected)
the rest is data
Another document mentions that packet should be terminated with CR/LF: ftp://aftp.cmdl.noaa.gov/user/jordan/XDATA%20Packet%20Example.pdf
Please note that XDATA doesn't seem to specify the data structure. It appears that the data processing is up to the ground station.
Since you didn't specify the radiosonde and instrument you plan to use, I won't write too much here, but you can go to http://www.esrl.noaa.gov/gmd/ozwv/wvap/sw.html where there is a little bit more information about the protocol (but not too much).

I'm the creator of the docs and website linked in the previous answer. Blagus summarized it well. Any instrument you want to send data through an xdata-compatible radiosonde (typically the iMet-1-RSB) to the ground should output 3.3V (3V works too) UART serial packets at 9600 baud (8-N-1, typically no flow control) according to the protocol. We keep track of a list of instrument ID numbers here to avoid conflicts, feel free to contact us to add a new instrument in the future (while prototyping, you can just make one up).
You can also "daisy chain" several xdata instruments together. Any incoming xdata packets have their DC index incremented and then are forwarded down the chain. When xdata packets reach the radiosonde they are stripped of their header info, a CRC is calculated and appended, then they are transmitted as binary FSK radio data to the antenna on the ground. The antenna/preamp is connected to a receiver and then SkySonde Server/Client can decode the data (if using an iMet radiosonde) from the receiver's audio output.

Breeze.js cache limitations? Or Browser?

We are investigating using Breeze for field deployment of some tools. The scenario is this -- an auditor will visit sites in the field, where most of the time there will be no -- or very degraded -- internet access. Rather than replicate our SQL database on all the laptops and tablets (if that's even possible), we are hoping to use Breeze to cache the data and then store it locally so it is accessible when there is not a usable connection.
Unfortunately, Breeze seems to choke when caching any significant amount of data. Generally on Chrome it's somewhere between 8 and 13MB worth of entities (as measured by the HTTPResponse headers). This can change a bit depending on how many tabs I have open and such, but I have not been able to move that more than 10%. the error I get is the Chrome tab crashes and tells me to reload. The error is replicable (I download the data in 100K chunks and it fails on the same read every time and works fine if I stop it after the previous read) When I change the page size, it always fails within the same range.
Is this a limitation of Breeze, or Chrome? Or windows? I tried it on Firefox, and it handles even less data before the whole browser crashes. IE fares a little better, but none of them do great.
Looking at performance in task manager, I get the following:
IE goes from 250M memory usage to 1.7G of memory usage during the caching process and caches a total of about 14MB before throwing an out-of-memory error.
Chrome goes from 206B memory usage to about 850M while caching a total of around 9MB
Firefox goes from around 400M to about 750M and manages to cache about 5MB before the whole program crashes.
I can calculate how much will be downloaded with any selection criteria, but I cannot find a way to calculate how much data can be handled by any specific browser instance. This makes using Breeze for offline auditing close to useless.
Has anyone else tackled this problem yet? What are the best approaches to handling something like this. I've thought of several things, but none of them are ideal. Any ideas would be appreciated.
ADDED At Steve Schmitt's request:
Here are some helpful links:
Metadata
Entity Diagram (pdf) (and html and edmx)
The first query, just to populate the tags on the page runs quickly and downloads minimal data:
var query = breeze.EntityQuery
.from("Countries")
.orderBy("Name")
.expand("Regions.Districts.Seasons, Regions.Districts.Sites");
Once the user has select the Sites s/he wishes to cache, the following two queries are kicked off (used to be one query, but I broke it into two hoping it would be less of a burden on resources -- it didn't help). The first query (usually 2-3K entities and about 2MB) runs as expected. Some combination of the predicates listed are used to filter the data.
var qry = breeze.EntityQuery
.from("SeasonClients")
.expand("Client,Group.Site,Season,VSeasonClientCredit")
.orderBy("DistrictId,SeasonId,GroupId,ClientId")
var p = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p1 = breeze.Predicate("SeasonId", "==", SeasonId);
var p2 = breeze.Predicate("DistrictId", "==", DistrictId);
var p3 = breeze.Predicate("Group.Site.SiteId", "in", SiteIds);
After the first query runs, the second query (below) runs (also using some combination of the predicates listed to filter the data. At about 9MB, it will have about 50K rows to download). When the total download burden between the two queries is between 10MB and 13MB, browsers will crash.
var qry = breeze.EntityQuery
.from("Repayments")
.orderBy('SeasonId,ClientId,RepaymentDate');
var p1 = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p2 = breeze.Predicate("SeasonId", "==", SeasonId);
var p3 = breeze.Predicate("DistrictId", "==", DistrictId);
var p4 = breeze.Predicate("SiteId", "in", SiteIds);
Thanks for the interest, Steve. You should know that the Entity Relationships are inherited and currently in production supporting the majority of the organization's operations, so as few changes as possible to that would be best. Also, the hope is to grow this from a reporting application to one with which data entry can be done in the field (so, as I understand it, using projections to limit the data wouldn't work).
Thanks for the interest, and let me know if there is anything else you need.

Here are some suggestions based on my experience building on an offline capable web application using breeze. Some or all of these might not make sense for your use cases...
Identify which entity types need to be editable vs which are used to fill drop-downs etc. Load non-editable data using the noTracking query option and cache them in localStorage yourself using JSON.stringify. This avoids the overhead of coercing the data into entities, change tracking, etc. Good candidates for this approach in your model might be entity types like Country, Region, District, Site, etc.
If possible, provide a facility in your application for users to identify which records they want to "take offline". This way you don't need to load and cache everything, which can get quite expensive depending on the number of relationships, entities, properties, etc.
In conjunction with suggestion #2, avoid loading all the editable data at once and avoid using the same EntityManager instance to load each set of data. For example, if the Client entity is something that needs to be editable out in the field without a connection, create a new EntityManager, load a single client (expanding any children that also need to be editable) and cache this data separately from other clients.
Cache the breeze metadata once. When calling exportEntities the includeMetadata argument should be false. More info on this here.
To create new EntityManager instances make use of the createEmptyCopy method.
EDIT:
I want to respond to this comment:
Say I have a client who has bills and payments. That client is in a
group, in a site, in a region, in a country. Are you saying that the
client, payment, and bill information might each have their own EM,
while the location hierarchy might be in a 4th EM with no-tracking?
Then when I refer to them, I wire up the relationships as needed using
LINQs on the different EMs (give me all the bills for customer A, give
me all the payments for customer A)?
It's a bit of a judgement call in terms of deciding how to separate things out. Some of what I'm suggesting might be overkill, it really depends on the amount of data and the way your application is used.
Assuming you don't need to edit groups, sites, regions and countries while offline, the first thing I'd do would be to load the list of groups using the noTracking option and cache them in localStorage for offline use. Then do the same for sites, regions and countries. Keep in mind, entities loaded with the noTracking option aren't cached in the entity manager so you'll need to grab the query result, JSON.stringify it and then call localStorage.setItem. The intent here is to make sure your application always has access to the list of groups, sites, regions, etc so that when you display a form to edit a client entity you'll have the data you need to populate the group, site, region and country select/combobox/dropdown.
Assuming the user has identified the subset of clients they want to work with while offline, I'd then load each of these clients one at a time (including their payment and bill information but not expanding their group, site, region, country) and cache each client+payments+bills set using entityManager.exportEntities. Reasoning here is it doesn't make sense to load several clients plus their payments and bills into the same EntityManager each time you want to edit a particular client. That could be a lot of unnecessary overhead, but again, this is a bit of a judgement call.

#Jeremy's answer was excellent and very helpful, but didn't actually answer the question, which I was starting to think was unanswerable, or at least the wrong question. However #Steve in the comments gave me the most appropriate information for this question.
It is neither Breeze nor the Browser, but rather Knockout. Apparently the knockout wrapper around the breeze entities uses all that memory (at least while loading the entities and in my environment). As described above, Knockout/Breeze would crap out after reading around 5MB of data, causing Chrome to crash with over 1.7GB of memory usage (from a pre-download memory usage around 300MB). Rewriting the app in ANgularJS eliminated the problem. So far I have been able to download over 50MB from the exact same EF6 model into Breeze/Angular, total Chrome memory usage never went above 625MB.
I will be testing larger payloads, but 50 MB more than satisfies my needs for the moment. Thanks everyone for your help.

How to do SQL-like queries in client side browser?

I've been looking for a way to do complex queries like SQL can perform but totally client side. I know that I can get the exact results that I want from doing SQL queries off of the server and I could even AJAX it so that it looks smooth. However for scaleability, performance, and bandwidth reasons I'd prefer to do this all client side.
Some requirements:
Wide browser compatibility. Anything that can run jQuery is fine. I'd actually prefer that it be a jQuery plugin.
Can sort on more than one column. For instance, order by state alphabetically and list all cities alphabetically within each state.
Can filter results. For instance, the equivalent of "where state = 'CA' or 'NY' or 'TX'".
Must work completely client side so the user only needs to download a large set of data once and can cut the data however they want without constantly fetching data from the server and would in fact be able to do all queries offline after the initial pull.
I've looked around on stackoverflow and found jslinq but it was last updated in 2009 and has no documentation. I also can't tell if it can do more complex queries like ordering on two different columns or doing "and" or "or" filtering.
I would think that something like this would have been done already. I know HTML5 got started down this path but then hit a roadblock. I just need basic queries, no joins or anything. Does anyone know of something that can do this? Thanks.
Edit: I think I should include a use case to help clarify what I'm looking for.
For example, I have a list of the 5000 largest cities in the US. Each record include Cityname, State, and Population. I would like to be able to download the entire dataset once and populate a JS array with it then, on the client side only, be able to run queries like the following and create a table from the resulting records.
Ten largest cities in California
All cities that start with "S" with populations of 1,000,000 or more.
Largest three cities in California, New York, Florida, Texas, and Illinois and order them alphabetically by state then by population. i.e. California, Los Angeles, 3,792,621; California, San Diego, 1,307,402; California, San Jose, 945,942...etc.
All of these queries would be trivial to do via SQL but I don't want to keep going back and forth to the server and I also want to allow offline use.

Take a look at http://linqjs.codeplex.com/
It easily meets all your requirements.

Try Alasql.js. This is a javascript client-side SQL database.
You can do complex queries with joins and grouping, even optimization of joins and where parts. It does not use WebSQL.
Your requirements support:
Wide browser compatibility - all modern versions of browsers, including mobiles.
Can sort on more than one column.- Alasql does it with ORDER BY clause.
Can filter results. - with WHERE clause.
Must work completely client side so the user only needs to download a large set of data once and can cut the data however they want without constantly fetching data from the server and would in fact be able to do all queries offline after the initial pull. - you can use pure JavaScript (Array.push(), etc.) operations to modify data (do not forget to set table.dirty flag).
Here is a simple example ( play with it in jsFiddle ):
// Fill table with data
var person = [
{ name: 'bill' , sex:'M', income:50000 },
{ name: 'sara' , sex:'F', income:100000 },
{ name: 'larry' , sex:'M', income:90000 },
{ name: 'olga' , sex:'F', income:85000 },
];
// Do the query
var res = alasql("SELECT * FROM ? person WHERE sex='F' AND income > 60000", [person]);

As long as the data can fit in memory as an array of objects, you can just use sort and filter. For example, say you want to filter products. You want to find all products either below $5 or above $100 and you want to sort by price (ascending), and if there are two products with the same price, sort by manufacturer (descending). You could do that like this:
var results = products.filter(function(product) {
// price is in cents
return product.price < 500 || product.price > 10000;
});
results.sort(function(a, b) {
var order = a.price - b.price;
if(order == 0) {
order = b.manufacturer.localeCompare(a.manufacturer);
}
return order;
});
For cross-browser compatibility, just shim filter.

How about Yahoo's YQL? I've only briefly looked at it, but it looks interesting.

Backbone is a pretty good js library which (their words) "gives structure to web applications by providing models with key-value binding and custom events, collections with a rich API of enumerable functions, views with declarative event handling, and connects it all to your existing API over a RESTful JSON interface."
I am not sure if this is what you are looking for but you can use it to mock up your model and bind event listeners to it. This seems to be a good tutorial to go through some of the base uses for it.

You can use CanJS. Its a relativelly new library that performs better then Backbone and other and its based on the infamous JavaScript MVC library. In reallity, its the MVC part of the JS MVC with a bit of spices.
You can take a look at this tut by net.tutsplus.com http://net.tutsplus.com/tutorials/javascript-ajax/diving-into-canjs-part-3/
It's pretty powerfull and fast. Has features like live binding that make's your life easy.

Coils is a Clojurescript framework which compiles to Javascropt and has client side SQL queries like this:
(go
(log (sql "SELECT * FROM test_table where name = ?" ["shopping"] )))
: It is full SQL that is passed to a server side relational database:
https://github.com/zubairq/coils

We Keep Coding

JavaScript is the programming language of the Web.