Appropriate DB solution for browser-based interactive query tool - javascript

I am working on an interactive query application for data records that look like this in a CSV file:
w1 w2 ... , w3 w4 w5 ... , f1, f2, f3, f4
where the first and second field contain phrases comprised of 1-15 words, and the rest (n1, n2 ...) are simply floating point numbers ("features"). The data can contain as many as 2-5 million such records.
I want to build a browser-based, standalone interactive querying app where I can run queries such as:
find N records (10 < N < 100) where the first (second) field is of length 5 words or less
find N records where the first (second) field contains a specific string
find N records with the highest (lowest) values for feature f1 (or f2/f3/f4)
find N records where the first field is longer (shorter) than the second field
...
I would like to use jquery to build the interactive part of this tool and, therefore, I figure that I need to use something like MongoDB to store the (JSON-formatted) data which I can then query using javascript. However, I am not sure whether it is even possible to use entirely local, client-side databases with JavaScript inside a browser. I am also not sure whether MongoDB can deal with queries like this for the data sizes that I will be dealing with. I am a complete novice at stuff like this so it is entirely possible that I may have missed something that's much more appropriate to this situation.
Thanks in advance!

I would think many local client browsers would run out of memory at that scale. I'd use Ajax to pass the queries back to some more traditional server-side database technology.

If you are developing this for an in-house type of project, you could definitely look into HTML5 client-side storage.
See:
http://www.webkit.org/blog/126/webkit-does-html5-client-side-database-storage/
https://developer.mozilla.org/En/Storage

Related

What is the best way to do complicated string search on 5M records ? Application layer or DB layer?

I have a use case where I need to do complicated string matching on records of which there are about 5.1 Million of. When I say complicated string matching, I mean using library to do fuzzy string matching. (http://blog.bripkens.de/fuzzy.js/demo/)
The database we use at work is SAP Hana which is excellent for retrieving and querying because it's in memory so I would like to avoid pulling data out of there and re-populating it in memory on the application layer but at the same time I cannot take advantages of the libraries (there is an API for fuzzy matching in the DB but it's not comprehensive enough for us).
What is the middle ground here? If I do pre-processing and associate words in the DB with certain keywords the user might search for I can cut down the overhead but are there any best practises that are employed when It comes to this ?
If it matters. The list is a list of Billing Descriptors (that show up on CC statements) therefore, the user will search these descriptors to find out which companies the descriptor belongs too.
Assuming your "billing descriptor" is a single column, probably of type (N)VARCHAR I would start with a very simple SAP HANA fuzzy search, e.g.:
SELECT top 100 SCORE() AS score, <more fields>
FROM <billing_documents>
WHERE CONTAINS(<bill_descr_col>, <user_input>, FUZZY(0.7))
ORDER BY score DESC;
Maybe this is already good enough when you want to apply your js library on the result set. If not, I would start to experiment with the similarCalculationMode option, like 'similarcalculationmode=substringsearch' etc. And I would always have a look at the response times, they can be higher when using some of the options.
Only if response times are to high, or many active concurrent users are using your query, I would try to create a fuzzy search index on your search column. If you need more search options, you can also create a fullext index.
But that all really depends on you use case, the values you want to compare etc.
There is a very comprehensive set of features and options for different use cases, check help.sap.com/hana/SAP_HANA_Search_Developer_Guide_en.pdf.
In a project we did a free style search on several address columns (name, surname, company name, post code, street) and we got response times of 100-200ms on ca 6 Mio records WITHOUT using any special indexes.

MySQL suggestions on DB design of N° values in 1 column or 1 column for value

I need to move my local project to a webserver and it is time to start saving things locally (users progress and history).
The main idea is that the webapp every 50ms or so will calculate 8 values that are related to the user which is using the webapp.
My questions are:
Should i use MySQL to store the data? At the moment im using a plain text file with a predefined format like:
Option1,Option2,Option3
Iteration 1
value1,value2,value3,value4,value5
Iteration 2
value1,value2,value3,value4,value5
Iteration 3
value1,value2,value3,value4,value5
...
If so, should i use 5 (or more in the future) columns (one for each value) and their ID as Iteration? Keep in mind i will have 5000+ Iterations per session (roughly 4mins)
Each users can have 10-20 sessions a day.
Will the DB become too big to be efficient?
Due to the sample speed a call to the DB every 50 ms seems a problem to me (especially since i have to animate the webpage heavily). I was wondering if it would be better to implement a Save button which populate all the DB with all the 5000+ values in one go. If so what could it be the best way?
Would it be better to save the *.txt directly in a folder in the webserver? Something like DB/usernameXYZ/dateZXY/filename_XZA.txt . To me yes, way less effort. If so which is the function that allows me to do so (possible JS/HTML).
The rules are simple, and are discussed in many Q&A here.
With rare exceptions...
Do not have multiple tables with the same schema. (Eg, one table per User)
Do not splay an array across columns. Use another table.
Do not put an array into a single column as a commalist. Exception: If you never use SQL to look at the individual items in the list, then it is ok for it to be an opaque text field.
Be wary of EAV schema.
Do batch INSERTs or use LOAD DATA. (10x speedup over one-row-per-INSERT)
Properly indexed, a billion-row table performs just fine. (Problem: It may not be possible to provide an adequate index.)
Images (a la your .txt files) could be stored in the filesystem or in a TEXT column in the database -- there is no universal answer of which to do. (That is, need more details to answer your question.)
"calculate 8 values that are related to the user" -- to vague. Some possibilities:
Dynamically computing a 'rank' is costly and time-consuming.
Summary data is best pre-computed
Huge numbers (eg, search hits) are best approximated
Calculating age from birth date - trivial
Data sitting in the table for that user is, of course, trivial to get
Counting number of 'friends' - it depends
etc.

Meteor, mongodb - canteen optimization

TL;DR:
I'm making an app for a canteen. I have a collection with the persons and a collection where I "log" every meat took. I need to know those who DIDN'T take the meal.
Long version:
I'm making an application for my local Red Cross.
I'm trying to optimize this situation:
there is a canteen at wich the helped people can take food at breakfast, lunch and supper. We need to know how many took the meal (and this is easy).
if they are present they HAVE TO take the meal and eat, so we need to know how many (and who) HAVEN'T eat (this is the part that I need to optimize).
When they take the meal the "cashier" insert their barcode, the program log the "transaction" in the log collection.
Actually, on creation of the template "canteen" I create a local collection "meals" and populate it with the data of all the people in the DB, (so ID, name, fasting/satiated), then I use this collection for my counters and to display who took the meal and who didn't.
(the variable "mealKind" is = "breakfast" OR "lunch" OR "dinner" depending on the actual serving.)
Template.canteen.created = function(){
Meals=new Mongo.Collection(null);
var today= new Date();today.setHours(0,0,1);
var pers=Persons.find({"status":"present"},{fields:{"Name":1,"Surname":1,"barcode":1}}).fetch();
pers.forEach(function(uno){
var vediamo=Log.findOne({"dest":uno.codice,"what":mealKind, "when":{"$gte": today}});
if(typeof vediamo=="object"){
uno['eat']="satiated";
}else{
uno['eat']="fasting";
}
Meals.insert(uno);
});
};
Template.canteen.destroyed = function(){
meals.remove({});
};
From the meal collection I estrapolate the two colums of people satiated (with name, surname and barcode) and fasting, and I also use two helpers:
fasting:function(){
return Meals.find({"eat":"fasting"});
}
"countFasting":function(){
return Meals.find({"eat":"fasting"}).count();
}
//same for satiated
This was ok, but now the number of people is really increasing (we are arount 1000 and counting) and the creation of the page is very very slow, and usually it stops with errors so I can read that "100 fasting, 400 satiated" but I have around 1000 persons in the DB.
I can't figure out how to optimize the workflow, every other method that I tried involved (in a manner or another) more queries to the DB; I think that I missed the point and now I cannot see it.
I'm not sure about aggregation at this level and inside meteor, because of minimongo.
Although making this server side and not client side is clever, the problem here is HOW discriminate "fasting" vs "satiated" without cycling all the person collection.
+1 if the solution is compatibile with aleed:tabular
EDIT
I am still not sure about what is causing your performance issue (too many things in client memory / minimongo, too many calls to it?), but you could at least try different approaches, more traditionally based on your server.
By the way, you did not mention either how you display your data or how you get the incorrect reading for your number of already served / missing Persons?
If you are building a classic HTML table, please note that browsers struggle rendering more than a few hundred rows. If you are in that case, you could implement a client-side table pagination / infinite scrolling. Look for example at jQuery DataTables plugin (on which is based aldeed:tabular). Skip the step of building an actual HTML table, and fill it directly using $table.rows.add(myArrayOfData).draw() to avoid the browser limitation.
Original answer
I do not exactly understand why you need to duplicate your Persons collection into a client-side Meals local collection?
This requires that you have first all documents of Persons sent from server to client (this may not be problematic if your server is well connected / local. You may also still have autopublish package on, so you would have already seen that penalty), and then cloning all documents (checking for your Logs collection to retrieve any previous passages), effectively doubling your memory need.
Is your server and/or remote DB that slow to justify your need to do everything locally (client side)?
Could be much more problematic, should you have more than one "cashier" / client browser open, their Meals local collections will not be synchronized.
If your server-client connection is good, there is no reason to do everything client side. Meteor will automatically cache just what is needed, and provide optimistic DB modification to keep your user experience fast (should you structure your code correctly).
With aldeed:tabular package, you can easily display your Persons big table by "pages".
You can also link it with your Logs collection using the dburles:collection-helpers (IIRC there is an example en the aldeed:tabular home page).

Saving Formula Patterns

Short: I need a way to save Formulas, so that I can execute them when I need it
Details: I am writing something for a eccommerce-system, so that the price of a product can be calculated by volume of the product. I want the backend-user (admin, seller of the product) to be able to set custom formulas for different ways of calculating the voluma. [e.g. (A x B - C x D) * E; A x B x (C - D);]. They differ in operations used (*,-,/,+) and in the amount of variables used in the formula.
I need a way to save this formulas (string is obviously a bad idea) in PHP, so that I can use them when I need them (set A,B,C,D,E to values and get the result) and also pass them by to Javascript and use them there too.
I appreciate any input on how this could be done.
Wow, nice problem! Here are some ideas.
a) Meta programming (is it the right word?)
In JS you can define a formula with a simple statement:
var f = new Function('a', 'alert(a)');
You can write a startup script which reads all the formula and load then dinamically (querying a webservice for example). This function could be called as expected:
f('Here I am');
There are many years since i learnt PHP but I think you could try dynamic includes. When user defines a formula you could generate a file on a special folder containing the PHP code representing it. If PHP allows function reference, the formulas could be accessed like:
$formulaBag['MyFormula']('Here I am')
Since you're allowing the user to insert code in your system, you should avoid "direct programming". Offer your user a small language for formula definition. When it is done, you trigger a process parse the formula-script and generate the corresponding JS and PHP code.
b) Dynamic parsing
It seems you have some performance concerns... if you define a simple language, parsing it in PHP would be not that difficult. Do you expect to run many formulas on a long-time-running process? If not, maybe this is a better option since you will not need to generate dynamic includes (security risk?). Calculating the formula would be something like this:
$res = ExecuteFormula($formula_name, $array_of_parameters)
[Added]
Parsing is not that difficult. You will need a Finity State Machine and a Stack to "reserve" temporary values when leading parentheses.
Well, I hope it helps.

StarDict support for JavaScript and a Firefox OS App

I wrote a dictionary app in the spirit of GoldenDict (www.goldendict.org, also see Google Play Store for more information) for Firefox OS: http://tuxor1337.github.io/firedict and https://marketplace.firefox.com/app/firedict
Since apps for ffos are based on HTML, CSS and JavaScript (WebAPI etc.), I had to write everything from scratch. At first, I wrote a basic library for synchronous and asynchronous access to StarDict dictionaries in JavaScript: https://github.com/tuxor1337/stardict.js
Although the app can be called stable by now, overall performance is still a bit sluggish. For some dictionaries, I have a list of words of almost 1,000,000 entries! That's huge. Indexing takes a really long time (up to several minutes per dictionary) and lookup as well. At the moment, the words are stored in an IndexedDB object store. Is there another alternative? With the current solution (words accessed and inserted using binary search) the overall experience is pretty slow. Maybe it would become faster, if there was some locale sort support by IndexedDB... Actually, I'm not even storing the terms themselves in the DB but only their offsets in the *.syn/*.idx file. I hope to save some memory doing that. But of course I'm not able to use any IDB sorting functionality with this configuration...
Maybe it's not the best idea to do the sorting in memory, because now the app is killed by the kernel due to an OOM on some devices (e.g. ZTE Open). A dictionary with more than 500,000 entries will definitely exceed 100 MB in memory. (That's only 200 Byte per entry and if you suppose the keyword strings are UTF-8, you'll exceed 100 MB immediately...)
Feel free to contribute directly to the project on GitHub. Otherwise, I would be glad to hear your advice concerning the above issues.
I am working on a pure Javascript implementation of MDict parser (https://github.com/fengdh/mdict-js) simliliar to your stardict project. MDict is another popular dictionary format with rich format (embeded image/audio/css etc.), which is widely support on window/linux/ios/android/windows phone. I have some ideas to share, and wish you can apply it to improve stardict.js in future.
MDict dictionary file (mdx/mdd) divides keyword and record into (optionaly compressed) block each contains around 2000 entries, and also provides a keyword block index table and record block index table to help quick look-up. Because of its compact data structure, I can implement my MDict parser scanning directly on dictionary file with small pre-load index table but no need of IndexDB.
Each keyword block index looks like:
{num_entries: ..,
first_word: ..,
last_word: ..,
comp_size: .., // size in compression
decomp_size: .., // size after decompression
offset: .., // offset in mdx file
index: ..
}
In keyblock, each entries is a pair of [keyword, offset]
Each record block index looks like:
{comp_size: .., // size in compression
decomp_size: .., // size after decompression
}
Given a word, use binary search to locate the keyword block maybe containing it.
Slice the keyword block and Load all keys in it, filter out matched one and get its record offfset.
Use binary search to locate the record block containing the word's record.
Slice the record block and retrieve its record (a definition in text or resource in ArrayBuffer) directly.
Since each block contains only around 2000 entries, it is fast enough to lookup word among 100K~1M dictionary entries within 100ms, quite decent value for human interaction. mdict-js parses file head only, it is super fast and of low memory usage.
In the same way, it is possible to retrieve a list of neighboring words for given phrase, even with wild card.
Please take a look on my online demo here: http://fengdh.github.io/mdict-js/
(You have to choose a local MDict dictionary: a mdx + optional mdd file)

Categories