I am new to that area, so the question may seem strange. However before asking I've read bunch of introductory articles about what are the key points about in machine learning and what are the acting parts of neural networks. Including very useful that one What is machine learning. Basically as I got it - an educated NN is (correct me if it's wrong):
set of connections between neurons (maybe self-connected, may have gates, etc.)
formed activation probabilities on each connection.
Both things are adjusted during the training to fit expected output as close as possible. Then, what we do with an educated NN - we load the test subset of data into it and check how good it performs. But what happens if we're happy with the test results and we want to store the education results and not run training again later when dataset get new values.
So my question is - is that education knowledge is stored somewhere except RAM? can be dumped (think of object serialisation in a way) so that you don't need to educate your NN with data you get tomorrow or later.
Now I am trying to make simple demo with my dataset using synaptic.js but I could not spot that kind of concept of saving education in project's wiki.
That library is just an example, if you reference some python lib would be good to!
With regards to storing it via synaptic.js:
This is quite easy to do! It actually has a built-in function for this. There are two ways to do this.
If you want to use the network without training it again
This will create a standalone function of your network, you can use it anywhere with javascript without requiring synaptic.js! Wiki
var standalone = myNetwork.standalone();
If you want to modify the network later on
Just convert your network to a JSON. This can be loaded up anytime again with synaptic.js! Wiki
// Export the network to a JSON which you can save as plain text
var exported = myNetwork.toJSON();
// Conver the network back to useable network
var imported = Network.fromJSON(exported);
I will assume in my answer that you are working with a simple multi-layer perceptron (MLP), although my answer is applicable to other networks too.
The purpose of 'training' an MLP is to find the correct synaptic weights that minimise the error on the network output.
When a neuron is connected to another neuron, its input is given a weight. The neuron performs a function, such as the weighted sum of all inputs, and then outputs the result.
Once you have trained your network, and found these weights, you can verify the results using a validation set.
If you are happy that your network is performing well, you simply record the weights that you applied to each connection. You can store these weights wherever you like (along with a description of the network structure) and then retrieve them later. There is no need to re-train the network every time you would like to use it.
Hope this helps.
Related
I have multiple datasets here that i took from Kaggle. There are multiple csv files and each csv file is made specifically for sit, stand, walking, running etc. The data is taken from sensors like accelerometers and gyroscopes. The values in datasets are of axis like x, y and z.
Sample Data
Here is a sample dataset of jogging. Now i need to make classifiers in my program so that my program can detect itself whether the data is of jogging, sitting, standing etc. I want to mix all the datasets in a single csv file and then upload it into my webapge and then i want the javascript code to start detecting whether a particular row is of sitting, standing, jogging etc. I don't want any code help but instead i just need a little explanation or a way to start coding it. How can i started making such classifier? I know it is kind of broad question but i think i have tried to explain myself in best way possible. Once my program has detected every row with specific activity it will count all the activities separately and then show it in a table format in webpage.
In order to answer properly to your question, it would be very helpful to know which is your level of understanding and experience with machine learning.
If you are a beginner I would suggest to try to run and understand a couple of tutorials that can be easily found on the web.
If you need an idea of which is the "standard" approach for machine learning development, I will try to give you a general idea of the process.
You can summarize the process in these main steps:
Data pre-processing-> Data splitting -> Feature selection -> Model Training -> Validation -> Deployment
Data pre-processing is meant to clean and format the data: removing NA values, decision about categorical variables, outlier analysis,.... This is a complex step that depends on the application. In your case I would start checking that the data in the different data-sets are homogeneous, i.e. the features have the same meaning across csv and corresponding features respect the same distribution. While the meaning of each feature should be explained in the description of your csv, the check of the distributions could be easily done plotting the box-plots for each feature and csv. If distributions of the same feature across different csv files don't overlap you should investigate further the issue.
An important step in the design of a good model is the splitting of the data. You should split your data in training/validation set (training/validation/test for a more comprehensive approach). This step allows you to train your model on the training set and test the model on the validation set computing unbiased performance of your model. I suggest here to become familiar with concepts as: Cross Validation, stratified-cross-validation, nested-cross-validation for hyper-parameter tuning, overfitting, bias,.... The Validation of the model will give you an idea of the expected performance that it will have on unseen data. If you are considering the use of more than one model, you can use the validation results to choose the "best" one. I suggest here a comparison using the confidence interval or if possible a significance test (e.g t-test, anova,...). Before the deployment the model is trained on all the available data.
The choice of the model depends on the data that you are using: number of samples, number of features, type of variable (numerical, categorical),....
I'm not an expert of javascript, but I believe (just a feeling) that python and R are more common choices for developing Machine learning applications. Both have libraries specifically developed for the task and you can find a lot of materials and tutorial around.
With a bit of more context I think that I could be more specific.
I hope it helps
I’m trying to show a list of lunch venues around the office with their today’s menus. But the problem is the websites that offer the lunch menus, don’t always offer the same kind of content.
For instance, some of the websites offer a nice JSON output. Look at this one, it offers the English/Finnish course names separately and everything I need is available. There are couple of others like this.
But others, don’t always have a nice output. Like this one. The content is laid out in plain HTML and English and Finnish food names are not exactly ordered. Also food properties like (L, VL, VS, G, etc) are just normal text like the food name.
What, in your opinion, is the best way to scrape all these available data in different formats and turn them into usable data? I tried to make a scraper with Node.js (& phantomjs, etc) but it only works with one website, and it’s not that accurate in case of the food names.
Thanks in advance.
You may use something like kimonolabs.com, they are much easier to use and they give you APIs to update your side.
Remember that they are best for tabular data contents.
There my be simple algorithmic solutions to the problem, If there is a list of all available food names this can be really helpful, you find the occurrence of a food name inside a document (for today).
If there is not any food list, You may use TF/IDF. TF/IDF allows to calculate the score of a word inside a document among the current document and also other documents. But this solution needs enough data to work.
I think the best solution is some thing like this:
Creating a list of all available websites that should be scrapped.
Writing driver classes for each website data.
Each driver has the duty of creating the general domain entity from its standard document.
If you can use PHP, Simple HTML Dom Parser along with Guzzle would be a great choice. These two will provide a jQuery like path finder and a nice wrapper arround HTTP.
You are touching really difficult problem. Unfortunately there are no easy solutions.
Actually there are two different parts to solve:
data scraping from different sources
data integration
Let's start with first problem - data scraping from different sources. In my projects I usually process data in several steps. I have dedicated scrapers for all specific sites I want, and process them in the following order:
fetch raw page (unstructured data)
extract data from page (unstructured data)
extract, convert and map data into page-specific model (fully structured data)
map data from fully structured model to common/normalized model
Steps 1-2 are scraping oriented and steps 3-4 are strictly data-extraction / data-integration oriented.
While you can easily implement steps 1-2 relatively easy using your own webscrapers or by utilizing existing web services - data integration is the most difficult part in your case. You will probably require some machine-learning techniques (shallow, domain specific Natural Language Processing) along with custom heuristics.
In case of such a messy input like this one I would process lines separately and use some dictionary to get rid Finnish/English words and analyse what has left. But in this case it will never be 100% accurate due to possibility of human-input errors.
I am also worried that you stack is not very well suited to do such tasks. For such processing I am utilizing Java/Groovy along with integration frameworks (Mule ESB / Spring Integration) in order to coordinate data processing.
In summary: it is really difficult and complex problem. I would rather assume less input data coverage than aiming to be 100% accurate (unless it is really worth it).
We are investigating using Breeze for field deployment of some tools. The scenario is this -- an auditor will visit sites in the field, where most of the time there will be no -- or very degraded -- internet access. Rather than replicate our SQL database on all the laptops and tablets (if that's even possible), we are hoping to use Breeze to cache the data and then store it locally so it is accessible when there is not a usable connection.
Unfortunately, Breeze seems to choke when caching any significant amount of data. Generally on Chrome it's somewhere between 8 and 13MB worth of entities (as measured by the HTTPResponse headers). This can change a bit depending on how many tabs I have open and such, but I have not been able to move that more than 10%. the error I get is the Chrome tab crashes and tells me to reload. The error is replicable (I download the data in 100K chunks and it fails on the same read every time and works fine if I stop it after the previous read) When I change the page size, it always fails within the same range.
Is this a limitation of Breeze, or Chrome? Or windows? I tried it on Firefox, and it handles even less data before the whole browser crashes. IE fares a little better, but none of them do great.
Looking at performance in task manager, I get the following:
IE goes from 250M memory usage to 1.7G of memory usage during the caching process and caches a total of about 14MB before throwing an out-of-memory error.
Chrome goes from 206B memory usage to about 850M while caching a total of around 9MB
Firefox goes from around 400M to about 750M and manages to cache about 5MB before the whole program crashes.
I can calculate how much will be downloaded with any selection criteria, but I cannot find a way to calculate how much data can be handled by any specific browser instance. This makes using Breeze for offline auditing close to useless.
Has anyone else tackled this problem yet? What are the best approaches to handling something like this. I've thought of several things, but none of them are ideal. Any ideas would be appreciated.
ADDED At Steve Schmitt's request:
Here are some helpful links:
Metadata
Entity Diagram (pdf) (and html and edmx)
The first query, just to populate the tags on the page runs quickly and downloads minimal data:
var query = breeze.EntityQuery
.from("Countries")
.orderBy("Name")
.expand("Regions.Districts.Seasons, Regions.Districts.Sites");
Once the user has select the Sites s/he wishes to cache, the following two queries are kicked off (used to be one query, but I broke it into two hoping it would be less of a burden on resources -- it didn't help). The first query (usually 2-3K entities and about 2MB) runs as expected. Some combination of the predicates listed are used to filter the data.
var qry = breeze.EntityQuery
.from("SeasonClients")
.expand("Client,Group.Site,Season,VSeasonClientCredit")
.orderBy("DistrictId,SeasonId,GroupId,ClientId")
var p = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p1 = breeze.Predicate("SeasonId", "==", SeasonId);
var p2 = breeze.Predicate("DistrictId", "==", DistrictId);
var p3 = breeze.Predicate("Group.Site.SiteId", "in", SiteIds);
After the first query runs, the second query (below) runs (also using some combination of the predicates listed to filter the data. At about 9MB, it will have about 50K rows to download). When the total download burden between the two queries is between 10MB and 13MB, browsers will crash.
var qry = breeze.EntityQuery
.from("Repayments")
.orderBy('SeasonId,ClientId,RepaymentDate');
var p1 = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p2 = breeze.Predicate("SeasonId", "==", SeasonId);
var p3 = breeze.Predicate("DistrictId", "==", DistrictId);
var p4 = breeze.Predicate("SiteId", "in", SiteIds);
Thanks for the interest, Steve. You should know that the Entity Relationships are inherited and currently in production supporting the majority of the organization's operations, so as few changes as possible to that would be best. Also, the hope is to grow this from a reporting application to one with which data entry can be done in the field (so, as I understand it, using projections to limit the data wouldn't work).
Thanks for the interest, and let me know if there is anything else you need.
Here are some suggestions based on my experience building on an offline capable web application using breeze. Some or all of these might not make sense for your use cases...
Identify which entity types need to be editable vs which are used to fill drop-downs etc. Load non-editable data using the noTracking query option and cache them in localStorage yourself using JSON.stringify. This avoids the overhead of coercing the data into entities, change tracking, etc. Good candidates for this approach in your model might be entity types like Country, Region, District, Site, etc.
If possible, provide a facility in your application for users to identify which records they want to "take offline". This way you don't need to load and cache everything, which can get quite expensive depending on the number of relationships, entities, properties, etc.
In conjunction with suggestion #2, avoid loading all the editable data at once and avoid using the same EntityManager instance to load each set of data. For example, if the Client entity is something that needs to be editable out in the field without a connection, create a new EntityManager, load a single client (expanding any children that also need to be editable) and cache this data separately from other clients.
Cache the breeze metadata once. When calling exportEntities the includeMetadata argument should be false. More info on this here.
To create new EntityManager instances make use of the createEmptyCopy method.
EDIT:
I want to respond to this comment:
Say I have a client who has bills and payments. That client is in a
group, in a site, in a region, in a country. Are you saying that the
client, payment, and bill information might each have their own EM,
while the location hierarchy might be in a 4th EM with no-tracking?
Then when I refer to them, I wire up the relationships as needed using
LINQs on the different EMs (give me all the bills for customer A, give
me all the payments for customer A)?
It's a bit of a judgement call in terms of deciding how to separate things out. Some of what I'm suggesting might be overkill, it really depends on the amount of data and the way your application is used.
Assuming you don't need to edit groups, sites, regions and countries while offline, the first thing I'd do would be to load the list of groups using the noTracking option and cache them in localStorage for offline use. Then do the same for sites, regions and countries. Keep in mind, entities loaded with the noTracking option aren't cached in the entity manager so you'll need to grab the query result, JSON.stringify it and then call localStorage.setItem. The intent here is to make sure your application always has access to the list of groups, sites, regions, etc so that when you display a form to edit a client entity you'll have the data you need to populate the group, site, region and country select/combobox/dropdown.
Assuming the user has identified the subset of clients they want to work with while offline, I'd then load each of these clients one at a time (including their payment and bill information but not expanding their group, site, region, country) and cache each client+payments+bills set using entityManager.exportEntities. Reasoning here is it doesn't make sense to load several clients plus their payments and bills into the same EntityManager each time you want to edit a particular client. That could be a lot of unnecessary overhead, but again, this is a bit of a judgement call.
#Jeremy's answer was excellent and very helpful, but didn't actually answer the question, which I was starting to think was unanswerable, or at least the wrong question. However #Steve in the comments gave me the most appropriate information for this question.
It is neither Breeze nor the Browser, but rather Knockout. Apparently the knockout wrapper around the breeze entities uses all that memory (at least while loading the entities and in my environment). As described above, Knockout/Breeze would crap out after reading around 5MB of data, causing Chrome to crash with over 1.7GB of memory usage (from a pre-download memory usage around 300MB). Rewriting the app in ANgularJS eliminated the problem. So far I have been able to download over 50MB from the exact same EF6 model into Breeze/Angular, total Chrome memory usage never went above 625MB.
I will be testing larger payloads, but 50 MB more than satisfies my needs for the moment. Thanks everyone for your help.
I'm working on a PhoneGap app and find it most efficient to cache as much user data as possible on the device when the user logs in. (Profile information, etc.) Due to project constraints, I can't use local storage.
I'm making an API call and pulling back of JSON data that I use to power the app. My specific question: It is somewhat safe to assume that the byte size of the JSON results will be roughly equal to the memory consumed? i.e. if the API call response is 200k of JSON data, that about 200k of memory will be used to store it in a javascript object?
The best answer is, "Test it."
As several commenters have said, under many common conditions where will be a rough 1:1 correspondence, but there are so many caveats and gotchas, some of which are engine-specific, that it's impossible to say with any certainty.
In other words, if you test it and it looks like the assumption holds under your particular use-case, then 'somewhat safe' is a reasonable assumption. Just don't bet the farm on it.
I am playing around with CouchDB to test if it is "possible" [1] to store scientific data (simulated and experimental raw data + metadata). A big pro is the schema-less approach of CouchDB: we have to be very flexible with the metadata, as the set of parameters changes very often.
Up to now I have some code to feed raw data, plots (both as attachments), and hierarchical metadata (as JSON) into CouchDB documents, and have written some prototype Javascript for filtering and showing. But the filtering is done on the client side (a.k.a. browser): The map function simply returns everything.
How could I change the (or push a second) map function of a specific _design-document with simple browser-JS?
I do not think that a temporary view would yield any performance gain...
Thanks for your time and answers.
[1]: of course it is possible, but is it also useful? feasible? reasonable?
[added]
Ah, the jquery.couch.js (version 0.9.0) provides a saveDoc() function, which could update the _design document with the new map function.
But I also tried out the query function, which uses a temporary view. Okay, "do not use this in the real product, only during development"... But scientific research is steady development, right?
Temporary views are getting cached, as I noticed, and it works well for ~1000 documents per DB. A second plus: all users (think of 1 to 3, so a big user management is quit of an overkill) can work with their own temporary view.
Never ever use temporary views. They are really only there for dev and debugging purposes. For more information, see http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views (specifically the bold "NOTE").
And yes, because design documents are really just documents with special powers, you can run you GET/POST/PUT/DELETE methods on them. However, you will usually need admin privileges to do this. So, if you are allowing a client side piece of software to do that, you are making your entire database public for read/write access - this may be fine for your application, but is important to remember.
Ex., if you restrict access to your database, but put the username and password in client side javascript, then anyone can see that username and password.
Cheers.
I´ve written an helper functions for jquery.couch and design docs, take a look at:
https://github.com/grischaandreew/jquery.couch.js