How compare every element in a row in Pentaho - javascript

I have an Excel and there is an example of how it looks
enter image description here
I am using Pentaho, with the purpose of creating a new row(related to) in which I will show if a person has a relation with another one, I will consider that two-person are related if they have the same Dirección (address). For instance, María Isabel Hevilla Castro and Miguel Manceras Fernández live in the same place, then in relation to of María Isabel Hevilla Castro it will be Miguel Manceras Fernández and on the contrary, in Miguel Manceras Fernández it will be María IsabelHevilla Castro.
I have tried to solve this using a Javascript modified value, but I'm just beginning to learn Javascript and I don't know how to solve this problem.
Could somebody help me, or give me a clue.

If your addresses are clean you can do this with a self-join on Dirección.
The idea is that you sort by Dirección, then duplicate the stream, rename the name field to something else (Nombre2 or Related_to) and inner join them by Dirección. This will result in records for every combination that has the same Dirección, including the person themselves. That is fixed by filtering the rows, keeping only the ones where Nombre is not equal to Nombre2.
The basic flow can be extended with cleanup of address fields (Calculator step can do similarity scores) beforehand or extra processing afterwards for the related_to field.

This is likely better accomplished using a loop in something like Python, R, or Javascript as you already mentioned.
Pentaho is fundamentally designed to process data on a row-by-row basis. There aren't that many functions in Pentaho that allow you to do analysis across a column of data.
If you have to use Pentaho for this rather than something like Python or Javascript, then I'd suggest sorting on the Direccion column, and then using the Analytic query step to analyze across rows. This will probably only work if you have a maximum of two people per address, but this might get you where you need to go.

Related

Designing SQLite table for Elements that have custom numbers of custom fields

I have an interesting situation where I'm working with posts. I don't know how the user will want to structure the posts. It would either be one block of text, or structured in an a-> b -> c structure where a, b, and c are all text blocks, and if represented as a table, there would be an unknown number of columns and unknown number of rows.
Outside of the post data, there is the possibility of adding custom attributes to the post. Most of these would be shorter text strings, but an unknown number of them.
Understanding that a json object would probably be the simplest solution, I have to fit this into a self-serving db. SQLite seems to be the current accepted solution for Redwoodjs, the framework I'm building out of. How would I go about storing this kind of data within Redwoodjs using the prisma.js that it comes with?
Edit: The text blocks need to be separate when displaying the post and able to be referenced separately. There is another part of the project that will link to each text block specifically. The user would be choosing how many columns there are before entering any posts (configured in settings), but the rows would have to be updated dynamically. Closest example I can think of is like a test management software where you have precondition, execution steps, and expected results across the top for columns, and each additional step is a row.
Well, there are two routes that you could take. If possible use a NoSQL database, such as mongoDB, which Prisma has support for. There you would be able to create a JSON like structure with as many or as little paragraphs you would like.
If that is not possible a workaround, since SQLite does not support JSON data, you could store the stringified JSON data in a text field, and then parse it. This is not the optimal solution, so if possible use the first one.

Classifier to Predict activities taking place

I have multiple datasets here that i took from Kaggle. There are multiple csv files and each csv file is made specifically for sit, stand, walking, running etc. The data is taken from sensors like accelerometers and gyroscopes. The values in datasets are of axis like x, y and z.
Sample Data
Here is a sample dataset of jogging. Now i need to make classifiers in my program so that my program can detect itself whether the data is of jogging, sitting, standing etc. I want to mix all the datasets in a single csv file and then upload it into my webapge and then i want the javascript code to start detecting whether a particular row is of sitting, standing, jogging etc. I don't want any code help but instead i just need a little explanation or a way to start coding it. How can i started making such classifier? I know it is kind of broad question but i think i have tried to explain myself in best way possible. Once my program has detected every row with specific activity it will count all the activities separately and then show it in a table format in webpage.
In order to answer properly to your question, it would be very helpful to know which is your level of understanding and experience with machine learning.
If you are a beginner I would suggest to try to run and understand a couple of tutorials that can be easily found on the web.
If you need an idea of which is the "standard" approach for machine learning development, I will try to give you a general idea of the process.
You can summarize the process in these main steps:
Data pre-processing-> Data splitting -> Feature selection -> Model Training -> Validation -> Deployment
Data pre-processing is meant to clean and format the data: removing NA values, decision about categorical variables, outlier analysis,.... This is a complex step that depends on the application. In your case I would start checking that the data in the different data-sets are homogeneous, i.e. the features have the same meaning across csv and corresponding features respect the same distribution. While the meaning of each feature should be explained in the description of your csv, the check of the distributions could be easily done plotting the box-plots for each feature and csv. If distributions of the same feature across different csv files don't overlap you should investigate further the issue.
An important step in the design of a good model is the splitting of the data. You should split your data in training/validation set (training/validation/test for a more comprehensive approach). This step allows you to train your model on the training set and test the model on the validation set computing unbiased performance of your model. I suggest here to become familiar with concepts as: Cross Validation, stratified-cross-validation, nested-cross-validation for hyper-parameter tuning, overfitting, bias,.... The Validation of the model will give you an idea of the expected performance that it will have on unseen data. If you are considering the use of more than one model, you can use the validation results to choose the "best" one. I suggest here a comparison using the confidence interval or if possible a significance test (e.g t-test, anova,...). Before the deployment the model is trained on all the available data.
The choice of the model depends on the data that you are using: number of samples, number of features, type of variable (numerical, categorical),....
I'm not an expert of javascript, but I believe (just a feeling) that python and R are more common choices for developing Machine learning applications. Both have libraries specifically developed for the task and you can find a lot of materials and tutorial around.
With a bit of more context I think that I could be more specific.
I hope it helps

What is the best way to do complicated string search on 5M records ? Application layer or DB layer?

I have a use case where I need to do complicated string matching on records of which there are about 5.1 Million of. When I say complicated string matching, I mean using library to do fuzzy string matching. (http://blog.bripkens.de/fuzzy.js/demo/)
The database we use at work is SAP Hana which is excellent for retrieving and querying because it's in memory so I would like to avoid pulling data out of there and re-populating it in memory on the application layer but at the same time I cannot take advantages of the libraries (there is an API for fuzzy matching in the DB but it's not comprehensive enough for us).
What is the middle ground here? If I do pre-processing and associate words in the DB with certain keywords the user might search for I can cut down the overhead but are there any best practises that are employed when It comes to this ?
If it matters. The list is a list of Billing Descriptors (that show up on CC statements) therefore, the user will search these descriptors to find out which companies the descriptor belongs too.
Assuming your "billing descriptor" is a single column, probably of type (N)VARCHAR I would start with a very simple SAP HANA fuzzy search, e.g.:
SELECT top 100 SCORE() AS score, <more fields>
FROM <billing_documents>
WHERE CONTAINS(<bill_descr_col>, <user_input>, FUZZY(0.7))
ORDER BY score DESC;
Maybe this is already good enough when you want to apply your js library on the result set. If not, I would start to experiment with the similarCalculationMode option, like 'similarcalculationmode=substringsearch' etc. And I would always have a look at the response times, they can be higher when using some of the options.
Only if response times are to high, or many active concurrent users are using your query, I would try to create a fuzzy search index on your search column. If you need more search options, you can also create a fullext index.
But that all really depends on you use case, the values you want to compare etc.
There is a very comprehensive set of features and options for different use cases, check help.sap.com/hana/SAP_HANA_Search_Developer_Guide_en.pdf.
In a project we did a free style search on several address columns (name, surname, company name, post code, street) and we got response times of 100-200ms on ca 6 Mio records WITHOUT using any special indexes.

SQL joining a large amount of tables and comparing queries to return a specific result

I am working with a database that was handed down to me. It has approximately 25 tables, and a very buggy query system that hasn't worked correctly for a while. I figured, instead of trying to bug test the existing code, I'd just start over from scratch. I want to say before I get into it, "I'm not asking anyone to build the code for me". I'm not that lazy, all I want to know is, what would be the best way to lay out the code? The existing query uses "JOIN" to combine the results of all the tables in one variable, and spits it into the query. I have been told in other questions displaying this code, that it's just too much, and far too many bugs to try to single out what is causing the break.
What would be the most efficient way to query these tables that reference each other?
Example: Person chooses car year, make, model. PHP then gathers that information, and queries the SQL database to find what parts have matching year, vehicle id's, and parts compatible. It then uses those results to pull parts that have matching car model id's, OR vehicle id's(because the database was built very sloppily, and compares all the different tables to produce: Parts, descriptions, prices, part number, sku number, any retailer notes, wheelbase, drive-train compatibility, etc.
I've been working on this for two weeks, and I'm approaching my deadline with little to no progress. I'm about to scrap their database, and just do data entry for a week, and rebuild their mess if it would be easier, but if I can use the existing pile of crap they've given me, and save some time, I would prefer it.
Would it be easier to do a couple queries and compare the results, then use those results to query for more results, and do it step by step like that, or is one huge query comparing everything at once more efficient?
Should I use JOIN and pull all the tables at once and compare, or pass the input into individual variables, and pass the PHP into javascript on the client side to save server load? Would it be simpler to break the code up so I can identify the breaking points, or would using one long string decrease query time, and server loads? This is a very complex question, but I just want to make sure there aren't too many responses asking for clarification on trivial areas. I'm mainly seeking the best advice possible on how to handle this complicated situation.
Rebuild the database then make a php import to bring over the data.

Maps (hashtables) in the real world

I'm trying to explain Map (aka hash table, dict) to someone who's new to programming. While the concepts of Array (=list of things) and Set (=bag of things) are familiar to everyone, I'm having a hard time finding a real-world metaphor for Maps (I'm specifically interested in python dicts and Javascript Objects). The often used dictionary/phone book analogy is incorrect, because dictionaries are sorted, while Maps are not - and this point is important to me.
So the question is: what would be a real world phenomena or device that behaves like Map in computing?
I agree with delnan in that the human example is probably too close to that of an object. This works well if you are trying to transition into explaining how objects are implemented in loosely typed languages, however a map is a concept that exists in Java and C# as well. This could potentially be very confusing if they begin to use those languages.
Essentially you need to understand that maps are instant look-ups that rely on a unique set of values as keys. These two things really need to be stressed, so here's a decent yet highly contrived example:
Lets say you're having a party and everyone is supposed to bring one thing. To help the organizer, everyone says what their first name is and what they're bringing. Now lets pretend there are two ways to store this information. The first is by putting it down on a list and the second is by telling someone with a didactic memory. The contrived part is that they can only identify you through you're first name (so he's blind and has a cochlear implant so everyone sounds like a robot, best I can come up with).
List: To add, you just append to the bottom of the list. To back out you just remove yourself from the list. If you want to see who is bringing something and what they're bringing, then you have to scan the entire list until you find them. If you don't find them after scanning, then they're clearly they're not on the list and not bringing anything. The list would clearly allow duplicates of people with the same first name.
Dictionary (contrived person): You don't append to the end of the list, you just tell him someone's first name and what they're bringing. If you want to know what someone is bringing you just ask by name and he immediately tells you. Likewise if two people of the same name tell him they're bringing something, he'll think its the same person just changing what they're bringing. If someone hasn't signed up you would ask by name, but he'd be confused and ask you what you're talking about. Also you would have to say when you tell the guy that someone is no longer bringing something he would lose all memory of them, so yeah highly contrived.
You might also want to show why the list is sufficient if you don't care who brings what, but just need to know what all is being brought. Maybe even leave the names off the list, to stress key/value pairs with the dictionary.
Perhaps it would be the analogy of a human being that your meeting for the first time:
Each person has an unordered amount of attributes, each of these attributes can only have 1 value, which is unique (like hair=long, eye_color=blue). And you would discover these attributes in no particular order.
So for a person she can have a shoesize=38, hair_color=brown and eye_color=blue and when reciting (human_dict.get('shoe_size')) this to someone else you would mention the attributes in no particular order except by attribute name.
I have seen cases where a large list of people were binned according to their last N digits of their identifying number, in order to save on key search. This binning is somewhat similar to hashing, and may help explain it.
Are you successful in explaining the array in a logical way..that array is a storage where elements are kept at first position. second position , third position....first,second.third are basically keys...
Now extend it to say maps are storage where are keys are not necessarily numbers..lets say they are strings...or even numbers which are not consecutive or have any relationship
Conversely lets say in array A(of int) are maps where index 1 is mapped to A's address, 2 to the address of A + 4 and so on....
In some restaurants when you make your order in the counter, they give you a number to identify your order. The numbers :
Don't need to be sorted.
Don't need to be consecutive
The only idea of the numbers is that they can find your order easily. In the map/hash table/associative array world the number would be the key and your order the value.
After you finish your order they can use the same number for another order. So the number is basically the identifier for an order at certain point in time, this would fit the Javascript Object example where the properties of the objects can change their value.

Categories