I'm trying to build an application which does the following (simplified):
Allow the user to select a CSV file
Upload that CSV to NodeJS server
Parse the file and create array of rows (with headers)
Generate dynamic "Create Table" sql based on the column headers in the csv, but also detect the datatype (the column names, datatypes etc will be different every time)
Insert the csv data into the newly created table
Its step 4 I'm having trouble with. Is there a way to scan an array of data elements and determine what the datatype should be?
I've looked at Papa Parse and csv-parse but neither do what I need. Papa Parse comes close though, but it converts each array element separately and doesn't pick up dates.
Even if you run a full file scan, it will be difficult to guess the exact types.
Another problem is handling errors in input files, eg the number in the column, which should be stored a date.
Further: the insurance number (or account number) is a number, but in the database should be stored as a string.
I suggest you a method straight from Big Data Analysis.
Run the entire process in 3 stages: first create an intermediate table, where each column will be of type Text and import data into it using mysq: LOAD DATA INFILE ...
Conduct a preliminary analysis based on the user's previous choices, column names, content analysis and display for user a "wizard" of the table. (Or skip display wizard)
The analysis should include the calculation of the shortest, longest, average and most common lengths (eg first 100 rows contains long string who is error message: Some date for some proces isn't provided and other are valid date); variety of values (gender, country, other "dictionary" values); random content analysis (detection dates and numbers)
At the end you can use INSERT INTO ... SELECT, change column type (don't forget allow to NULL for convert error) or line by line convert and filtering operation.
//edit
Eh, I thought your files had a few GB. Loading large files in memory does not make sense.
Of course, you can use a library to read CSV and analyze it in memory instead of a temporary table in MySQL. But you will not avoid content analysis anyway. There is nothing to hide - automatic analysis without advanced AI systems works on average.
If you've found something that even detects data types a bit, you can build on it. Helpful also, I can be a tablesorter parsers.
if you still looking for answer, i would recommend npm csv parser packages such as const parse = require('csv-parse') , it is simply easy, first of all you have to get csv file data and parse it through csv parser function, then loop through your data and put it in an object to use it in sql query..
Related
I have an interesting situation where I'm working with posts. I don't know how the user will want to structure the posts. It would either be one block of text, or structured in an a-> b -> c structure where a, b, and c are all text blocks, and if represented as a table, there would be an unknown number of columns and unknown number of rows.
Outside of the post data, there is the possibility of adding custom attributes to the post. Most of these would be shorter text strings, but an unknown number of them.
Understanding that a json object would probably be the simplest solution, I have to fit this into a self-serving db. SQLite seems to be the current accepted solution for Redwoodjs, the framework I'm building out of. How would I go about storing this kind of data within Redwoodjs using the prisma.js that it comes with?
Edit: The text blocks need to be separate when displaying the post and able to be referenced separately. There is another part of the project that will link to each text block specifically. The user would be choosing how many columns there are before entering any posts (configured in settings), but the rows would have to be updated dynamically. Closest example I can think of is like a test management software where you have precondition, execution steps, and expected results across the top for columns, and each additional step is a row.
Well, there are two routes that you could take. If possible use a NoSQL database, such as mongoDB, which Prisma has support for. There you would be able to create a JSON like structure with as many or as little paragraphs you would like.
If that is not possible a workaround, since SQLite does not support JSON data, you could store the stringified JSON data in a text field, and then parse it. This is not the optimal solution, so if possible use the first one.
I have bunch of zip codes records (50k) with corresponding states they belong to in csv file (0.5 MB in size). I want to read it in array and later write my own function to see if user provided zip code matches with state.
Currently, I have those records in MongoDb and I do Async reading in the beginning of the application and it takes time. I do not ever need to update the data. Just one time read and array's filter function would do the required job for me.
Would you have any suggestion on alternative way of storing the data that is quicker to load?
Thank you,
My apologies, but I am not well versed in json. We are currently using a single level json file for date information, one file per tour package. These are fetched via js then processed and inserted into the appropriate spots on the webpage. What we would like to do is combined all the tours date information, plus some additional details, into a single json file that once fetched is cached on the browser for a few hours. Basically ending up with a local flat file "database" with all tours for js to access.
Doing the single level json was fairly straight forward, but combining it into multiple levels is more daunting. I am wondering:
1) if there is a specific format for the data as outlined below?
2) how to use js to extract the data from that format?
Each tour is designated by a numerical id and has number of values. So should this first level be one or two levels (this is only the data concept, not json code):
tours -> tour_id, price1, price2, price3, duration, level, dates
tours -> tour_id -> price1, price2, price3, duration, level, dates
The dates value will have multiple dates each with several values:
dates -> date1, date2, date3, date4, etc
each date has -> trip_code, start_date, end_date, price, spaces
The basic functionality will be, when the page is loaded js will read the tour value from the page, then find the appropriate tour within the json file. The general values will be extracted by one function and simply inserted into the page as is using innerHTML. The date values will be used by a different function to build strings and then those strings likewise inserted into the page.
As I read through available info, I find some folks use only braces, some use braces & brackets, various suggestions for extraction, etc. I appreciate any help towards which format / extraction method would be preferable. And by preferrable, I mean whichever format / method puts the least workload on the browser. Having a slightly larger file size due to extra braces or brackets is fine if it reduces the js overhead and speeds up the finished page.
While it is probably of no consequence to the answer, the json file will be built by PHP and saved as a static file on the server.
You can make the multi-level object in the form
multiLevel = { [tour_id]: yourOriginal1LevelObject}
For retrieving the tour you can use
const tour = multiLevel[tour_id]
or using destructuring
const {tour_id: tour} = multiLevel
I have to check csv files against some specific rules to make sure every single row is validated. For example, a Quantity column must contain numbers only. If a single record contains a string, then the csv is rejected.
I found http://papaparse.com/ which I think is a very good library for parsing the csv data. Off the back of that, I can use the parsed data to check row by row.
I wanted to get ideas on what's the best way I can validate a csv file of 50,000 rows against 10 rules without crashing the browser?
I have a database and I have a website front end. I have a field in my front end that is text now but I want it to support markdown. I am trying to figure out the right was to store in my database because I have various views that needs to be supported (PDF reports, web pages, excel files, etc)?
My concern is that since some of those views don't support HTML, I don't just want to have an HTML version of this field.
Should I store 2 copies (one text only and one HTML?), or should I store HTML and on the fly try to remove them HTML tags when I am rendering out to Excel for example?
I need to figure out correct format (or formats) to store in the database to be able to render both:
HTML, and
Regular text (with no markdown or HTML syntax)
Any suggestions would be appreciated as I don't want to go down the wrong path. My point is that I don't want to show any HTML tags or markdown syntax in my Excel output.
Decide like this:
Store the original data (text with markdown).
Generate the derived data (HTML and plaintext) on the fly.
Measure the performance:
If it's acceptable, you're done, woohoo!
If not, cache the derived data.
Caching can be done in many ways... you can generate the derived data immediately, and store it in the database, or you can initially store NULLs and do the generation lazily (when and if it's needed). You can even cache it outside the database.
But whatever you do, make sure the cache is never "stale" - i.e. when the original data changes, the derived data in the cache must be re-generated or at least marked as "dirty" somehow. One way to do that is via triggers.
You need to store your data in a canonical format. That is, in one true format within your database. It sounds like this format should be a text column that contains markdown. That answers the database-design part of your question.
Then, depending on what format you need to export, you should take the canonical format and convert it to the required output format. This might be just outputting the markdown text, or running it through some sort of parser to remove the markdown or convert it to HTML.
Most everyone seems to be saying to just store the data as HTML in the database and then process it to turn it into plain text. In my opinion there are some downsides to that:
You will likely need application code to strip the HTML and extract the plain text. Imagine if you did this in SQL Server. What if you want to write a stored procedure/query that has the plain text version? How do you extract plain text in SQL? It's possible with a function, but it's a lot of work.
Processing the HTML blob can be slow. I would imagine for small HTML blobs it will be very fast, but there is certainly more overhead than just reading a plain text field.
HTML parsers don't always work well/they can be complex. The idea is that your users can be very creative and insert blobs that won't work well with your parser. I know from experience that it's not always trivial to extract plain text from HTML well.
I would propose what most email providers do:
Store a rich text/HTML version and a plain text version. Two fields in the database.
As is the use case with email providers, the users might want those two fields to have different content.
You can write a UI function that lets the user enter in HTML and then transforms it via the application into a plain text version. This gives the user a nice starting point and they can massage/edit the plain text version before saving to the database.
Always store the source, in your case it is markdown.
Also store the formats that are frequently used.
Use on demand conversion/rendering for less frequent used formats.
Explanation:
Always have the source. You may need it for various purpose, e.g. the same input can be edited, audit trail, debugging etc etc.
No overhead for processor/ram if the same format is frequently requested, you are trading it with the disk storage which is cheap comparing to the formars.
Occasional overhead, see the #2
I would suggest to store it in the HTML format, since is the richest one in this case, and remove the tags when obtaining the data for other formats (such PDF, Latex or whatever). In the following question you'll find a way to remove tags easily.
Regular expression to remove HTML tags
From my point of view, storing data (original and downgraded) in two separate fields is a waste of space, but also an integrity problem, since one of the fields could be -in theory- modified without changing the second one.
Good luck!
I think that what I'd do - if storage is not an issue - would be store the canonical version, but automatically generate from it, in persisted, computed fields, whatever other versions one might need. You want the fields to be persisted because it's pointless doing the conversion every time you need the data. And you want them to be computed because you don't want them to get out of synch with the canonical version.
In essence this is using the database as a cache for the other versions, but a cache that guarantees you data integrity.