How to upload massive number of books automatically to database

How to upload massive number of books automatically to database - javascript

I have a client who has 130k books (~4 terabyte) and he wants a site to upload them into it and make an online library. So, how can I make it possible for him to upload them automatically or at least upload multiple books per time? I'll be using Node.js + mysql

I might suggest using Object Storage on top of MySQL to speed up book indexing and retrieval- but that is entirely up to you.
HTTP is a streaming protocol, and node, interestingly, has streams.
This means that, when you send a HTTP request to a node server, internally, node handles it as a stream. This means, theoretically, you can upload massive books to your web server while only having a fraction of it in memory at a time.
The first thing is- book can be very large. In order to efficiently process them, we must process the metadata (name, author, etc) and content seperately.
One example, using an express-like framework, could be: pseudo-code
app.post('/begin/:bookid', (Req, Res) => {
ParseJSON(Req)
MySQL.addColumn(Req.params.bookid,Req.body.name,Req.body.author)
})
app.put('/upload/:bookid', (Req, Res) => {
// If we use MySQL to store the books:
MySQL.addColumn(Req.params.bookid,Req.body)
// Or if we use object storage:
let Uploader = new StorageUploader(Req.params.bookid)
Req.pipe(Uploader)
})
If you need inspiration, look at how WeTransfer has created their API. They deal with lots of data daily- their solution might be helpful to you.
Remember- your client likely won't want to use Postman to upload their books. Build a simple website for them in Svelte or React.

Related

How to send and receive large JSON data

I'm relatively new to full-stack development, and currently trying to figure out an effective way to send and fetch large data between my front-end (React) and back-end (Express) while minimizing memory usage. Specifically, I'm building a mapping app which requires me to play around with large JSON files (10-100mb).
My current setup works for smaller JSON files:
Backend:
const data = require('../data/data.json');
router.get('/', function(req, res, next) {
res.json(data);
});
Frontend:
componentDidMount() {
fetch('/')
.then(res => res.json())
.then(data => this.setState({data: data}));
}
However, if data is bigger than ~40mb, the backend would crash if I test on local due to running out of memory. Also, holding onto the data with require() takes quite a bit of memory as well.
I've done some research and have a general understanding of JSON parsing, stringifying, streaming, and I think the answer lies somewhere with using chunked json stream to send the data bit by bit, but am pretty much at a loss on its implementation, especially using a single fetch() to do so (is this even possible?).
Definitely appreciate any suggestions on how to approach this.

First off, 40mb is huge and can be inconsiderate to your users especially if there' s a high probability of mobile use.
If possible, it would be best to collect this data on the backend, probably put it onto disk, and then provide only the necessary data to the frontend as it's needed. As the map needs more data, you would make further calls to the backend.
If this isn't possible, you could load this data with the client-side bundle. If the data doesn't update too frequently, you can even cache it on the frontend. This would at least prevent the user from needing to fetch it repeatedly.
Alternatively, you can read the JSON via a stream on the server and stream the data to the client and use something like JSONStream to parse the data on the client.
Here's an example of how to stream JSON from your server via sockets: how to stream JSON from your server via sockets

What is the most efficient way to make a batch request to a Firebase DB based on an array of known keys?

I need a solution that makes a Firebase DB API call for multiple items based on keys and returns the data (children) of those keys (in one response).
Since I don't need data to come real-time, some sort of standard REST call made once (rather than a Firebase DB listener), I think it would be ideal.
The app wouldn't have yet another listener and WebSocket connection open. However, I've looked through Firebase's API docs and it doesn't look like there is a way to do this.
Most of the answers I've seen always suggest making a composite key/index of some sort and filter accordingly using the composite key, but that only works for searching through a range. Or they suggest just nesting the data and not worrying about redundancy and disk space (and it's quicker), instead of retrieving associated data through foreign keys.
However, the problem is I am using Geofire and its query method only returns the keys of the items, not the items' data. All the docs and previous answers would suggest retrieving data either by the real-time SDK, which I've tried by using the once method or making a REST call for all items and filter with the orderBy, startAt, endAt params and filtering locally by the keys I need.
This could work, but the potential overhead of retrieving a bunch of items I don't need only to filter them out locally seems wasteful. The approach using the once listener seems wasteful too because it's a server roundtrip for each item key. This approach is kind of explained in this pretty good post, but according to this explanation it's still making a roundtrip for each item (even if it's asynchronously and through the same connection).
This poor soul asked a similar question, but didn't get many helpful replies (that really address the costs of making n number of server requests).
Could someone, once and for all explain the approaches on how this could be done and the pros/cons? Thanks.

Looks like you are looking for Cloud Functions. You can create a function called from http request and do every database read inside of it.
These function are executed in the cloud and their results are sent back to the caller. HTTP call is one way to trigger a Cloud Function but you can setup other methods (schedule, from the app with Firebase SDK, database trigger...). The data are not charged until they leave the server (so only in your request response or if you request a database of another region). Cloud Function billing is based on CPU used, number of invocations and running intances, more details on the quota section.
You will get something like :
const database = require('firebase-admin').database();
const functions = require('firebase-functions');
exports.getAllNodes = functions.https.onRequest((req, res) => {
let children = [ ... ]; // get your node list from req
let promises = [];
for (const i in children) {
promises.push(database.ref(children[i]).once('value'));
}
Promise.all(promises)
.then(result => {
res.status(200).send(result);
})
.catch(error => {
res.status(503).send(error);
});
});
That you will have to deploy with the firebase CLI.

I need a solution that makes a Firebase DB API call for multiple items based on keys and returns the data (children) of those keys (in one response).
One solution might be to set up a separate server to make ALL the calls you need to your Firebase servers, aggregate them, and send it back as one response.
There exists tools that do this.
One of the more popular ones recently spec'd by the Facebook team is GraphQL.
https://graphql.org/
Behind the scenes, you set up your graphql server to map your queries which would all make separate API calls to fetch the data you need to fit the query. Once all the API calls have been completed, graphql will then send it back as a response in the form of a JSON object.

This is how you can do a one time call to a document in javascript, hope it helps
// Get a reference to the database service
let database = firebase.database();
// one time call to a document
database.ref("users").child("demo").get().then((snapshot) => {
console.log("value of users->demo-> is", snapshot.node_.value_)
});

how to insert an attachment in a timeline using google-api-nodejs-client?

I am trying the Google Glass Mirror APIs now. My test app is a simple node.js/express server with googleapis (https://github.com/google/google-api-nodejs-client).
So far I could do almost all the basic operations of timelines successfully, such as list/get/update/delete, without attachments. Here is how I insert a timeline card:
var googleapis = require('googleapis');
app.all('/timeline_insert', function(req, res) {
var timeline = {'text': req.query.text};
googleapis.discover('mirror', 'v1')
.execute(function(err, client) {
client.mirror.timeline.insert({resource: timeline})
.withAuthClient(oauth2client)
.execute(function(err, result) {
// ...
});
});
}
Now I want to go one more step further to test the attachment features. However, I have no idea how to use the APIs via googleapis and node.js. Is there any sample code for the attachment operations, such as insert/get? I know I can always use raw HTTP format to do it. But since googleapis already provides the APIs, I just want to directly use them. Thanks.

The Node.js client library, which is based on the JavaScript client library, has no built-in support for media upload: you will need to build the request "manually".
This answer should help you get started on building this request.
More information on Google's media upload protocol can be found in our documentation.

Adding couchdb persistence to a socketio json feed in node

I'm currently researching how to add persistence to a realtime twitter json feed in node.
I've got my stream setup, it's broadcasting to the client, but how do i go about storing this data in a json database such as couchdb, so i can access the stores json when the client first visits the page?
I can't seem to get my head around couchdb.
var array = {
"tweet_id": tweet.id,
"screen_name": tweet.user.screen_name,
"text" : tweet.text,
"profile_image_url" : tweet.user.profile_image_url
};
db.saveDoc('tweet', strencode(array), function(er, ok) {
if (er) throw new Error(JSON.stringify(er));
util.puts('Saved my first doc to the couch!');
});
db.allDocs(function(er, doc) {
if (er) throw new Error(JSON.stringify(er));
//client.send(JSON.stringify(doc));
console.log(JSON.stringify(doc));
util.puts('Fetched my new doc from couch:');
});
These are the two snippets i'm using to try and save / retrieve tweet data. The array is one individual tweet, and needs to be saved to couch each time a new tweet is received.
I don't understand the id part of saveDoc - when i make it unique, db.allDocs only lists ID's and not the content of each doc in the database - and when it's not unique, it fails after the first db entry.
Can someone kindly explain the correct way to save and retrieve this type of json data to couchdb?
I basically want to to load the entire database when the client first views the page. (The database will have less than 100 entries)
Cheers.

You need to insert the documents in the database. You can do this by inserting the JSON that comes from the twitter API or you can insert one status at a time (for loop)
You should create a view that exposes that information. If you saved the JSON directly from Twitter you are going to need to emit several times in your map function
There operations (ingestion and querying) are not the same thing, so you should really do them at the different times in your program.
You should consider running a bg process (maybe in something as simple as a setInterval) that updates your database. Or you can use something like clarinet (http://github.com/dscape/clarinet) to parse the Twitter streaming API directly.
I'm the author of nano, and here is one of the tests that does most of what you need:
https://github.com/dscape/nano/blob/master/tests/view/query.js
For the actual query semantics and for you learn a bit more of how CouchDB works I would suggest you read:
http://guide.couchdb.org/editions/1/en/index.html
I you find it useful I would suggest you buy the book :)

If you want to use a module to interact with CouchDB I would suggest cradle or nano.
You can also use the default http module you find in Node.js to make requests to CouchDB. The down-side is that the default http module tends to be a little verbose. There are alternatives that give you an better API to deal with http requests. The request is really popular.
To get data you need to make a GET request to a view you can find more information here. If you want to create a document you have to use PUT request to your database.

Node.js - Is this a good structure for frequently updated content?

As a follow-up to my question yesterday about Node.js and communicating with clients, I'm trying to understand how the following would work.
Case:
So, I have this website where content is updated very frequently. Let's assume this time, this content is a list of locations with temperatures. (yes, a weather service)
Now, every time a client checks for a certain location, he or she goes to a url like this: example.com/location/id where id corresponds to the id of the location in my database.
Implementation:
At the server, checktemps.js loops (every other second or so) through all the locations in my (mySQL) database and checks for the corresponding temperature. It then stores this data is an array within checktemps.js. Because temperatures can change all the time, it's important to keep checking for updates in the database.
When a request to example.com/location/id is made, checktemps.js looks into the array with a record with id = id. Then, it responds with the corresponding temperature.
Question:
Plain text, html or an ajax call is not relevant at the moment. I'm just curious if I have this right? Node.js is a rather unusual thing to get a grasp on, so I try to figure out if this is logical?

At the server, checktemps.js loops
(every other second or so) through all
the locations in my (mySQL) database
and checks for the corresponding
temperature. It then stores this data
is an array within checktemps.js
This is extremely inefficient. You should not be doing looping(every other second or so).
Modules
Below I would try and make a list of the modules(both node.js modules as other modules) I would use to do this efficient:
npm is a package manager for node. You can use it to install and publish your node
programs. It manages dependencies and does other cool stuff.
I hope sincerely that you already know about npm, if not i recommend you to learn about it as soon as possible. In the beginning you just need to learn how to install packages, but that is very easy. You just need to type npm install <package-name>. Later I would really like to advice you to learn to write your own packages to manage the dependencies for you.
Express is a High performance, high class web development for Node.js.
This sinatra-style framework from TJ is really sweet and you should read the documentation/screencasts available to learn it's power.
Socket.IO aims to make realtime apps possible in every browser and
mobile device, blurring the
differences between the different
transport mechanisms.
Redis is an open source, advanced
key-value store. It is often referred
to as a data structure server since
keys can contain strings, hashes,
lists, sets and sorted sets.
Like Raynos said this extremely fast/sexy database has pubsub semantics, which are needed to do your question efficiently. I think you should really play with this database(tutorial) to appreciate it's raw power. Installing is easy as pie: make install
Node_redis is a complete Redis client for node.js. It supports all Redis commands, including MULTI, WATCH, and PUBLISH/SUBSCRIBE.
Prototype
I just remembered I helped another user out in the past with a question about pubsub. I think when you look at that answer you will have a better understanding how to do it correctly. The code has been posted a while back and should be updated(minor changes in express) to:
var PORT = 3000,
HOST = 'localhost',
express = require('express'),
io = require('socket.io'),
redis = require('redis'),
app = module.exports = express.createServer(),
socket = null;
app.use(express.static(__dirname + '/public'));
if (!module.parent) {
app.listen(PORT, HOST);
console.log("Express server listening on port %d", app.address().port)
socket = io.listen(app);
socket.on('connection', function(client) {
var subscribe = redis.createClient();
subscribe.subscribe('pubsub'); // listen to messages from channel pubsub
subscribe.on("message", function(channel, message) {
client.send(message);
});
client.on('message', function(msg) {
});
client.on('disconnect', function() {
subscribe.quit();
});
});
}
I have compressed the updated code with all the dependencies inside, but you while still need to start redis first.
Questions
I hope this gives you an idea how to do this.

With node.js you could do it even better. The request/response thingy is manifested in our heads since the beginning of the web. But you can do just one ajax request if you open the website/app and never end this call. Now node.js can send data whenever you have updates to the client. Search on youtube for Introduction to Node.js with Ryan Dahl (the creator of node.js), there he explains it. Then you have realtime updates without having to do requests by the client all the time.

We Keep Coding

JavaScript is the programming language of the Web.