Fetching large data from Elasticsearch in Node.js

Fetching large data from Elasticsearch in Node.js - javascript

I am fetching about 60,000 documents from an index ABC in Elasticsearch (v7) through their Node.js client. I tried using their recommended Scroll API but it took almost 20s to do so. Next, I increased the max_window_size of the index ABC to 100,000 (from the default 10,000) and the query took 5s.
I am running Elasticsearch v7 on a 8-core, 32GB-RAM machine. I am not performing any aggregations and my query is just a simple GET request to fetch all documents from the index ABC. Is there any way to speed this up to less than one second through their Node.js client?
My code in Express
const { body } = await client.search({
index: 'ABC',
size: 60000,
body: {
query: {
match_all: {}
}
},
});
res.send(body.hits.hits);
If there's no other way to cut the time down, should I cache the entire response in memory and let the UI read from it rather than hitting ES directly (the data only updates once per day, I might not need to perform cache validations). Should I look into Redis for this?

Though I'm not exactly sure what your use case is, the scroll API is probably not appropriate for a web server servicing a web page. To quote this page in the docs:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
Instead, you should paginate the results with the from/size arguments. When the UI needs to display more items, you should submit another API request to pull another page by increasing the from argument appropriately.

Related

Apollo GraphQL keeps receiving requests with no queries or mutations being made

I was learning GraphQL and about to finish the tutorial and this never happened before.
The problem is that the GraphQL server keeps receiving requests after opening GraphQL Playground in the browser even though no query or mutation is made.
I see these sort of responses being returned by the server:
{
"name":"deprecated",
"description":"Marks an element of a GraphQL schema as no longer supported.",
"locations":[
"FIELD_DEFINITION",
"ENUM_VALUE"
],
"args":[
{
"name":"reason",
"description":"Explains why this element was deprecated, usually also including a suggestion for how to access supported similar data. Formatted using the Markdown syntax (as specified by [CommonMark](https://commonmark.org/).",
"type":{
"kind":"SCALAR",
"name":"String",
"ofType":null
},
"defaultValue":"\"No longer supported\""
}
]
}

This is expected behavior.
GraphQL Playground issues an introspection query to your server. It uses the result of that query to provide validation and autocompletion for your queries. Playground will send that query to your server repeatedly (every 2 seconds by default) so that if your schema changes, these changes can be immediately reflected in the UI (although there's an issue with this feature at the moment).
You can adjust the relevant settings (click on the settings icon in the top right corner of the Playground UI) to either change the polling frequency or turn it off entirely:
'schema.polling.enable': true, // enables automatic schema polling
'schema.polling.endpointFilter': '*localhost*', // endpoint filter for schema polling
'schema.polling.interval': 2000, // schema polling interval in ms
However, the behavior you're seeing is only related to Playground so it's harmless and won't impact any other clients connecting to your server.

How do I sync data with remote database in case of offline-first applications?

I am building a "TODO" application which uses Service Workers to cache the request's responses and in case a user is offline, the cached data is displayed to the user.
The Server exposes an REST-ful endpoint which has POST, PUT, DELETE and GET endpoints exposed for the resources.
Considering that when the user is offline and submitting a TODO item, I save that to local IndexedDB, but I can't send this POST request for the server since there is no network connection. The same is true for the PUT, DELETE requests where a user updates or deletes an existing TODO item
Questions
What patterns are in use to sync the pending requests with the REST-ful Server when the connection is back online?

What patterns are in use to sync the pending requests with the REST-ful Server when the connection is back online?
Background Sync API will be suitable for this scenario. It enables web applications to synchronize data in the background. With this, it can defer actions until the user has a reliable connection, ensuring that whatever the user wants to send is actually sent. Even if the user navigates away or closes the browser, the action is performed and you could notify the user if desired.
Since you're saving to IndexDB, you could register for a sync event when the user add, delete or update a TODO item
function addTodo(todo) {
return addToIndeDB(todo).then(() => {
// Wait for the scoped service worker registration to get a
// service worker with an active state
return navigator.serviceWorker.ready;
}).then(reg => {
return reg.sync.register('add-todo');
}).then(() => {
console.log('Sync registered!');
}).catch(() => {
console.log('Sync registration failed :(');
});
}
You've registered a sync event of type add-todo which you'll listen for in the service-worker and then when you get this event, you retrieve the data from the IndexDB and do a POST to your Restful API.
self.addEventListener('sync', event => {
if (event.tag == 'add-todo') {
event.waitUntil(
getTodo().then(todos => {
// Post the messages to the server
return fetch('/add', {
method: 'POST',
body: JSON.stringify(todos),
headers: { 'Content-Type': 'application/json' }
}).then(() => {
// Success!
});
})
})
);
}
});
This is just an example of how you could achieve it using Background Sync. Note that you'll have to handle conflict resolution on the server.
You could use PouchDB on the client and Couchbase or CouchDB on the server. With PouchDB on the client, you can save data on the client and set it to automatically sync/replicate the data whenever the user is online. When the database synchronizes and there are conflicting changes, CouchDB will detect this and will flag the affected document with the special attribute "_conflicts":true. It determines which one it'll use as the latest revision, and save the others as the previous revision of that record. It does not attempt to merge the conflicting revision. It is up to you to dictate how the merging should be done in your application. It's not so different from Couchbase too. See the links below for more on Conflict Resolution.
Conflict Management with CouchDB
Understanding CouchDB Conflict
Resolving Couchbase Conflict
Demystifying Conflict Resolution in Couchbase Mobile
I've used pouchDB and couchbase/couchdb/IBM cloudant but I've done that through Hoodie It has user authentication out-of-the box, handles conflict management, and a few more. Think of it like your backend. In your TODO application, Hoodie will be a great fit. I've written something on how to use Hoodie, see links Below:
How to build offline-smart application with Hoodie
Introduction to offline data storage and sync with PouchBD and Couchbase

At the moment I can think of two approaches and it depend on what storage options you are using at your backend.
If you are using an RDBMS to backup all data:
The problem with offline first systems in this approach is the possibility of conflict that you may face when posting new data or updating existing data.
As a first measure to avoid conflicts from happening you will have to generate unique IDs for all objects from your clients and in such a way that they remain unique when posted on the server and saved in a data base. For this you can safely rely on UUIDs for generating unique IDs for objects. UUID guarantees uniqueness across systems in a distributed system and depending on what your language of implementation is you will have methods to generate UUIDs without any hassle.
Design your local database such that you can use UUIDs as primary key in your local database. On the server end you can have both, an integer type auto incremented and indexed, primary key and a VARCHAR type to hold the UUIDs. The primary key on server uniquely identifies objects in that table while UUID uniquely identifies records across tables and databases.
So when posting your object to server at the time of syncing you will have to just check if any object with the UDID is already present and take appropriate action from there. When your are fetching objects from the server send both the primary key of the object from your table and the UDID for the objects. This why when you serialise the response in model objects or save them in local database you can tell the objects which have been synced from the ones which haven't as the objects that needs syncing will not have a primary key in your local database, just the UUID.
There may be a case when your server malfunctions and refuses to save data when you are syncing. In this case you can keep an integer variable in your objects that will keep a count of the number of times you have tried syncing it. If this number exceed by a certain value, say 3, you move on to sync the next object. Now what you do with the unsynced objects is up you the policy you have for such objects, as a solution you could discard them or keep them just locally.
If you are not using RDBMS
As an alternate approach, instead of keeping all objects you could keep transactions that each client perform locally to the server. Each client syncs just the transactions and the while fetching you get the current state by working all the transactions from bottom up. This is very similar to what Git uses. It saves changes in your repository in form of transactions like what has been added (or removed) and by whom. The current state of the repository for each user is worked from the transactions. This approach will not result in conflicts but as you can see its a little tricky to develop.

Limiting entries in Javascript Object

Alrighty! I'm working on small chat add-on for my website, and when a user logs on they'll see the chat history, I'm using a Javascript Object to store all messages in my NodeJS server, now I'd like it so whenever more than fifty entries are in the Object it adds the latest message and removes the oldest, I'd like this to limit my server from handling a lot of messages every time a user logs on. How would I be doing this?
Here's how I store my messages,
var messages = {
"session":[
]
};
messages.session.push(
{
"name":user.name,
"message":safe_tags_replace(m.msg),
"image":user.avatar,
"user":user.steamid,
"rank":user.rank,
}
);
I could also just do loading the last fifty messages in the JSON Object but whenever I run my server for a long time without restarting it this Object will become extremly big, would this be a problem?

Since you are pushing elements to the end of your array, you could just use array shift() to remove the first element of the array if needed. e.g.
var MAX_MESSAGES = 50;
if (messages.session.length > MAX_MESSAGES) { messages.session.shift(); }
To answer the second part of your question:
The more data you hold, the more physical memory you consume on the client machine, obviously. Which can - by itself - be a problem, especially for mobile devices and on old hardware. Also; having huge arrays will impact performance on lookup, iteration, some insert operations and sorting.

Storing JSON objects that contain chat history in the server is not a good idea. For one you are taking up memory that will be held up for an indefinite period. If you have multiple clients all taking to each other, these objects will continue to grow are eventually impact performance. Secondly once the server is restarted, or after your clenan up these objects, the chat history is lost.
The ideal solution is to store message in a database; a simple solution is mongoDB. Whenever a user logs in to the app. Query the db for that users chat history (here you can define how far back you want to go) and send them an initial response contain this data. Then whenever a message is sent, insert that message into the table/collection for future reference. This way the server is only responsible for sending chat history during the initial signon. After that the client is responsible for maintaining any added new message.

What is the most efficient way to make a batch request to a Firebase DB based on an array of known keys?

I need a solution that makes a Firebase DB API call for multiple items based on keys and returns the data (children) of those keys (in one response).
Since I don't need data to come real-time, some sort of standard REST call made once (rather than a Firebase DB listener), I think it would be ideal.
The app wouldn't have yet another listener and WebSocket connection open. However, I've looked through Firebase's API docs and it doesn't look like there is a way to do this.
Most of the answers I've seen always suggest making a composite key/index of some sort and filter accordingly using the composite key, but that only works for searching through a range. Or they suggest just nesting the data and not worrying about redundancy and disk space (and it's quicker), instead of retrieving associated data through foreign keys.
However, the problem is I am using Geofire and its query method only returns the keys of the items, not the items' data. All the docs and previous answers would suggest retrieving data either by the real-time SDK, which I've tried by using the once method or making a REST call for all items and filter with the orderBy, startAt, endAt params and filtering locally by the keys I need.
This could work, but the potential overhead of retrieving a bunch of items I don't need only to filter them out locally seems wasteful. The approach using the once listener seems wasteful too because it's a server roundtrip for each item key. This approach is kind of explained in this pretty good post, but according to this explanation it's still making a roundtrip for each item (even if it's asynchronously and through the same connection).
This poor soul asked a similar question, but didn't get many helpful replies (that really address the costs of making n number of server requests).
Could someone, once and for all explain the approaches on how this could be done and the pros/cons? Thanks.

Looks like you are looking for Cloud Functions. You can create a function called from http request and do every database read inside of it.
These function are executed in the cloud and their results are sent back to the caller. HTTP call is one way to trigger a Cloud Function but you can setup other methods (schedule, from the app with Firebase SDK, database trigger...). The data are not charged until they leave the server (so only in your request response or if you request a database of another region). Cloud Function billing is based on CPU used, number of invocations and running intances, more details on the quota section.
You will get something like :
const database = require('firebase-admin').database();
const functions = require('firebase-functions');
exports.getAllNodes = functions.https.onRequest((req, res) => {
let children = [ ... ]; // get your node list from req
let promises = [];
for (const i in children) {
promises.push(database.ref(children[i]).once('value'));
}
Promise.all(promises)
.then(result => {
res.status(200).send(result);
})
.catch(error => {
res.status(503).send(error);
});
});
That you will have to deploy with the firebase CLI.

I need a solution that makes a Firebase DB API call for multiple items based on keys and returns the data (children) of those keys (in one response).
One solution might be to set up a separate server to make ALL the calls you need to your Firebase servers, aggregate them, and send it back as one response.
There exists tools that do this.
One of the more popular ones recently spec'd by the Facebook team is GraphQL.
https://graphql.org/
Behind the scenes, you set up your graphql server to map your queries which would all make separate API calls to fetch the data you need to fit the query. Once all the API calls have been completed, graphql will then send it back as a response in the form of a JSON object.

This is how you can do a one time call to a document in javascript, hope it helps
// Get a reference to the database service
let database = firebase.database();
// one time call to a document
database.ref("users").child("demo").get().then((snapshot) => {
console.log("value of users->demo-> is", snapshot.node_.value_)
});

Adding couchdb persistence to a socketio json feed in node

I'm currently researching how to add persistence to a realtime twitter json feed in node.
I've got my stream setup, it's broadcasting to the client, but how do i go about storing this data in a json database such as couchdb, so i can access the stores json when the client first visits the page?
I can't seem to get my head around couchdb.
var array = {
"tweet_id": tweet.id,
"screen_name": tweet.user.screen_name,
"text" : tweet.text,
"profile_image_url" : tweet.user.profile_image_url
};
db.saveDoc('tweet', strencode(array), function(er, ok) {
if (er) throw new Error(JSON.stringify(er));
util.puts('Saved my first doc to the couch!');
});
db.allDocs(function(er, doc) {
if (er) throw new Error(JSON.stringify(er));
//client.send(JSON.stringify(doc));
console.log(JSON.stringify(doc));
util.puts('Fetched my new doc from couch:');
});
These are the two snippets i'm using to try and save / retrieve tweet data. The array is one individual tweet, and needs to be saved to couch each time a new tweet is received.
I don't understand the id part of saveDoc - when i make it unique, db.allDocs only lists ID's and not the content of each doc in the database - and when it's not unique, it fails after the first db entry.
Can someone kindly explain the correct way to save and retrieve this type of json data to couchdb?
I basically want to to load the entire database when the client first views the page. (The database will have less than 100 entries)
Cheers.

You need to insert the documents in the database. You can do this by inserting the JSON that comes from the twitter API or you can insert one status at a time (for loop)
You should create a view that exposes that information. If you saved the JSON directly from Twitter you are going to need to emit several times in your map function
There operations (ingestion and querying) are not the same thing, so you should really do them at the different times in your program.
You should consider running a bg process (maybe in something as simple as a setInterval) that updates your database. Or you can use something like clarinet (http://github.com/dscape/clarinet) to parse the Twitter streaming API directly.
I'm the author of nano, and here is one of the tests that does most of what you need:
https://github.com/dscape/nano/blob/master/tests/view/query.js
For the actual query semantics and for you learn a bit more of how CouchDB works I would suggest you read:
http://guide.couchdb.org/editions/1/en/index.html
I you find it useful I would suggest you buy the book :)

If you want to use a module to interact with CouchDB I would suggest cradle or nano.
You can also use the default http module you find in Node.js to make requests to CouchDB. The down-side is that the default http module tends to be a little verbose. There are alternatives that give you an better API to deal with http requests. The request is really popular.
To get data you need to make a GET request to a view you can find more information here. If you want to create a document you have to use PUT request to your database.

We Keep Coding

JavaScript is the programming language of the Web.