I'm trying to improve my understanding of Redis, as I have a project that needs to crunch a lot of numbers in a rapid fashion, however, I'm running into an issue and it's either my understanding is wrong or somehow my code isn't working as expected.
I have data in a MariaDB table, and I'm using ioredis to hmset the data for each line into the Redis database, then performing an sadd to create indices for each point that I need to pivot off of.
However, my result sets are not matching. For example, in the MariaDB I get a result set of of rougly 55k records off of two fields:
SELECT COUNT(`Email`) FROM myTable
WHERE `Qual Field A`='Yes' AND `Qual Field B`='Something else'
using those same fields in Redis I'm getting results around 2k:
SINTER qualFieldA:'Yes' qualFieldB:'Something else'
I was under the impression, based on what I'd read on SO and elsewhere, that doing a SINTER key1:value key2:value would be roughly the equivalent of SELECT {fields} FROM {table} WHERE field1=value AND field2=value.
Is that the case and perhaps my importing or sadd calls are off, or do I not properly understand how SINTER works?
In principle you are right, however, besides errors in the import process, the main suspect IMO is this: MariaDB does index collation and normalizes values in certain ways for selection, while in redis what you see is what you get.
So for example, the values "Yes", "yes", "Yés" and "YES" in MariaDB will all be selected if you query for "Yes", in redis only the value for "Yes" will be.
And it's not just lowercase - if you deal with unicode you're entering a world of pain trying to implement normalization and collation by yourself.
Related
I need to find all documents in a MongoDB database that have a property containing a string that is similar to the search term but allows for a certain % in divergence.
In plain javascript I could for example use https://www.npmjs.com/package/string-similarity and then basically match all documents that have > 90% similarity score.
I'd like do to this as MongoDB query and be as performant as possible as the database contains millions of documents.
What possible options do I have in this situation?
I found something about $text search, but it doesn't seem it helps a lot
I was thinking about creating some sort of signature for each document, like some sort of hash that allows for some sort of divergence.
I am really happy for every idea to get this solved in the best possible way.
The common solution to this problem is to use a search engine database, like Elasticsearch or Atlas search (by Mongodb team). I will not go into too much detail on how these databases work but generally speaking they are an inverse index database, this means you tokenize your data on insert and then your queries run on the tokenized data and not on the raw data set.
This approach is very powerful and can help with many "search engine" problems like autocomplete or in your case what is called a "fuzzy" search.
Let's see how elasticsearch deals with this by reading about their fuzzy feature:
To find similar terms, the fuzzy query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion.
Basically what they do is create all "possible" permutations of the query within the given parameters. I would personally recommend you just use one of these databases that give this ability OOTB, however if you want to do a "pseudo" search engine in Mongo you can just use this approach ( with the downside of Mongo's indexes being a tree so you force a tree scan for these queries instead of a db designed for this )
I previously used MongoDB and mongoose, now try to implement dynamoose in my app,
how to write the same query in dynamoose for below-shown mongoose query?
return this.scan({ abcd: { $ne: null }, activeFlag: true }).skip(9 * (page - 1)).sort({ date: -1 }).limit(9)
I need the same for
$ne - not equal to
skip
sort
and also limit
Dynamoose currently doesn't have support for a skip function. This is mainly because due to my understanding DynamoDB doesn't have a skip type function on their platform.
For doing something like not equal you can use the .not() syntax in Dynamoose.
For sorting you have to use the Query.descending() or Query.ascending() syntax instead of Scan.
You can limit the number of items DynamoDB will scan or query by using Query.limit() or Scan.limit(). Please keep in mind that this limits the number of items scanned or queried on DynamoDB before any filters are applied. So if you are filtering out items in your scan or query the limit will limit the number of items that even get considered for the filter.
Finally it's important to understand that although Dynamoose was heavily inspired by Mongoose it's not meant to be a plug and play replacement. There are some major fundamental concepts that are very different between MongoDB and DynamoDB, which makes a plug and play system basically impossible.
Therefor I would highly encourage you to read through both the Dynamoose and DynamoDB documentation to understand the benefits and limitations of DynamoDB and ensure it really meets the needs for your given project.
Of course you can do everything you want to do after the results get returned from DynamoDB inside of the Dynamoose callback function by writing your own code, but depending on how many results you have, how many items you have in the table, how fast you need this to be, etc. that might not be a very good option. So although everything you are asking for is possible by writing your own code, there is a chance it won't be as good as MongoDB, and because I don't know MongoDB quite as well as DynamoDB, and because I don't know how those specific Mongoose functions work I can't really speak to that.
i would like to ask for help with my backend flux.
I'm starting to use SQL now, and i have some background in noSQL databases, but i don't know SQL much, so i'm having some trouble finding out how to register my schemas.
I'm using node-mysql, and the way that i can create schemas is calling the method query, like:
myInstance.query( 'CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100) NULL NULL,
email VARCHAR(100) NULL NULL,
password VARCHAR(100) NULL NULL
)');
The problem with this solution is: this code will run in every server initialization.
So, i would like to know how to check if the schema already exists, and.. is this a good solution?? I was thinking in a bash script that creates all schemas, them i don't need this if statements.
Thanks.
What you are calling a "schema" is really a "table". Hence, create table statement, rather than create schema. This is very important. Perhaps this part of the documentation will help you understand the difference.
There are four very different constructs:
Database Server -- how you connect to one or more databases
Database Instance -- a grouping of objects, typically a unit of backup and storage
Schemas -- a grouping of objects (which may be within a database), typically a unit of permissions
Tables -- where data is stored
Note that different database systems have slightly different variations on these.
Of course, "tables" have schemas, which is why it is easy to get confused.
Generally, the management of the database is handled separately from user applications. That is, the DBA (which might also be the developer) would create the database, manage access, handle backup/recovery, and other things. The application would simply connect to the database and assume that the correct tables and data are there.
That is, under most circumstances, you wouldn't be creating tables in application code. Just use the tables that should already have been created for your database.
You can modify you sql statement to
CREATE TABLE IF NOT EXISTS users (…
That way the code will run on server init, but not do anything and also not fail when the tables are already there. See the corresponding mysql documentation.
Contrary to the answer you got, to have the SQL statements in application code is not that uncommon for backends.
I have a MongoDB JavaScript function saved in db.system.js, and I want to use it to both query and produce output data.
I'm able to query the results of the function using the $where clause like so:
db.records.find(
{$where: "formatEmail(this.email.toString(), 'format type') == 'xxx'"},
{email:1})
but I'm unable to use this to return a formatted value for the projected "email" field, like so:
db.records.find({}, {"formatEmail(this.email.toString(), 'format type')": 1})
Is there any way to do this while preserving the ability to simply use a pre-built function?
UPDATE:
Thank you all for your prompt participation.
Let me explain why I need to do this in MongoDB and it's not a matter of client logic at the wrong layer.. What I am really trying to do is use the function for a shard bucketing value. Email was just one example, but in reality, what I have is a hash function that returns a mod value.
I'm aware of Mongo having the ability to shard based on a hashed value, but from what I gather, it produces a highly random value that can burden the re-balancing of shards with unnecessary load. So I want to control it like so func(_id, mod), which would return a value from 0 to say 1000 (depending on the mod value).
Also, I guess I would also like to use the output of the function in some sort of grouping scenario, and I guess Map Reduce does come to mind.. I was just hoping to avoid writing overly complex M/R for something so simple.. also, I don't really know how to do Map Reduce .. lol.
So, I gather that from your answers, there is no way to return any formatted value back from mongo (without map/reduce), is that right?
I think you are mixing your "layers" of functionality here -- the database stores and retrieves data, thats all. What you need to do is:
* get that data and store the cursor in a variable
* loop through your cursor, and for every record you go through
* format and output your record as you see fit.
This is somewhat similar to what you have described in your question, but its not part of MongoDB and you have to provide the "formatEmail" function in your "application layer"
Hope it helps
As #alernerdev has already mentioned, this is generally not done at a database layer. However, sometimes storing a pre-formatted version in your database is the way to go. Here's some instances where you may wish to store extra data:
If you need to lookup data in a particular format. For example, I have a "username" and a "usernameLowercase" fields for my primary user collection. The lowercase'd one is indexed, and is the one I use for username lookup. The mixed-case one is used for displaying the username.
If you need to report a large amount of data in a particular format. 100,000 email addresses all formatted in a particular way? Probably best to just store them in that format in the db.
If your translation from one format to another is computationally expensive. Doubly so if you're processing lots of records.
In this case, if all you're doing is looking up or retrieving an email in a specific format, I'd recommend adding a field for it and then indexing it. That way you won't need to do actual document retrieval for the lookup or the display. Super fast. Disk storage space for something the size of an email address is super cheap!
I am working with a database that was handed down to me. It has approximately 25 tables, and a very buggy query system that hasn't worked correctly for a while. I figured, instead of trying to bug test the existing code, I'd just start over from scratch. I want to say before I get into it, "I'm not asking anyone to build the code for me". I'm not that lazy, all I want to know is, what would be the best way to lay out the code? The existing query uses "JOIN" to combine the results of all the tables in one variable, and spits it into the query. I have been told in other questions displaying this code, that it's just too much, and far too many bugs to try to single out what is causing the break.
What would be the most efficient way to query these tables that reference each other?
Example: Person chooses car year, make, model. PHP then gathers that information, and queries the SQL database to find what parts have matching year, vehicle id's, and parts compatible. It then uses those results to pull parts that have matching car model id's, OR vehicle id's(because the database was built very sloppily, and compares all the different tables to produce: Parts, descriptions, prices, part number, sku number, any retailer notes, wheelbase, drive-train compatibility, etc.
I've been working on this for two weeks, and I'm approaching my deadline with little to no progress. I'm about to scrap their database, and just do data entry for a week, and rebuild their mess if it would be easier, but if I can use the existing pile of crap they've given me, and save some time, I would prefer it.
Would it be easier to do a couple queries and compare the results, then use those results to query for more results, and do it step by step like that, or is one huge query comparing everything at once more efficient?
Should I use JOIN and pull all the tables at once and compare, or pass the input into individual variables, and pass the PHP into javascript on the client side to save server load? Would it be simpler to break the code up so I can identify the breaking points, or would using one long string decrease query time, and server loads? This is a very complex question, but I just want to make sure there aren't too many responses asking for clarification on trivial areas. I'm mainly seeking the best advice possible on how to handle this complicated situation.
Rebuild the database then make a php import to bring over the data.