Measuring the identicality of strings (in Javascript)

Measuring the identicality of strings (in Javascript) - javascript

In principle this question can be answered language-independent, but specifically I am looking for a Javascript implementation.
Are there any libraries that allow me to measure the "identicality" of two strings? More generally, are there any algorithms that do this, that I could implement (in Javascript)?
Take, as an example, the following string
Abnormal Elasticity of Single-Crystal Magnesiosiderite across the Spin
Transition in Earth’s Lower Mantle
And also consider the following, slightly adjusted string. Note the boldface parts that are different
bnormal Elasticity of Single Crystal Magnesio-Siderite across the Spin-Transition in Earths Lower Mantle.
Javascript's native equality operators won't tell you a lot about the relation between these strings. In this particular case, you could match the strings using regex, but in general that only works when you know which differences to expect. If the input strings are random, the generality of this approach breaks down quickly.
Approach... I can imagine writing an algorithm that splits up the input string in an arbitrary amount N of substrings, and then matching the target string with all those substrings, and using the amount of matches as a measurement of identicality. But this feels like an unattractive approach, and I wouldn't even want to think about how big O will depend on N.
It would seem to me that there are a lot of free parameters in such an algorithm. For example, whether case-sensitivity of characters should contribute equally/more/less to the measurement than order-preservation of characters, seems like an arbitrary choice to make by the designer, i.e.:
identicality("Abxy", "bAxy") versus identicality("Abxy", "aBxy")
Defining the requirements more specifically...
The first example is the scenario in which I could use it. I'm loading a bunch of strings (titles of academic papers), and I check whether I have them in my database. However, the source might contain typos, differences in conventions, errors, whatever, which makes matching hard. There probably is a more easy way to match titles in this specific scenario: as you can sort of expect what might go wrong, this allows you to write down some regex beast.

You can implement Hirschberg's algorithm and distinguish delete/insert operations (or alter Levenshtein).
For Hirschbers("Abxy", "bAxy") the results are:
It was 2 edit operations:
keep: 3
insert: 1
delete: 1
and for Hirschbers("Abxy", "aBxy") the results are:
It was 2 edit operations:
keep: 2
replace: 2
You can check the javascript implementation on this page.
'Optimal' String-Alignment Distance
function optimalStringAlignmentDistance(s, t) {
// Determine the "optimal" string-alignment distance between s and t
if (!s || !t) {
return 99;
}
var m = s.length;
var n = t.length;
/* For all i and j, d[i][j] holds the string-alignment distance
* between the first i characters of s and the first j characters of t.
* Note that the array has (m+1)x(n+1) values.
*/
var d = new Array();
for (var i = 0; i <= m; i++) {
d[i] = new Array();
d[i][0] = i;
}
for (var j = 0; j <= n; j++) {
d[0][j] = j;
}
// Determine substring distances
var cost = 0;
for (var j = 1; j <= n; j++) {
for (var i = 1; i <= m; i++) {
cost = (s.charAt(i-1) == t.charAt(j-1)) ? 0 : 1; // Subtract one to start at strings' index zero instead of index one
d[i][j] = Math.min(d[i][j-1] + 1, // insertion
Math.min(d[i-1][j] + 1, // deletion
d[i-1][j-1] + cost)); // substitution
if(i > 1 && j > 1 && s.charAt(i-1) == t.charAt(j-2) && s.charAt(i-2) == t.charAt(j-1)) {
d[i][j] = Math.min(d[i][j], d[i-2][j-2] + cost); // transposition
}
}
}
// Return the strings' distance
return d[m][n];
}
alert(optimalStringAlignmentDistance("Abxy", "bAxy"))
alert(optimalStringAlignmentDistance("Abxy", "aBxy"))
Damerau-Levenshtein Distance
function damerauLevenshteinDistance(s, t) {
// Determine the Damerau-Levenshtein distance between s and t
if (!s || !t) {
return 99;
}
var m = s.length;
var n = t.length;
var charDictionary = new Object();
/* For all i and j, d[i][j] holds the Damerau-Levenshtein distance
* between the first i characters of s and the first j characters of t.
* Note that the array has (m+1)x(n+1) values.
*/
var d = new Array();
for (var i = 0; i <= m; i++) {
d[i] = new Array();
d[i][0] = i;
}
for (var j = 0; j <= n; j++) {
d[0][j] = j;
}
// Populate a dictionary with the alphabet of the two strings
for (var i = 0; i < m; i++) {
charDictionary[s.charAt(i)] = 0;
}
for (var j = 0; j < n; j++) {
charDictionary[t.charAt(j)] = 0;
}
// Determine substring distances
for (var i = 1; i <= m; i++) {
var db = 0;
for (var j = 1; j <= n; j++) {
var i1 = charDictionary[t.charAt(j-1)];
var j1 = db;
var cost = 0;
if (s.charAt(i-1) == t.charAt(j-1)) { // Subtract one to start at strings' index zero instead of index one
db = j;
} else {
cost = 1;
}
d[i][j] = Math.min(d[i][j-1] + 1, // insertion
Math.min(d[i-1][j] + 1, // deletion
d[i-1][j-1] + cost)); // substitution
if(i1 > 0 && j1 > 0) {
d[i][j] = Math.min(d[i][j], d[i1-1][j1-1] + (i-i1-1) + (j-j1-1) + 1); //transposition
}
}
charDictionary[s.charAt(i-1)] = i;
}
// Return the strings' distance
return d[m][n];
}
alert(damerauLevenshteinDistance("Abxy", "aBxy"))
alert(damerauLevenshteinDistance("Abxy", "bAxy"))
Optimal String Alignment has better performance
Optimal String Alignment Distance 0.20-0.30ms
Damerau-Levenshtein Distance 0.40-0.50ms

Related

Adding method to prototype string but want does this method mean

I am trying to understand what this mean?
What I am thinking is that phrase will pass a part of array, so this in this case eve to phrase.palindrome method. That method will take and run it through. First var len takes eve and remove 1 from it using length -1. This results in var len being assigned number two as the length of eve is 3. Now for is in use, so var i = 0; i <= len/2; i++.
now becomes var i = 1;1 <= 1; i++."is this correct"
I don't understand what going on here:
for (var i = 0; i <= len/2; i++) {
if (this.charAt(i) !== this.charAt(len-i)) {
return false;
Here is all of the the code:
String.prototype.palindrome = function() {
var len = this.length-1;
for (var i = 0; i <= len/2; i++) {
if (this.charAt(i) !== this.charAt(len-i)) {
return false;
}
}
return true;
};
var phrases = ["eve", "kayak", "mom", "wow", "Not a palindrome"];
for (var i = 0; i < phrases.length; i++) {
var phrase = phrases[i];
if (phrase.palindrome()) {
console.log("'" + phrase + "' is a palindrome");
} else {
console.log("'" + phrase + "' is NOT a palindrome");
}
}

The code is essentially iterating through the string from both directions, comparing the first and last characters (indexes 0 and len) then the second from first and second from last and so forth until you reach the middle. A word is a palindrome if and only if the first and last characters are the same and the second and second to last characters are the same and so forth.
Note, there is something very wrong with this code. While it is technically possible to mutate the prototypes of built-in types in Javascript you should never, ever, ever do it. You gain no functionality you wouldn't from a normal function, while badly violating the principle of least surprise. Treat all types from other libraries, including built ins as if they are closed for modification and open for extension.

I think this line has error:
for (var i = 0; i <= len/2; i++) {
Beacuse somethimes length can be 3,5,7... and that can be problem.
I added some using examples:
for (var i = 0; i <= Math.floor(len / 2); i++) {
// That same thing with floor in positive numbers, but in negative numbers that works like ceil.
for (var i = 0; i <= ~~(len / 2); i++) {
// That same thing with floor, we can think like this pollyfill of floor.
for (var i = 0; i <= ~~(len / 2) + (Math.abs(len)>len?-1:0); i++) {

Having issues trying to solve N Rook problem . Always get n*n solution and not N factorial

I'm trying to get N ways of solves a N rook problem. The issue I am having is currently, I seem to get n*n solutions while it needs to be N! . Below is my code, I have written it in simple loops and functions, so it's quite long. Any help would be greatly appreciated
Note: Please ignore case for n = 2. I get some duplicates which I thought I would handle via JSON.stringify
var createMatrix = function (n) {
var newMatrix = new Array(n);
// build matrix
for (var i = 0; i < n; i++) {
newMatrix[i] = new Array(n);
}
for (var i = 0; i < n; i++) {
for (var j = 0; j < n; j++) {
newMatrix[i][j] = 0;
}
}
return newMatrix;
};
var newMatrix = createMatrix(n);
// based on rook position, greying out function
var collision = function (i, j) {
var col = i;
var row = j;
while (col < n) {
// set the row (i) to all 'a'
col++;
if (col < n) {
if (newMatrix[col][j] !== 1) {
newMatrix[col][j] = 'x';
}
}
}
while (row < n) {
// set columns (j) to all 'a'
row++;
if (row < n) {
if (newMatrix[i][row] !== 1) {
newMatrix[i][row] = 'x';
}
}
}
if (i > 0) {
col = i;
while (col !== 0) {
col--;
if (newMatrix[col][j] !== 1) {
newMatrix[col][j] = 'x';
}
}
}
if (j > 0) {
row = j;
while (row !== 0) {
row--;
if (newMatrix[i][row] !== 1) {
newMatrix[i][row] = 'x';
}
}
}
};
// checks position with 0 and sets it with Rook
var emptyPositionChecker = function (matrix) {
for (var i = 0; i < matrix.length; i++) {
for (var j = 0; j < matrix.length; j++) {
if (matrix[i][j] === 0) {
matrix[i][j] = 1;
collision(i, j);
return true;
}
}
}
return false;
};
// loop for every position on the board
loop1:
for (var i = 0; i < newMatrix.length; i++) {
var row = newMatrix[i];
for (var j = 0; j < newMatrix.length; j++) {
// pick a position for rook
newMatrix[i][j] = 1;
// grey out collison zones due to the above position
collision(i, j);
var hasEmpty = true;
while (hasEmpty) {
//call empty position checker
if (emptyPositionChecker(newMatrix)) {
continue;
} else {
//else we found a complete matrix, break
hasEmpty = false;
solutionCount++;
// reinitiaze new array to start all over
newMatrix = createMatrix(n);
break;
}
}
}
}

There seem to be two underlying problems.
The first is that several copies of the same position are being found.
If we consider the case of N=3 and we visualise the positions by making the first rook placed red, the second placed green and the third to be placed blue, we get these three boards:
They are identical positions but will count as 3 separate ones in the given Javascript.
For a 3x3 board there are also 2 other positions which have duplicates. The gets the count of unique positions to 9 - 2 - 1 -1 = 5. But we are expecting N! = 6 positions.
This brings us to the second problem which is that some positions are missed. In the case of N=3 this occurs once when i===j==1 - ie the mid point of the board.
This position is reached:
This position is not reached:
So now we have the number of positions that should be found as 9 - 2 - 1 - 1 +1;
There appears to be nothing wrong with the actual Javascript in as much as it is implementing the given algorithm. What is wrong is the algorithm which is both finding and counting duplicates and is missing some positions.
A common way of solving the N Rooks problem is to use a recursive method rather than an iterative one, and indeed iteration might very soon get totally out of hand if it's trying to evaluate every single position on a board of any size.
This question is probably best taken up on one of the other stackexchange sites where algorithms are discussed.

Not sure what space complexity my map object uses in the example?

I'm learning to analyze space complexity, but I'm confused of analyzing an array vs an object in JS. So I'd like to get some help here.
ex1. array []
int[] table = new int[26];
for (int i = 0; i < s.length(); i++) {
table[s.charAt(i) - 'a']++;
}
ex1. is from an example online, and it says the space complexity is O(1) because the table's size stays constant.
ex2. object {}
let nums[0,1,2,3,4,5], map = {};
for (let i = 0; i < nums.length; i++) {
map[ nums[i] ] = i;
}
I think ex2. uses O(n) because the map object is accessed 6 times. However, if I use the concept learned from ex1., the space complexity should be O(1)? Any ideas where I went wrong?

From the complexity analysis point of view, in ex 1, the complexity is O(1) because the array size doesn't increase. Because you are initializing the table to a fixed size of 26 (Looks like you are counting the characters in a string?).
See the below example that keeps track of counts of a alphabets in a string (Only small letters for clarity). In this case the length of array which tracks the count of alphabets never change even if the string changes its length.
function getCharacterCount(s){
const array = new Int8Array(26);
for (let i = 0; i < s.length; i++) {
array[s.charCodeAt(i) - 97]++;
}
return array;
}
Now let's change the implementation to map instead. Here the size of the map increases as and when a new character is encountered in the string.So
Theoretically speaking, the space complexity is O(n).
But in reality, we started with map with length 0 (0 keys) and it doesn't go beyond 26. If the string doesn't contain all the characters, the space taken would be much lesser than an array as in previous implementation.
function getCharacterCountMap(s){
const map = {};
for (let i = 0; i < s.length; i++) {
const charCode = s.charCodeAt(i) - 97;
if(map[charCode]){
map[charCode] = map[charCode] + 1
}else{
map[charCode] = 0;
}
}
return map;
}
function getCharacterCount(s){
const array = new Int8Array(26);
for (let i = 0; i < s.length; i++) {
array[s.charCodeAt(i) - 97]++;
}
return array;
}
function getCharacterCountMap(s){
const map = {};
for (let i = 0; i < s.length; i++) {
const charCode = s.charCodeAt(i) - 97;
if(map[charCode]){
map[charCode] = map[charCode] + 1
}else{
map[charCode] = 1;
}
}
return map;
}
console.log(getCharacterCount("abcdefabcedef"));
console.log(getCharacterCountMap("abcdefabcdef"));

Find all permutations of smaller string s in string b (JavaScript)

I've been trying to find a O(n) solution to the following problem: Find the number of anagrams (permutations) of string s in string b, where s.length will always be smaller than b.length
I read that the optimal solution involves keeping track of the frequencies of the characters in the smaller string and doing the same for the sliding window as it moves across the larger string, but I'm not sure how that implementation actually works. Right now my solution doesn't work (see comments) but even if it did, it would take O(s + sn) time.
EDIT: Sample input: ('aba', 'abaab'). Output: 3, because 'aba' exists in b starting at index 0, and 'baa' at 1, and 'aab' at 2.
function anagramsInStr(s,b) {
//O(s)
let freq = s.split("").reduce((map, el) => {
map[el] = (map[el] + 1) || 1;
return map;
}, {});
let i = 0, j = s.length;
// O(n)
for (let char in b.split("")) {
// O(s)
if (b.length - char + 1 > s.length) {
let window = b.slice(i,j);
let windowFreq = window.split("").reduce((map, el) => {
map[el] = (map[el] + 1) || 1;
return map;
}, {});
// Somewhere about here compare the frequencies of chars found in the window to the frequencies hash defined in the outer scope.
i++;
j++;
}
}
}

Read through the comments and let me know if you have any questions:
function countAnagramOccurrences(s, b) {
var matchCount = 0;
var sCounts = {}; // counts for the letters in s
var bCounts = {}; // counts for the letters in b
// construct sCounts
for (var i = 0; i < s.length; i++) {
sCounts[s[i]] = (sCounts[s[i]] || 0) + 1;
}
// all letters that occur in sCounts
var letters = Object.keys(sCounts);
// for each letter in b
for (var i = 0; i < b.length; i++) {
// maintain a sliding window
// if we already have s.length items in the counts, remove the oldest one
if (i >= s.length) {
bCounts[b[i-s.length]] -= 1;
}
// increment the count for the letter we're currently looking at
bCounts[b[i]] = (bCounts[b[i]] || 0) + 1;
// test for a match (b counts == s counts)
var match = true;
for (var j = 0; j < letters.length; j++) {
if (sCounts[letters[j]] !== bCounts[letters[j]]) {
match = false;
break;
}
}
if (match) {
matchCount += 1;
}
}
return matchCount;
}
console.log(countAnagramOccurrences('aba', 'abaab')); // 3
EDIT
A note about the runtime: this is sort of O(nk + m), where n is the length of s, m is the length of b, and k is the number of unique characters in b. Since m is always less than n, we can reduce to O(nk), and since k is bounded by a fixed constant (the size of the alphabet), we can further reduce to O(n).

Fastest way to loop through this array in Javascript on Chrome 36

I have a very big array which looks similar to this
var counts = ["gfdg 34243","jhfj 543554",....] //55268 elements long
this is my current loop
var replace = "";
var scored = 0;
var qgram = "";
var score1 = 0;
var len = counts.length;
function score(pplaintext1) {
qgram = pplaintext1;
for (var x = 0; x < qgram.length; x++) {
for (var a = 0, len = counts.length; a < len; a++) {
if (qgram.substring(x, x + 4) === counts[a].substring(0, 4)) {
replace = parseInt(counts[a].replace(/[^1-9]/g, ""));
scored += Math.log(replace / len) * Math.LOG10E;
} else {
scored += Math.log(1 / len) * Math.LOG10E;
}
}
}
score1 = scored;
scored = 0;
} //need to call the function 1000 times roughly
I have to loop through this array several times and my code is running slowly. My question is what the fastest way to loop through this array would be so I can save as much time as possible.

Your counts array appears to be a list of unique strings and values associated with them. Use an object instead, keyed on the unique strings, e.g.:
var counts = { gfdg: 34243, jhfj: 543554, ... };
This will massively improve the performance by removing the need for the O(n) inner loop by replacing it with an O(1) object key lookup.
Also, avoid divisions - log(1 / n) = -log(n) - and move loop invariants outside the loops. Your log(1/len) * Math.LOG10E is actually a constant added in every pass, except that in the first if branch you also need to factor in Math.log(replace), which in log math means adding it.
p.s. avoid using the outer scoped state variables for the score, too! I think the below replicates your scoring algorithm correctly:
var len = Object.keys(counts).length;
function score(text) {
var result = 0;
var factor = -Math.log(len) * Math.LOG10E;
for (var x = 0, n = text.length - 4; x < n; ++x) {
var qgram = text.substring(x, x + 4);
var replace = counts[qgram];
if (replace) {
result += Math.log(replace) + factor;
} else {
result += len * factor; // once for each ngram
}
}
return result;
}

We Keep Coding

JavaScript is the programming language of the Web.