Check if two Strings share a common substring in JavaScript - javascript

Is there any fast way in JavaScript to find out if 2 Strings contain the same substring? e.g. I have these 2 Strings: "audi is a car" and "audiA8".
As you see the word "audi" is in both strings but we cannot find it out with a simple indexOf or RegExp, because of other characters in both strings.

The standard tool for doing this sort of thing in Bioinformatics is the BLAST program. It is used to compare two fragments of molecules (like DNA or proteins) to find where they align with each other - basically where the two strings (sometimes multi GB in size) share common substrings.
The basic algorithm is simple, just systematically break up one of the strings into pieces and compare the pieces with the other string. A simple implementation would be something like:
// Note: not fully tested, there may be bugs:
function subCompare (needle, haystack, min_substring_length) {
// Min substring length is optional, if not given or is 0 default to 1:
min_substring_length = min_substring_length || 1;
// Search possible substrings from largest to smallest:
for (var i=needle.length; i>=min_substring_length; i--) {
for (j=0; j <= (needle.length - i); j++) {
var substring = needle.substr(j,i);
var k = haystack.indexOf(substring);
if (k != -1) {
return {
found : 1,
substring : substring,
needleIndex : j,
haystackIndex : k
}
}
}
}
return {
found : 0
}
}
You can modify this algorithm to do more fancy searches like ignoring case, fuzzy matching the substring, look for multiple substrings etc. This is just the basic idea.

Take a look at the similar text function implementation here. It returns the number of matching chars in both strings.
For your example it would be:
similar_text("audi is a car", "audiA8") // -> 4
which means that strings have 4-char common substring.

Don't know about any simpler method, but this should work:
if(a.indexOf(substring) != -1 && b.indexOf(substring) != -1) { ... }
where a and b are your strings.

var a = "audi is a car";
var b = "audiA8";
var chunks = a.split(" ");
var commonsFound = 0;
for (var i = 0; i < chunks.length; i++) {
if(b.indexOf(chunks[i]) != -1) commonsFound++;
}
alert(commonsFound + " common substrings found.");

Related

Checking if combination of any amount of strings exists

I'm solving a puzzle and I have an idea of how to solve this problem, but I would like some guidance and hints.
Suppose I have the following, Given n amount of words to input, and m amount of word combos without spaces, I will have some functionality as the following.
4
this
is
my
dog
5
thisis // outputs 1
thisisacat // 0, since a or cat wasnt in the four words
thisisaduck // 0, no a or cat
thisismy // 1 this,is,my is amoung the four words
thisismydog // 1
My thoughts
First What I was thinking of doing is storing those first words into an array. After that, I check if any of those words is the first word of those 5 words
Example: check if this is in the first word thisis. It is! Great, now remove that this, from thisis to get simply just is, now delete the original string that corresponded to that equality and keep iterating over the left overs (now is,my,dog are available). If we can keep doing this process, until we get an empty string. We return 1, else return 0!
Are my thoughts on the right track? I think this would be a good approach (By the way I would like to implement this in javascript)
Sorting words from long to short may in some cases help to find a solution quicker, but it is not a guarantee. Sentences that contain the longest word might only have a solution if that longest word is not used.
Take for instance this test case:
Words: toolbox, stool, boxer
Sentence: stoolboxer
If "toolbox" is taken as a word in that sentence, then the remaining characters cannot be matched with other valid words. Yet, there is a solution, but only if the word "toolbox" is not used.
Solution with a Regular Expression
When regular expressions are allowed as part of the solution, then it is quite simple. For the above example, the regular expression would be:
^(toolbox|stool|boxer)*$
If a sentence matches that expression, it is a solution. If not, then not. This is quite straightforward, and doesn't really require an algorithm. All is done by the regular expression interpreter. Here is a snippet:
var words = ['this','is','a','string'];
var sentences = ['thisis','thisisastring','thisisaduck','thisisastringg','stringg'];
var regex = new RegExp('^(' + words.join('|') + ')*$');
sentences.forEach(sentence => {
// search returns a position. It should be 0:
console.log(sentence + ': ' + (sentence.search(regex) ? 'No' : 'Yes'));
});
But using regular expressions in an algorithm-challenge feels like cheating: you don't really write the algorithm, but rely on the regular expression implementation to do the job for you.
Without Regular Expressions
You could use this algorithm: first check whether a word matches at the start of the input sentence, and if so, remove that first occurrence from it. Then repeat this for the remaining part of the sentence. If this can be repeated until no characters are left over, you have a solution.
If characters are left over which cannot be matched with any word... well, then you cannot really conclude there is no solution for that sentence. It might be that some earlier made word choice was the wrong one, and there was an alternative. So to cope with that, your algorithm could backtrack and try other words.
This principle can be implemented through recursion. To gain memory-efficiency, you could leave the original sentence in-tact, and work with an index in that sentence instead.
The algorithm is implemented in arrow-function testString:
var words = ['this','is','a','string'];
var sentences = ['thisis','thisisastring','thisisaduck','thisisastringg','stringg'];
var testString = (words, str, i = 0) =>
i >= str.length || words.some( word =>
str.substr(i, word.length) == word && testString(words, str, i + word.length)
);
sentences.forEach(sentence => {
console.log(sentence + ': ' + (testString(words, sentence) ? 'Yes' : 'No'));
});
Or, the same in non-arrow-function syntax:
var words = ['this','is','a','string'];
var sentences = ['thisis','thisisastring','thisisaduck','thisisastringg','stringg'];
var testString = function (words, str, i = 0) {
return i >= str.length || words.some(function (word) {
return str.substr(i, word.length) == word
&& testString(words, str, i + word.length);
});
}
sentences.forEach(function (sentence) {
console.log(sentence + ': ' + (testString(words, sentence) ? 'Yes' : 'No'));
});
... and without some(), forEach() or ternary operator:
var words = ['this','is','a','string'];
var sentences = ['thisis','thisisastring','thisisaduck','thisisastringg','stringg'];
function testString (words, str, i = 0) {
if (i >= str.length) return true;
for (var k = 0; k < words.length; k++) {
var word = words[k];
if (str.substr(i, word.length) == word
&& testString(words, str, i + word.length)) {
return true;
}
}
}
for (var n = 0; n < sentences.length; n++) {
var sentence = sentences[n];
if (testString(words, sentence)) {
console.log(sentence + ': Yes');
} else {
console.log(sentence + ': No');
}
}
Take the 4 words, put them into a regex.
Use that regex to split each string.
Take the length of the resulting array (subtract one for the initial length of one).
var size = 'thisis'.split(/this|is|my|dog/).length - 1
Or if your list of words is an array
var search = new RegExp(words.join('|'))
var size = 'thisis'.split(search).length - 1
Either way you are splitting up the string by the list of words you have defined.
You can sort the words by length to ensure that larger words are matched first by
words.sort(function (a, b) { return b.length - a.length })
Here is the solution for anyone interested
var input = ['this','is','a','string']; // This will work for any input, but this is a test case
var orderedInput = input.sort(function(a,b){
return b.length - a.length;
});
var inputRegex = new RegExp(orderedInput.join('|'));
// our combonation of words can be any size in an array, just doin this since prompt in js is spammy
var testStrings = ['thisis','thisisastring','thisisaduck','thisisastringg','stringg'];
var foundCombos = (regex,str) => !str.split(regex).filter(str => str.length).length;
var finalResult = testStrings.reduce((all,str)=>{
all[str] = foundCombos(inputRegex,str);
if (all[str] === true){
all[str] = 1;
}
else{
all[str] = 0;
}
return all;
},{});
console.log(finalResult);

Longest Substring Without Repeating Characters corner cases

I was asked this question in a recent interview. I need to find the longest substring without repeating characters.
Given "abcabcbb", the answer is "abc", which the length is 3.
Given "bbbbb", the answer is "b", with the length of 1.
Given "pwwkew", the answer is "wke", with the length of 3
This is what I came up with, I think it works correctly, but the interviewer was not impressed and said that my solution may not work for all cases.
var str = "pwwkew";
var longSubstring = function(str) {
var obj = {}; //map object
var count = 0;
var c = []; //count array to keep the count so far
for (var i = 0; i < str.length; ++i) {
//check if the letter is already in the map
if (str[i] in obj && obj[str[i]] !== i) {
c.push(count); //we encountered repeat character, so save the count
obj = {};
obj[str[i]] = i;
count = 1;
continue;
} else {
obj[str[i]] = i;
++count;
}
}
return Math.max.apply(null, c);
}
console.log(longSubstring(str)); //prints 3
Can anyone tell me what's the problem with my solution? I think it is one of the best :) and also solves in O(n) time.
I guess the problem with your code is, it gets haywire when there is no repeating letters in the whole sentence to start with. As mentioned in one of the comments "abc" won't produce a correct result. My approach would be slightly different than yours as follows;
var str = "pwwkew",
data = Array.prototype.reduce.call(str, (p,c) => (p.test.includes(c) ? p.test = [c]
: p.test.length >= p.last.length ? p.test = p.last = p.test.concat(c)
: p.test.push(c)
, p), {last:[], test:[]}),
result = data.last.length;
console.log(data);
console.log(result);
In this reduce code we initially start with an object like {last:[], test:[]} and go over the characters one by one. If the received character is in our objects's test array we immediately reset the test array to include only the letter we tested (the p.test = [c] line). However If the received character is not in our test array then we do either one of the two thing as follows. If our test array's length is equal or longer than the last array's length then we add the current character to test array and make last array = test array. (p.test.length >= p.last.length ? p.test = p.last = p.test.concat(c) line). But if our test array's length is shorter than the last array's length we just add the current character to test array and continue likewise all the way to the end of the string one by one over single characters.

how to get list of unique chars from a string in javascript?

I have some text files, each with a mix of western and chinese characters. I want a list of the chinese characters that appear in each file.
I have tried
ch = text.match(/[\u4E00-\u9FFF]/g); // unicode usual chinese characters - that'll do for me
if (ch != null) {
alert(ch);
}
This gives me the list of chinese characters, but with some repetitions. For example:
肉,捕,兵,死,兵,半,水
for a file
卵,水,半,水,土,木,水,清,慢,底,海,海,海,清,清,清,木,清,慢,底,清,土,半,水,水,土,半,水,土
for another...
1) I don't need those commas. Where did they come from? (I can take them off with a single replace, but since I'm using regex, I think it may be faster if I solve it inside the regex itself.)
2) How to get only unique values? For example:
肉捕兵死半水
for the first file
卵水半土木清慢底海
for the second...
commas come from default array to string conversion. use ch.join('') to convert array to string instead.
To remove duplicate values, use this line:
ch = text.match(/([\u4E00-\u9FFF])/g);
ch = ch.filter(function (c, i) { return ch.indexOf(c) === i; }).join('');
Array.prototype.getUnique = function(){
var u = {}, a = [];
for(var i = 0, l = this.length; i < l; ++i){
if(u.hasOwnProperty(this[i])) {
continue;
}
a.push(this[i]);
u[this[i]] = 1;
}
return a;
}
ch = text.match(/([\u4E00-\u9FFF])/g);
var result_string = ch.getUnique().join("");
Try this:
var text = "卵水半水土木水清慢底海海海清清清木清慢底清土半水水土半水土",
re = /([\u4E00-\u9FFF])/g,
unique = {},
chars = "", c;
while(c = re.exec(text)){
if(!unique[c[0]]){
chars += c[0];
unique[c[0]] = true;
}
}
chars.split("");
Which returned:
["卵", "水", "半", "土", "木", "清", "慢", "底", "海"]
And yes, the commas you're seeing are when a browser typecasts an array to a string: it joins the the string representations of each value together with commas. I'm guessing that came from the call to "alert" in your original example, which was being supplied an array (returned from the string's "Match" method).
Array's "filter" method isn't supported in legacy browsers, but it's quite easy to polyfill (and certainly not necessary to if you're only concerned with supporting agents as recent as IE9).
There is a one-liner solution with regex:
input.match(/([\u4E00-\u9FFF])(?![\s\S]*\1)/g)
However, I wouldn't recommend using it, since it will have O(n * k) complexity in worst case (when the string contains mostly Chinese characters), where n is the length of the string and k is the number of unique Chinese characters. Why O(n * k)? Since the look-ahead (?![\s\S]*\1) basically says "assert that you can't find another instance of whatever matched in first capturing group in the rest of the string".
This answer by #Ruben Kazumov is a reasonable alternative. Its complexity depends on the implementation of setting and getting property in an Object, which should be sub-linear per operation in a reasonable implementation.

The best way to match at least three out of four regex requirements

In password strategy, there are 4 requirements.
It should contains any three of the following
lower case.
upper case.
numeric.
special character.
The following regex will match all cases
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^a-zA-Z0-9]).{4,8}$
I know I can use '|' to declare all combinations, however, that will produce a supper long regex. What is the best way to replace '|' so that it can check if the input contains any of three conditions in the combination?
If you're using a PCRE flavor, the following one could suit your needs (formatted for readability):
^(?:
((?=.*\d))((?=.*[a-z]))((?=.*[A-Z]))((?=.*[^a-zA-Z0-9]))|
(?1) (?2) (?3) |
(?1) (?2) (?4) |
(?1) (?3) (?4) |
(?2) (?3) (?4)
).{4,8}$
One-lined:
^(?:((?=.*\d))((?=.*[a-z]))((?=.*[A-Z]))((?=.*[^a-zA-Z0-9]))|(?1)(?2)(?3)|(?1)(?2)(?4)|(?1)(?3)(?4)|(?2)(?3)(?4)).{4,8}$
Demo on Debuggex
JavaScript regex flavor does not support recursion (it does not support many things actually). Better use 4 different regexes instead, for example:
var validate = function(input) {
var regexes = [
"[A-Z]",
"[a-z]",
"[0-9]",
"[^a-zA-Z0-9]"
];
var count = 0;
for (var i = 0, n = regexes.length; i < n; i++) {
if (input.match(regexes[i])) {
count++;
}
}
return count >=3 && input.match("^.{4,8}$");
};
Sure, here's a method which uses a slight modification of the same regex, but with a short bit of code authoring required.
^(?=(\D*\d)|)(?=([^a-z]*[a-z])|)(?=([^A-Z]*[A-Z])|)(?=([a-zA-Z0-9]*[^a-zA-Z0-9])|).{4,8}$
Here you have five capturing groups - Check whether at least 3 of them are not null. A null group for capture effectively indicates that the alternative within the lookahead has been matched, and the capturing group on the left hand side could not be matched.
For example, in PHP:
preg_match("/^(?=(\\D*\\d)|)(?=([^a-z]*[a-z])|)(?=([^A-Z]*[A-Z])|)(?=([a-zA-Z0-9]*[^a-zA-Z0-9])|).{4,8}$/", $str, $matches);
$count = -1; // Because the zero-eth element is never null.
foreach ($matches as $element) {
if ( !empty($element)) {
$count += 1;
}
}
if ($count >= 3) {
// ...
}
Or Java:
Matcher matcher = Pattern.compile(...).matcher(string);
int count = 0;
if (matcher.matches())
for (int i = 1; i < 5; i++)
if (null != matcher.group(i))
count++;
if (count >= 3)
// ...

Count number of words in string using JavaScript

I am trying to count the number of words in a given string using the following code:
var t = document.getElementById('MSO_ContentTable').textContent;
if (t == undefined) {
var total = document.getElementById('MSO_ContentTable').innerText;
} else {
var total = document.getElementById('MSO_ContentTable').textContent;
}
countTotal = cword(total);
function cword(w) {
var count = 0;
var words = w.split(" ");
for (i = 0; i < words.length; i++) {
// inner loop -- do the count
if (words[i] != "") {
count += 1;
}
}
return (count);
}
In that code I am getting data from a div tag and sending it to the cword() function for counting. Though the return value is different in IE and Firefox. Is there any change required in the regular expression? One thing that I show that both browser send same string there is a problem inside the cword() function.
[edit 2022, based on comment] Nowadays, one would not extend the native prototype this way. A way to extend the native protype without the danger of naming conflicts is to use the es20xx symbol. Here is an example of a wordcounter using that.
Old answer: you can use split and add a wordcounter to the String prototype:
if (!String.prototype.countWords) {
String.prototype.countWords = function() {
return this.length && this.split(/\s+\b/).length || 0;
};
}
console.log(`'this string has five words'.countWords() => ${
'this string has five words'.countWords()}`);
console.log(`'this string has five words ... and counting'.countWords() => ${
'this string has five words ... and counting'.countWords()}`);
console.log(`''.countWords() => ${''.countWords()}`);
I would prefer a RegEx only solution:
var str = "your long string with many words.";
var wordCount = str.match(/(\w+)/g).length;
alert(wordCount); //6
The regex is
\w+ between one and unlimited word characters
/g greedy - don't stop after the first match
The brackets create a group around every match. So the length of all matched groups should match the word count.
This is the best solution I've found:
function wordCount(str) {
var m = str.match(/[^\s]+/g)
return m ? m.length : 0;
}
This inverts whitespace selection, which is better than \w+ because it only matches the latin alphabet and _ (see http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.6)
If you're not careful with whitespace matching you'll count empty strings, strings with leading and trailing whitespace, and all whitespace strings as matches while this solution handles strings like ' ', ' a\t\t!\r\n#$%() d ' correctly (if you define 'correct' as 0 and 4).
You can make a clever use of the replace() method although you are not replacing anything.
var str = "the very long text you have...";
var counter = 0;
// lets loop through the string and count the words
str.replace(/(\b+)/g,function (a) {
// for each word found increase the counter value by 1
counter++;
})
alert(counter);
the regex can be improved to exclude html tags for example
//Count words in a string or what appears as words :-)
function countWordsString(string){
var counter = 1;
// Change multiple spaces for one space
string=string.replace(/[\s]+/gim, ' ');
// Lets loop through the string and count the words
string.replace(/(\s+)/g, function (a) {
// For each word found increase the counter value by 1
counter++;
});
return counter;
}
var numberWords = countWordsString(string);

Categories