JavaScript - Dynamic RegExp for Duplicate Characters - javascript

I'm very new to programming, and I'm trying to solve an exercise where you encode a string (in this case, a single word) based on whether or not the constituent characters occur twice or more. Characters occurring only once encode to, say "■", characters encoding twice or more encode to, say "X".
Example: input = "hippodrome" :: output = "■■XXX■■X■■"
I managed to solve it in a very convoluted way using nested loops and a key:value object storing character:occurrences, but i am trying to refactor the solution to be more efficient using a dynamically created RegExp, but i think i'm not understanding regex notation.
function encodeDupes(word) {
let encoded = "";
for (let char of word) {
let regex = new RegExp(char + "{2,}","ig"); // create a regex to see if "char" occurs 2 or more times
regex.test(word) ? encoded += "X" : encoded += "■"; // check this char against rest of word, push appropriately
}
return encoded;
}
it works with a more simple gate like char < "m" ? do X : do Y, and i thought i understood this answer here ({n,} = at least n occurrences), but i'm new enough that i'm still not sure if it's my regex or my logic.
thank you!

I'm very new to programming, ..., I am trying to refactor the solution to be more efficient using a dynamically created RegExp...
That's a bit of a catch 22 because regular expressions trade efficiency for convenience. In order for the regular expression "engine" to run, a grammar must be established, and a lexer, parser, and evaluator transform the string-based input expressions into program output. It's (sometimes) convenient to implement a particular program using regular expressions, but it's almost impossible to beat out a fundamental algorithm that isn't slowed down by the regular expression engine.
I managed to solve it in a very convoluted way using nested loops and a key:value object storing character:occurrences ...
Convoluted indeed, but sadly not uncommon to see even "expert" programmers do such things. An efficient algorithm emerges when we realise we don't need to count each letter. Instead, we only need to know whether a letter occurs more than once. Using two Set objects, once and more, we can determine the answer without needing to allocate counter memory per letter! And sets are lightning fast, thanks to O(1) constant-time lookup -
function encodeDupes(word)
{ const once = new Set
const more = new Set
for (const c of word)
if (more.has(c))
continue
else if (once.has(c))
(once.delete(c), more.add(c))
else
once.add(c)
return Array
.from(word, c => more.has(c) ? "X" : "■")
.join("")
}
console.log(encodeDupes("hippodrome"))
Output
■■XXX■■X■■

Usually RegExp are used to compare entire words or phrases.
Whenever {n,} is used, it's searching for two or more characters consecutively. Here's an example:
n{2,}
nn # match
anna # match
nan # does not match
The following RegExp isn't perfect, but it should suffice, replacing n with the character of your choice
(.*n{2,}.*)|(.*n.*n.*)+
(.*n{2,}.*) —— for consecutive ‘n’s
| —— or
(.*n.*n.*)+ —— ‘n’s with anything in between
Let me know how it goes.

Related

str.match(reg) equivalent for str.split(".") in javascript

Is there a regular expression reg, so that for any string str the results of str.split(".") and str.match(reg) are equivalent? If multiline should somehow matter, a solution for a single line would be sufficient.
As an example: Considering the RegExp /[^\.]+/g: for the string "nice.sentance", "nice.sentance".split(".") gives the same result as "nice.sentance".match(/[^\.]+/g) - ["nice", "sentance"]. However, this is not the case for any string. E.g. for the empty string "" they would give different results, "".split(".") returning [""] and "".match(/[^\.]+/g) returning null, meaning /[^\.]+/g is not a solution, as it would need to work for any possible string.
The question comes from a misinterpretation of another question here and left me wondering. I do not have a practical application for it at the moment and am interested because i could not find an answer - it looks like an interesting RegExp problem. It may however be impossible.
Things i have considered:
Imho it is fairly clear that reg needs the global flag, removing capture groups as a possibility
/[^\.]+/g does not match empty parts, e.g. for "", ".a" or "a..a"
/[^\.]*/g produces additional empty strings after non-empty matches, because when iteration starts for the next match, it can fit in an empty match. E.g. for "a"
With features not available on javascript currently (but on other languages), one could repair the previous flaw: /(?<=^|\.)[^\.]*/g
My conclusion here would be that real empty matches need to be considered but cannot be differentiated from empty matches between a non-empty match and the following dot or EOL, without "looking behind". This seems a bit vague to count as a proper argument for it being impossible, but maybe is already enough. There might however be a RegExp feature i don't know about, e.g. to advance the index after a match without including the symbol, or something similar to be used as a trick.
Allowing some correction step on the array resulting from match makes the problem trivial.
I found some related questions, which as expected utilize look-behind or capture groups though:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
characters between two delimiters
lots more similar to the above
I do not see the point but assume you have to apply this in an environment where .split is not available.
Crafting a matching regex that does the same as .split(".") or /\./ requires to account for several cases:
no input => empty split
single . => two empty splits
. at the beginning => empty split at position 0
. at the end => empty split at the end
. in the middle
multiple consecutive .s => one empty split per ..
Following this, I came up with the following solution:
^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)
Code Sample*:
const regex = /^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)/gm;
const test = `
.
.a
a.
a.a
a..a
.a.
..a..
.a.z
..`;
var a = test.split("\n");
a.forEach(str => {
console.log(`"${str}"`);
console.log(str.split("."));
let m; let matches = [];
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
matches.push(m[0]);
}
console.log(matches);
});
The output should be read in triple blocks: input/split/regex-match.
The output on each 2nd and 3rd line should be the same.
Have fun!
*Caveat: This requires RegExp Lookbehind Assertions: JavaScript Lookbehind assertions are approved by TC39 and are now part of the ES2018 standard.
RegExp Lookbehind Assertions have been implemented in V8 and shipped without flags with Google Chrome v62 and in Node.js v6 behind a flag and v9 without a flag. The Firefox team is working on it, and for Microsoft Edge, it's an implementation suggestion.

What is the time complexity or Big O notation for "str.replace()" built In function in Javascript?

I am confused if the time complexity for str.replace() function is O(n) or O(1), for example:
var str = "Hello World";
str = str.replace("Hello", "Hi");
console.log(str);
//===> str = "Hi World"
Is it always the same answer or does it depend on what we replace?
Any thoughts or helpful links?!
Firstly it should be
str = str.replace("Hello", "Hi");
Secondly,
searching a substring inside a string can be done in linear time using KMP algorithm which is the most efficient.
Replacing in the worst case will take linear time as well.
So overall time complexity: O(n)
Here n is dependent on the string str.
In the worst case it will end up traversing the whole string and still not find the searchValue given to the replace function.
It's definitely not O(1) (comparison of string searching algorithms) , but ECMAScript 6 doesn't dictate the search algorithm:
Search string for the first occurrence of searchString and let pos be the index within string of the first code unit of the matched substring and let matched be searchString. If no occurrences of searchString were found, return string.
So it depends on the implementation.
Is it always the same answer or does it depend on what we replace?
Generally, it will be slower for longer search strings. How much slower is implementation-dependent.
You'll really have to look into implementation details for a complete answer. But to start there's V8's runtime-strings.cc and builtins-string-gen.cc. It's a deep dive--and I don't know c++, so I'm not entirely sure if I'm even looking at the right files, but they seem to use different approaches depending on the size of the needle and the depth of recursion needed to build a search tree.
For example, in builtins-string-gen.cc there's a block under ES6 #sec-string.prototype.replace that checks if the search_string has a length of 1, and if the subject_string length is greater than 255 (0xFF). When those conditions are true it looks like Runtime_StringReplaceOneCharWithString in runtime-strings.cc is called which in turn will try calling StringReplaceOneCharWithString first with a tree-traversable subject_string.
If that search hits the recursion limit, the runtime makes another call to StringReplaceOneCharWithString but this time with a flattened subject_string.
So, my partially educated guess here is you're always looking at some kind of linear time. Possibly O(mn) when hitting the recursion limit and doing a follow-on naive search. I don't know for sure that it's a naive search, but a flattened string to me implies traversing the subject_string step-by-step instead of through a search tree.
And possibly something less than O(mn) when the tree-traversal doesn't hit the recursion limit, though I'm not entirely sure what they're gaining by walking the subject_string recursively.
For an actual what is the time complexity for Javascript implementations, you'll probably want to ask the runtime devs directly or see if what they're doing is like other string searching algorithms to figure out which cases run in what time complexity.

How to get the nth (Unicode) character from a string in JavaScript

Suppose we have a string with some (astral) Unicode characters:
const s = 'Hi 👋 Unicode!'
The [] operator and .charAt() method don't work for getting the 4th character, which should be "👋":
> s[3]
'�'
> s.charAt(3)
'�'
The .codePointAt() does get the correct value for the 4th character, but unfortunately it's a number and has to be converted back to a string using String.fromCodePoint():
> String.fromCodePoint(s.codePointAt(3))
'👋'
Similarly, converting the string into an array using splats yields valid Unicode characters, so that's another way of getting the 4th one:
> [...s][3]
'👋'
But i can't believe that going from string to number back to string, or having to split the string into an array are the only ways of doing this seemingly trivial thing. Isn't there a simple method for doing this?
> s.simpleMethod(3)
'👋'
Note: i know that the definition of "character" is somewhat fuzzy, but for the purpose of this question a character is simply the symbol that corresponds to a Unicode codepoint (no combining characters, no grapheme clusters, etc).
Update: the String.fromCodePoint(str.codePointAt(n)) method is not really viable, since the nth position there doesn't take previous astral symbols into account: String.fromCodePoint('👋🙈'.codePointAt(1)) // => '�'
(I feel kinda dumb asking this; like i'm probably missing something obvious. But previous answers to this questions don't work on strings with Unicode simbols on astral planes.)
The string iterator is the only thing that iterates through code points rather than UCS-2/UTF-16 code units. So:
const string = 'Hi 👋 Unicode!';
for (const symbol of string) {
console.log(symbol);
}
So to get a specific code point based on its index from a string:
const string = 'Hi 👋 Unicode!';
// Note: The spread operator uses the string iterator under the hood.
const symbols = [...string];
symbols[3]; // '👋'
Still, this would break with grapheme clusters, or emoji sequences such as 👨‍👩‍👧‍👦 (👨 + U+200D ZERO WIDTH JOINER + 👩 + U+200D ZERO WIDTH JOINER + 👧 + U+200D ZERO WIDTH JOINER + 👦). Text segmentation helps with that.
Do you actually need to get the 4th code point in the string, though? What’s your use case?
You can use the new u flag to regexp if it's available to you.
const chars = 'Hi 👋 Unicode!'.match(/./ug);
console.log(chars);
The accepted answer to this question is out of date.
There is now a member of the String object called .at()/1 which does exactly what you're hoping for. If you have shims, shams, a transcompiler like TypeScript or Babel, etc, just set whatever your local configuration is, and you should be good to go.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/at
Amusingly, the spec for this feature, as well as the most common implementation shim (the one that I use,) is written by the person who authored the now out-of date accepted answer here. So even when he's out of date, he's still up to date.
If shimming or transcompiling isn't appropriate for you, there's a library called jsesc that can handle it for you through simple escaping. I'll give you three guesses who wrote the library. First two don't count.
https://www.npmjs.com/package/jsesc

"+" Quantifier not doing its job?

I'm a novice programmer making a simple calculator in JavaScript for a school project, and instead of using eval() to evaluate a string, I made my own function calculate(exp).
Essentially, my program uses order of operations (PEMDAS, or Parenthesis, Exponents, Multiplication/Division, Addition/Subtraction) to evaluate a string expression. One of my regex patterns is like so ("mdi" for multiplication/division):
mdi = /(-?\d+(\.\d+)?)([\*\/])(-?\d+(\.\d+)?)/g; // line 36 on JSFiddle
What this does is:
-?\d+ finds an integer number
(\.\d+)? matches the decimal if there is one
[\*\/] matches the operator used (* or / for multiplication or division)
/g matches every occurence in the string expression.
I loop through this regular expression's matches with the following code:
while((res = mdi.exec(exp)) !== null) { // line 69 on JSFiddle
exp = exp.replace(mdi,
function(match,$1,$3,$4,$5) {
if($4 == "*")
return parseFloat($1) * parseFloat($5);
else
return parseFloat($1) / parseFloat($5);
});
exp = exp.replace(doN,""); // this gets rid of double negatives
}
However, this does not work all the time. It only works with numbers with an absolute value less than 10. I cannot do any operations on numbers like 24 and -5232000321, even though the regex should match it with the + quantifier. It works with small numbers, but crashes and uses up most of my CPU when the numbers are larger than 10.
For example, when the expression 5*.5 is inputted, 2.5 is outputted, but when you input 75*.5 and press enter, the program stops.
I'm not really sure what's happening here, because I can't locate the source of the error for some reason - nothing is showing up even though I have console.log() all over my code for debugging, but I think it is something wrong with this regex. What is happening?
The full code (so far) is here at JSFiddle.net, but please be aware that it may crash. If you have any other suggestions, please tell me as well.
Thanks for any help.
The problem is
bzp = /^.\d/;
while((res = bzp.exec(result)) !== null) {
result = result.replace(bzp,
function($match) {
console.log($match + " -> 0 + " + $match);
return "0" + $match;
});
}
It keeps prepending zeros with no limit.
Removing that code it works well.
I have also cleaned your code, declared variables, and made it more maintainable: Demo
If you have any other suggestions, please tell me as well.
As pointed out in the comments, parsing your input by iteratively applying regular expressions is very ad-hoc. A better approach would be to actually construct a grammar for your input language and parse based on that. Here's an example grammar that basically matches your input language:
expr ::= term ( additiveOperator term )*
term ::= factor ( multiplicativeOperator factor )*
expr ::= number | '(' expr ')'
additiveOperator ::= '+' | '-'
multiplicativeOperator ::= '*' | '/'
The syntax here is pretty similar to regular expressions, where parenthesese denote groups, * denotes zero-or-more repetitions, and | denotes alternatives. The symbols enclosed in single quotes are literals, whereas everything else is symbolic. Note that this grammar doesn't handle unary operators (based on your post it sounds like you assume a single negative sign for negative numbers, which can be parsed by the number parser).
There are several parser-generator libraries for JavaScript, but I prefer combinator-style parsers where the parser is built functionally at runtime rather than having to run a separate tool to generate the code for your parer. Parsimmon is a nice combinator parser for JavaScript, and the API is pretty easy to wrap your head around.
A parser usually returns some sort of a tree data structure corresponding to the parsed syntax (i.e. an abstract syntax tree). You then traverse this data structure in order to calculate the value of the arithmetic expression.
I created a fiddle demonstrating parsing and evaluating of arithmetic expressions. I didn't integrate any of this into your existing calculator interface, but if you can understand how to use the parser
Mathematical expression are not parsed and calculated with regular expressions because of the number of permutations and combinations available. The faster way so far, is POST FIX notation because other notations are not as fast as this one. As they mention on Wikipedia:
In comparison testing of reverse Polish notation with algebraic
notation, reverse Polish has been found to lead to faster
calculations, for two reasons. Because reverse Polish calculators do
not need expressions to be parenthesized, fewer operations need to be
entered to perform typical calculations. Additionally, users of
reverse Polish calculators made fewer mistakes than for other types of
calculator. Later research clarified that the increased speed
from reverse Polish notation may be attributed to the smaller number
of keystrokes needed to enter this notation, rather than to a smaller
cognitive load on its users. However, anecdotal evidence suggests
that reverse Polish notation is more difficult for users to learn than
algebraic notation.
Full article: Reverse Polish Notation
And also here you can see other notations that are still far more better than regex.
Calculator Input Methods
I would therefore suggest you change your algorithm to a more efficient one, personally I would prefer POST FIX.

How to split a string with the help of JavaScript's regex?

I need to split a string to one or more substrings each of which contains no more or less than two dots. For example, if the string is foo.boo.coo.too" then what would be the regex to get the following array?: ["foo.boo.coo", "boo.coo.too"]. I hope there will be someone to answer this question - I will really admire you, as I've been programming for several years and have not still be used to regular expressions well enough to solve this particular problem by myself. Thank you very much in advance. Let me know your identity so that I can credit you as a contributor of the program I am creating.
RegEx is for this Problem not the best solution a similar problem was discussed here: split-a-sting-every-3-characters-from-back-javascript
A good javascript solution would be a javascript function like this
function splitter(text){
var parts = text.split(".");
var times = parts.length - 2;
var values = [];
for(var index = 0; index<times;index++)
{
values.push(parts.slice(index,index+3).join("."));
}
return values;
}
splitter("too.boo.coo.too")
//=> Result tested on Chrome 25+ ["too.boo.coo", "boo.coo.too"]
I hope this helps
If you want to Use Regex try the Lookhead Stuff, this could help http://www.regular-expressions.info/lookaround.html
Regex by its nature will return non-intersecting results, so if you want "all matches" from a single regex - it's not possible.
So basically you will need to find first match, and then start from next position to find next match and so on; something like this technique described here regex matches with intersection in C# (it's not JavaScript but idea is the same)
You can use the following regex for example:
(?<=^|\.)((?:[^.]*\.){2}[^.]*?)(?=$|\.)
It ensures that it starts and ends with dot, or at begin/end of line, and contains exactly two dots inside, and captures result in first capture. You can replace * with + to make sure at least one symbol exists between dots, if it is required.
But you need to understand that such approach has really bad performance for the task you are solving, so may be using other way (like split + for) will be better solution.

Categories