Compare strings with different spaces and potentially null characters

Compare strings with different spaces and potentially null characters - javascript

I'm currently making a wikipedia scraper for a project I'm doing. The problem is that my code sometimes produces bugs when trying to compare strings. If I have strings that look identical, they sometimes are still registered as different. For example:
var elementText = $("selector").text();
console.log(elementText); // "abc def"
console.log(elementText === "abc def"); // false
It seems that Wikipedia uses some weird characters that my code detects and doesn't like. I have tried:
function replaceBadSpaces(string) {
return decodeURIComponent(encodeURIComponent(string).replace("/%C2%A0/g", "%20"));
}
and using elementText.replace(/\s+/g, ''), and neither seem to work. How can I completely get rid of these characters so that strings that are intuitively equal actually do match as equal?
Note: I have also tested my code with ==, and it does seem to fix the issue; however, in the interest of avoiding future bugs I'd like to avoid using this fix.

Remove the quotation marks surrounding the first argument of replace. This is because you are using a regular expression ( /g )for the replace function, which does not need to be wrapped in quotation marks.
function replaceBadSpaces(string) {
return decodeURIComponent(encodeURIComponent(string).replace(/%C2%A0/g, "%20"));
}

Related

str.match(reg) equivalent for str.split(".") in javascript

Is there a regular expression reg, so that for any string str the results of str.split(".") and str.match(reg) are equivalent? If multiline should somehow matter, a solution for a single line would be sufficient.
As an example: Considering the RegExp /[^\.]+/g: for the string "nice.sentance", "nice.sentance".split(".") gives the same result as "nice.sentance".match(/[^\.]+/g) - ["nice", "sentance"]. However, this is not the case for any string. E.g. for the empty string "" they would give different results, "".split(".") returning [""] and "".match(/[^\.]+/g) returning null, meaning /[^\.]+/g is not a solution, as it would need to work for any possible string.
The question comes from a misinterpretation of another question here and left me wondering. I do not have a practical application for it at the moment and am interested because i could not find an answer - it looks like an interesting RegExp problem. It may however be impossible.
Things i have considered:
Imho it is fairly clear that reg needs the global flag, removing capture groups as a possibility
/[^\.]+/g does not match empty parts, e.g. for "", ".a" or "a..a"
/[^\.]*/g produces additional empty strings after non-empty matches, because when iteration starts for the next match, it can fit in an empty match. E.g. for "a"
With features not available on javascript currently (but on other languages), one could repair the previous flaw: /(?<=^|\.)[^\.]*/g
My conclusion here would be that real empty matches need to be considered but cannot be differentiated from empty matches between a non-empty match and the following dot or EOL, without "looking behind". This seems a bit vague to count as a proper argument for it being impossible, but maybe is already enough. There might however be a RegExp feature i don't know about, e.g. to advance the index after a match without including the symbol, or something similar to be used as a trick.
Allowing some correction step on the array resulting from match makes the problem trivial.
I found some related questions, which as expected utilize look-behind or capture groups though:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
characters between two delimiters
lots more similar to the above

I do not see the point but assume you have to apply this in an environment where .split is not available.
Crafting a matching regex that does the same as .split(".") or /\./ requires to account for several cases:
no input => empty split
single . => two empty splits
. at the beginning => empty split at position 0
. at the end => empty split at the end
. in the middle
multiple consecutive .s => one empty split per ..
Following this, I came up with the following solution:
^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)
Code Sample*:
const regex = /^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)/gm;
const test = `
.
.a
a.
a.a
a..a
.a.
..a..
.a.z
..`;
var a = test.split("\n");
a.forEach(str => {
console.log(`"${str}"`);
console.log(str.split("."));
let m; let matches = [];
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
matches.push(m[0]);
}
console.log(matches);
});
The output should be read in triple blocks: input/split/regex-match.
The output on each 2nd and 3rd line should be the same.
Have fun!
*Caveat: This requires RegExp Lookbehind Assertions: JavaScript Lookbehind assertions are approved by TC39 and are now part of the ES2018 standard.
RegExp Lookbehind Assertions have been implemented in V8 and shipped without flags with Google Chrome v62 and in Node.js v6 behind a flag and v9 without a flag. The Firefox team is working on it, and for Microsoft Edge, it's an implementation suggestion.

Why isn't .replace() working on a large generated string from escodege.generate()?

I am attempting to generate some code using escodegen's .generate() function which gives me a string.
Unfortunately it does not remove completely the semi-colons (only on blocks of code), which is what I need it to do get rid of them myself. So I am using the the .replace() function , however the semi-colons are not removed for some reason.
Here is what I currently have:
generatedCode = escodegen.generate(esprima.parseModule(code), escodegenOptions)
const cleanGeneratedCode = generatedFile.replace(';', '')
console.log('cleanGeneratedCode ', cleanGeneratedCode) // string stays the exact same.
Am I doing something wrong or missing something perhaps?

As per MDN, if you provide a substring instead of a regex
It is treated as a verbatim string and is not interpreted as a regular expression. Only the first occurrence will be replaced.
So, the output probably isn't exactly the same as the code generated, but rather the first semicolon has been removed. To remedy this, simply use a regex with the "global" flag (g). An example:
const cleanGenereatedCode = escodegen.generate(esprima.parseModule(code), escodegenOptions).replace(/;/g, '');
console.log('Clean generated code: ', cleanGeneratedCode);

Regex expression for exactly known pattern without "cutting into" the string not working

I am currently developing a web-application where I work with java, javascript, html, jquery, etc. and at some point I need to check that whether an input matches a known pattern and only proceed if it is true.
The pattern should be [at least one but max 3 numbers between 0-9]/[exactly 4 numbers between 0-9], so the only acceptable variations should be like
1/2014 or 23/2015 or 123/2016.
and nothing else, and I CANNOT accept something like 1234/3012 or anything else, and this is my problem right here, it accepts everything in which it can find the above pattern, so like from 12345/6789 it accepts and saves 345/6789.
I am a total newbie with regex, so I checked out http://regexr.com and this is the code I have in my javascript:
$.validator.addMethod("hatarozat", function(value, element) {
return (this.optional(element) || /[0-9]{1,3}(?:\/)[0-9]{4}/i.test(value));
}, "Hibás határozat szám!");
So this is my regex: /[0-9]{1,3}(?:\/)[0-9]{4}/i
which I built up using the above website. What could be the problem, or how can I achived what I described? I tried /^[0-9]{1,3}(?:\/)[0-9]{4}$/ibut this doesn't seem to work, please anyone help me, I have everything else done and am getting pretty stressed over something looking so simple yet I cannot solve it. Thank you!

Your last regex with the anchors (^ and $) is a correct regex. What prevents your code from working is this.optional(element) ||. Since this is a static thing, and is probably true, so it does not show any error (as || is an OR condition, if the first is true, the whole returns true, the regex is not checked at all).
So, use
return /^[0-9]{1,3}\/[0-9]{4}$/.test(value);
Note you do not need the (?:...) with \/ as the grouping does not do anything important here and is just redundant. The anchors are important, since you want the whole string to match the pattern (and ^ anchors the regex at the start of the string and $ does that at the end of the string.)

You need use the the following special characters in your regex expression:
^ and $
or \b
so 2 regexp will be correct:
/\b[0-9]{1,3}(?:\/)[0-9]{4}\b/i;
or
/^[0-9]{1,3}(?:\/)[0-9]{4}$/i

How to ignore / allow multiple line breaks in a Javascript replace regex

What I am trying to do is clean up an input field on blur. My blur code is completely functional, but I can't get the regex to work correctly. All of the characters I'm trying to allow are working correctly except for any number of line breaks. I've tried with different variations and combinations of including /s, /r, and /n .
I am doing this because I want to prevent as many characters that don't really belong in a descriptive input field as possible. I am using entity to linq for database input, which should protect me from sql injection attacks, but I still want to restrict the characters for added security. I am allowing apostrophes, but that should be the only potential threat from the allowed characters listed in the regex below.
Once I get the regex, I'll also replace on paste using the same code block.
This is my javascript method that I reverted back to.
function CleanSentenceInput(AlphaNumString) {
input = AlphaNumString;
var CleanInput = input.replace(/[^a-z0-9\s.,;:'()-]/gi, '');
CleanInput = myTrim(CleanInput);
return CleanInput;
}
Is there a way to allow any number of line breaks by modifying this replace regex?
Test Input:
aaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
cccccccccccccccc
cccccccccccccccc
Test Result:
aaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbcccccccccccccccccccccccccccccccc
Expected Result:
aaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
cccccccccccccccc
cccccccccccccccc
Update ** It turns out that the trim function I was using was removing the line-breaks. Here is that function:
Bad trim function:
function myTrim(x) {
return x.replace(/^\s+|\s+$/gm, '');
}
Is there a way to fix this regex so that it still replaces whitespace before and after, but not inside of the content?
Updated **
Good Trim Function:
function myTrim(x) {
return x.replace(/^\s+|\s+$/g, '');
}

As I noted from the beginning the problem is not with the regex since
/[^a-z0-9\s.,;:'()-]/gi
matches characters other than whitespace (beside others in the character class).
In MyTrim you need to remove m because otherwise, $ is treated as a line end and ^ as line start anchors, and in fact you want to only trim the string from its beginning and end:
function myTrim(x) {
return x.replace(/^\s+|\s+$/g, '');
}
It is also possible to use trim() (it is supported by all modern browsers, IE9 already should support it).

.trim() and regular expressions producing unexpected results

I wrote a fairly simple regular expression to detect when a string looks like it could be an email:
var looksLikeEmail = /^\S+#\S+\.\S+$/gi;
I'm using Knockout and the string being tested is the value of a textarea.
Essentially, say we have the value of the textarea in a variable text. This value was, for example, the typed in value abc#example.com.
What's odd, is it seems like, even though text === text.trim(), looksLikeEmail.test(text) returns true, but looksLikeEmail.test(text.trim()) returns false.
On the other hand, if I manually create the string var test2 = 'abc#example.com', it does not have this issue.
This seems to indicate to me that the textarea is inserting some odd characters or something... that .trim() is doing something weird with. But test.length === test2.length and test.length === test.trim().length
Does anyone know how to make this behave correctly?
I've written up a jsfiddle to quickly demonstrate the behavior...
If you go to the fiddle and try typing in an email... you will see the problem. another weird behavior: add a space after the email, then remove it. /confused
Any help is much appreciated. Thanks.

.test(), just like .exec() will remember the last index of a match when using a global regex, and try to match from it onward, failing on the second call. Just remove the /g option from your regex - it doesn't make sense to have /g in a non-multiline regex which matches beginning and end.

We Keep Coding

JavaScript is the programming language of the Web.

Compare strings with different spaces and potentially null characters - javascript

Related

str.match(reg) equivalent for str.split(".") in javascript

Why isn't .replace() working on a large generated string from escodege.generate()?

Regex expression for exactly known pattern without "cutting into" the string not working

How to ignore / allow multiple line breaks in a Javascript replace regex

.trim() and regular expressions producing unexpected results

Categories

Resources