Javascript Regular Expressions Functionality - javascript

I've spent a few hours on this and I can't seem to figure this one out.
In the code below, I'm trying to understand exactly what and how the regular expressions in the url.match are working.
As the code is below, it doesn't work. However if I remove (?:&toggle=|&ie=utf-8|&FORM=|&aq=|&x=|&gwp) it seems to give me the output that I want.
However, I don't want to remove this without understanding what it is doing.
I found a pretty useful resource, but after a few hours I still can't precisely determine what these expressions are doing:
https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Regular_Expressions#Using_Parenthesized_Substring_Matches
Could someone break this down for me and explain how exactly it is parsing the strings. The expressions themselves and the placement of the parentheses is not really clear to me and frankly very confusing.
Any help is appreciated.
(function($) {
$(document).ready(function() {
function parse_keywords(url){
var matches = url.match(/.*(?:\?p=|\?q=|&q=|\?s=)([a-zA-Z0-9 +]*)(?:&toggle=|&ie=utf-8|&FORM=|&aq=|&x=|&gwp)/);
return matches ? matches[1].split('+') : [];
}
myRefUrl = "http://www.google.com/url?sa=f&rct=j&url=https://www.mydomain.com/&q=my+keyword+from+google&ei=fUpnUaage8niAKeiICgCA&usg=AFQjCNFAlKg_w5pZzrhwopwgD12c_8z_23Q";
myk1 = (parse_keywords(myRefUrl));
kw="";
for (i=0;i<myk1.length;i++) {
if (i == (myk1.length - 1)) {
kw = kw + myk1[i];
}
else {
kw = kw + myk1[i] + '%20';
}
}
console.log (kw);
if (kw != null && kw != "" && kw != " " && kw != "%20") {
orighref = $('a#applynlink').attr('href');
$('a#applynlink').attr('href', orighref + '&scbi=' + kw);
}
});
})(jQuery);

Let's break this regex down.
/
Begin regex.
.*
Match zero or more anything - basically, we're willing to match this regex at any point into the string.
(?:\?p=
|\?q=
|&q=
|\?s=)
In this, the ?: means 'do not capture anything inside of this group'. See http://www.regular-expressions.info/refadv.html
The \? means take ? literally, which is normally a character meaning 'match 0 or 1 copies of the previous token' but we want to match an actual ?.
Other than that, it's just looking for a multitude of different options to select (| means 'the regex is valid if I match either what's before me or after me.)
([a-zA-Z0-9 +]*)
Now we match zero or more of any of the following characters in any arrangement: a-ZA-Z0-9 + And since it is inside a () with no ?: we DO capture it.
(?:&toggle=
|&ie=utf-8
|&FORM=
|&aq=
|&x=
|&gwp)
We see another ?: so this is another non-capturing group.
Other than that, it is just full of literal characters separated by |s, so it is not doing any fancy logic.
/
End regex.
In summary, this regex looks through the string for any instance of the first non capturing group, captures everything inside of it, then looks for any instance of the second non capturing group to 'cap' it off and returns everything that was between those two non capturing groups. (Think of it as a 'sandwich', we look for the header and footer and capture everything in between that we're interested in)
After the regex runs, we do this:
return matches ? matches[1].split('+') : [];
Which grabs the captured group and splits it on + into an array of strings.

For situations like this, it's really helpful to visualize it with www.debuggex.com (which I built). It immediately shows you the structure of your regex and allows you to walk through step-by-step.
In this case, the reason it works when you remove the last part of your regex is because none of the strings &toggle=, &ie=utf-8, etc are in your sample url. To see this, drag the grey slider above the test string on debuggex and you'll see that it never makes it past the & in that last group.

Related

Regex taking long time to evaluate

At the time of login, I need to allow either username (alphanumeric and some special characters) or email address or username\domain format only. For this purpose, I used this regex with or (|) condition. Along with this, I need to allow some other language characters like Japanese, Chinese etc., so included those as well in the same regex. Now, the issue is when I enter characters (>=30) and # or some special character, the evaluation of this regex is taking some seconds and browser goes in hang mode.
export const usernameRegex = /(^[a-zA-Z0-9._~^#!%+\-]+#[a-z0-9.-]+\.[a-z]{2,4})+|^[a-zA-Z0-9._~^#!\-]+\\([._-~^#!]|[\p{Ll}\p{Lm}\p{Lt}a-zA-Z0-9-\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf\u3130-\u318F\uAC00-\uD7AF])+|^([._-~^#!]|[\p{Ll}\p{Lm}\p{Lt}a-zA-Z0-9-\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf\u3130-\u318F\uAC00-\uD7AF])+$/gu;
When I tried removing the other language character set such as [\p{Ll}\p{Lm}\p{Lt}a-zA-Z0-9-\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf\u3130-\u318F\uAC00-\uD7AF])+|^([._-~^#!]|[\p{Ll}\p{Lm}\p{Lt}a-zA-Z0-9-\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf\u3130-\u318F\uAC00-\uD7AF] it works fine.
I understood that generally regex looks simple but it does a lot under the hood. Is there any modification that needs to be done in this regex, so that it doesn't take time to evaluate. Any help is much appreciated!
Valid texts:
stackoverflow,
stackoverflow1~,
stackoverflow!#~^-,
stackoverflow#g.co,
stackoverflow!#~^-#g.co,
こんにちは,
你好,
tree\guava
EDIT:
e.g. Input causing the issue
stackoverflowstackoverflowstackoverflow#
On giving the above text it is taking long time.
https://imgur.com/T2Vg4lg
Your regex seems to consist of three regular expressions concatenated with |
(^[a-zA-Z0-9._~^#!%+\-]+#[a-z0-9.-]+\.[a-z]{2,4})+
^[a-zA-Z0-9._~^#!\-]+\\([._-~^#!]|[\p{Ll}\p{Lm}\p{Lt}a-zA-Z0-9-\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf\u3130-\u318F\uAC00-\uD7AF])+
^([._-~^#!]|[\p{Ll}\p{Lm}\p{Lt}a-zA-Z0-9-\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf\u3130-\u318F\uAC00-\uD7AF])+$
first regex (^...)+ how many times do you think this entire pattern can occur that starts at the beginning of the string. Either it's a second occurence OR it starts at the beginning of the string it can't be both.
So ^[a-zA-Z0-9._~^#!%+\-]+#[a-z0-9.-]+\.[a-z]{2,4}
parts 2 and 3 are mostly identical, only that nr. 2 contains this block [a-zA-Z0-9._~^#!\-]+\\ followed by what's the rest of the 3rd part.
So let's combine them: ^(?:[a-zA-Z0-9._~^#!\-]+\\)? ... and make sure to use non-capturing groups when possible.
([abc]|[def])+ can be simplified to [abcdef]+. This btw. is the part that's killing your performance.
your regex ends with a $. This was only part of the last part, but I assume you always want to match the entire string? So let's make all 3 (now 2) parts ^ ... $
Summary:
/^[a-zA-Z0-9._~^#!%+-]+#[a-z0-9.-]+\.[a-z]{2,4}$|^(?:[a-zA-Z0-9._~^#!-]+\\)?[._-~^#!\p{Ll}\p{Lm}\p{Lt}a-zA-Z0-9-\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf\u3130-\u318F\uAC00-\uD7AF]+$/u
A JS example how a simple regex would try to match a string, and how it fails, backtracks, retries with the other side of the | and so on, and so on.
// let's implement what `/([a-z]|[\p{Ll}])+/u` would do,
// how it would try to match something.
const a = /[a-z]/; // left part
const b = /[\p{Ll}]/u; // right part
const string = "abc,";
const testNextCharacter = (index) => {
if (index === string.length) {
return true;
}
const pattern = index + " ".repeat(index + 1) + "%o.test(%o)";
const character = string.charAt(index);
console.log(pattern, a, character);
// checking the left part && if successful checking the next character
if (a.test(character) && testNextCharacter(index + 1)) {
return true;
}
// checking the right part && if successful checking the next character
console.log(pattern, b, character);
if (b.test(character) && testNextCharacter(index + 1)) {
return true;
}
return false;
}
console.log("result", testNextCharacter(0));
.as-console-wrapper{top:0;max-height:100%!important}
And this are only 4 characters. Why don't you try this with 5,6 characters to get an impression how much work this will be at 20characters.

Applying currency format using replace and a regular expression

I am trying to understand some code where a number is converted to a currency format. Thus, if you have 16.9 it converts to $16.90. The problem with the code is if you have an amount over $1,000, it just returns $1, an amount over $2,000 returns $2, etc. Amounts in the hundreds show up fine.
Here is the function:
var _formatCurrency = function(amount) {
return "$" + parseFloat(amount).toFixed(2).replace(/(\d)(?=(\d{3})+\.)/g, '$1,')
};
(The reason the semicolon is after the bracket is because this function is in itself a statement in another function. That function is not relevant to this discussion.)
I found out that the person who originally put the code in there found it somewhere but didn't fully understand it and didn't test this particular scenario. I myself have not dealt much with regular expressions. I am not only trying to fix it, but to understand how it is working as it is now.
Here's what I've found out. The code between the backslash after the open parenthesis and the backslash before the g is the pattern. The g means global search. The \d means digit, and the (?=\d{3})+\. appears to mean find 3 digits plus a decimal point. I'm not sure I have that right, though, because if that was correct shouldn't it ignore numbers like 5.4? That works fine. Also, I'm not sure what the '$1,' is for. It looks to me like it is supposed to be placed where the digits are, but wouldn't that change all the numbers to $1? Also, why is there a comma after the 1?
Regarding your comment
I was hoping to just edit the regex so it would work properly.
The regex you are currently using is obviously not working for you so I think you should consider alternatives even if they are not too similar, and
Trying to keep the code change as small as possible
Understandable but sometimes it is better to use a code that is a little bit bigger and MORE READABLE than to go with compact and hieroglyphical.
Back to business:
I'm assuming you are getting a string as an argument and this string is composed only of digits and may or may not have a dot before the last 1 or 2 digts. Something like
//input //intended output
1 $1.00
20 $20.00
34.2 $34.20
23.1 $23.10
62516.16 $62,516.16
15.26 $15.26
4654656 $4,654,656.00
0.3 $0.30
I will let you do a pre-check of (assumed) non-valids like 1. | 2.2. | .6 | 4.8.1 | 4.856 | etc.
Proposed solution:
var _formatCurrency = function(amount) {
amount = "$" + amount.replace(/(\d)(?=(\d{3})+(\.(\d){0,2})*$)/g, '$1,');
if(amount.indexOf('.') === -1)
return amount + '.00';
var decimals = amount.split('.')[1];
return decimals.length < 2 ? amount + '0' : amount;
};
Regex break down:
(\d): Matches one digit. Parentheses group things for referencing when needed.
(?=(\d{3})+(\.(\d){0,2})*$). Now this guy. From end to beginning:
$: Matches the end of the string. This is what allows you to match from the end instead of the beginning which is very handy for adding the commas.
(\.(\d){0,2})*: This part processes the dot and decimals. The \. matches the dot. (\d){0,2} matches 0, 1 or 2 digits (the decimals). The * implies that this whole group can be empty.
?=(\d{3})+: \d{3} matches 3 digits exactly. + means at least one occurrence. Finally ?= matches a group after the main expression without including it in the result. In this case it takes three digits at a time (from the end remember?) and leaves them out of the result for when replacing.
g: Match and replace globally, the whole string.
Replacing with $1,: This is how captured groups are referenced for replacing, in this case the wanted group is number 1. Since the pattern will match every digit in the position 3n+1 (starting from the end or the dot) and catch it in the group number 1 ((\d)), then replacing that catch with $1, will effectively add a comma after each capture.
Try it and please feedback.
Also if you haven't already you should (and SO has not provided me with a format to stress this enough) really really look into this site as suggested by Taplar
The pattern is invalid, and your understanding of the function is incorrect. This function formats a number in a standard US currency, and here is how it works:
The parseFloat() function converts a string value to a decimal number.
The toFixed(2) function rounds the decimal number to 2 digits after the decimal point.
The replace() function is used here to add the thousands spearators (i.e. a comma after every 3 digits). The pattern is incorrect, so here is a suggested fix /(\d)(?=(\d{3})+\.)/g and this is how it works:
The (\d) captures a digit.
The (?=(\d{3})+\.) is called a look-ahead and it ensures that the captured digit above has one set of 3 digits (\d{3}) or more + followed by the decimal point \. after it followed by a decimal point.
The g flag/modifier is to apply the pattern globally, that is on the entire amount.
The replacement $1, replaces the pattern with the first captured group $1, which is in our case the digit (\d) (so technically replacing the digit with itself to make sure we don't lose the digit in the replacement) followed by a comma ,. So like I said, this is just to add the thousands separator.
Here are some tests with the suggested fix. Note that it works fine with numbers and strings:
var _formatCurrency = function(amount) {
return "$" + parseFloat(amount).toFixed(2).replace(/(\d)(?=(\d{3})+\.)/g, '$1,');
};
console.log(_formatCurrency('1'));
console.log(_formatCurrency('100'));
console.log(_formatCurrency('1000'));
console.log(_formatCurrency('1000000.559'));
console.log(_formatCurrency('10000000000.559'));
console.log(_formatCurrency(1));
console.log(_formatCurrency(100));
console.log(_formatCurrency(1000));
console.log(_formatCurrency(1000000.559));
console.log(_formatCurrency(10000000000.559));
Okay, I want to apologize to everyone who answered. I did some further tracing and found out the JSON call which was bringing in the amount did in fact have a comma in it, so it is just parsing that first digit. I was looking in the wrong place in the code when I thought there was no comma in there already. I do appreciate everyone's input and hope you won't think too bad of me for not catching that before this whole exercise. If nothing else, at least I now know how that regex operates so I can make use of it in the future. Now I just have to go about removing that comma.
Have a great day!
Assuming that you are working with USD only, then this should work for you as an alternative to Regular Expressions. I have also included a few tests to verify that it is working properly.
var test1 = '16.9';
var test2 = '2000.5';
var test3 = '300000.23';
var test4 = '3000000.23';
function stringToUSD(inputString) {
const splitValues = inputString.split('.');
const wholeNumber = splitValues[0].split('')
.map(val => parseInt(val))
.reverse()
.map((val, idx, arr) => idx !== 0 && (idx + 1) % 3 === 0 && arr[idx + 1] !== undefined ? `,${val}` : val)
.reverse()
.join('');
return parseFloat(`${wholeNumber}.${splitValues[1]}`).toFixed(2);
}
console.log(stringToUSD(test1));
console.log(stringToUSD(test2));
console.log(stringToUSD(test3));
console.log(stringToUSD(test4));

javascript regex for region code

I have a regex problem with validation for a region code.
My region code could be only one digit but it also could be a digits separated by '-'
for Example my region code could be one of the following:
6
6-66
77-7
As you can see I must have at least one digit or digits separated by '-' and if they are separated there should be a digits after the '-' sign (does not matter how many). So 6- must not be validated as legal region code. I try 2 hours to solve this, but I couldn't, so please help me! Thank you!
/\d+(-\d+)?$/
This will match 6, 6-66,77-7, but not6-`
If what you are looking for is the whole string:
/^\d+(?:-\d+)?$/
or something like that:
if (parseInt(yourstring.split(/-/)[0])>=eval(yourstring)) alert('true');
else alert('false');
But it is more complicated :) and less efficient! And if the condition is false you code will crash!
var data = ['6', '6-66', '77-7', '6-'];
var len = data.length;
for(var i=0; i<len; ++i) {
var current = data[i];
var result = data[i].match(/^(\d+|\d+[-]\d+)$/);
if(result != null) {
console.log(current);
}
}
--output:--
6
6-66
77-7
For a quick answer you can try following:
/^([0-9])|([0-9]\-[0-9][0-9])|([0-9][0-9]\-[0-9])$/
or in case your engine support perl-styled character classes:
/^(\d)|(\d\-\d\d)|(\d\d\-\d)$/
here what it does:
between / and / resides as string defining a regular expression
\d stands for one digit it coudl also be writen as [0-9]
() defines a sub-expression, so (\d) matches your first one-digit, (\d-\d\d) second three digits style, and last (\d\d-\d) third variant of three-digit region code
| goes as "OR" like (A)|(B)|(C), so by combining previous three we will get:
/(\d)|(\d-\d\d)|(\d\d-\d)/
Finally ^ means start of string, and $ - end of string.
also there is so called BRE mode (in which you have to add "\" symbol before each parentheses), but I think it is not the case. However if you would have some free time, please consider any quick tutorial like this one.

How to make this simple regexp?

I need to make a string starts and ends with alphanumeric range between 5 to 20 characters and it could have a space or none between characters. /^[a-z\s?A-Z0-9]{5,20}$/ but this is not working.
EDIT
test test -should pass
testtest -should pass
test test test -should not pass
You can't do this with traditional regex without writing a ridiculously long expression, so you need to use a look-ahead:
/^(?=(\w| ){15,20}$)\w+ ?\w+$/
This says, make sure there are between 15 and 20 characters in the match, then match /\w+ \w+/
Note I used \w for simplification. It is the same as your character class above except it also accepts underscores. If you don't want to match them you have to do:
/^(?=[a-zA-Z0-9 ]{15,20}$)[a-zA-Z0-9]+ ?[a-zA-Z0-9]+$/
You can't put a ? inside of [...]. [...] is used to specify a set of characters precisely, you can't maybe (?) have a character inside a set of characters. The occurrence of any specific characters is already optional, the ? is meaningless.
If you allow any number of spaces inside your match, just remove the question mark. If you want to allow a single space but no more, then regular expressions alone can't do that for you, you'd need something like
if (myString.match(/^[a-z\sA-Z0-9]{5,20}$/ && myString.match(/\s/g).length <= 1)
You couldn't do this with a single traditional regex without it being dozens of lines long; regexes are meant for matching more simpler patterns than this.
If you only want to use regexes, you could use two instead of one. The first matches the general pattern, the second ensures that only one non-space characters is found.
if (myString.match(/^[a-z\sA-Z0-9]{5,20}$/ && myString.match(/^[^\s]*\s?[^\s]*$/))) {
Example Usage
inputs = ["test test", "testtest", "test test test"];
for (index in inputs) {
var myString = inputs[index];
if (myString.match(/^[a-z\sA-Z0-9]{5,20}$/ && myString.match(/^[^\s]*\s?[^\s]*$/))) {
console.log(myString + " matches.")
} else {
console.log(myString + " does not match.")
}
}
This produces the output specified in your question.
Meh , So here's the ridiculously long traditional regex for the same
(?i)[a-z0-9]+( [a-z0-9]+)?{5,12}
js vesrion (w/o the nested quantifier)
/^([a-z0-9]( [a-z0-9])?){5,12}$/i

Sort lines on webpage using javascript/ regex

I'd like to write a Greasemonkey script that requires finding lines ending with a string ("copies.") & sorting those lines based on the number preceding that string.
The page I'm looking to modify does not use tables unfortunately, just the br/ tag, so I assume that this will involve Regex:
http://www.publishersweekly.com/article/CA6591208.html
(Lines without the matching string will just be ignored.)
Would be grateful for any tips to get me started.
Most times, HTML and RegEx do not go together, and when parsing HTML your first thought should not be RegEx.
However, in this situation, the markup looks simple enough that it should be okay - at least until Publisher Weekly change how they do that page.
Here's a function that will extract the data, grab the appropriate lines, sort them, and put them back again:
($j is jQuery)
function reorderPwList()
{
var Container = $j('#article span.table');
var TargetLines = /^.+?(\d+(?:,\d{3})*) copies\.<br ?\/?>$/gmi
var Lines = Container.html().match( TargetLines );
Lines.sort( sortPwCopies );
Container.html( Lines.join('\n') );
function sortPwCopies()
{
function getCopyNum()
{ return arguments[0].replace(TargetLines,'$1').replace(/\D/g,'') }
return getCopyNum(arguments[0]) - getCopyNum(arguments[1]);
}
}
And an explanation of the regex used there:
^ # start of line
.+? # lazy match one or more non-newline characters
( # start capture group $1
\d+ # match one or more digits (0-9)
(?: # non-capture group
,\d{3} # comma, then three digits
)* # end group, repeat zero or more times
) # end group $1
copies\. # literal text, with . escaped
<br ?\/?> # match a br tag, with optional space or slash just in case
$ # end of line
(For readability, I've indented the groups - only the spaces before 'copies' and after 'br' are valid ones.)
The regex flags gmi are used, for global, multi-line mode, case-insensitive matching.
<OLD ANSWER>
Once you've extracted just the text you want to look at (using DOM/jQuery), you can then pass it to the following function, which will put the relevant information into a format that can then be sorted:
function makeSortable(Text)
{
// Mark sortable lines and put number before main content.
Text = Text.replace
( /^(.*)([\d,]+) copies\.<br \/>/gm
, "SORT ME$2 $1"
);
// Remove anything not marked for sorting.
Text = Text.replace( /^(?!SORT ME).*$/gm , '' );
// Remove blank lines.
Text = Text.replace( /\n{2,}/g , '\n' );
// Remove sort token.
Text = Text.replace( /SORT ME/g , '' );
return Text;
}
You'll then need a sort function to ensure that the numbers are sorted correctly (the standard JS array.sort method will sort on text, and put 100,000 before 20,000).
Oh, and here's a quick explanation of the regexes used here:
/^(.*)([\d,]+) copies\.<br \/>/gm
/.../gm a regex with global-match and multi-line modes
^ matches start of line
(.*) capture to $1, any char (except newline), zero or more times
([\d,]+) capture to $2, any digit or comma, one or more times
copies literal text
\.<br \/> literal text, with . and / escaped (they would be special otherwise)
/^(?!SORT ME).*$/gm
/.../gm again, enable global and multi-line
^ match start of line
(?!SORT ME) a negative lookahead, fails the match if text 'SORT ME' is after it
.* any char (except newline), zero or more times
$ end of line
/\n{2,}/g
\n{2,} a newline character, two or more times
</OLD ANSWER>
you can start with something like this (just copypaste into the firebug console)
// where are the things
var elem = document.getElementById("article").
getElementsByTagName("span")[1].
getElementsByTagName("span")[0];
// extract lines into array
var lines = []
elem.innerHTML.replace(/.+?\d+\s+copies\.\s*<br>/g,
function($0) { lines.push($0) });
// sort an array
// lines.sort(function(a, b) {
// var ma = a.match(/(\d+),(\d+)\s+copies/);
// var mb = b.match(/(\d+),(\d+)\s+copies/);
//
// return parseInt(ma[1] + ma[2]) -
// parseInt(mb[1] + mb[2]);
lines.sort(function(a, b) {
function getNum(p) {
return parseInt(
p.match(/([\d,]+)\s+copies/)[1].replace(/,/g, ""));
}
return getNum(a) - getNum(b);
})
// put it back
elem.innerHTML = lines.join("");
It's not clear to me what it is you're trying to do. When posting questions here, I encourage you to post (a part of) your actual data and clearly indicate what exactly you're trying to match.
But, I am guessing you know very little regex, in which case, why use regex at all? If you study the topic a bit, you will soon know that regex is not some magical tool that produces whatever it is you're thinking of. Regex cannot sort in whatever way. It simply matches text, that's all.
Have a look at this excellent on-line resource: http://www.regular-expressions.info/
And if after reading you think a regex solution to your problem is appropriate, feel free to elaborate on your question and I'm sure I, or someone else is able to give you a hand.
Best of luck.

Categories