document.cookie.match inconsistent, sometimes extracting wrong cookie value - javascript

I have the following script (used in Google Tag Manager)
function() {
try {
var cml = document.cookie.match("comagic_visitor.+=.+%7C%7C.+(\\d{6})\;")[1];
if (cml !== undefined) {
return cml;
}
} catch(e) {}
return 'false';
}
It has to get the cookie value. Name of the cookie may change, but the first part of it always remain unchanged "_comagic_visitor".
For some reason when I use the code to get cookie value in console I get correct value:
PHPSESSID=3reongfce35dl150rbdkkllto0; region=2; region2=2; _gat_UA-XXXXXX-2=1; _ym_visorc_263098=w; _ga=GA1.3.26804606X.X431002649; _comagic_visitorTH17k=cASNWQ3N9mRZT8tSmUtTGs5IG9LaD7BPHtCCiEpq_fpSnSKGMcCsEG0kPVur16gH%7C%7C124972212; _comagic_sessionTH17k=203937260
VALUE: 972212
But using it with Tag Manager I get 937260 (which as you can see is from "_comagic_session" (last 6 digits).
Unfortunately I'm not good at debugging and my js skill is very bad to figure out how to fix this. Any ideas on what I have to fix?

Any ideas on what I have to fix?
You need to fix the regular expression used for matching. The argument in parenthesis in document.cookie.match() is a regular expression.
From MDN, document.cookie is a string of all the cookies, separated by ;. Since document.cookie is simply a String document.cookie.match() is simply calling String.match(). String.match(regexp) finds an Array of matches using the regular expression parameter regexp.
The regexp you are using is:
comagic_visitor.+=.+%7C%7C.+(\\d{6})\;
This regexp means a match must satisfy all of the following conditions:
comagic_visitor begin with comagic_visitor
.+ followed by one or more other characters (any chars except newline and a couple others). This can be dangerous
= followed by =
.+ followed by one or more other characters Again, can be dangerous
%7C is a little dangerous depending on where you use it, and might be a literal %7C or could be translated into | which means "or"
.+ one or more other characters
(\\d{6}) the parenthesis extract this as the result of the match, and \d{6} is exactly 6 digits. It seems to be escaped by an extra \ which would be unnecessary if you used /regexp/ instead of "regexp"
\; is an escaped ;, which requires the final ;
Primary Issue: This regexp is much too loose and matches much more than desirable. .+ is greedy, in practice it matches as much as it can, and allows the regexp to match the beginning of the desired cookie, all the other cookies in the string, and the digits in some other cookie. Since the individual cookies in document.cookie are probably not guaranteed to be in any particular order, a greedy match can behave inconsistently. When the desired cookie is at the end of the string, you will get the correct result. At other times, you won't, when the .+ matches too much and there are 6 digits at some other cookie at the end that can be matched.
Alternative #1: Write a short function to split your cookie string on ; which will split into an array of strings and then feed each string into match separately and return the first match. This prevents the regexp from making a bad match in the full cookie string.
Alternative #2: Fix the regexp to match only what you want. You can use http://refiddle.com/ or a console window to test regular expressions. Possibly you can change the .+ to [^;]+ which changes any char except newline to any char except ; and that might fix it, because to match across multiple cookies in the full cookie string it has to be allowed to match a ; and if we deny that those false matches should be impossible.
Like this:
var cml = document.cookie.match(/comagic_visitor[^;]+(\d{6})\;/)[1];
This works for me in nodejs.
d = "PHPSESSID=3reongfce35dl150rbdkkllto0; region=2; region2=2; _gat_UA-XXXXXX-2=1; _ym_visorc_263098=w; _ga=GA1.3.26804606X.X431002649; _comagic_visitorTH17k=cASNWQ3N9mRZT8tSmUtTGs5IG9LaD7BPHtCCiEpq_fpSnSKGMcCsEG0kPVur16gH%7C%7C124972212; _comagic_sessionTH17k=203937260";
r = /comagic_visitor[^;]+(\d{6})\;/
d.match(r)[1]
---> '972212'

Related

Get the Opposite of a Regular Expression [duplicate]

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)

Find out the position where a regular expression failed

I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.
The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.
Is there any way to find out the position in the string which caused the regular expression to fail?
INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.
Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?
Short answer
There is no such thing as a "position in the string that causes the
regular expression to fail".
However, I will show you an approach to answer the reverse question:
At which token in the regex did the engine become unable to match the
string?
Discussion
In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.
In my view, a more proper question would be:
At which token in the regex did the engine become unable to match the
string?
For instance, consider the regex ^\w+\d+$ and the string abc132z.
The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.
Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.
At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?
No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.
Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.
The Other Question
Are there techniques that allow us to answer the other question:
At which token in the regex did the engine become unable to match the
string?
It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.
Javascript is a bit stingy on optional lookaheads, but you can write something like this:
^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.
In PCRE, .NET, Python... you could write this more compactly:
^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.
What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.
Group 1 tests the \w+ token
Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token
There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.
For reference: Inside View of a Failure Path
For what it's worth, here is a view of the failure path from the RegexBuddy debugger.
You can use a negated character set RegExp,
[^xyz]
[^a-c]
A negated or complemented character set. That is, it matches anything
that is not enclosed in the brackets. You can specify a range of
characters by using a hyphen, but if the hyphen appears as the first
or last character enclosed in the square brackets it is taken as a
literal hyphen to be included in the character set as a normal
character.
index property of String.prototype.match()
The returned Array has an extra input property, which contains the
original string that was parsed. In addition, it has an index
property, which represents the zero-based index of the match in the
string.
For example to log index where digit is matched for RegExp /[^a-zA-z]/ in string aBcD7zYx
var re = /[^a-zA-Z]/;
var str = "aBcD7zYx";
var i = str.match(re).index;
console.log(i); // 4
Is there any way to find out the position in the string which caused the regular expression to fail?
No, there isn't. A Regex either matches or doesn't. Nothing in between.
Partial Expressions can match, but the whole pattern doesnt. So the engine always needs to evaluates the whole expression:
Take the String Hello my World and the Pattern /Hello World/. While each word will match individually, the whole Expression fails. You cannot tell whether Hello or World matched - independent, both do. Also the whitespace between them is available.

What does this JavaScript Regular Expression /[^\d.-] mean?

We had a developer here who had added following line of code to a web application:
var amount = newValue.replace(/[^\d.-]/g, '');
The particular line deals with amount values that a user may enter into a field.
I know the following about the regular expression:
that it replaces the matches with empty strings (i.e. removes them)
that /g is a flag that means to match all occurrences inside "newValue"
that the brackets [] denote a special group
that ^ means beginning of the line
that d means digits
Unfortunately I do not know enough to determine what kind of strings this should match. I checked with some web-based regex testers if it matches e.g. strings like 98.- and other alternatives with numbers but so far no luck.
My problem is that it seems to make IE very slow so I need to replace it with something else.
Any help on this would be appreciated.
Edit:
Thanks to all who replied. I tried not just Google but sites like myregextester.com, regular-expressions.info, phpliveregex.com, and others. My problem was misunderstanding the meaning of ^ and expecting that this required a numeric string like 44.99.
Inside the group, when the ^ is the first character, it works as a negation of the character matches. In other words, it's saying match any character that are not the ones in the group.
So this will mean "match anything that is not a digit, a period, or a hyphen".
The ^ character is a negation character.
var newValue = " x44x.-x ";
var amount = newValue.replace(/[^\d.-]/g, '');
console.log(amount);
will print
44.-
I suspect the developer maybe just wanted to remove trailing whitespaces? I would rather try to parse the string for numbers and remove anything else.

What does /;/ and /^ +/ denote?

I recently came across the statement :
var cookies = document.cookie.split(/;/);
and
var pair = allCookies[i].split("=", 2);
if (pair[0].replace(/^ +/, "") == "lastvisit")
In the first statement what does /;/ in the argument of split denote ?
In the second statement what does /^ +/ in the argument of replace denote ?
These are Regular Expressions.
Javascript supports them natively.
In this particular example:
.split(/;/) uses ; as the split character;
.replace(/^ +/, "") removes ("") any (+) leading (^) whitespace ().
In both examples, / surround or delimit the regular expression (or "regex"), informing Javascript that you're providing a regex.
Follow the links provided above for more information; regexes are broad in scope and worth learning.
Slashes delimit a regular expression, just like quotes delimit a string.
/;/ matches a semi-colon. Specifically:
var cookies = document.cookie.split(/;/);
Means we split the document.cookie string into an array, splitting it where there are semicolons. So it would take something like "a;b;c" and turn it into ["a", "b", "c"].
pair[0].replace(/^ +/, "")
Just strips all leading whitespace. It turns
" lastvisit"
into
"lastvisit"
The caret ^ means "beginning of line", it's followed by space, and the + means to repeat the space one or more times, as many as possible.
The // syntax denotes a regular expression (also known as a 'regex').
Regex is a syntax for searching and replacing strings.
The first example you gave is /;/. This is a very simply regex which just searches the string for semi-colons, and then splits it into an array based on the result. Since this is not using any special regex functionality, it could just as easily have been expressed as a simple string, ie split(";") (as has been done with the equal sign in your other example), without making any difference to the result.
The second example is /^ +/. This is more complex and requires a bit of knowledge of how regex works. In short, what it is doing is searching for leading spaces on a string, and removing them.
To learn more about regex, I recommend this site as a good starting point: http://www.regular-expressions.info/
Hope that helps.
I think that /^ +/ means: one or more no-" " characters

Javascript String pattern Validation

I have a string and I want to validate that string so that it must not contain certain characters like '/' '\' '&' ';' etc... How can I validate all that at once?
You can solve this with regular expressions!
mystring = "hello"
yourstring = "bad & string"
validRegEx = /^[^\\\/&]*$/
alert(mystring.match(validRegEx))
alert(yourstring.match(validRegEx))
matching against the regex returns the string if it is ok, or null if its invalid!
Explanation:
JavaScript RegEx Literals are delimited like strings, but with slashes (/'s) instead of quotes ("'s).
The first and last characters of the validRegEx cause it to match against the whole string, instead of just part, the carat anchors it to the beginning, and the dollar sign to the end.
The part between the brackets ([ and ]) are a character class, which matches any character so long as it's in the class. The first character inside that, a carat, means that the class is negated, to match the characters not mentioned in the character class. If it had been omited, the class would match the characters it specifies.
The next two sequences, \\ and \/ are backslash escaped because the backslash by itself would be an escape sequence for something else, and the forward slash would confuse the parser into thinking that it had reached the end of the regex, (exactly similar to escaping quotes in strings).
The ampersand (&) has no special meaning and is unescaped.
The remaining character, the kleene star, (*) means that whatever preceeded it should be matched zero or more times, so that the character class will eat as many characters that are not forward or backward slashes or ampersands, including none if it cant find any. If you wanted to make sure the matched string was non-empty, you can replace it with a plus (+).
I would use regular expressions.
See this guide from Mozillla.org. This article does also give a good introduction to regular expressions in JavaScript.
Here is a good article on Javascript validation. Remember you will need to validate on the server side too. Javascript validation can easily be circumvented, so it should never be used for security reasons such as preventing SQL Injection or XSS attacks.
You could learn regular expressions, or (probably simpler if you only check for one character at a time) you could have a list of characters and then some kind of sanitize function to remove each one from the string.
var myString = "An /invalid &string;";
var charList = ['/', '\\', '&', ';']; // etc...
function sanitize(input, list) {
for (char in list) {
input = input.replace(char, '');
}
return input
}
So then:
sanitize(myString, charList) // returns "An invalid string"
You can use the test method, with regular expressions:
function validString(input){
return !(/[\\/&;]/.test(input));
}
validString('test;') //false
You can use regex. For example if your string matches:
[\\/&;]+
then it is not valid. Look at:
http://www.regular-expressions.info/javascriptexample.html
You could probably use a regular expression.
As the others have answered you can solve this with regexp but remember to also check the value server-side. There is no guarantee that the user has JavaScript activated. Never trust user input!

Categories