My regex that should only accept latin-based characters is acting strangely

My regex that should only accept latin-based characters is acting strangely - javascript

I've got a regex written to the best of my ability that allows the latin character set only with the option of a '-' that, if included MUST be followed by at least one other latin character.
My RegEx:
[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:[-]?[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)
I came to this after reading a few posts and rereading the manual to figure out the best way to approach this. This check is attached to a text field where a user types only their first name and then submits.
It works okay but there is certainly room for improvement.
Examples:
Tom // passes
Éve // passes
John-Paul // passes
2pac // passes and removes numbers (not really what I want)
John316 // passes and removes numbers (not really what I want)
What I would REALLY want to happen is a fail on those last two checks.
How would I revise it to get the outcome I'd like?

You need to anchor the regex by adding ^ at the start and $ at the end. That way you will not let any other symbols in the input string.
I also suggest enhancing the pattern by moving ? from after hyphen to the end (that will make regex execution linear as the hyphen has no quantifier and is required, thus, limiting backtracking):
^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$
See regex demo.
JS snippet:
console.log(/^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$/.test('Éve')); //=> true
console.log(/^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$/.test('John-Paul')); // => true
console.log(/^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$/.test('John316')); // => false

Related

Regex expression for exactly known pattern without "cutting into" the string not working

I am currently developing a web-application where I work with java, javascript, html, jquery, etc. and at some point I need to check that whether an input matches a known pattern and only proceed if it is true.
The pattern should be [at least one but max 3 numbers between 0-9]/[exactly 4 numbers between 0-9], so the only acceptable variations should be like
1/2014 or 23/2015 or 123/2016.
and nothing else, and I CANNOT accept something like 1234/3012 or anything else, and this is my problem right here, it accepts everything in which it can find the above pattern, so like from 12345/6789 it accepts and saves 345/6789.
I am a total newbie with regex, so I checked out http://regexr.com and this is the code I have in my javascript:
$.validator.addMethod("hatarozat", function(value, element) {
return (this.optional(element) || /[0-9]{1,3}(?:\/)[0-9]{4}/i.test(value));
}, "Hibás határozat szám!");
So this is my regex: /[0-9]{1,3}(?:\/)[0-9]{4}/i
which I built up using the above website. What could be the problem, or how can I achived what I described? I tried /^[0-9]{1,3}(?:\/)[0-9]{4}$/ibut this doesn't seem to work, please anyone help me, I have everything else done and am getting pretty stressed over something looking so simple yet I cannot solve it. Thank you!

Your last regex with the anchors (^ and $) is a correct regex. What prevents your code from working is this.optional(element) ||. Since this is a static thing, and is probably true, so it does not show any error (as || is an OR condition, if the first is true, the whole returns true, the regex is not checked at all).
So, use
return /^[0-9]{1,3}\/[0-9]{4}$/.test(value);
Note you do not need the (?:...) with \/ as the grouping does not do anything important here and is just redundant. The anchors are important, since you want the whole string to match the pattern (and ^ anchors the regex at the start of the string and $ does that at the end of the string.)

You need use the the following special characters in your regex expression:
^ and $
or \b
so 2 regexp will be correct:
/\b[0-9]{1,3}(?:\/)[0-9]{4}\b/i;
or
/^[0-9]{1,3}(?:\/)[0-9]{4}$/i

Capturing optional part of URL with RegExp

While writing an API service for my site, I realized that String.split() won't do it much longer, and decided to try my luck with regular expressions. I have almost done it but I can't find the last bit. Here is what I want to do:
The URL represents a function call:
/api/SECTION/FUNCTION/[PARAMS]
This last part, including the slash, is optional. Some functions display a JSON reply without having to receive any arguments. Example: /api/sounds/getAllSoundpacks prints a list of available sound packs. Though, /api/sounds/getPack/8Bit prints the detailed information.
Here is the expression I have tried:
req.url.match(/\/(.*)\/(.*)\/?(.*)/);
What am I missing to make the last part optional - or capture it in whole?

This will capture everything after FUNCTION/ in your URL, independent of the appearance of any further / after FUNCTION/:
FUNCTION\/(.+)$
The RegExp will not match if there is no part after FUNCTION.

This regex should work by making last slash and part after optional:
/^\/[^/]*\/[^/]*(?:\/.*)?$/
This matches all of these strings:
/api/SECTION/FUNCTION/abc
/api/SECTION
/api/SECTION/
/api/SECTION/FUNCTION

Your pattern /(.*)/(.*)/?(.*) was almost correct, it's just a bit too short - it allows 2 or 3 slashes, but you want to accept anything with 3 or 4 slashes. And if you want to capture the last (optional) slash AND any text behind it as a whole, you simply need to create a group around that section and make it optional:
/.*/.*/.*(?:/.+)?
should do the trick.
Demo. (The pattern looks different because multiline mode is enabled, but it still works. It's also a little "better" because it won't match garbage like "///".)

JavaScript Repetitive RegEx

Consider my few input strings.
http://local.app.com/local/frontend/v12/#/abcde/
http://local.app.com/local/frontend/v12/#/abcde/!/fghij/
http://local.app.com/local/frontend/v12/#/abcde/!/ghijk/!/klmno/
I have written this regex which works fine for input string 1.
(?:([a-zA-Z0-9.://_]*)(/#/(?=([a-zA-Z0-9]{5})/)))
Output:
http://local.app.com/local/frontend/v12/#/,http://local.app.com/local/frontend/v12,/#/,abcde
But when I extend it to support repetitive !/.../ place holder for input string 1,2 and 3, it doesn't work and gives empty string rather than token.
(?:([a-zA-Z0-9.://_]*)(/#/(?=([a-zA-Z0-9]{5})/))(!/(?=([a-zA-Z0-9]{5})/))*)
Output:
http://local.app.com/local/frontend/v12/#/,http://local.app.com/local/frontend/v12,/#/,abcde,,

?= captures in fact a position defined by what you specify after the ?=
It does not (also) capture whatever may match the specification of the lookaround (?=).
Try
(.+? # (/[a-zA-Z0-9]{5}/) (!/([a-zA-Z0-9]{5})/)* )
(hope I didn't make a typo, can't test it right now.)
This should capture the complete input, but the various captures inside give you access to the captured "tokens".
You can, in addition, give names to the various captures inside, making it easier to identify them in the match:
(.+?#(/(?<tokenFirst>[a-zA-Z0-9]{5})/)(!/(?<tokenMore>[a-zA-Z0-9]{5})/)*)
Success

Hope this will clarify my comment and earlier remarks.

Negative lookahead Regular Expression

I want to match all strings ending in ".htm" unless it ends in "foo.htm". I'm generally decent with regular expressions, but negative lookaheads have me stumped. Why doesn't this work?
/(?!foo)\.htm$/i.test("/foo.htm"); // returns true. I want false.
What should I be using instead? I think I need a "negative lookbehind" expression (if JavaScript supported such a thing, which I know it doesn't).

The problem is pretty simple really. This will do it:
/^(?!.*foo\.htm$).*\.htm$/i.test("/foo.htm"); // returns false

What you are describing (your intention) is a negative look-behind, and Javascript has no support for look-behinds.
Look-aheads look forward from the character at which they are placed — and you've placed it before the .. So, what you've got is actually saying "anything ending in .htm as long as the first three characters starting at that position (.ht) are not foo" which is always true.
Usually, the substitute for negative look-behinds is to match more than you need, and extract only the part you actually do need. This is hacky, and depending on your precise situation you can probably come up with something else, but something like this:
// Checks that the last 3 characters before the dot are not foo:
/(?!foo).{3}\.htm$/i.test("/foo.htm"); // returns false

As mentioned JavaScript does not support negative look-behind assertions.
But you could use a workaroud:
/(foo)?\.htm$/i.test("/foo.htm") && RegExp.$1 != "foo";
This will match everything that ends with .htm but it will store "foo" into RegExp.$1 if it matches foo.htm, so you can handle it separately.

Like Renesis mentioned, "lookbehind" is not supported in JavaScript, so maybe just use two regexps in combination:
!/foo\.htm$/i.test(teststring) && /\.htm$/i.test(teststring)

Probably this answer has arrived just a little bit later than necessary but I'll leave it here just in case someone will run into the same issue now (7 years, 6 months after this question was asked).
Now lookbehinds are included in ECMA2018 standard & supported at least in last version of Chrome. However, you might solve the puzzle with or without them.
A solution with negative lookahead:
let testString = `html.htm app.htm foo.tm foo.htm bar.js 1to3.htm _.js _.htm`;
testString.match(/\b(?!foo)[\w-.]+\.htm\b/gi);
> (4) ["html.htm", "app.htm", "1to3.htm", "_.htm"]
A solution with negative lookbehind:
testString.match(/\b[\w-.]+(?<!foo)\.htm\b/gi);
> (4) ["html.htm", "app.htm", "1to3.htm", "_.htm"]
A solution with (technically) positive lookahead:
testString.match(/\b(?=[^f])[\w-.]+\.htm\b/gi);
> (4) ["html.htm", "app.htm", "1to3.htm", "_.htm"]
etc.
All these RegExps tell JS engine the same thing in different ways, the message that they pass to JS engine is something like the following.
Please, find in this string all sequences of characters that are:
Separated from other text (like words);
Consist of one or more letter(s) of english alphabet, underscore(s),
hyphen(s), dot(s) or digit(s);
End with ".htm";
Apart from that, the part of sequence before ".htm" could be anything
but "foo".

String.prototype.endsWith (ES6)
console.log( /* !(not)endsWith */
!"foo.html".endsWith("foo.htm"), // true
!"barfoo.htm".endsWith("foo.htm"), // false (here you go)
!"foo.htm".endsWith("foo.htm"), // false (here you go)
!"test.html".endsWith("foo.htm"), // true
!"test.htm".endsWith("foo.htm") // true
);

You could emulate the negative lookbehind with something like
/(.|..|.*[^f]..|.*f[^o].|.*fo[^o])\.htm$/, but a programmatic approach would be better.

Can someone tell me the purpose of the second capture group in the jQuery rts regular expression?

In Jeff Roberson's jQuery Regular Expressions Review he proposes changing the rts regular expression in jQuery's ajax.js from /(\?|&)_=.*?(&|$)/ to /([?&])_=[^&\r\n]*(&?)/. In both versions, what is the purpose of the second capture group? The code does a replacement of the current random timestamp with a new random timestamp:
var ts = jQuery.now();
// try replacing _= if it is there
var ret = s.url.replace(rts, "$1_=" + ts + "$2");
Doesn't it only replace what it matches? I am thinking this does the same:
var ret = s.url.replace(/([?&])_=[^&\r\n]*/, "$1_=" + ts);
Can someone explain the purpose of the second capture group?

It's to pick up the next delimiter in the query string on the URL, so that it still works properly as a query string. Thus if the url is
http://foo.bar/what/ever?blah=blah&_=12345&zebra=banana
then the second group picks up the "&" before "zebra".
That's an awesome blog post by the way and everybody should read it.
edit — now that I think about it, I'm not sure why it's necessary to bother with replacing that second delimiter. In the "fixed" expression, that greedy * will pick up the whole parameter value and stop at the delimiter (or the end of the string) anyway.

I think you're right. It was needed in the original because matching the ampersand or end-of-string was how the .*? knew when to stop. In Jeff's version that's no longer necessary.

As the author of the article I can't tell you the reason for the second capture group. My intent with the article was to take existing regexes and simply make them more efficient - i.e. they should all match the same text - just do it faster. Unfortunately I did not have time to delve deeply into the code to see exactly how each and every one of them was being used. I assumed that the capture group for this one was there for a reason so I did not mess with it.

We Keep Coding

JavaScript is the programming language of the Web.

My regex that should only accept latin-based characters is acting strangely - javascript

Related

Regex expression for exactly known pattern without "cutting into" the string not working

Capturing optional part of URL with RegExp

JavaScript Repetitive RegEx

Negative lookahead Regular Expression

Can someone tell me the purpose of the second capture group in the jQuery rts regular expression?

Categories

Resources