JS Regexp to exclude forward slash after .com in url - javascript

I have this URL for e.g https://www.example.com/filters/test.jpg and in JS, I want to retrieve this part: filters/test.jpg.
I am using match() but the element of the first position of match is /filters/test.jpg.
This is my regexp:/(?!com)\/((\w+)\/(.*))/
What am I missing to remove the forward slash / from the match array?

If your interest is in regex itself rather than just the result, how about this expression?
(?<=.+\.com\/).+
This uses a positive lookbehind and will give you everything after any amount of text ending in ".com/". Note my use of escape slashes for the period and the forward slash. If you want more specificity, you can do the same thing with the word group and second slash in your original regex:
(?<=.com\/)((\w+)\/(.*))
UPDATE: As requested, a note on negative vs. positive lookahead/lookbehind: lookahead instructs the query to "look for X, but match only if followed by Y." Negative lookahead "look for X, but match only if not followed by Y." In your case, you want a lookbehind because that will "look for X, but match only if preceded by Y." A negative lookbehind, which you were trying, allows to match a pattern only if there isn't something before it, so doing this in your case would be a mistake. For more information, see https://javascript.info/regexp-lookahead-lookbehind
If your goal is just to get the result, I think using the URL object in javascript (as in the comment) is actually better than regex because it's more tuned to the specific problem. See https://dev.to/attacomsian/introduction-to-javascript-url-object-27hn.

If code for new JS engines /(?<=\/)(\w+)\/.*/
If code for old JS engines /\b(?!(?:com|net|org|uk)\/)(\w+)\/.*/
Best way though is store array using /\/((\w+)\/.*)/

Related

How to match all words starting with dollar sign but not slash dollar

I want to match all words which are starting with dollar sign but not slash and dollar sign.
I already try few regex.
(?:(?!\\)\$\w+)
\\(\\?\$\w+)\b
String
$10<i class="">$i01d</i>\$id
Expected result
*$10*
*$i01d*
but not this
*$id*
After find all expected matching word i want to replace this my object.
One option is to eliminate escape sequences first, and then match the cleaned-up string:
s = String.raw`$10<i class="">$i01d</i>\$id`
found = s.replace(/\\./g, '').match(/\$\w+/g)
console.log(found)
The big problem here is that you need a negative lookbehind, however, JavaScript does not support it. It's possible to emulate it crudely, but I will offer an alternative which, while not great, will work:
var input = '$10<i class="">$i01d</i>\\$id';
var regex = /\b\w+\b\$(?!\\)/g;
//sample implementation of a string reversal function. There are better implementations out there
function reverseString(string) {
return string.split("").reverse().join("");
}
var reverseInput = reverseString(input);
var matches = reverseInput
.match(regex)
.map(reverseString);
console.log(matches);
It is not elegant but it will do the job. Here is how it works:
JavaScript does support a lookahead expression ((?>)) and a negative lookahead ((?!)). Since this is the reverse of of a negative lookbehind, you can reverse the string and reverse the regex, which will match exactly what you want. Since all the matches are going to be in reverse, you need to also reverse them back to the original.
It is not elegant, as I said, since it does a lot of string manipulations but it does produce exactly what you want.
See this in action on Regex101
Regex explanation Normally, the "match x long as it's not preceded by y" will be expressed as (?<!y)x, so in your case, the regex will be
/(?<!\\)\$\b\w+\b/g
demonstration (not JavaScript)
where
(?<!\\) //do not match a preceding "\"
\$ //match literal "$"
\b //word boundary
\w+ //one or more word characters
\b //second word boundary, hence making the match a word
When the input is reversed, so do all the tokens in order to match. Furthermore, the negative lookbehind gets inverted into a negative lookahead of the form x(?!y) so the new regular expression is
/\b\w+\b\$(?!\\)/g;
This is more difficult than it appears at first blush. How like Regular Expressions!
If you have look-behind available, you can try:
/(?<!\\)\$\w+/g
This is NOT available in JS. Alternatively, you could specify a boundary that you know exists and use a capture group like:
/\s(\$\w+)/g
Unfortunately, you cannot rely on word boundaries via /b because there's no such boundary before '\'.
Also, this is a cool site for testing your regex expressions. And this explains the word boundary anchor.
If you're using a language that supports negative lookback assertions you can use something like this.
(?<!\\)\$\w+
I think this is the cleanest approach, but unfortunately it's not supported by all languages.
This is a hackier implementation that may work as well.
(?:(^\$\w+)|[^\\](\$\w+))
This matches either
A literal $ at the beginning of a line followed by multiple word characters. Or...
A literal $ this is preceded by any character except a backslash.
Here is a working example.

Regular Expression to find a pattern and replace just part of it

I want to know how can I use RegEx to find a pattern and replace just a part of it in JavaScript.
Let's say, for example, I want to replace some patterns like this -foo but just if it has a - after it, like -foo- but replace just the -foo.
Can someone please explain in details the RegEx construction to achieve it?
I did not find a detailed explanation of it here, just codes with a minimum explanation.
You need to use a positive look-ahead (?=-) that will check the existence of - after -foo but will not consume it:
var s = "-foo- -foo";
alert(s.replace(/-foo(?=-)/g, 'REPLACED'));
You can read more about look-aheads (and look-behinds, though they are not supported by the JS regex engine) at regular-expressions.info.
The main idea is that the text is checked for presence or absence of some patterns defined in the look-around, and based on that either allow or fail the match. They can actually be used efficiently together with anchors, but this is not the case here.
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions... lookaround actually matches characters, but then gives up the match, returning only the result: match or no match... They do not consume characters in the string, but only assert whether a match is possible or not.
As the first poster said, you need to make use of a lookahead (?=) to check for an additional character(s). In this situation, the character you need to look for is -, therefore your pattern would make use of a lookahead followed by - ie(?=-).

Exact string negation in javascript regexpressions

This is more a question to satisfy my curiosity than a real need for help, but I will appreciate your help equally as it is driving me nuts.
I am trying to negate an exact string using Javascript regular expressions, the idea is to exclude URL that include the string "www". For instance this list:
http://www.example.org/
http://status.example.org/index.php?datacenter=1
https://status.example.org/index.php?datacenter=2
https://www.example.org/Insights
http://www.example.org/Careers/Job_Opportunities
http://www.example.org/Insights/Press-Releases
For that I can succesfully use the following regex:
/^http(|s):..[^w]/g
This works correctly, but while I can do a positive match I cannot do something like:
/[^www]/g or /[^http]/g
To exclude lines that include the exact string www or http. I have tried the infamous "negative Lookeahead" like that:
/*(?: (?!www).*)/g
But this doesn't work either OR I cannot test it online, it doesn't works in Notepad++ either.
If I were using Perl, Grep, Awk or Textwrangler I would have simply done:
!www OR !http
And this would have done the job.
So, my question is obviously: What would be the correct way to do such thing in Javascript? Does this depend on the regex parser (as I seem to understand?).
Thanks for any answer ;)
You need to add a negative lookahead at the start.
^(?!.*\bwww\.)https?:\/\/.*
DEMO
(?!.*\bwww\.) Negative lookahead asserts that the string we are going to match won't contain, www.. \b means word boundary which matches between a word character and a non-word character. Without \b, www. in your regex would match www. in foowww.
To negate 'www' at every position in the input string:
var a = [
'http://www.example.org/',
'http://status.example.org/index.php?datacenter=1',
'https://status.example.org/index.php?datacenter=2',
'https://www.example.org/Insights',
'http://www.example.org/Careers/Job_Opportunities',
'http://www.example.org/Insights/Press-Releases'
];
a.filter(function(x){ return /^((?!www).)*$/.test(x); });
So at every position check that 'www' doesn't match, and then match
any character (.).

Find out the position where a regular expression failed

I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.
The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.
Is there any way to find out the position in the string which caused the regular expression to fail?
INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.
Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?
Short answer
There is no such thing as a "position in the string that causes the
regular expression to fail".
However, I will show you an approach to answer the reverse question:
At which token in the regex did the engine become unable to match the
string?
Discussion
In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.
In my view, a more proper question would be:
At which token in the regex did the engine become unable to match the
string?
For instance, consider the regex ^\w+\d+$ and the string abc132z.
The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.
Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.
At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?
No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.
Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.
The Other Question
Are there techniques that allow us to answer the other question:
At which token in the regex did the engine become unable to match the
string?
It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.
Javascript is a bit stingy on optional lookaheads, but you can write something like this:
^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.
In PCRE, .NET, Python... you could write this more compactly:
^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.
What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.
Group 1 tests the \w+ token
Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token
There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.
For reference: Inside View of a Failure Path
For what it's worth, here is a view of the failure path from the RegexBuddy debugger.
You can use a negated character set RegExp,
[^xyz]
[^a-c]
A negated or complemented character set. That is, it matches anything
that is not enclosed in the brackets. You can specify a range of
characters by using a hyphen, but if the hyphen appears as the first
or last character enclosed in the square brackets it is taken as a
literal hyphen to be included in the character set as a normal
character.
index property of String.prototype.match()
The returned Array has an extra input property, which contains the
original string that was parsed. In addition, it has an index
property, which represents the zero-based index of the match in the
string.
For example to log index where digit is matched for RegExp /[^a-zA-z]/ in string aBcD7zYx
var re = /[^a-zA-Z]/;
var str = "aBcD7zYx";
var i = str.match(re).index;
console.log(i); // 4
Is there any way to find out the position in the string which caused the regular expression to fail?
No, there isn't. A Regex either matches or doesn't. Nothing in between.
Partial Expressions can match, but the whole pattern doesnt. So the engine always needs to evaluates the whole expression:
Take the String Hello my World and the Pattern /Hello World/. While each word will match individually, the whole Expression fails. You cannot tell whether Hello or World matched - independent, both do. Also the whitespace between them is available.

Regex Positive Lookbehind on url segment

I am parsing a number from a URL string. The URL looks like:
https://www.myapi.com/player/?url=https%3A//myapi.com/users/11468859&color=788b78&auto_play=false&show_artwork=false
I would like to match the number between 'users/' and '&'. In this case '11468859'. So I using a positive lookahead and lookbehind to accomplish this.
This is what I have so far:
(?<=users/)([0-9]*?)(?=\&)
This doesn't match anything. My lookbehind is wrong. So if I omit the lookbehind I can match on users/11468859
([0-9]*?)(?=\&) matches >> 'users/11468859'
How do I correctly create a positive lookbehind to match on users/?
Thanks!
Putting aside your lookbehind question for a moment, this regex works:
users/([0-9]+)
Debuggex Demo
The id is in capture group one.
In debuggex your lookbehind works fine but not in JavaScript:
(?<=users/)([0-9]*?)(?=\&)
Debuggex Demo
(You could also get away with just
(?<=users/)([0-9]*)
Debuggex Demo
since [0-9]* is greedy.)
However, as you're using JavaScript, I recommend the regex at the top of my answer.
If you're certain that the desired segment will be a series of integers immediately after user/, you don't need the look ahead. Also, I would recommend escaping any sort of slash: \/
(?<=users\/)([0-9]*?)
Also, you don't need to tell the regex not to be greedy unless you know it will run into other numbers, and I would consider telling the regex that there must be numbers so it won't match if they are missing:
thus
([0-9]*?)
becomes
(\d+)
There are a couple of approaches avaiable in most languages. To match a number use the positive look ahead fromat (?<=STUFF). To match numbers try \d+ or [0-9]+. Each of the following lines work. The second includes a positive look ahead for including letters in an id but will fail if the ampersand is moved.
(?<=users.)\d+
(?<=users.).*?(?=&)
(?<=users.)[0-9]+
For more information: http://myregextester.com/index.php#highlighttab
How do I correctly create a positive lookbehind to match on users/?
You don't, because JavaScript does not support lookbehinds:
From javascript regex - look behind alternative?:
Javascript doesn't have regex lookbehind.
http://regexadvice.com/forums/thread/58678.aspx:
The JavaScript regex engine does not support look-behinds
As an alternative, you can capture the number like this:
users\/(.*?)\&
And just access the first capturing group. Explanation and demonstration: http://regex101.com/r/aZ3bL0
try
string = "https://www.myapi.com/player/?url=https%3A//myapi.com/users/11468859&color=788b78&auto_play=false&show_artwork=false"
regex = /users.([\d]*)/;
arr = regex.exec(a);
result = arr[1];

Categories