Javascript Regex, re use a group to mach another group - javascript

Hi guys I want re use the expression for capture other group
For example:
I need search US $4.5-8.8
The structure is: Part1-Part2
The part1 and the part2 have the same code, I could use the part1 like group, and then re use in the part2
I've doned the expression until 4.5-XXXX
US \$([0-9]{1}(?=\.{1})\.{1}[0-9]+)(?=\-)\-
check in: https://regex101.com/r/E2MjWh/1
What should I do for re use the first group? It is easy in other lenguague, but I can't do it in javascript..
PD: I need it in regex, not include javascript code like var... etc etc..

First, have a look at your regex: the positive lookaheads are not really necessary there as they just require the same as the following consuming subpatterns. (?=\.{1})\.{1} means *require 1 dot immediately to the right of the current location and then match the dot, and (?=\-)\- has a similar meaning requiring and matching a - symbol.
Now, you ask if you can repeat the same part of a pattern using just the regex syntax. No, it is not possible in JS regex.
You may use the following regex to match the whole string like yours:
/US\s+\$(\d+\.\d+)-(\d+\.\d+)/
See the regex demo. Sure, you may add word boundaries (to match US as a whole word) or anchors (to match the whole input string) or replace the \d+\.\d+ part with \d*\.?\d+ (to match both integers or floats) to further enhance the pattern.
There is a way to shorten the pattern by placing the repetitive part into a variable and build the regex dynamically using the constructor notation:
var price = "\\d*\\.?\\d+";
var reg = new RegExp("US\\s+\\$(" + price + ")-(" + price + ")");
Add the required modifiers if necessary.

Related

RegExp works in JS and PHP but not in Java

I have a regexp to extract an id and a label out of an HTML source code. It can be found HERE.
As you can see it work fine and its fast but when i try this regexp in java with the same source code it 1. Takes for ever and 2. only matches one string (from the first a to the last a is one match).
I tried it with the Multiline flag on and off but no difference. I don't understand how a regexp can work everywhere but in java. Any ideas?
private static final String COURSE_REGEX = "<a class=\"list-group-item list-group-item-action \" href=\"https:\\/\\/moodle-hs-ulm\\.de\\/course\\/view\\.php\\?id=([0-9]*)\"(?:.*\\s){7}<span class=\"media-body \">([^<]*)<\\/span>";
Pattern pattern = Pattern.compile(COURSE_REGEX, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(sourceCode);
List<String> courses = new ArrayList<>();
while(matcher.find() && matcher.groupCount() == 2){
courses.add(matcher.group(1) + "(" + matcher.group(2) + ")");
}
Your regex is running into catastrophic backtracking because of the gargantuan number of possible permutations the subexpression (?:.*\s){7} needs to check (because the . can also match spaces). Java aborts the match attempt after a certain number of steps (not sure how many, certainly > 1.000.000). PHP or JS may not be so cautious.
If you simplify that part of your regex to .*?, you do get the matches:
"(?s)<a class=\"list-group-item list-group-item-action \" href=\"https://moodle-hs-ulm\\.de/course/view\\.php\\?id=([0-9]*)\".*?<span class=\"media-body \">([^<]*)</span>"
Note that you need the DOTALL flag ((?s), so . may match a newline) instead of the MULTILINE flag which changes the behavior of ^ and $ anchors (none of which your regex is using).
Also note that you don't need to escape slashes in a Java regex.
This solution is not very robust because .*? is rather unspecific. I suppose your previous attempt of (?:.*\\s){7} may have been designed to match no more than 7 lines of text? In that case, you could use (?:(?!</a>).)* instead to ensure that you don't cross over into the next <a> tag. That's one of the dangers of parsing HTML with regex :)
Finally, greetings from a staff member of the faculty of Informatics at your university :)

Splitting a string at question mark, exclamation mark, or period in javascript and retain those marks?

I was a bit surprised, that actually no one had the exact same issue in javascript...
I tried several different solutions none of them parse the content correctly.
The closest one I tried : (I stole its regex query from a PHP solution)
const test = `abc?aaa.abcd?.aabbccc!`;
const sentencesList = test.split("/(\?|\.|!)/");
But result just going to be
["abc?aaa.abcd?.aabbccc!"]
What I want to get is
['abc?', 'aaa.', 'abcd?','.', 'aabbccc!']
I am so confused.. what exactly is wrong?
/[a-z]*[?!.]/g) will do what you want:
const test = `abc?aaa.abcd?.aabbccc!`;
console.log(test.match(/[a-z]*[?!.]/g))
To help you out, what you write is not a regex. test.split("/(\?|\.|!)/"); is simply an 11 character string. A regex would be, for example, test.split(/(\?|\.|!)/);. This still would not be the regex you're looking for.
The problem with this regex is that it's looking for a ?, ., or ! character only, and capturing that lone character. What you want to do is find any number of characters, followed by one of those three characters.
Next, String.split does not accept regexes as arguments. You'll want to use a function that does accept them (such as String.match).
Putting this all together, you'll want to start out your regex with something like this: /.*?/. The dot means any character matches, the asterisk means 0 or more, and the questionmark means "non-greedy", or try to match as few characters as possible, while keeping a valid match.
To search for your three characters, you would follow this up with /[?!.]/ to indicate you want one of these three characters (so far we have /.*?[?!.]/). Lastly, you want to add the g flag so it searches for every instance, rather than only the first. /.*?[?!.]/g. Now we can use it in match:
const rawText = `abc?aaa.abcd?.aabbccc!`;
const matchedArray = rawText.match(/.*?[?!.]/g);
console.log(matchedArray);
The following code works, I do not think we need pattern match. I take that back, I have been answering in Java.
final String S = "An sentence may end with period. Does it end any other way? Ofcourse!";
final String[] simpleSentences = S.split("[?!.]");
//now simpleSentences array has three elements in it.

What Regex would capture both the beginning and end from of a string?

I am trying to edit a DateTime string in typescript file.
The string in question is 02T13:18:43.000Z.
I want to trim the first three characters including the letter T from the beginning of a string AND also all 5 characters from the end of the string, that is Z000., including the dot character. Essentialy I want the result to look like this: 13:18:43.
From what I found the following pattern (^(.*?)T) can accomplish only the first part of the trim I require, that leaves the initial result like this: 13:18:43.000Z.
What kind of Regex pattern must I use to include the second part of the trim I have mentioned? I have tried to include the following block in the same pattern (Z000.)$ but of course it failed.
Thanks.
Any help would be appreciated.
There is no need to use regular expression in order to achieve that. You can simply use:
let value = '02T13:18:43.000Z';
let newValue = value.slice(3, -5);
console.log(newValue);
it will return 13:18:43, assumming that your string will always have the same pattern. According to the documentation slice method will substring from beginIndex to endIndex. endIndex is optional.
as I see you only need regex solution so does this pattern work?
(\d{2}:)+\d{2} or simply \d{2}:\d{2}:\d{2}
it searches much times for digit-digit-doubleDot combos and digit-digit-doubleDot at the end
the only disadvange is that it doesn't check whether say there are no minutes>59 and etc.
The main reason why I didn't include checking just because I kept in mind that you get your dates from sources where data that are stored are already valid, ex. database.
Solution
This should suffice to remove both the prefix from beginning to T and postfix from . to end:
/^.*T|\..*$/g
console.log(new Date().toISOString().replace(/^.*T|\..*$/g, ''))
See the visualization on debuggex
Explanation
The section ^.*T removes all characters up to and including the last encountered T in the string.
The section \..*$ removes all characters from the first encountered . to the end of the string.
The | in between coupled with the global g flag allows the regular expression to match both sections in the string, allowing .replace(..., '') to trim both simultaneously.

What's wrong with this regular expression to find URLs?

I'm working on a JavaScript to extract a URL from a Google search URL, like so:
http://www.google.com/search?client=safari&rls=en&q=thisisthepartiwanttofind.org&ie=UTF-8&oe=UTF-8
Right now, my code looks like this:
var checkForURL = /[\w\d](.org)/i;
var findTheURL = checkForURL.exec(theURL);
I've ran this through a couple regex testers and it seems to work, but in practice the string I get returned looks like this:
thisisthepartiwanttofind.org,.org
So where's that trailing ,.org coming from?
I know my pattern isn't super robust but please don't suggest better patterns to use. I'd really just like advice on what in particular I did wrong with this one. Thanks!
Remove the parentheses in the regex if you do not process the .org (unlikely since it is a literal). As per #Mark comment, add a + to match one or more characters of the class [\w\d]. Also, I would escape the dot:
var checkForURL = /[\w\d]+\.org/i;
What you're actually getting is an array of 2 results, the first being the whole match, the second - the group you defined by using parens (.org).
Compare with:
/([\w\d]+)\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl"]
/[\w\d]+\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org"]
/([\w\d]+)(\.org)/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl", ".org"]
The result of an .exec of a JS regex is an Array of strings, the first being the whole match and the subsequent representing groups that you defined by using parens. If there are no parens in the regex, there will only be one element in this array - the whole match.
You should escape .(DOT) in (.org) regex group or it matches any character. So your regex would become:
/[\w\d]+(\.org)/
To match the url in your example you can use something like this:
https?://([0-9a-zA-Z_.?=&\-]+/?)+
or something more accurate like this (you should choose the right regex according to your needs):
^https?://([0-9a-zA-Z_\-]+\.)+(com|org|net|WhatEverYouWant)(/[0-9a-zA-Z_\-?=&.]+)$

Match a specific sequence or everything else with regex

Been trying to come up with a regex in JS that could split user input like :
"Hi{user,10,default} {foo,10,bar} Hello"
into:
["Hi","{user,10,default} ","{foo,10,bar} ","Hello"]
So far i achieved to split these strings with ({.+?,(?:.+?){2}})|([\w\d\s]+) but the second capturing group is too exclusive, as I want every character to be matched in this group. Tried (.+?) but of course it fails...
Ideas fellow regex gurus?
Here's the regex I came up with:
(:?[^\{])+|(:?\{.+?\})
Like the one above, it includes that space as a match.
Use this:
"Hi{user,10,default} {foo,10,bar} Hello".split(/(\{.*?\})/)
And you will get this
["Hi", "{user,10,default}", " ", "{foo,10,bar}", " Hello"]
Note: {.*?}. The question mark here ('?') stops at fist match of '}'.
Beeing no JavaScript expert, I would suggest the following:
get all positive matches using ({[^},]*,[^},]*,[^},]*?})
remove all positive matches from the original string
split up the remaining string
Allthough, this might get tricky if you need the resulting values in order.

Categories