RegExp works in JS and PHP but not in Java - javascript

I have a regexp to extract an id and a label out of an HTML source code. It can be found HERE.
As you can see it work fine and its fast but when i try this regexp in java with the same source code it 1. Takes for ever and 2. only matches one string (from the first a to the last a is one match).
I tried it with the Multiline flag on and off but no difference. I don't understand how a regexp can work everywhere but in java. Any ideas?
private static final String COURSE_REGEX = "<a class=\"list-group-item list-group-item-action \" href=\"https:\\/\\/moodle-hs-ulm\\.de\\/course\\/view\\.php\\?id=([0-9]*)\"(?:.*\\s){7}<span class=\"media-body \">([^<]*)<\\/span>";
Pattern pattern = Pattern.compile(COURSE_REGEX, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(sourceCode);
List<String> courses = new ArrayList<>();
while(matcher.find() && matcher.groupCount() == 2){
courses.add(matcher.group(1) + "(" + matcher.group(2) + ")");
}

Your regex is running into catastrophic backtracking because of the gargantuan number of possible permutations the subexpression (?:.*\s){7} needs to check (because the . can also match spaces). Java aborts the match attempt after a certain number of steps (not sure how many, certainly > 1.000.000). PHP or JS may not be so cautious.
If you simplify that part of your regex to .*?, you do get the matches:
"(?s)<a class=\"list-group-item list-group-item-action \" href=\"https://moodle-hs-ulm\\.de/course/view\\.php\\?id=([0-9]*)\".*?<span class=\"media-body \">([^<]*)</span>"
Note that you need the DOTALL flag ((?s), so . may match a newline) instead of the MULTILINE flag which changes the behavior of ^ and $ anchors (none of which your regex is using).
Also note that you don't need to escape slashes in a Java regex.
This solution is not very robust because .*? is rather unspecific. I suppose your previous attempt of (?:.*\\s){7} may have been designed to match no more than 7 lines of text? In that case, you could use (?:(?!</a>).)* instead to ensure that you don't cross over into the next <a> tag. That's one of the dangers of parsing HTML with regex :)
Finally, greetings from a staff member of the faculty of Informatics at your university :)

Related

How to make replace() global in JavaScript

I know this question had been asked lot of time but i could not find solution. I have some smilies which each of them has code to be rendered as smiley using replace() , but I get syntax error, I don't know why and how to render my code :/ to smiley
txt = " Hi :/ ";
txt.replace("/\:/\/g","<img src='img/smiley.gif'>");
Your regular expression doesn't need to be in quotes. You should escape the correct / forward slash (you were escaping the wrong slash) and assign the replacement, since .replace doesn't modify the original string.
txt = " Hi :/ ";
txt = txt.replace(/:\//g,"<img src='img/smiley.gif'>");
Based on jonatjano's brilliant deduction, I think you should add a little more to the regular expression to avoid such calamities as interfering with URLs.
txt = txt.replace(/:\/(?!/)/g,"<img src='img/smiley.gif'>");
The above ensures that :// is not matched by doing a negative-lookahead.
There are two problems in the first argument of replace() it escapes the wrong characters and it uses a string that seems to contain a regex instead of a real RegExp.
The second line should read:
txt.replace(/:\//g,"<img src='img/smiley.gif'>");
/:\//g is the regex. The first and the last / are the RegExp delimiters, g is the "global" RegExp option (String.replace() needs a RegExp instead of a string to do a global replace).
The content of the regex is :/ (the string you want to find) but because / has a special meaning in a RegExp (see above), it needs to be escaped and it becomes :\/.

What Regex would capture both the beginning and end from of a string?

I am trying to edit a DateTime string in typescript file.
The string in question is 02T13:18:43.000Z.
I want to trim the first three characters including the letter T from the beginning of a string AND also all 5 characters from the end of the string, that is Z000., including the dot character. Essentialy I want the result to look like this: 13:18:43.
From what I found the following pattern (^(.*?)T) can accomplish only the first part of the trim I require, that leaves the initial result like this: 13:18:43.000Z.
What kind of Regex pattern must I use to include the second part of the trim I have mentioned? I have tried to include the following block in the same pattern (Z000.)$ but of course it failed.
Thanks.
Any help would be appreciated.
There is no need to use regular expression in order to achieve that. You can simply use:
let value = '02T13:18:43.000Z';
let newValue = value.slice(3, -5);
console.log(newValue);
it will return 13:18:43, assumming that your string will always have the same pattern. According to the documentation slice method will substring from beginIndex to endIndex. endIndex is optional.
as I see you only need regex solution so does this pattern work?
(\d{2}:)+\d{2} or simply \d{2}:\d{2}:\d{2}
it searches much times for digit-digit-doubleDot combos and digit-digit-doubleDot at the end
the only disadvange is that it doesn't check whether say there are no minutes>59 and etc.
The main reason why I didn't include checking just because I kept in mind that you get your dates from sources where data that are stored are already valid, ex. database.
Solution
This should suffice to remove both the prefix from beginning to T and postfix from . to end:
/^.*T|\..*$/g
console.log(new Date().toISOString().replace(/^.*T|\..*$/g, ''))
See the visualization on debuggex
Explanation
The section ^.*T removes all characters up to and including the last encountered T in the string.
The section \..*$ removes all characters from the first encountered . to the end of the string.
The | in between coupled with the global g flag allows the regular expression to match both sections in the string, allowing .replace(..., '') to trim both simultaneously.

Javascript Regex, re use a group to mach another group

Hi guys I want re use the expression for capture other group
For example:
I need search US $4.5-8.8
The structure is: Part1-Part2
The part1 and the part2 have the same code, I could use the part1 like group, and then re use in the part2
I've doned the expression until 4.5-XXXX
US \$([0-9]{1}(?=\.{1})\.{1}[0-9]+)(?=\-)\-
check in: https://regex101.com/r/E2MjWh/1
What should I do for re use the first group? It is easy in other lenguague, but I can't do it in javascript..
PD: I need it in regex, not include javascript code like var... etc etc..
First, have a look at your regex: the positive lookaheads are not really necessary there as they just require the same as the following consuming subpatterns. (?=\.{1})\.{1} means *require 1 dot immediately to the right of the current location and then match the dot, and (?=\-)\- has a similar meaning requiring and matching a - symbol.
Now, you ask if you can repeat the same part of a pattern using just the regex syntax. No, it is not possible in JS regex.
You may use the following regex to match the whole string like yours:
/US\s+\$(\d+\.\d+)-(\d+\.\d+)/
See the regex demo. Sure, you may add word boundaries (to match US as a whole word) or anchors (to match the whole input string) or replace the \d+\.\d+ part with \d*\.?\d+ (to match both integers or floats) to further enhance the pattern.
There is a way to shorten the pattern by placing the repetitive part into a variable and build the regex dynamically using the constructor notation:
var price = "\\d*\\.?\\d+";
var reg = new RegExp("US\\s+\\$(" + price + ")-(" + price + ")");
Add the required modifiers if necessary.

Javascript: Highlighting part of a string with <b> tags

I'm trying to highlight a match in a string by inserting <b> tags around the matching substring. For example, if the query is "cat" then:
"I have a cat."
should become:
"I have a <b>cat</b>."
Likewise, if the query is "stack overflow", then:
"Stack Overflow is great."
should become:
"<b>Stack Overflow</b> is great."
In other words, I have to preserve the case of the original string, but not be case-sensitive when matching.
One thing I was trying so far is:
var regex = new RegExp('(' + query + ')', 'i');
return strResult.replace(regex, '<b>$1</b>');
However, this causes a runtime exception if query has any parenthesis in it, and I think it'd be too much hassle to attempt to escape all the possible regular expression characters.
See "Escape Regular Expression Characters in String - JavaScript" for information about how to escape special regex characters, such as ()
EDIT: Also check out this older SO question that asks a very similar - almost identical - question.
Don't use regex to manipulate HTML.
For example if the query is ‘cat’, then:
I have a <em class="category">dog</em>
will become a mess of broken markup. In cases where the query and text may be user-generated, the resulting HTML-injection attacks are likely to leave you with cross-site-scripting security holes.
See this question for an example of how to find and mark up text using a regex in the DOM.
(For completeness, here is a function to escape regex-special characters, since the version linked at snipplr is insufficient. It fails to escape ^ and $, plus - which is special in character groups.)
RegExp.escape= function(s) {
return s.replace(/[-/\\^$*+?.()|[\]{}]/g, '\\$&')
};
How about using a highlight plugin?
http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html

Regular expression to remove a file's extension

I am in need of a regular expression that can remove the extension of a filename, returning only the name of the file.
Here are some examples of inputs and outputs:
myfile.png -> myfile
myfile.png.jpg -> myfile.png
I can obviously do this manually (ie removing everything from the last dot) but I'm sure that there is a regular expression that can do this by itself.
Just for the record, I am doing this in JavaScript
Just for completeness: How could this be achieved without Regular Expressions?
var input = 'myfile.png';
var output = input.substr(0, input.lastIndexOf('.')) || input;
The || input takes care of the case, where lastIndexOf() provides a -1. You see, it's still a one-liner.
/(.*)\.[^.]+$/
Result will be in that first capture group. However, it's probably more efficient to just find the position of the rightmost period and then take everything before it, without using regex.
The regular expression to match the pattern is:
/\.[^.]*$/
It finds a period character (\.), followed by 0 or more characters that are not periods ([^.]*), followed by the end of the string ($).
console.log(
"aaa.bbb.ccc".replace(/\.[^.]*$/,'')
)
/^(.+)(\.[^ .]+)?$/
Test cases where this works and others fail:
".htaccess" (leading period)
"file" (no file extension)
"send to mrs." (no extension, but ends in abbr.)
"version 1.2 of project" (no extension, yet still contains a period)
The common thread above is, of course, "malformed" file extensions. But you always have to think about those corner cases. :P
Test cases where this fails:
"version 1.2" (no file extension, but "appears" to have one)
"name.tar.gz" (if you view this as a "compound extension" and wanted it split into "name" and ".tar.gz")
How to handle these is problematic and best decided on a project-specific basis.
/^(.+)(\.[^ .]+)?$/
Above pattern is wrong - it will always include the extension too. It's because of how the javascript regex engine works. The (\.[^ .]+) token is optional so the engine will successfully match the entire string with (.+)
http://cl.ly/image/3G1I3h3M2Q0M
Here's my tested regexp solution.
The pattern will match filenameNoExt with/without extension in the path, respecting both slash and backslash separators
var path = "c:\some.path/subfolder/file.ext"
var m = path.match(/([^:\\/]*?)(?:\.([^ :\\/.]*))?$/)
var fileName = (m === null)? "" : m[0]
var fileExt = (m === null)? "" : m[1]
dissection of the above pattern:
([^:\\/]*?) // match any character, except slashes and colon, 0-or-more times,
// make the token non-greedy so that the regex engine
// will try to match the next token (the file extension)
// capture the file name token to subpattern \1
(?:\. // match the '.' but don't capture it
([^ :\\/.]*) // match file extension
// ensure that the last element of the path is matched by prohibiting slashes
// capture the file extension token to subpattern \2
)?$ // the whole file extension is optional
http://cl.ly/image/3t3N413g3K09
http://www.gethifi.com/tools/regex
This will cover all cases that was mentioned by #RogerPate but including full paths too
another no-regex way of doing it (the "oposite" of #Rahul's version, not using pop() to remove)
It doesn't require to refer to the variable twice, so it's easier to inline
filename.split('.').slice(0,-1).join()
This will do it as well :)
'myfile.png.jpg'.split('.').reverse().slice(1).reverse().join('.');
I'd stick to the regexp though... =P
return filename.split('.').pop();
it will make your wish come true. But not regular expression way.
In javascript you can call the Replace() method that will replace based on a regular expression.
This regular expression will match everything from the begining of the line to the end and remove anything after the last period including the period.
/^(.*)\..*$/
The how of implementing the replace can be found in this Stackoverflow question.
Javascript regex question

Categories