Using JavaScript, I'm looking to pinpoint text that's inside two other strings WITHOUT including those strings. For example:
input: ONE example TWO
regular expression: (?=ONE).+(?=TWO)
matches: ONE example
I want: example
I'm really surprised that the question mark (which is supposed just include that string in the query but not the result) works on the end of the string, but not on the start.
Ah-ha! I figured it out.
for example, here's how to get text inside parenthesis without the parenthesis
(?<=\().+(?=\))
Here's a nice reference: http://www.regular-expressions.info/lookaround.html
Part of my confusion was javascript's fault. It evidently doesn't support "lookbehinds" natively. I found this workaround though:
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
(I use Python's re module to show the examples -- exactly how to do this depends on your regexp implementation [some don't have groups, for example -- or backreferences])
Use a backwards assertion, not a forward assertion, for the first assertion.
>>> re.search(r"(?<=ONE).+(?=TWO)", "ONE x a b TWO").group()
' x a b '
The problem is that the zero width assertion (?=ONE) matches the text "ONE", but doesn't "consume" it -- i.e. it just checks that it's there, but leaves the string as-is. Then the .+ starts reading text, and does consume it.
Backwards assertions don't look ahead, they look behind, so .+ doesn't get run until whatever is behind it is "ONE".
It is probably better not to bother with these at all, but use groups. Consider:
>>> re.search(r"ONE(.+)TWO", "ONE x a b TWO").group(1)
' x a b '
Related
I have a string that will have a lot of formatting things like bullet points or arrows or whatever. I want to clean this string so that it only contains letters, numbers and punctuation. Multiple spaces should be replaced by a single space too.
Allowed punctuation: , . : ; [ ] ( ) / \ ! # # $ % ^ & * + - _ { } < > = ? ~ | "
Basically anything allowed in this ASCII table.
This is what I have so far:
let asciiOnly = y.replace(/[^a-zA-Z0-9\s]+/gm, '')
let withoutSpacing = asciiOnly.replace(/\s{2,}/gm, ' ')
Regex101: https://regex101.com/r/0DC1tz/2
I also tried the [:punct:] tag but apparently it's not supported by javascript. Is there a better way I can clean this string other than regex? A library or something maybe (I didn't find any). If not, how would I do this with regex? Would I have to edit the first regex to add every single character of punctuation?
EDIT: I'm trying to paste an example string in the question but SO just removes characters it doesn't recognize so it looks like a normal string. Heres a paste.
EDIT2: I think this is what I needed:
let asciiOnly = x.replace(/[^\x20-\x7E]+/gm, '')
let withoutSpacing = asciiOnly.replace(/\s{2,}/gm, ' ')
I'm testing it with different cases to make sure.
You can achieve this using below regex, which finds any non-ascii characters (also excludes non-printable ascii characters and excluding extended ascii too) and removes it with empty string.
[^ -~]+
This is assuming you want to retain all printable ASCII characters only, which range from space (ascii value 32) to tilde ~ hence usage of this char set [^ !-~]
And then replaces all one or more white space with a single space
var str = `Determine the values of P∞ and E∞ for each of the following signals: b.
d.
f.
Periodic and aperiodic signals Determine whether or not each of the following signals is periodic:
b.
Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period.
b.
d.
Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals:
b. c.
d. e. f. Figure 1: Problem Set 1.4
Even and Odd Signals
For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero.
b.
d. -------------------------`;
console.log(str.replace(/[^ -~]+/g,'').replace(/\s+/g, ' '));
<!-- begin snippet: js hide: false console: true babel: false -->
console.log(str.replace(/[^ !-~]+/g,'').replace(/\s+/g, ' '));
Also, if you just want to allow all alphanumeric characters and mentioned special characters, then you can use this regex to first retain all needed characters using this regex ,
[^ a-zA-Z0-9,.:;[\]()/\!##$%^&*+_{}<>=?~|"-]+
Replace this with empty string and then replace one or more white spaces with just a single space.
var str = `Determine the values of P∞ and E∞ for each of the following signals: b.
d.
f.
Periodic and aperiodic signals Determine whether or not each of the following signals is periodic:
b.
Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period.
b.
d.
Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals:
b. c.
d. e. f. Figure 1: Problem Set 1.4
Even and Odd Signals
For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero.
b.
d. -------------------------`;
console.log(str.replace(/[^ a-zA-Z0-9,.:;[\]()/\!##$%^&*+_{}<>=?~|"-]+/g,'').replace(/\s+/g, ' '));
This is how i will do. I will remove the all the non allowed character first and than replace the multiple spaces with a single space.
let str = `Determine the values of P∞ and E∞ for each of the following signals: b.
d.
f.
Periodic and aperiodic signals Determine whether or not each of the following signals is periodic:!!!23
b.
Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period.
b.
d.
Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals:
b. c.
d. e. f. Figure 1: Problem Set 1.4
Even and Odd Signals
For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero.
b.
d. ------------------------- `
const op = str.replace(/[^\w,.:;\[\]()/\!##$%^&*+{}<>=?~|" -]/g, '').replace(/\s+/g, " ")
console.log(op)
EDIT : In case you want to keep \n or \t as it is use (\s)\1+, "$1" in second regex.
There probably isn't a better solution than a regex. The under-the-hood implementation of regex actions is usually well optimized by virtue of age and ubiquity.
You may be able to explicitly tell the regex handler to "compile" the regex. This is usually a good idea if you know the regex is going to be used a lot within a program, and may help with performance here. But I don't know if javascript exposes such an option.
The idea of "normal punctuation" doesn't have an excellent foundation. There are some common marks like "90°" that aren't ASCII, and some ASCII marks like "" () that you almost certainly don't want. I would expect you to find similar edge cases with any pre-made list. In any case, just explicitly listing all the punctuation you want to allow is better in general, because then no one will ever have to look up what's in the list you chose.
You may be able to perform both substitutions in a single pass, but it's unclear if that will perform better and it almost certainly won't be clearer to any co-workers (including yourself-from-the-future). There will be a lot of finicky details to work out such as whether " ° " should be replaced with "", " ", or " ".
I'm not particularly strong with Regular Expressions. Basically, I have the following string:
Showing 1-20 of 748 results.
I want to extract the "748", convert it to a number, and use it for comparisons. As expected, "Showing", "of", and "results" are not expected to change, but the numbers could. I have a couple of solutions in mind. The first is using lookbehinds, but I do not believe JS supports them. The second is doing a more blunt approach, maybe finding all the numbers in the string using match() and taking the element at the third index in the returned array (which should be "748").
Any thoughts on the best way to do this?
I would use the regex:
Showing \d+-\d+ of (\d+) results\.
where \d+ in each case means to match 1 or more digits. The parentheses around the number you wanted to find is called a capture group.
So if the search string was in str, the resulting JavaScript might look like:
var resultsRe = /Showing \d+-\d+ of (\d+) results\./;
var numResults = resultsRe.exec(str);
console.log("There are " + numResults + " results.");
For a simple approach you could do the following:
(\d+)\sresults
All it does is capture the integer directly before the word results.
I observe these results:
// Test 1:
var re = /a|ab/;
"ab".match(re); // returns ["a"] <--- Unexpected
// Test 2:
re = /ab|a/;
"ab".match(re); // returns ["ab"]
I would expect tests 1 and 2 to both return ["ab"], due to the principal of "leftmost longest". I don't understand why the order of the 2 alternatives in the regex should change the results.
Find the reason below:
Note that alternatives are considered left to right until a match is
found. If the left alternative matches, the right alternative is
ignored, even if it would have produced a “better” match. Thus, when
the pattern /a|ab/ is applied to the string “ab,” it matches only the
first letter.
(source: Oreilly - Javascript Pocket Reference - Chapter 9 Regular Expressions)
Thanks.
This is because JavaScript doesn't implement the POSIX engine.
POSIX NFA Engines work similarly to Traditional NFAs with one
exception: a POSIX engine always picks the longest of the leftmost
matches. For example, the alternation cat|category would match the full word "category" whenever possible, even if the first alternative ("cat") matched and appeared earlier in the alternation. (SEE MRE 153-154)
Source: Oreilly - Javascript Pocket Reference, p.4
I want to match all strings ending in ".htm" unless it ends in "foo.htm". I'm generally decent with regular expressions, but negative lookaheads have me stumped. Why doesn't this work?
/(?!foo)\.htm$/i.test("/foo.htm"); // returns true. I want false.
What should I be using instead? I think I need a "negative lookbehind" expression (if JavaScript supported such a thing, which I know it doesn't).
The problem is pretty simple really. This will do it:
/^(?!.*foo\.htm$).*\.htm$/i.test("/foo.htm"); // returns false
What you are describing (your intention) is a negative look-behind, and Javascript has no support for look-behinds.
Look-aheads look forward from the character at which they are placed — and you've placed it before the .. So, what you've got is actually saying "anything ending in .htm as long as the first three characters starting at that position (.ht) are not foo" which is always true.
Usually, the substitute for negative look-behinds is to match more than you need, and extract only the part you actually do need. This is hacky, and depending on your precise situation you can probably come up with something else, but something like this:
// Checks that the last 3 characters before the dot are not foo:
/(?!foo).{3}\.htm$/i.test("/foo.htm"); // returns false
As mentioned JavaScript does not support negative look-behind assertions.
But you could use a workaroud:
/(foo)?\.htm$/i.test("/foo.htm") && RegExp.$1 != "foo";
This will match everything that ends with .htm but it will store "foo" into RegExp.$1 if it matches foo.htm, so you can handle it separately.
Like Renesis mentioned, "lookbehind" is not supported in JavaScript, so maybe just use two regexps in combination:
!/foo\.htm$/i.test(teststring) && /\.htm$/i.test(teststring)
Probably this answer has arrived just a little bit later than necessary but I'll leave it here just in case someone will run into the same issue now (7 years, 6 months after this question was asked).
Now lookbehinds are included in ECMA2018 standard & supported at least in last version of Chrome. However, you might solve the puzzle with or without them.
A solution with negative lookahead:
let testString = `html.htm app.htm foo.tm foo.htm bar.js 1to3.htm _.js _.htm`;
testString.match(/\b(?!foo)[\w-.]+\.htm\b/gi);
> (4) ["html.htm", "app.htm", "1to3.htm", "_.htm"]
A solution with negative lookbehind:
testString.match(/\b[\w-.]+(?<!foo)\.htm\b/gi);
> (4) ["html.htm", "app.htm", "1to3.htm", "_.htm"]
A solution with (technically) positive lookahead:
testString.match(/\b(?=[^f])[\w-.]+\.htm\b/gi);
> (4) ["html.htm", "app.htm", "1to3.htm", "_.htm"]
etc.
All these RegExps tell JS engine the same thing in different ways, the message that they pass to JS engine is something like the following.
Please, find in this string all sequences of characters that are:
Separated from other text (like words);
Consist of one or more letter(s) of english alphabet, underscore(s),
hyphen(s), dot(s) or digit(s);
End with ".htm";
Apart from that, the part of sequence before ".htm" could be anything
but "foo".
String.prototype.endsWith (ES6)
console.log( /* !(not)endsWith */
!"foo.html".endsWith("foo.htm"), // true
!"barfoo.htm".endsWith("foo.htm"), // false (here you go)
!"foo.htm".endsWith("foo.htm"), // false (here you go)
!"test.html".endsWith("foo.htm"), // true
!"test.htm".endsWith("foo.htm") // true
);
You could emulate the negative lookbehind with something like
/(.|..|.*[^f]..|.*f[^o].|.*fo[^o])\.htm$/, but a programmatic approach would be better.
In Jeff Roberson's jQuery Regular Expressions Review he proposes changing the rts regular expression in jQuery's ajax.js from /(\?|&)_=.*?(&|$)/ to /([?&])_=[^&\r\n]*(&?)/. In both versions, what is the purpose of the second capture group? The code does a replacement of the current random timestamp with a new random timestamp:
var ts = jQuery.now();
// try replacing _= if it is there
var ret = s.url.replace(rts, "$1_=" + ts + "$2");
Doesn't it only replace what it matches? I am thinking this does the same:
var ret = s.url.replace(/([?&])_=[^&\r\n]*/, "$1_=" + ts);
Can someone explain the purpose of the second capture group?
It's to pick up the next delimiter in the query string on the URL, so that it still works properly as a query string. Thus if the url is
http://foo.bar/what/ever?blah=blah&_=12345&zebra=banana
then the second group picks up the "&" before "zebra".
That's an awesome blog post by the way and everybody should read it.
edit — now that I think about it, I'm not sure why it's necessary to bother with replacing that second delimiter. In the "fixed" expression, that greedy * will pick up the whole parameter value and stop at the delimiter (or the end of the string) anyway.
I think you're right. It was needed in the original because matching the ampersand or end-of-string was how the .*? knew when to stop. In Jeff's version that's no longer necessary.
As the author of the article I can't tell you the reason for the second capture group. My intent with the article was to take existing regexes and simply make them more efficient - i.e. they should all match the same text - just do it faster. Unfortunately I did not have time to delve deeply into the code to see exactly how each and every one of them was being used. I assumed that the capture group for this one was there for a reason so I did not mess with it.