How to match sth. "preceeded / followed by" something with regular expression? - javascript

I am working with a Google Sheets document in which I need to manipulate strings and extract certain parts of them. These strings have exactly the following form, to the character:
Ad name: FOO_FOOBAR_DE_CH_Zagreb+N1_970x250.zip; 970x250
I need to extract two "fields":
Zagreb
970x250
Obviously, the first one is always surrounded by "\_" and "+" which makes things a bit easier and the other one is either surrounded by "_" and "." OR preceded by "; " if I were to capture it from the end of the string.
I am trying to use Google Sheets proprietary REGEXMATCH formula (read more about it here) but I must be doing something wrong. If it matters, Google products use RE2 RegEx "flavor".
Here is what I have so far:
=REGEXEXTRACT(text, "(?:_)[A-Za-z]+(?:\+).*")
This one returns:
_Zagreb+
so I need to lose the "_" and "+". I understand that for this type of operation (extracting text between certain characters) look-arounds should be used but I am still quite unfamiliar with these. Also, I understand that some of them (negative look-behind most notably) do not work with JavaScript.
This is attempt 2:
=REGEXEXTRACT(text, ".*[A-Za-z]+(?=\+.*)")
This one just throws a #REF error. I find these two resources invaluable for learning RegEx:
Rexegg
Regular-expressions.info
but since I am short of time, I can't afford to study this in detail right now.

In Google Speadsheets, you may use a capturing group around the piece of text you need to extract from a specific context. Thus, just place ( and ) around those pattern parts.
To get Zagreb, use =REGEXEXTRACT(F15,"_([a-zA-Z]+)\+") and to get the resolution, use =REGEXEXTRACT(F15,";\s*([0-9x]+)$").
Pattern 1:
_ - an underscore that is just matched
([a-zA-Z]+) - Capture group 1 matching one or more ASCII letters
\+ - a literal +.
Pattern 2
;\s* - a ; and 0+ whitespaces
([0-9x]+) - Capture group 1 matching one or more digits or x
$ - at the end of the cell contents.
In both cases, you only get the substrings captured into Group 1.
More information about capturing groups can be found here.

Related

Javascript -- Regex -- Blacklist of multiple words to END with a partial match

I've read many Questions on StackOverflow, including this one, this one, and even read Rexegg's Best Trick, which is also in a question here. I found this one, which works on entire lines, but not "everything up to the bad word". None of these have helped me, so here I go:
In Javascript, I have a long regex pattern. I'm trying to match a sequence in similar sentence structures, like follows:
1 UniquePrefixA [some-token] and [some-token] want to take [some-token] to see some monkeys.
2 UniqueC [some-token] wants to take [some-token] to the store. UniqueB, [some-token] is in the pattern once more.
3 UniquePrefixA [some-token] is using [some-token] to [some-token].
Notice that each pattern starts with a unique prefix. Encountering that prefix signals the start of a pattern. If I encounter that pattern again during capture, I should not capture a second occurance, and STOP THERE. I'll have captured everything up to that prefix.
If I don't encounter the prefix later in the pattern, I need to continue matching that pattern.
I'm also using capture groups (not repeating, since Capture Groups only return the last matched of that group). The capture group contents need to be returned, so I'm using match, non-greedy.
Here's my pattern and a working example
/(?:UniquePrefixA|UniqueB|UniqueC)\s*(\[some-token\])(?:and|\s)*(\[some-token\])?(\s|[^\[\]])*(\[some-token\])? --->(\s|[^\[\]])*<--- (\[some-token\])?(\s|[^\[\]])*/i
It's basically 2 repeating patterns in a specific order:
(\s|[^\[\]])* // Basicaly .*, but excluding brackets
(\[some-token\]) // A token [some-token]
How I can prevent the match from continuing past a black list of words?
I want this to happen where I drew three arrows, for context. The equivalent of Any character, but not the contents of this list: (UniquePrefixA|UniqueB|UniqueC) (as seen in capture group 1).
It's possible I need a better understanding of negative lookahead, or if it can work with a group of things. Most importantly, I'm looking to know if a negative look-ahead approach can support a list of options Or is there a better way altogether? If the answer is "you can't do that," that's cool too.
I think, an easier to maintain solution is to divide your task into 2 parts:
Find each chunk of text starting from any of your unique prefixes,
up to the next or to the end of string.
Process each such chunk, looking for your some tokens and maybe
also the content between them.
The regex performing the first task should include 3 parts:
(?:UniquePrefixA|UniqueB|UniqueC) - A non-capturing group looking
for any unique prefix.
((?:.|\n)+?) - A capturing group - the fragment to catch for further
processing (see the note below).
(?=UniquePrefixA|UniqueB|UniqueC|$) - A positive lookahead, looking
for either any unique prefix or the end of the string (a stop criterion
you are looking for).
To sum up, the whole regex looks like below:
/(?:UniquePrefixA|UniqueB|UniqueC)((?:.|\n)+?)(?=UniquePrefixA|UniqueB|UniqueC|$)/gi
Note: Unfortunately, JavaScript flavour of regex does not implement
single-line (-s) option. So, instead of just . in the capturing group
above, you must use (?:.|\n), meaning:
either any char other than \n (.),
or just \n.
Both these variants are "enveloped" into a non-capturing group,
to put limits of variants (both sides of |), because the repetition
marker (+?) pertains to both variants.
Note ? after +, meaning the reluctant version.
So this part of regex (the capturing group) will match any sequence of chars
including \n, ending before the next uniqie prefix (if any),
just as you expect.
The second task is to apply another regex to the captured chunk (group 1),
looking for [some-token]s and possibly the content between them.
You didn't specify what you want exactly do with each chunk,
so I'm not sure what this second regex shoud include.
Maybe it will be enough just to match [some-token]?
to ensure a pattern not occurs in a repeating character sequence such as (\s|[^\[\]])*, note that \s is included in [^\[\]] so may be just [^\[\]]*, is to prepend a negative lookahead (which is a zero lentgh match assertion like ^) at the left and inside the repeating pattern so that it is checked for every character :
((?!UniquePrefixA)(\s|[^\[\]]))*

Use regex in Find & Replace to extract everything but a pattern/string

I want to extract the ASIN from any Amazon URL. I found this, giving me the following regex:
/([a-zA-Z0-9]{10})(?:[/?]|$)
This expression works for me in Excel. However, I also have use another tool where I can only edit my text with Find & Replace. I can use regex but the tool will always replace the result from my regex.
When I use the expression above the tool will find exactly the string I am looking for but will then replace it with either blank or whatever I put in the replace field.
How does the regex have to look when I must use Find & Replace? I assume it should match/find anything BUT the ASIN/string and then replace it with blank. At the end of the day everything should be deleted/replaced except the ASIN.
Example input:
https://www.amazon.de/gp/product/**B00ZFWRGXC**/ref=br_asw_pdt-1?pf_rd_m=A3JWKAKI7XB7XF&pf_rd_s=desktop-6&pf_rd_r=BKAKXRSA7JM715TZ38YN&pf_rd_t=36701&pf_rd_p=f54c1f0d-d685-4847-826e-7fdd8c321011&pf_rd_i=desktop
I only want to keep the bold part (via Find & Replace).
You may use a regex based on an alternation with one branch matching and capturing what you need, and the other will just match all the text that does not start your sequence.
Use
/([a-zA-Z0-9]{10})|(?:(?!/[a-zA-Z0-9]{10}).)*
and replace with $1\n. To make it work better, make sure . matches the newline option (if present) is on. If it is not present, replace the . with [\s\S].
Details:
/([a-zA-Z0-9]{10}) - match a / and capture 10 alphanumerical symbols
| - or
(?:(?!/[a-zA-Z0-9]{10}).)* - any 0+ character that is not starting a sequence of a / followed with 10 alphanumerical symbols.
The $1 is a backreference restoring the contents of the capturing group (10 alphanumerical symbols) in the result.
/([A-Z0-9]{10})|(?:(?!/[A-Z0-9]{10}).)*
or
/([a-zA-Z0-9]{10})/|(?:(?!/[a-zA-Z0-9]{10}/).)*
will fix it.

Parsing units with javascript regex

Say I have a string which contains some units (which may or may not have prefixes) that I want to break into the individual units. For example the string may contain "Btu(th)" or "Btu(th).ft" or even "mBtu(th).ft" where mBtu(th) is the bastardised unit milli thermochemical BTU's (this is purely an example).
I currently have the following (simplified) regex however it fails for the case "mBtu(th).ft":
/(m|k)??(Btu\(th\)|ft|m)(?:\b|\s|$)/g
Currently this does not correctly detect the boundary between the end of 'Btu(th)' and the start of 'ft'. I understand javascript regex does not support look back so how do I accurately parse the string?
Additional notes
The regex presented above is greatly simplified around the prefixes and units groups. The prefixes could span multiple characters like 'Ki' and therefore character sets are not suitable.
The desire is for each group to catch the prefix match as group 1 and the unit as match two i.e for 'mBtu(th).ft' match one would be ['m','Btu(th)'] and match two would be ['','ft'].
The prefix match needs to be lazy so that the string 'm' would be matched as the unit metres rather than the prefix milli. Likewise the match for 'mm' would need to be the prefix milli and the unit metres.
I would try with:
/((m)|(k)|(Btu(\(th\))?)|(ft)|(m)|(?:\.))+/g
at least with example above, it matches all units merged into one string.
DEMO
EDIT
Another try (DEMO):
/(?:(m)|(k)|(Btu)|(th)|(ft)|[\.\(\)])/g
this one again match only one part, but if you use $1,$2,$3,$4, etc, (DEMO) you can extract other fragments. It ignores ., (, ), characters. The problem is to count proper matched groups, but it works to some degree.
Or if you accept multiple separate matches I think simple alternative is:
/(m|k|Btu|th|ft)/g
A word boundary will not separate two non-word characters. So, you don't actually want a word boundary since the parentheses and period are not valid word characters. Instead, you want the string to not be followed by a word character, so you can use this instead:
[mk]??(Btu\(th\)|ft|m)(?!\w)
Demo
I believe you're after something like this. If I understood you correctly that want to match any kind of element, possibly preceded by the m or k character and separated by parantheses or dots.
/[\s\.\(]*(m|k?)(\w+)[\s\.\)]*/g
https://regex101.com/r/eQ5nR4/2
If you don't care about being able to match the parentheses but just return the elements you can just do
/(m|k?)(\w+)/g
https://regex101.com/r/oC1eP5/1

Matching variable-term equations

I am trying to develop a regular expression to match the following equations:
(Price+10%+100+200)
(Price+20%+200)
(Price+30%)
(Price+100)
(Price-10%-100-200)
(Price-20%-200)
(Price-30%)
(Price-100)
My regex so far is...
/([(])+([P])+([r])+([i])+([c])+([e])+([+]|[-]){1}([\d])+([+]|[-])?([\d])+([%])?([)])/g
..., but it only matches the following equations:
(Price+100+10%)
(Price+100+100)
(Price+200)
(Price-100-10%)
(Price-100-100)
(Price-200)
Can someone help me understand how to make my pattern match the full set of equations provided?
Note: Parentheses and 'Price' are musts in the equations that the pattern must match.
Try this, which matches all the input strings provided in the question:
/\(Price([+-]\d+%?){1,3}\)/g
You can test it in a regex fiddle.
Things to note:
Only use parentheses where you want to group. Parentheses around single-possibility, fixed-quantity matches (e.g. ([P]) provide no value.
Use character classes (opened with [ and closed with ]) for multiple characters that can match at a position in the pattern (e.g. [+-]). Single-possibility character classes (e.g. [P]) similarly provide no value.
Yes, character classes (generally) implicitly escape regex special characters within them (e.g. ( in [(] vs. equivalent \( outside a character class), but to just escape regex special characters (i.e. to match them literally), you are better off not using a character class and just escaping them (e.g. \() – unless multiple characters should match at a position in the pattern (per the previous point to note).
The quantifier {1} is (almost) always useless: drop it.
The quantifier + means "one or more" as you probably know. However, in a series of cases where you used it (i.e. ([(])+([P])+([r])+([i])+([c])+([e])+), it would match many values that I doubt you expect (e.g. ((((((PPPrriiiicccceeeeee): basically, don't overuse it. Stop to consider whether you really want to match one or more of the character (class) or group to which + applies in the pattern.
To match a literal string without any regex special characters like Price, just use the literal string at the appropriate position in the pattern – e.g. Price in \(Price.
/\(Price[+-](\d)+(%)?([+-]\d+%?)?([+-]\d+%?)?\)/g
works on http://www.regexr.com/
/^[(Price]+\d+\d+([%]|[)])&/i
try at your own risk!

PHP or Javascript - Find a specific string

Finding a specific string is relatively easy, but I am not sure where to begin on this one. I would need to extract a string that would be different every time, but with similar characteristics.
Here are some example strings I need to find in a paragraph, either at the beginning, end or somewhere in the middle.
7b.9t.7iv.4x
4ir.4i.5i.6t
7ix.7t.4t.0z
As you can see the string will always begin with a number, and would have up to 2 characters after it and will always contain 4 octets separated by dots.
Let me know if you may need more details.
EDIT:
Thanks to the answer below I came up with this, while not pretty, does what I need.
$body="test 1f.9t.7iv.4x test 1a.9a.7ab.4xa test ";
$var=preg_match_all("([0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2})",$body,$matches);
$count=count($matches[0]);
$stack = array();
while($count > 0){
$count--;
array_push($stack, "<span id='ip_".$matches[0][$count]."'>".$matches[0][$count]."</span>");
}
$stack=array_reverse($stack);
$body=str_replace($matches[0],$stack,$body);
You can use a regular expression.
Something like this to get you started. There may be a better way to match since it's repeated, but....
([0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2})
( Start a capture group
[0-9] match any character 0 through 9
[a-z] match any character [a-z]
{1,2} but only match the previous 1 or 2 times
\. match a literal . the \ is needed as an escape because . is a special character
) End capture group
Both php and javascript allow for regular expression use.
For an even better visual representation you can check out this tool: http://www.debuggex.com/
If you need each octet by itself (as a match) you can add more parenthesis () around each [0-9][a-z]{1,2} which will then store those octets individually.
Also note that \d is the same as [0-9] but I prefer the later as I find it a little more readable.

Categories