I want to extract the ASIN from any Amazon URL. I found this, giving me the following regex:
/([a-zA-Z0-9]{10})(?:[/?]|$)
This expression works for me in Excel. However, I also have use another tool where I can only edit my text with Find & Replace. I can use regex but the tool will always replace the result from my regex.
When I use the expression above the tool will find exactly the string I am looking for but will then replace it with either blank or whatever I put in the replace field.
How does the regex have to look when I must use Find & Replace? I assume it should match/find anything BUT the ASIN/string and then replace it with blank. At the end of the day everything should be deleted/replaced except the ASIN.
Example input:
https://www.amazon.de/gp/product/**B00ZFWRGXC**/ref=br_asw_pdt-1?pf_rd_m=A3JWKAKI7XB7XF&pf_rd_s=desktop-6&pf_rd_r=BKAKXRSA7JM715TZ38YN&pf_rd_t=36701&pf_rd_p=f54c1f0d-d685-4847-826e-7fdd8c321011&pf_rd_i=desktop
I only want to keep the bold part (via Find & Replace).
You may use a regex based on an alternation with one branch matching and capturing what you need, and the other will just match all the text that does not start your sequence.
Use
/([a-zA-Z0-9]{10})|(?:(?!/[a-zA-Z0-9]{10}).)*
and replace with $1\n. To make it work better, make sure . matches the newline option (if present) is on. If it is not present, replace the . with [\s\S].
Details:
/([a-zA-Z0-9]{10}) - match a / and capture 10 alphanumerical symbols
| - or
(?:(?!/[a-zA-Z0-9]{10}).)* - any 0+ character that is not starting a sequence of a / followed with 10 alphanumerical symbols.
The $1 is a backreference restoring the contents of the capturing group (10 alphanumerical symbols) in the result.
/([A-Z0-9]{10})|(?:(?!/[A-Z0-9]{10}).)*
or
/([a-zA-Z0-9]{10})/|(?:(?!/[a-zA-Z0-9]{10}/).)*
will fix it.
Related
I need some help with RegEx, it may be a basic stuff but I cannot find a correct way how to do it. Please help!
So, here's my question:
I have a list of URLs, that are invalid because of double slash, like this:
http://website.com//wp-content/folder/file.jpg, to fix it I need to remove all double slashes except the first one followed by colon (http://), so fixed URL is this: http://website.com/wp-content/folder/file.jpg.
I need to do it with RegExp.
Variant 1
url.replace(/\/\//g,'/'); // => http:/website.com/wp-content/folder/file.jpg
will replace all double slashed (//), including the first one, which is not correct.
example here:
https://regex101.com/r/NhCVMz/2
You may use
url = url.replace(/(https?:\/\/)|(\/){2,}/g, "$1$2")
See the regex demo
Note: a ^ anchor at the beginning of the pattern might be used if the strings are entire URLs.
This pattern will match and capture http:// or https:// and will restore it in the resulting string with the $1 backreference and all other cases of 2 or more / will be matched by (\/){2,} and only 1 occurrence will be put back into the resulting string since the capturing group does not include the quantifier.
Find (^|[^:])/{2,}
Replace $1/
delimited: /(^|[^:])\/{2,}/
I am working with a Google Sheets document in which I need to manipulate strings and extract certain parts of them. These strings have exactly the following form, to the character:
Ad name: FOO_FOOBAR_DE_CH_Zagreb+N1_970x250.zip; 970x250
I need to extract two "fields":
Zagreb
970x250
Obviously, the first one is always surrounded by "\_" and "+" which makes things a bit easier and the other one is either surrounded by "_" and "." OR preceded by "; " if I were to capture it from the end of the string.
I am trying to use Google Sheets proprietary REGEXMATCH formula (read more about it here) but I must be doing something wrong. If it matters, Google products use RE2 RegEx "flavor".
Here is what I have so far:
=REGEXEXTRACT(text, "(?:_)[A-Za-z]+(?:\+).*")
This one returns:
_Zagreb+
so I need to lose the "_" and "+". I understand that for this type of operation (extracting text between certain characters) look-arounds should be used but I am still quite unfamiliar with these. Also, I understand that some of them (negative look-behind most notably) do not work with JavaScript.
This is attempt 2:
=REGEXEXTRACT(text, ".*[A-Za-z]+(?=\+.*)")
This one just throws a #REF error. I find these two resources invaluable for learning RegEx:
Rexegg
Regular-expressions.info
but since I am short of time, I can't afford to study this in detail right now.
In Google Speadsheets, you may use a capturing group around the piece of text you need to extract from a specific context. Thus, just place ( and ) around those pattern parts.
To get Zagreb, use =REGEXEXTRACT(F15,"_([a-zA-Z]+)\+") and to get the resolution, use =REGEXEXTRACT(F15,";\s*([0-9x]+)$").
Pattern 1:
_ - an underscore that is just matched
([a-zA-Z]+) - Capture group 1 matching one or more ASCII letters
\+ - a literal +.
Pattern 2
;\s* - a ; and 0+ whitespaces
([0-9x]+) - Capture group 1 matching one or more digits or x
$ - at the end of the cell contents.
In both cases, you only get the substrings captured into Group 1.
More information about capturing groups can be found here.
I'm struggling with some regex, in javascript which doesn't have a typical lookbehind option, to only match a group if it's not preceded with a string:
(^|)(www\.[\S]+?(?= |[,;:!?]|\.( )|$))
so in the following
hello http:/www.mytestwebsite.com is awesome
I'm trying to detect if the www.mytestwebsite.com is preceeded by
/
and if it is I don't want to match, otherwise match away. I tried using a look ahead but it looked to be conflicting with the look ahead I already had.
I've been playing around with placing (?!/) in different areas with no success.
(^|)((?!/)www\.[\S]+?(?= |[,;:!?]|\.( )|$))
A look ahead to not match if the match is preceded
Due to lack of lookbehinds in JS, the only way to accomplish your goal
is to match those web sites that contain the errant / as well.
This is because a lookahead won't advance the current position.
Only a match on consumable text will advance the position.
But, a good workaround has always been to include the errant text as an option
within the regex. You'd put some capture groups around it, then test the
group for a match. If it matched, skip, go on to next match.
This requires sitting in a while loop checking each successful match.
In the below regex, if group 1 matched, don't store the group 2 url,
If it didn't, store the group 2 url.
(/)?(www\.\S+?(?= |[,;:!?]|\.( )|$))
Formatted:
( &\#x2f; )? # (1)
( # (2 start)
www\. \S+?
(?=
&\#x20;
| [,;:!?]
| \.
( &\#x20; ) # (3)
| $
)
) # (2 end)
Another option (and I've done zero performance testing) would be to use string.replace() with a regex and a callback as the second parameter.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace
Then, inside of the replace function, prepend/append the illegal/characters you don't want to match to the matched string, using the offset parameter passed to the callback (see above docs) you can determine each match, and it's position and make a determination whether to replace the text or not.
Finding a specific string is relatively easy, but I am not sure where to begin on this one. I would need to extract a string that would be different every time, but with similar characteristics.
Here are some example strings I need to find in a paragraph, either at the beginning, end or somewhere in the middle.
7b.9t.7iv.4x
4ir.4i.5i.6t
7ix.7t.4t.0z
As you can see the string will always begin with a number, and would have up to 2 characters after it and will always contain 4 octets separated by dots.
Let me know if you may need more details.
EDIT:
Thanks to the answer below I came up with this, while not pretty, does what I need.
$body="test 1f.9t.7iv.4x test 1a.9a.7ab.4xa test ";
$var=preg_match_all("([0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2})",$body,$matches);
$count=count($matches[0]);
$stack = array();
while($count > 0){
$count--;
array_push($stack, "<span id='ip_".$matches[0][$count]."'>".$matches[0][$count]."</span>");
}
$stack=array_reverse($stack);
$body=str_replace($matches[0],$stack,$body);
You can use a regular expression.
Something like this to get you started. There may be a better way to match since it's repeated, but....
([0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2})
( Start a capture group
[0-9] match any character 0 through 9
[a-z] match any character [a-z]
{1,2} but only match the previous 1 or 2 times
\. match a literal . the \ is needed as an escape because . is a special character
) End capture group
Both php and javascript allow for regular expression use.
For an even better visual representation you can check out this tool: http://www.debuggex.com/
If you need each octet by itself (as a match) you can add more parenthesis () around each [0-9][a-z]{1,2} which will then store those octets individually.
Also note that \d is the same as [0-9] but I prefer the later as I find it a little more readable.
I am trying to test a string for a state code, the regex I have is
^A[LKSZRAEP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY]$
The issue is, if I have something like "CTA12" as a test string, it will get a match of CT. How can I modify my regex to make it only match state codes that are not part of a larger string?
Your use of anchors with alternation is incorrect, ^AB|DC$ means "strings that start with AB or end with DC". To get the ^ and $ to both apply to each element of the alternation, you need to put the alternation in a group, for example ^(AB|DC)$.
Try changing your regex to the following:
^(A[LKSZRAEP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$
The alternative to using a group is to put the ^ and $ as a part of each element in the alternation, for example ^AB$|^DC$, but that would make your regex significantly longer so a group is the way to go.