Regexp to explode url - javascript

I have a string url like "home/products/product_name_1/details/some_options"
And i want to parse it into array with Regexp to ["home", "products","product","details","some"]
So the rule is "split by words if backslash, but if the word have underscores - take only that part that comes before first underscore"
JavaScript equivalent for this regex is
str.split("/").map(item => item.indexOf("_") > -1 ? item.split("_")[0] : item)
Please help!

you can use this pattern
(?<!\w)[^/_]+
results
['home', 'products', 'product', 'details', 'some']
python code
import re
str="home/products/product_name_1/details/some_options"
re.findall('(?<!\w)[^/_]+',str)
['home', 'products', 'product', 'details', 'some']

Try this:
input = ["home/products/product_name_1/details/some_options",
"company/products/cars_all/details/black_color",
"public/places/1_cities/disctricts/1234_something"]
let pattern = /([a-zA-Z\d]*)(?:\/|_.*?(?:\/|$))/gmi
input.forEach(el => {
let matches = el.matchAll(pattern)
for (const match of matches) {
console.log(match[1]);
}
})
Remove \d from the regex pattern if you dont want digits in the url.
I have used matchAll here, matchAll returns a iterator, use that to get each match object, inside which the first element is the full match, and the second elemnt(index: 1) is the required group.
/([a-zA-Z\d]*)(?:\/|_.*?(?:\/|$))/gmi
/
([a-zA-Z\d]*) capture group to match letters and digits
(?:\/|_.*?(?:\/|$)) non capture group to match '/' or '_' and everything till another '/' or end of the line is found
/gmi
You can test this regex here: https://regex101.com/r/B5Bo74/1

You can use:
\b[^\W_]+
\b A word boundary to prevent a partial match
[^\W_]+ Match 1+ word characters except for _
See a regex demo.
const s = "home/products/product_name_1/details/some_options";
const regex = /\b[^\W_]+/g;
console.log(s.match(regex));
If there has to be a leading / or the start of the string before the match, you can use an alternation (?:^|\/) and use a capture group for the values that you want to keep:
const s = "home/products/product_name_1/details/some_options";
const regex = /(?:^|\/)([^\W_]+)/g;
console.log(Array.from(s.matchAll(regex), m => m[1]));

Given input:
string "home/products/product_name_1/details/some_options"
Expected output:
array ["home", "products", "product", "details", "some"]
Note: ignore/exclude name, 1, options (because word occurs after 1st underscore).
Task:
split URI by slash into a set of path-segments (words)
(if the path-segment or word contains underscores) remove the part after first underscore
Regex to match
With a regex \/|_\w+ you could match the URL-path separator (slash) and excluded word-part (every word after an underscore).
Then use this regex
either as separator to split the string into its parts(excluding the regex matches): e.g. in JS split(/\/|_\w+/)
or as search-pattern in replace to prepare a string that can be easily split: e.g. in JS replaceAll(/\/|_\w+/g, ',') to obtain a CSV row which can be easily split by comma `split(',')
Beware: The regular-expression itself (flavor) and functions to apply it depend on your environment/regex-engine and script-/programming-language.
Regex applied in Javascript
split by regex
For example in Javascript use url.split(/\/|_\w*/) where:
/pattern/: everything inside the slashes is the regex-pattern
\/: a c slash (URL-path-separator)
|: the alternate junction, interpreted as boolean OR
_\w*: zero or more (*) word-characters (w, i.e. letter from alphabet, numeric digit or underscore) following an underscore
See also:
Use of capture groups in String.split()
However, this returns also empty strings (as empty split-off second parts inside underscore-containing path-segments). We can remove the empty strings with a filter where predicate s => s returns true if the string is non-empty.
Demo to solve your task:
const url = "home/products/product_name_1/details/some_options";
let firstWordsInSegments = url.split(/\/|_\w*/).filter(s => s);
console.log(firstWordsInSegments);
const urlDuplicate = "home/products/product_name_1/details/some_options/_/home";
console.log(urlDuplicate.split(/\/|_\w*/).filter(s => s)); // contains duplicates in output array
replace into CSV, then split and exclude (map,replace,filter)
The CSV containing path-segments can be split by comma and resulting parts (path-segments) can be filtered or replaced to exclude unwanted sub-parts.
using:
replaceAll to transform to CSV or remove empty strings. Note: global flag required when calling replaceAll with regex
map to remove unwanted parts after underscore
filter(s => s) to filter out empty strings
const url = "home/products/product_name_1/details/some_options";
// step by step
let pathSegments = url.split('/');
console.log('pathSegments:', pathSegments);
let firstWordsInSegments = pathSegments.map(s => s.replaceAll(/_\w*/g,''));
console.log(firstWordsInSegments);
// replace to obtain CSV and then split
let csv = "home/products/product_name_1/details/some_options/_/home".replaceAll(/\/|_\w+/g, ',');
console.log('csv:', csv);
let parts = csv.split(',');
console.log('parts:', parts); // contains empty parts
let nonEmptyParts = parts.filter(s => s);
console.log('nonEmptyParts:', nonEmptyParts); // filtered out empty parts
Bonus Tip
Try your regex online (e.g. regex101 or regexplanet). See the demo on regex101.

You could split the url with this regex
(_\w*)+|(\/)
This matches the /, _name_1 and _options.
BUT depending what you are trying to to, or which language do you use, there are way better options to do this.

You can try a pattern like \/([^\/_]+){1,} (assuming that the path starts with '/' and the components are separated by '/'); depending on language you might get an array or iterator that will give the components.

Try ^[[:alpha:]]+|(?<=\/)[[:alpha:]]+ or ^[a-zA-Z]+|(?<=\/)[a-zA-Z]+ if [[:alpha:]] is not supported , it matches one or more characters on the beginning or after slash until first non char.

Related

Regex specific number inside quote

I am new to regex and have this cdn url that returns text and I want to use javascript to match and extract the version number. I can match the latestVersion but I am not sure how to get the value inside of it.
ex on text:
...oldVersion:"1.2.0",stagingVersion:"1.2.1",latestVersion:"1.3.0",authVersion:"2.2.2"...
I tried doing this line to display latestVersion:"1.3.0 but not successful
const regex = /\blatestVersion:"*"\b/
stringIneed = text.match(regex)
And I only need 1.3.0 not including the string latestVersion:
There are many ways of doing it. This is one:
const text='...oldVersion:"1.2.0",stagingVersion:"1.2.1",latestVersion:"1.3.0",authVersion:"2.2.2"...';
console.log(text.match(/latestVersion:"(.*?)"/)?.[1])
The .*? is a "non-greedy" wildcard that will match as few as possible characters in order to make the whole regexp match. For this reason it will stop matching before the ".
Try adding a capture group () to match certain strings in the Regex.
/\blatestVersion:"([0-9.]+)"/
You could use a lookbehind, or a capturing group like this:
const str = '...oldVersion:"1.2.0",stagingVersion:"1.2.1",latestVersion:"1.3.0",authVersion:"2.2.2"...'
console.log(
str.match(/(?<=latestVersion:")[^"]+/)?.[0]
)
console.log(
str.match(/latestVersion:"([^"]+)"/)?.[1]
)

JS split string on positive lookahead, avoid overlapping cases

I have a set of data that includes dated notes, all concatenated, as in the example below. Assume the date always comes at the beginning of its note. I'd like to split these into individual notes. I've used a positive lookahead so I can keep the delimiter (the date).
Here's what I'm doing:
const notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]"
const pattern = /(?=\d{1,2}\/\d{1,2}[- ]+)/g
console.log(notes.split(pattern))
and the result is
[ '[',
'3/28- A note. ',
'3/25- Another note. ',
'3/24- More text. ',
'1',
'0/19- further notes. [',
'1',
'0/18- Some more text.]' ]
The pattern \d{1,2} matches both 10/19 and 0/19 so it splits before both of those.
Instead I'd like to have
[ '[',
'3/28- A note. ',
'3/25- Another note. ',
'3/24- More text. ',
'10/19- further notes. [',
'10/18- Some more text.]' ]
(I can handle the extraneous brackets later.)
How can I accomplish this split with regex or any other technique?
To get your wanted output, you can prepend a word boundary in the lookahead, and you can omit the plus sign at the end of the pattern.
(?=\b\d{1,2}\/\d{1,2}[- ])
Regex demo
const notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]"
const pattern = /(?=\b\d{1,2}\/\d{1,2}[- ])/g
console.log(notes.split(pattern))
I would avoid split() here and instead use match():
var notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]";
var matches = notes.match(/\[?\d+\/\d+\s*-\s*.*?\.\]?/g);
console.log(matches);
You may do a further cleanup of leading/trailing brackets using regex, e.g.
var input = "[10/18- Some more text.]";
var output = input.replace(/^\[|\]$/, "");
Try .replaceAll() and this regex:
/(\[?\d{1,2}\/\d{1,2}\-.+?)/
// Replacement
"\n$1"
Figure I - Regex
Segment
Description
(\[?
Begin capture group - match literal "[" zero or one time
\d{1,2}\/
match a digit one or two times and a literal "/"
\d{1,2}\-
match a digit one or two times and a literal "-"
.+?)
match anything one to any number of times "lazily" - end capture group
Figure II - Replacement
Segment
Description
\n
New line
$1
Everything matched in the capture group (...)
const notes = "[3/28- A note. 3/25- Another note 3/24- More text. 10/19- further notes [10/18- Some more text.]";
const rgx = new RegExp(/(\[?\d{1,2}\/\d{1,2}\-.+?)/, 'g');
let result = notes.replaceAll(rgx, "\n$1");
console.log(result);

How can I include the delimiter with regex String.split()?

I need to parse the tokens from a GS1 UDI format string:
"(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
I would like to split that string with a regex on the "(nnn)" and have the delimiter included with the split values, like this:
[ "(20)987111", "(240)A", "(10)ABC123", "(17)2022-04-01", "(21)888888888888888" ]
Below is a JSFiddle with examples, but in case you want to see it right here:
// This includes the delimiter match in the results, but I want the delimiter included WITH the value
// after it, e.g.: ["(20)987111", ...]
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\))/).filter(Boolean))
// Result: ["(20)", "987111", "(240)", "A", "(10)", "ABC123", "(17)", "2022-04-01", "(21)", "888888888888888"]
// If I include a pattern that should (I think) match the content following the delimiter I will
// only get a single result that is the full string:
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)\W+)/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
// I think this is because I'm effectively mathching the entire string, hence a single result.
// So now I'll try to match only up to the start of the next "(":
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)(^\())/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
I've found and read this question, however the examples there are matching literals and I'm using character classes and getting different results.
I'm failing to create a regex pattern that will provide what I'm after. Here's a JSFiddle of some of the things I've tried: https://jsfiddle.net/6bogpqLy/
I can't guarantee the order of the "application identifiers" in the input string and as such, match with named captures isn't an attractive option.
You can split on positions where parenthesised element follows, by using a zero-length lookahead assertion:
const text = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
const parts = text.split(/(?=\(\d+\))/)
console.log(parts)
Instead of split use match to create the array. Then find 1) digits in parenthesis, followed by a group that might contain a digit, a letter, or a hyphen, and then 2) group that whole query.
(PS. I often find a site like Regex101 really helps when it comes to testing out expressions outside of a development environment.)
const re = /(\(\d+\)[\d\-A-Z]+)/g;
const str = '(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888';
console.log(str.match(re));

Regex: Character set with special characters

Is it possible for a regex character set ([abc]) to have characters with special meanings (like $ for end of the line)? For example, is it possible to make a characters set that matches either a / or the start of the line ^?
I tried /[^\/]/g but it just checks for a literal ^ or a /.
Note: I'm using JavaScript.
No you can't create custom special meaning, but you can be clever about the way you use the regex.
When doing complicated regex I test if any of these special characters exist in the string I'm testing -
const str = '¡¢my string <>some content <>between</> fragment</> and other data...'
const markers = ['\u00A1','\u00A2','\u00A4']; // ['¡','¢','¤'] you can add others
const safeMarker = markers.find(marker => str.indexOf(marker) === -1) // ¤ - this is not in the string so I can use it as a marker
if (safeMarker) {
const replaced = str.replace(/<\/>/g, safeMarker); // output = 'my string <>some content <>between¤ fragment¤ and other data...'
// do something with this regex and so on...
// then replace the string back
}
The beauty of this is that you can convert any combination of characters into a single marker, which means you can use it in your Negation expression like this:
/<>[^\u00A4]*\u00A4/g
Which would have been equivalent to (ie get the content between the tags)
<>[^</>]*</>

How to match bold markdown if it isn't preceded with a backslash?

I'm looking to match bolded markdown. Here are some examples:
qwer *asdf* zxcv matches *asdf*
qwer*asdf*zxcv matches *asdf*
qwer \*asdf* zxcv does not match
*qwer* asdf zxcv matches *qwer*
A negative look behind like this (?<!\\)\*(.*)\* works.
Except there is no browser support in Firefox, so I cannot use it.
Similarly, I can get very close with (^|[^\\])\*(.*)\*
The issue is that there are two capture groups, and I need the index of the second capture group, and Javascript only returns the index of the first capture group. I can bandaid it in this case by just adding 1, but in other cases this hack will not work.
My reasoning for doing this is that I'm trying to replace a small subset of Markdown with React components. As an example, I'm trying to convert this string:
qwer *asdf* zxcv *123*
Into this array:
[ "qwer ", <strong>asdf</strong>, " zxcv ", <strong>123</strong> ]
Where the second and fourth elements are created via JSX and included as array elements.
You will also need to take into account that when a backslash occurs before an asterisk, it may be one that is itself escaped by a backslash, and in that case the asterisk should be considered the start of bold markup. Except if that one is also preceded by a backslash,...etc.
So I would suggest this regular expression:
((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*
If the purpose is to replace these with tags, like <strong> ... </strong>, then just use JavaScript's replace as follows:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*/g;
let res = s.replace(regex, "$1<strong>$2</strong>");
console.log(res);
If the bolded words should be converted to a React component and stored in an array with the other pieces of plain text, then you could use split and map:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((?:\\.|[^*])*)\*/g;
let res = s.split(regex).map((s, i) =>
i%3 === 2 ? React.createComponent("strong", {}, s) : s
);
Since there are two capture groups in the "delimiter" for the split call, one having the preceding character(s) and the second the word itself, every third item in the split result is a word to be bolded, hence the i%3 expression.
This should do the trick:
/(?:^|[^\\])(\*[^*]+[^\\]\*)/
The only capturing group there is the string surrounded by *'s.

Categories