JS split string on positive lookahead, avoid overlapping cases - javascript

I have a set of data that includes dated notes, all concatenated, as in the example below. Assume the date always comes at the beginning of its note. I'd like to split these into individual notes. I've used a positive lookahead so I can keep the delimiter (the date).
Here's what I'm doing:
const notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]"
const pattern = /(?=\d{1,2}\/\d{1,2}[- ]+)/g
console.log(notes.split(pattern))
and the result is
[ '[',
'3/28- A note. ',
'3/25- Another note. ',
'3/24- More text. ',
'1',
'0/19- further notes. [',
'1',
'0/18- Some more text.]' ]
The pattern \d{1,2} matches both 10/19 and 0/19 so it splits before both of those.
Instead I'd like to have
[ '[',
'3/28- A note. ',
'3/25- Another note. ',
'3/24- More text. ',
'10/19- further notes. [',
'10/18- Some more text.]' ]
(I can handle the extraneous brackets later.)
How can I accomplish this split with regex or any other technique?

To get your wanted output, you can prepend a word boundary in the lookahead, and you can omit the plus sign at the end of the pattern.
(?=\b\d{1,2}\/\d{1,2}[- ])
Regex demo
const notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]"
const pattern = /(?=\b\d{1,2}\/\d{1,2}[- ])/g
console.log(notes.split(pattern))

I would avoid split() here and instead use match():
var notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]";
var matches = notes.match(/\[?\d+\/\d+\s*-\s*.*?\.\]?/g);
console.log(matches);
You may do a further cleanup of leading/trailing brackets using regex, e.g.
var input = "[10/18- Some more text.]";
var output = input.replace(/^\[|\]$/, "");

Try .replaceAll() and this regex:
/(\[?\d{1,2}\/\d{1,2}\-.+?)/
// Replacement
"\n$1"
Figure I - Regex
Segment
Description
(\[?
Begin capture group - match literal "[" zero or one time
\d{1,2}\/
match a digit one or two times and a literal "/"
\d{1,2}\-
match a digit one or two times and a literal "-"
.+?)
match anything one to any number of times "lazily" - end capture group
Figure II - Replacement
Segment
Description
\n
New line
$1
Everything matched in the capture group (...)
const notes = "[3/28- A note. 3/25- Another note 3/24- More text. 10/19- further notes [10/18- Some more text.]";
const rgx = new RegExp(/(\[?\d{1,2}\/\d{1,2}\-.+?)/, 'g');
let result = notes.replaceAll(rgx, "\n$1");
console.log(result);

Related

How to split a string by one delimiter but having a particular format as described below

I have a string as:
const str = 'My [Link format](https://google.com) demo'
I want the word array to be like:
['My', '[Link format](https://google.com)', 'demo']
What to do in javascript?
I was trying using split() and str.match(). Nothing worked yet.
This is a simple split on a space as a delimiter, but we us a negative lookahead to check for the combination of open and closed square brackets [] and round brackets ()
const str = 'My [Link format](https://google.com) demo'
console.log(str.split(/\s+(?![^\[]*\])(?![^\(]*\))/));
We also allow for spaces in the URL portion, even though it has a low chance of having spaces, it could still happen
Try it here: https://jsfiddle.net/m4q6e9x7/
["My", "[Link format](https://google.com)", "demo"]
In the fiddle I've tried to show to two separate negative lookaheads for the combination of the types of brackets: (I've put a space in the round brackets to prove the concept)
const str = 'My [Link format](http s://google.com) demo'
ignore space between []
console.log(str.split(/\s+(?![^\[]*\])/));
["My", "[Link format](http", "s://google.com)", "demo"]
ignore space between ()
console.log(str.split(/\s+(?![^\(]*\))/));
["My", "[Link", "format](http s://google.com)", "demo"]
So we can easily combine the two criteria because we need both of them to not match.
Because [] and () need to be escaped, it might be easier to see the regex if we modify and test for spaces between braces {}
const str = 'My {Link format}(https://google.com) demo'
console.log(str.split(/\s+(?![^{]*})/));
["My", "{Link format}(https://google.com)", "demo"]
Both solutions assume, that the string has correct form (meaning basically no space between ']' and '(', no ']' characters inside [...] and similar intuitions. You didn't really provide information about what the input string can be other than your concrete example – so solutions work well in this and very similar cases. Second is very easily modified as needed, first is easily extended to check if the string is in fact not correct.
Solution using Regular Expressions
Below code finds everything before first '[', everything in '[...](...)' pattern (note: first ... must not contain ']', and second – ')', but I assume this would make for an incorrect input in the first place), and everything after that.
So
let regex = /(.*)(\[.*\]\(.*\))(.*)/
let res = str.match(regex).splice(1,3)
gives res as
['My ', '[Link format](https://google.com)', ' demo']
From there, you can trim every entry in this array ('My ' => 'My') for example using a trim function like so:
res.map((val) => val.trim());
Look here for explanation of what the array obtained from .match() method represents, but generally except index 0 it contains capture groups, meaning the parts of string corresponding to parts of regex surrounded by parentheses.
If you are not familiar with Regular Expressions (regexes) in JS, or at all, you will find many online resources about the topic easily. After grasping the basics, regex101 is a nice tool to experiment with regexes and explore their capabilities. When using it, you should probably choose EcmaSCRIPT/JS flavor from the menu on the left.
Equivalent solution without regex
Equivalent solution is to find where is the first '[' manually, as well as where the '[...](...)' pattern ends. Than splice the parts (before '[', pattern, and after pattern) from the string, and probably trim them. So just loop over characters of the string in search of '[' and than ']', '(', ')'. Note that in this case you can easily and granularily decide what to do if the string has unexpected/incorrect form.
TODO: I will probably sketch some code when I have time for it
Regex is your friend!
const regexMdLinks = /!?\[([^\]]*)\]\(([^\)]+)\)/gm
// Example md file contents
const str = `My [Link format](https://google.com) demo My [Link format2](https://google.com/2) demo2`
let regex_splitted = str.split(regexMdLinks);
let arr = [];
//1. Item will be the text (or empty text)
//2. Item is the link text
//3. Item is the url
for(let i = 0; i < regex_splitted.length; i++){
if(i % 3 == 0){ //Split normal text
arr.push(...regex_splitted[i].split(" ").filter(i => i));
} else if(i % 3 == 1){//Add brackets around link text
arr.push("["+regex_splitted[i]+"]");
} else {
arr.push("("+regex_splitted[i]+")");
}
}
console.log(arr)

How can I include the delimiter with regex String.split()?

I need to parse the tokens from a GS1 UDI format string:
"(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
I would like to split that string with a regex on the "(nnn)" and have the delimiter included with the split values, like this:
[ "(20)987111", "(240)A", "(10)ABC123", "(17)2022-04-01", "(21)888888888888888" ]
Below is a JSFiddle with examples, but in case you want to see it right here:
// This includes the delimiter match in the results, but I want the delimiter included WITH the value
// after it, e.g.: ["(20)987111", ...]
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\))/).filter(Boolean))
// Result: ["(20)", "987111", "(240)", "A", "(10)", "ABC123", "(17)", "2022-04-01", "(21)", "888888888888888"]
// If I include a pattern that should (I think) match the content following the delimiter I will
// only get a single result that is the full string:
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)\W+)/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
// I think this is because I'm effectively mathching the entire string, hence a single result.
// So now I'll try to match only up to the start of the next "(":
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)(^\())/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
I've found and read this question, however the examples there are matching literals and I'm using character classes and getting different results.
I'm failing to create a regex pattern that will provide what I'm after. Here's a JSFiddle of some of the things I've tried: https://jsfiddle.net/6bogpqLy/
I can't guarantee the order of the "application identifiers" in the input string and as such, match with named captures isn't an attractive option.
You can split on positions where parenthesised element follows, by using a zero-length lookahead assertion:
const text = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
const parts = text.split(/(?=\(\d+\))/)
console.log(parts)
Instead of split use match to create the array. Then find 1) digits in parenthesis, followed by a group that might contain a digit, a letter, or a hyphen, and then 2) group that whole query.
(PS. I often find a site like Regex101 really helps when it comes to testing out expressions outside of a development environment.)
const re = /(\(\d+\)[\d\-A-Z]+)/g;
const str = '(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888';
console.log(str.match(re));

Regexp to explode url

I have a string url like "home/products/product_name_1/details/some_options"
And i want to parse it into array with Regexp to ["home", "products","product","details","some"]
So the rule is "split by words if backslash, but if the word have underscores - take only that part that comes before first underscore"
JavaScript equivalent for this regex is
str.split("/").map(item => item.indexOf("_") > -1 ? item.split("_")[0] : item)
Please help!
you can use this pattern
(?<!\w)[^/_]+
results
['home', 'products', 'product', 'details', 'some']
python code
import re
str="home/products/product_name_1/details/some_options"
re.findall('(?<!\w)[^/_]+',str)
['home', 'products', 'product', 'details', 'some']
Try this:
input = ["home/products/product_name_1/details/some_options",
"company/products/cars_all/details/black_color",
"public/places/1_cities/disctricts/1234_something"]
let pattern = /([a-zA-Z\d]*)(?:\/|_.*?(?:\/|$))/gmi
input.forEach(el => {
let matches = el.matchAll(pattern)
for (const match of matches) {
console.log(match[1]);
}
})
Remove \d from the regex pattern if you dont want digits in the url.
I have used matchAll here, matchAll returns a iterator, use that to get each match object, inside which the first element is the full match, and the second elemnt(index: 1) is the required group.
/([a-zA-Z\d]*)(?:\/|_.*?(?:\/|$))/gmi
/
([a-zA-Z\d]*) capture group to match letters and digits
(?:\/|_.*?(?:\/|$)) non capture group to match '/' or '_' and everything till another '/' or end of the line is found
/gmi
You can test this regex here: https://regex101.com/r/B5Bo74/1
You can use:
\b[^\W_]+
\b A word boundary to prevent a partial match
[^\W_]+ Match 1+ word characters except for _
See a regex demo.
const s = "home/products/product_name_1/details/some_options";
const regex = /\b[^\W_]+/g;
console.log(s.match(regex));
If there has to be a leading / or the start of the string before the match, you can use an alternation (?:^|\/) and use a capture group for the values that you want to keep:
const s = "home/products/product_name_1/details/some_options";
const regex = /(?:^|\/)([^\W_]+)/g;
console.log(Array.from(s.matchAll(regex), m => m[1]));
Given input:
string "home/products/product_name_1/details/some_options"
Expected output:
array ["home", "products", "product", "details", "some"]
Note: ignore/exclude name, 1, options (because word occurs after 1st underscore).
Task:
split URI by slash into a set of path-segments (words)
(if the path-segment or word contains underscores) remove the part after first underscore
Regex to match
With a regex \/|_\w+ you could match the URL-path separator (slash) and excluded word-part (every word after an underscore).
Then use this regex
either as separator to split the string into its parts(excluding the regex matches): e.g. in JS split(/\/|_\w+/)
or as search-pattern in replace to prepare a string that can be easily split: e.g. in JS replaceAll(/\/|_\w+/g, ',') to obtain a CSV row which can be easily split by comma `split(',')
Beware: The regular-expression itself (flavor) and functions to apply it depend on your environment/regex-engine and script-/programming-language.
Regex applied in Javascript
split by regex
For example in Javascript use url.split(/\/|_\w*/) where:
/pattern/: everything inside the slashes is the regex-pattern
\/: a c slash (URL-path-separator)
|: the alternate junction, interpreted as boolean OR
_\w*: zero or more (*) word-characters (w, i.e. letter from alphabet, numeric digit or underscore) following an underscore
See also:
Use of capture groups in String.split()
However, this returns also empty strings (as empty split-off second parts inside underscore-containing path-segments). We can remove the empty strings with a filter where predicate s => s returns true if the string is non-empty.
Demo to solve your task:
const url = "home/products/product_name_1/details/some_options";
let firstWordsInSegments = url.split(/\/|_\w*/).filter(s => s);
console.log(firstWordsInSegments);
const urlDuplicate = "home/products/product_name_1/details/some_options/_/home";
console.log(urlDuplicate.split(/\/|_\w*/).filter(s => s)); // contains duplicates in output array
replace into CSV, then split and exclude (map,replace,filter)
The CSV containing path-segments can be split by comma and resulting parts (path-segments) can be filtered or replaced to exclude unwanted sub-parts.
using:
replaceAll to transform to CSV or remove empty strings. Note: global flag required when calling replaceAll with regex
map to remove unwanted parts after underscore
filter(s => s) to filter out empty strings
const url = "home/products/product_name_1/details/some_options";
// step by step
let pathSegments = url.split('/');
console.log('pathSegments:', pathSegments);
let firstWordsInSegments = pathSegments.map(s => s.replaceAll(/_\w*/g,''));
console.log(firstWordsInSegments);
// replace to obtain CSV and then split
let csv = "home/products/product_name_1/details/some_options/_/home".replaceAll(/\/|_\w+/g, ',');
console.log('csv:', csv);
let parts = csv.split(',');
console.log('parts:', parts); // contains empty parts
let nonEmptyParts = parts.filter(s => s);
console.log('nonEmptyParts:', nonEmptyParts); // filtered out empty parts
Bonus Tip
Try your regex online (e.g. regex101 or regexplanet). See the demo on regex101.
You could split the url with this regex
(_\w*)+|(\/)
This matches the /, _name_1 and _options.
BUT depending what you are trying to to, or which language do you use, there are way better options to do this.
You can try a pattern like \/([^\/_]+){1,} (assuming that the path starts with '/' and the components are separated by '/'); depending on language you might get an array or iterator that will give the components.
Try ^[[:alpha:]]+|(?<=\/)[[:alpha:]]+ or ^[a-zA-Z]+|(?<=\/)[a-zA-Z]+ if [[:alpha:]] is not supported , it matches one or more characters on the beginning or after slash until first non char.

How to match bold markdown if it isn't preceded with a backslash?

I'm looking to match bolded markdown. Here are some examples:
qwer *asdf* zxcv matches *asdf*
qwer*asdf*zxcv matches *asdf*
qwer \*asdf* zxcv does not match
*qwer* asdf zxcv matches *qwer*
A negative look behind like this (?<!\\)\*(.*)\* works.
Except there is no browser support in Firefox, so I cannot use it.
Similarly, I can get very close with (^|[^\\])\*(.*)\*
The issue is that there are two capture groups, and I need the index of the second capture group, and Javascript only returns the index of the first capture group. I can bandaid it in this case by just adding 1, but in other cases this hack will not work.
My reasoning for doing this is that I'm trying to replace a small subset of Markdown with React components. As an example, I'm trying to convert this string:
qwer *asdf* zxcv *123*
Into this array:
[ "qwer ", <strong>asdf</strong>, " zxcv ", <strong>123</strong> ]
Where the second and fourth elements are created via JSX and included as array elements.
You will also need to take into account that when a backslash occurs before an asterisk, it may be one that is itself escaped by a backslash, and in that case the asterisk should be considered the start of bold markup. Except if that one is also preceded by a backslash,...etc.
So I would suggest this regular expression:
((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*
If the purpose is to replace these with tags, like <strong> ... </strong>, then just use JavaScript's replace as follows:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*/g;
let res = s.replace(regex, "$1<strong>$2</strong>");
console.log(res);
If the bolded words should be converted to a React component and stored in an array with the other pieces of plain text, then you could use split and map:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((?:\\.|[^*])*)\*/g;
let res = s.split(regex).map((s, i) =>
i%3 === 2 ? React.createComponent("strong", {}, s) : s
);
Since there are two capture groups in the "delimiter" for the split call, one having the preceding character(s) and the second the word itself, every third item in the split result is a word to be bolded, hence the i%3 expression.
This should do the trick:
/(?:^|[^\\])(\*[^*]+[^\\]\*)/
The only capturing group there is the string surrounded by *'s.

Best way to manipulate and cut a string using character matching?

So in my example, we have strings that look like this:
CP1_ctl05_RCBPAThursdayStartTimePicker_0_dateInput
CP1_ctl05_RCBPAFridayStartTimePicker_3_dateInput
CP1_ctl05_RCBPAMondayStartTimePicker_1_dateInput
The task is to extract the days of the week from the string.
I already figured you can trim the first set of characters CP1_ctl05_RCBPA as they will always have the same length and will always occur in the same position. Using string.substr(15), I was able to reduce the string to FridayStartTimePicker_3_dateInput but I am not sure how to approach deleting the rest of the suffixal garbage text.
I was thinking about trimming the end by finding the first occurring y (as it will only occur in days of the week in this case) and slicing off the end up until that point, but I am not sure about how to approach slicing off a part of a string like this.
You can use regex to extract them. As every day ends with a y, and no day has a y in between, you can simply use that as delimiter
const regex = /\w{15}(\w+y).*/g;
const str = `CP1_ctl05_RCBPAThursdayStartTimePicker_0_dateInput`;
const subst = `\$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Instead of deleting unwanted parts, you could just match what you want.
The following regex ^.{15}(\w+?y) matches 15 any character from the begining of the string then matches and capture in group 1 one or more word character not greedy then the letter y. It is mandatory to use not greedy ? unless it will match until the last y that exists in the string.
We then just have to get the content of the first group and assign to variable day
var test = [
'CP1_ctl05_RCBPAThursdayStartTimePicker_0_dateInput', 'CP1_ctl05_RCBPAFridayStartTimePicker_3_dateInput', 'CP1_ctl05_RCBPAMondayStartTimePicker_1_dateInput'
];
console.log(test.map(function (a) {
return a.match(/^.{15}(\w+?y)/)[1]
}));

Categories