split on a multi character delimiter using regex expression

split on a multi character delimiter using regex expression - javascript

I am using this code to split text into segments:
let value = "(germany or croatia) and (denmark or hungary)"
let tokens = value.split(/(?=[()\s])|(?<=[()\s])/g).filter(segment => segment.trim() != '');
This produces the following array:
['(', 'germany', 'or', 'croatia', ')', 'and', '(', 'denmark', 'or', 'hungary', ')']
How should I rewrite regex so that it would be able to split this string:
(germany*or*croatia)*and*(denmark*or*hungary)
into
['(', 'germany', '*or*', 'croatia', ')', '*and*', '(', 'denmark', '*or*', 'hungary', ')']
The problem is that split paramrs *or* and *and* are of multiple characters and just using
let tokens = value.split(/(?=[()\s\*or\*\*and\*])|(?<=[()\s\*or\*\*and\*])/g).filter(segment => segment.trim() != '');
will not work.

Instead of matching the space between tokens, you will have an easier time trying to the tokens themselves.
const tokenizer = /\(|\)|\*[^()*\s]+\*|[^()*\s]+/g;
const value = "(germany*or*croatia)*and*(denmark*or*hungary)";
const tokens = value.matchAll(tokenizer);
console.log(Array.from(tokens, match => match[0]));
Note that this isn't very robust, as any unexpected token is just ignored silently. It's also very generic; you might have an easier time specifically looking for the list of allowed operators, like *or* and *and*, instead of producing a token for any *something* found.
If you want to validate each token further, you can wrap each one in a capture group, and add a capture group for whitespace and for any unexpected leftovers. Keep in mind the order of | alternatives in a regex matters!
const tokenizer = /(\()|(\))|(\*[^()*\s]+\*)|([^()*\s]+)|(\s+)|(.+)/g;
const symbols = [
Symbol("open-paren"),
Symbol("close-paren"),
Symbol("*operator*"),
Symbol("term"),
Symbol("whitespace"),
Symbol("unexpected"),
];
const value = "(germany*or*croatia)*and*(denmark or hungary)***";
const matches = value.matchAll(tokenizer);
for (match of matches) {
const str = match[0];
const groups = match.slice(1);
const tokenType = symbols[groups.findIndex(capture => capture !== undefined)];
console.log([tokenType.toString(), str]);
}
This kind of tokenizing regex was inspired by one of Douglas Crockford's books, where he uses something similar to transpile his own programming language into JS -- I forget which one.

Could you try this expression?
/(\()|(\))|(\*or\*)|(\*and\*)|[\w]*/g

Related

Normalize Special Letter Char from Various Foreigner Languages

Forgive my ignorance for not knowing the technical term for foreigner languages that use characters as
ø i.e. Helsingør
Ł i.e Łeczna
ı i.e Altınordu
ł i.e. Głogow
how could I normalize those with Javascript (write a regex), while also making case insentive?
const strArr = ['Helsingør', 'Łeczna', 'Altınordu', 'Głogow', Népoão's]
With the code below, I was able to replace latin based characters (é, ñ, ô, etc) but not the above ones.
strArr.map(string => string.normalize('NFD')
.replaceAll(/[\u0300-\u036f,"\'"]/g, ''))
Mi final output should read as ['Helsingor', 'Leczna', 'Altinordu', 'Glogow', Nepoaos]

For replace non-ASCII characters with closet ASCII Character you need to use lodash library:
So your final code for replace special characters will be:
const _ = require('lodash');
const strArr = ['Helsingør', 'Łeczna', 'Altınordu', 'Głogow'];
const normalized = strArr.map(string => _.deburr(string));
console.log(normalized);
Output Result :
[ 'Helsingor', 'Leczna', 'Altinordu', 'Glogow' ]

As you discovered the .normalize('NFD') only decomposes Latin characters. You'd need to use library that normalizes other languages, possibly https://github.com/walling/unorm.
You can also roll your own. Here is a solution that uses the .normalize('NFD') in conjunction with a iMap object that maps from international to English characters. This iMap is short, you can expand as needed, such as taking the mapping from https://github.com/cvan/lunr-unicode-normalizer/blob/master/lunr.unicodeNormalizer.js. You can also override the mapping of .normalize('NFD'), for example, umlaut ü is better mapped to ue instead of u.
function normalizeString(str) {
const iMap = {
'ð': 'd',
'ı': 'i',
'Ł': 'L',
'ł': 'l',
'ø': 'o',
'ß': 'ss',
'ü': 'ue'
};
const iRegex = new RegExp(Object.keys(iMap).join('|'), 'g')
return str
.replace(iRegex, (m) => iMap[m])
.normalize("NFD")
.replace(/[\u0300-\u036f]/g, '');
}
[
'Helsingør', 'Łeczna', 'Altınordu', 'Głogow',
'Áfram með smjörið', 'Crème Brulée', 'Bär Müller Straße'
].forEach(str => {
let result = normalizeString(str);
console.log(str, '=>', result);
});
Output:
Helsingør => Helsingor
Łeczna => Leczna
Altınordu => Altinordu
Głogow => Glogow
Áfram með smjörið => Afram med smjorid
Crème Brulée => Creme Brulee
Bär Müller Straße => Bar Mueller Strasse

Get multiple regex match result all in one match by Javascript

I am trying to make my code looks professional by removing those duplicate code. the question is I want to get some data from a string, to be specific, I need to know the NUMBER, X, Y, Z, A, B, etc. values but the regex expression are different for each variable so I have to repeat myself writing a lot of duplicate code.
let TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regNumber = /(?<=NUMBER:=)[0-9]+/gm;
let lineNumber = Number(TextString.match(regNumber));
const regX = /(?<=X:=)(-?[0-9]+)(.[0-9]+)?/gm;
let X = Number(TextString.match(regX)).toFixed(1);
const regY = /(?<=Y:=)(-?[0-9]+)(.[0-9]+)?/gm;
let Y = Number(TextString.match(regY)).toFixed(1);
const regZ = /(?<=Z:=)(-?[0-9]+)(.[0-9]+)?/gm;
let Z = Number(TextString.match(regZ)).toFixed(1);
const regA = /(?<=A:=)(-?[0-9]+)(.[0-9]+)?/gm;
let A = Number(TextString.match(regA)).toFixed(1);
const regB = /(?<=B:=)(-?[0-9]+)(.[0-9]+)?/gm;
let B = Number(TextString.match(regB)).toFixed(1);
// and many more duplicate code.
console.log(lineNumber, X, Y, Z, A, B);
I could only think of a way like the above, to match each variable individually and run .match() multiple times, but as you can see there are 17 variables total and in real situations, there are hundreds of these TextString. I was worried that this matching process will have a huge impact on performance.
Are there any other ways to fetch all variables in one match and store them in an array or object? or any other elegant way of doing this?

Every coordinate will have a single letter identifier, so you can use a more general positive lookback (?<=,[A-Z]:=). This lookback matches a comma followed by a single uppercase letter then the equality symbol.
You can then use .match() to get all matches and use .map() to run the conversion you were doing.
let TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regNumber = /(?<=NUMBER:=)[0-9]+/gm;
let lineNumber = Number(TextString.match(regNumber));
const regex = /(?<=,[A-Z]:=)(-?[0-9]+)(.[0-9]+)?/gm;
let coord = TextString.match(regex).map(n => Number(n).toFixed(1));
console.log(lineNumber, coord);

You could write a single pattern:
(?<=\b(?:NUMBER|[XYZAB]):=)-?\d+(?:\.\d+)?\b
Explanation
(?<= Positive lookbehind, assert that to the left of the current position is
\b(?:NUMBER|[XYZAB]):= Match either NUMBER or one of X Y Z A B preceded by a word boundary and followed by :=
) Close the lookbehind
-? Match an optional -
\d+(?:\.\d+)? Match 1+ digits and an optional decimal part
\b A word boundary to prevent a partial word match
See a regex demo.
const TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regNumber = /(?<=\b(?:NUMBER|[XYZAB]):=)-?\d+(?:\.\d+)?\b/g;
const result = TextString
.match(regNumber)
.map(s =>
Number(s).toFixed(1)
);
console.log(result);

One possible approach could be based on a regex pattern which utilizes capturing groups. The matching regex for the OP's sample text would look like this ...
/\b(NUMBER|[XYZAB])\:=([^,]+),/g
... and the description is provided with the regex' test site.
The pattern is both simple and generic. The latter is due to always capturing both the matching key like Number and its related value like 20. Thus it doesn't matter where a key-value pair occurs within a drill-data string.
Making use later of an object based Destructuring Assignment for assigning all of the OP's variables at once the post processing task needs to reduce the result array of matchAll into an object which features all the captured keys and values. Within this task one also can control how the values are computed and/or whether or how the keys might get sanitized.
const regXDrillData = /\b(NUMBER|[XYZAB])\:=([^,]+),/g;
const textString =
`DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
// - processed values via reducing the captured
// groups of a `matchAll` result array of a
// generic drill-data match-pattern.
const {
number: lineNumber,
x, y, z,
a, b,
} = [...textString.matchAll(regXDrillData)]
.reduce((result, [match, key, value]) => {
value = Number(value);
value = (key !== 'NUMBER') ? value.toFixed(1) : value;
return Object.assign(result, { [ key.toLowerCase() ]: value });
}, {})
console.log(
`processed values via reducing the captured
groups of a 'matchAll' result array of a
generic drill-data match-pattern ...`,
{ lineNumber, x, y, z, a, b },
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Every value match a pattern :=[value], or :=[value]) for the last one. So there is my regex
(?<=:=)-?[\d\w.']+(?=[,)])
Positive Lookbehind (?<=:=) look for match behind :=
-? match - optional (for negative number)
[\d\w.']+: match digit, word character, ., '
Positive Lookahead (?=[,)]) look for match ahead character , or )
Live regex101.com demo
Now change your code to
let TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regexPattern= /(?<=:=)-?[\d\w.']+(?=[,)])/g;
console.log(TextString.match(regexPattern))
// ['20', "'4'", '1', '10.1', '73.344', '0', '-1.435', '1.045', '1', '2', '3', '4', '1', '10.5', '2.1', '1.2', '2', '2.4', '1', '2']
Edit
I just realized the the Positive Lookahead is unnecessary as #Peter Seliger
mentioned
(?<=:=)-?[\d\w.']+
Change your regex pattern to
const regexPattern= /(?<=:=)-?[\d\w.']+/g;

Here is a solution using a .reduce() on keys of interest and returns an object:
const TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const keys = [ 'NUMBER', 'X', 'Y', 'Z', 'A', 'B' ];
let result = keys.reduce((obj, key) => {
const regex = new RegExp('(?<=\\b' + key + ':=)-?[0-9.]+');
obj[key] = Number(TextString.match(regex)).toFixed(1);
return obj;
}, {});
console.log(result);
Output:
{
"NUMBER": "20.0",
"X": "10.1",
"Y": "73.3",
"Z": "0.0",
"A": "-1.4",
"B": "1.0"
}
Notes:
The regex is built dynamically from the key
A \b word boundary is added to the regex to reduce the chance of unintended matches
If you need the line number as an integer you could take that out of the keys, and handle it separately.

Javascript - Replacing multiple parts of string in one go

I want to replace multiple parts of a string with different things. I have a series of URLs that contain these strings that need to change, they all follow the same pattern.
e.g.
'spanish-beginners-course'
'italian-beginners-course'
'spanish-italian-beginners-course'
I just want the result to be the languages e.g. spanish, italian, spanish italian
I have tried this as a test but it returns 'spanish undefined undefined'
const pageName = 'spanish-beginners-course'
const chars = { '-beginners': '', '-course': '', '-': ' ' }
const language = pageName.replace(/-|beginners|course/g, m => chars[m])

This is happening because your REGEX match finds beginners, but in your chars object there is no key called beginners - it's called -beginners. Same for course/-course.
const pageName = 'spanish-beginners-course'
const chars = { '-beginners': '', '-course': '', '-': ' ' }
const language = pageName.replace(/-|beginners|course/g, m => chars[m])
In any case your object is unnecessary, and so is REGEX (as #Alastair points out) since you're replacing a static, unchanging substring.
const language = pageName.replace('-beginners-course', '');

Your case is very simple, you can split and take first.
let pageName = "spanish-beginners-course";
let language = pageName.split(/-/)[0];
console.log(language); // spanish
pageName = "italian-beginners-course";
language = pageName.split(/-/)[0];
console.log(language); //italian
.as-console-row {color: blue!important}

You get null because this part -|beginners|course is an alternation which will match either -, beginners or course
You use the match to get the value from the object, but the object contains -beginners and -course
If you want to do the replacement, and there has to be at least 1 word before it, you could use a capturing group $1 in the replacement and match after it what you want to remove.
(\w+(?:-\w+)*)-beginners-course\b
Regex demo
const pageName = 'spanish-beginners-course';
const language = pageName.replace(/(\w+(?:-\w+)*)-beginners-course\b/g, "$1");
console.log(language)
[
'spanish-beginners-course',
'italian-beginners-course',
'spanish-italian-beginners-course',
'donotremove!-beginners-course'
]
.forEach(s => console.log(s.replace(/(\w+(?:-\w+)*)-beginners-course\b/g, "$1")));

Regex giving incorrect results

Working in Javascript attempting to use a regular expression to capture data in a string.
My string appears as this starting with the left bracket
['ABC']['ABC.5']['ABC.5.1']
My goal is to get each piece of the regular expression as a chunk or in array.
I have reviewed and see that the match function might be a good choice.
var myString = "['ABC']['ABC.5']['ABC.5.1']";
myString.match(/\[/g]);
The output I see is only the [ for each element.
I would like the array to be like this for example
myString[0] = ['ABC']
myString[1] = ['ABC.5']
myString[2] = ['ABC.5.1']
What is the correct regular expression and or function to get the above-desired output?

If you just want to separate them, you can use a simple expression or better than that you can split them:
\[\'(.+?)'\]
const regex = /\[\'(.+?)'\]/gm;
const str = `['ABC']['ABC.5']['ABC.5.1']`;
const subst = `['$1']\n`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
DEMO

You can use this regex with split:
\[[^\]]+
Details
\[ - Matches [
[^\]]+ - Matches anything except ] one or more time
\] - Matches ]
let str = `['ABC']['ABC.5']['ABC.5.1']`
let op = str.split(/(\[[^\]]+\])/).filter(Boolean)
console.log(op)

Double bracket pattern search in string

I have a regular expression to search in a string.
new RegExp("\\b"+searchText+"\\b", "i")
My strings:
"You are likely to find [[children]] in [[a school]]"
"[[school]] is for [[learning]]"
How can I search only the words in double brackets?
A regular expression should contain searchText as a function argument.

This RegEx will give you the basics of what you want:
\[\[[^\]]*\]\]
\[\[ matches the two starting brackets. A bracket is a special character in RegEx, hence it must be escaped with \
[^\]]* is a negated set that matches zero or more of any character except a closing bracket. This matches the content in-between the brackets.
\]\] matches the two closing brackets.
Here's a very basic example of what you could do with this:
let string = "You are likely to find [[children]] in [[a school]]<br>[[school]] is for [[learning]]";
string = string.replace(/\[\[[^\]]*\]\]/g, x => `<mark>${x}</mark>`);
document.body.innerHTML = string;

You can use this regex:
var str = `-- You are likely to find [[children]] in [[a school]]
-- [[school]] is for [[learning]]`;
var regex = /(?<=(\[\[))([\w\s]*)(?=(\]\]))/gm;
var match = str.match(regex);
console.log(match);

const re = /(?<=\[\[)[^\]]+(?=]])/gm
const string = `-- You are likely to find [[children]] in [[a school]]
-- [[school]] is for [[learning]]`
console.log(string.match(re))
const replacement = {
children: 'adults',
'a school': 'a home',
school: 'home',
learning: 'rest',
}
console.log(string.split(/(?<=\[\[)[^\]]+(?=]])/).map((part, index) => part + (replacement[string.match(re)[index]] || '')).join(''))

We Keep Coding

JavaScript is the programming language of the Web.

split on a multi character delimiter using regex expression - javascript

Could you try this expression? /(\()|(\))|(\or\)|(\and\)|[\w]*/g

Related

Normalize Special Letter Char from Various Foreigner Languages

Get multiple regex match result all in one match by Javascript

Javascript - Replacing multiple parts of string in one go

Regex giving incorrect results

Double bracket pattern search in string

Categories

Resources

We Keep Coding

JavaScript is the programming language of the Web.

split on a multi character delimiter using regex expression - javascript

Could you try this expression? /(\()|(\))|(\*or\*)|(\*and\*)|[\w]*/g

Related

Normalize Special Letter Char from Various Foreigner Languages

Get multiple regex match result all in one match by Javascript

Javascript - Replacing multiple parts of string in one go

Regex giving incorrect results

Double bracket pattern search in string

Categories

Resources

Could you try this expression? /(\()|(\))|(\or\)|(\and\)|[\w]*/g