Working on something similar to Solr's WordDelimiterFilter, but not in Java.
Want to split words into tokens like this:
P90X = P, 90, X (split on word/number boundary)
TotallyCromulentWord = Totally, Cromulent, Word (split on lowercase/uppercase boundary)
TransAM = Trans, AM
Looking for a general solution, not specific to the above examples. Preferably in a regex flavour that doesn't support lookbehind, but I can use PL/perl if necessary, which can do lookbehind.
Found a few answers on SO, but they all seemed to use lookbehind.
Things to split on:
Transition from lowercase letter to upper case letter
Transition from letter to number or number to letter
(Optional) split on a few other characters (- _)
My main concern is 1 and 2.
That's not something I'd like to do without lookbehind, but for the challenge, here is a javascript solution that you should be able to easily convert into whatever language:
function split(s) {
var match;
var result = [];
while (Boolean(match = s.match(/([A-Z]+|[A-Z]?[a-z]+|[0-9]+|([^a-zA-Z0-9])+)$/))) {
if (!match[2]) {
//don't return non alphanumeric tokens
result.unshift(match[1]);
}
s = s.substring(0, s.length - match[1].length);
}
return result;
}
Demo:
P90X [ 'P', '90', 'X' ]
TotallyCromulentWord [ 'Totally', 'Cromulent', 'Word' ]
TransAM [ 'Trans', 'AM' ]
URLConverter [ 'URL', 'Converter' ]
Abc.DEF$012 [ 'Abc', 'DEF', '012' ]
This regex should split into tokens all the words in a paragraph, or string.
Even works for the simple case in you're example.
Match globally. Also, if you want to add other specific delimiters that can be done as well.
# /(?:[A-Z]?[a-z]+(?=[A-Z\d]|[^a-zA-Z\d]|$)|[A-Z]+(?=[a-z\d]|[^a-zA-Z\d]|$)|\d+(?=[a-zA-Z]|[^a-zA-Z\d]|$))[^a-zA-Z\d]*|[^a-zA-Z\d]+/
(?:
[A-Z]? [a-z]+
(?= [A-Z\d] | [^a-zA-Z\d] | $ )
|
[A-Z]+
(?= [a-z\d] | [^a-zA-Z\d] | $ )
|
\d+
(?= [a-zA-Z] | [^a-zA-Z\d] | $ )
)
[^a-zA-Z\d]*
|
[^a-zA-Z\d]+
Related
Example strings :
2222
333333
12345
111
123456789
12345678
Expected result:
2#222
333#333
12#345
111
123#456#789
12#345#678
i.e. '#' should be inserted at the 4th,8th,12th etc last position from the end of the string.
I believe this can be done using replace and some other methods in JavaScript.
for validation of output string i have made the regex :
^(\d{1,3})(\.\d{3})*?$
You can use this regular expression:
/(\d)(\d{3})$/
this will match and group the first digit \d and group the last three \d{3} which are then grouped in their own group. Using the matched groups, you can then reference them in your replacement string using $1 and $2.
See example below:
const transform = str => str.replace(/(\d)(\d{3})$/, '$1#$2');
console.log(transform("2222")); // 2#222
console.log(transform("333333")); // 333#333
console.log(transform("12345")); // 12#345
console.log(transform("111")); // 111
For larger strings of size N, you could use other methods such as .match() and reverse the string like so:
const reverse = str => Array.from(str).reverse().join('');
const transform = str => {
return reverse(reverse(str).match(/(\d{1,3})/g).join('#'));
}
console.log(transform("2222")); // 2#222
console.log(transform("333333")); // 333#333
console.log(transform("12345")); // 12#345
console.log(transform("111")); // 111
console.log(transform("123456789")); // 123#456#789
console.log(transform("12345678")); // 12#345#678
var test = [
'111',
'2222',
'333333',
'12345',
'123456789',
'1234567890123456'
];
console.log(test.map(function (a) {
return a.replace(/(?=(?:\B\d{3})+$)/g, '#');
}));
You could match all the digits. In the replacement insert an # after every third digit from the right using a positive lookahead.
(?=(?:\B\d{3})+$)
(?= Positive lookahead, what is on the right is
(?:\B\d{3})+ Repeat 1+ times not a word boundary and 3 digits
$ Assert end of string
) Close lookahead
Regex demo
const regex = /^\d+$/;
["2222",
"333333",
"12345",
"111",
"123456789",
"12345678"
].forEach(s => console.log(
s.replace(/(?=(?:\B\d{3})+$)/g, "#")
));
I want to implement a function that outputs the respective strings as an array from an input string like "str1|str2#str3":
function myFunc(string) { ... }
For the input string, however, it is only necessary that str1 is present. str2 and str3 (with their delimiters) are both optional. For that I have already written a regular expression that performs a kind of split. I can not do a (normal) split because the delimiters are different characters and also the order of str1, str2, and str3 is important. This works kinda with my regex pattern. Now, I'm struggling how to extend this pattern so that you can escape the two delimiters by using \| or \#.
How exactly can I solve this best?
var strings = [
'meaning',
'meaning|description',
'meaning#id',
'meaning|description#id',
'|description',
'|description#id',
'#id',
'meaning#id|description',
'sub1\\|sub2',
'mea\\|ning|descri\\#ption',
'mea\\#ning#id',
'meaning|description#identific\\|\\#ation'
];
var pattern = /^(\w+)(?:\|(\w*))?(?:\#(\w*))?$/ // works without escaping
console.log(pattern.exec(strings[3]));
Accordingly to the problem definition, strings 0-3 and 8-11 should be valid and the rest not. myFunc(strings[3]) and should return ['meaning','description','id'] and myFunc(strings[8]) should return [sub1\|sub2,null,null]
You need to allow \\[|#] alognside the \w in the pattern replacing your \w with (?:\\[#|]|\w) pattern:
var strings = [
'meaning',
'meaning|description',
'meaning#id',
'meaning|description#id',
'|description',
'|description#id',
'#id',
'meaning#id|description',
'sub1\\|sub2',
'mea\\|ning|descri\\#ption',
'mea\\#ning#id',
'meaning|description#identific\\|\\#ation'
];
var pattern = /^((?:\\[#|]|\w)+)(?:\|((?:\\[#|]|\w)*))?(?:#((?:\\[#|]|\w)*))?$/;
for (var s of strings) {
if (pattern.test(s)) {
console.log(s, "=> MATCHES");
} else {
console.log(s, "=> FAIL");
}
}
Pattern details
^ - string start
((?:\\[#|]|\w)+) - Group 1: 1 or more repetitions of \ followed with # or | or a word char
(?:\|((?:\\[#|]|\w)*))? - an optional group matching 1 or 0 occurrences of
\| - a | char
((?:\\[#|]|\w)*) - Group 2: 0 or more repetitions of \ followed with # or | or a word char
(?:#((?:\\[#|]|\w)*))? - an optional group matching 1 or 0 occurrences of
# - a # char
((?:\\[#|]|\w)*) Group 3: 0 or more repetitions of \ followed with # or | or a word char
$ - end of string.
My guess is that you wish to split all your strings, for which we'd be adding those delimiters in a char class maybe, similar to:
([|#\\]+)?([\w]+)
If we don't, we might want to do so for validations, otherwise our validation would become very complicated as the combinations would increase.
const regex = /([|#\\]+)?([\w]+)/gm;
const str = `meaning
meaning|description
meaning#id
meaning|description#id
|description
|description#id
#id
meaning#id|description
sub1\\|sub2
mea\\|ning|descri\\#ption
mea\\#ning#id
meaning|description#identific\\|\\#ation`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Demo
Seems like what you're looking for may be this?
((?:\\#|\\\||[^\|#])*)*
Explanation:
Matches all sets that include "\#", "\|", or any character except "#" and "|".
https://regexr.com/4fr68
I have a crazy string, something like:
sun #plants #!wood% ##arebaba#tey travel#blessed #weed das#$#F!#D!AAAA
I want to extract all "words" (also containing special characters) that begin with # or that have a space right before, taking the following as a result:
[
'sun',
'plants',
'!wood%',
'arebaba',
'tey',
'travel',
'blessed',
'weed',
'das',
'$',
'F!#D!AAAA'
]
How do I get this using regex?
You can use match using regex: [^#\s]+:
var str = 'sun #plants #!wood% ##arebaba#tey travel#blessed #weed das#$#F!#D!AAAA';
var arr = str.match(/[^\s#]+/g);
console.log(arr);
RegEx Demo
Just using match you could get all the group 1 matches into an array.
(?:^|[ #]+)([^ #]+)(?=[ #]|$)
Easy!
(?: ^ | [ #]+ )
( [^ #]+ ) # (1)
(?= [ #] | $ )
Or, if you feel it's this simple, then just use ([^ #]+) or [^ #]+
which gets the same thing (like split in reverse).
I have the following string:
dynamic[elements][0][slider][image1]
Which also could be:
dynamic[elements][0][slider][image1]
dynamic[elements][0][abc][image1]
dynamic[elements][0][static][image1]
dynamic[elements][0][fronter][image1]
dynamic[elements][0][xyz][image1]
That I would like to first change "dynamic" to "main" and then slider,abc,static,fronter or xyz (or something completely different) to "value"
So the solution I am looking for should return:
main[elements][0][value][image1]
main[elements][0][value][image1]
main[elements][0][value][image1]
main[elements][0][value][image1]
main[elements][0][value][image1]
How can this be accomplished? I am thinking that a pregex that targets the third [] could be a solution, so I tried the following:
var line = 'dynamic[elements][0][value][image1]';
line.replace('dynamic', 'main');
line.replace(/\[.*?\]/g, 'value');
But I dont know how to replace the third brackets content and the above try doesn't really work.
You may use
.replace(/^[^\][]*(\[[^\][]*]\[\d+]\[)[^\][]*(].*)/g, 'main$1value$2')
See the regex demo.
Details:
^ - start of string
[^\][]* - 0+ chars other than [ and ]
(\[[^\][]*]\[\d+]\[) - Group 1 matching:
\[ - a literal [
[^\][]* - 0 or more chars other than [ and ]
]\[ - a literal ][ substring
\d+ - 1 or more digits
]\[ - a literal ][ substring
[^\][]* - 0+ chars other than [ and ]
(].*) - Group 2 matching ] and then any 0+ chars other than line break chars.
Note: the ] inside a character class must be escaped, while when outside of the character class, it does not have to be escaped.
The main$1value$2 is the replacement pattern inserting main at the start, then pasting Group 1 contents (with the $1 backreference), then inserting value and then the contents of Group 2.
var ss = ['dynamic[elements][0][slider][image1]', 'dynamic[elements][0][abc][image1]', 'dynamic[elements][0][static][image1]', 'dynamic[elements][0][fronter][image1]', 'dynamic[elements][0][xyz][image1]'];
var rx = /^[^\][]*(\[[^\][]*]\[\d+]\[)[^\][]*(].*)/g;
var subst = "main$1value$2";
for (var s of ss) {
console.log(s, "=>", s.replace(rx, subst));
}
I want all the proper natural numbers from a given string,
var a = "#1234abc 12 34 5 67 sta5ck over # numbrs ."
numbers = a.match(/d+/gi)
in the above string I should only match the numbers 12, 34, 5, 67, not 1234 from the first word 5 etc..
so numbers should be equal to [12,34,5,67]
Use word boundaries,
> var a = "#1234abc 12 34 5 67 sta5ck over # numbrs ."
undefined
> numbers = a.match(/\b\d+\b/g)
[ '12', '34', '5', '67' ]
Explanation:
\b Word boundary which matches between a word charcter(\w) and a non-word charcter(\W).
\d+ One or more numbers.
\b Word boundary which matches between a word charcter and a non-word charcter.
OR
> var myString = '#1234abc 12 34 5 67 sta5ck over # numbrs .';
undefined
> var myRegEx = /(?:^| )(\d+)(?= |$)/g;
undefined
> function getMatches(string, regex, index) {
... index || (index = 1); // default to the first capturing group
... var matches = [];
... var match;
... while (match = regex.exec(string)) {
..... matches.push(match[index]);
..... }
... return matches;
... }
undefined
> var matches = getMatches(myString, myRegEx, 1);
undefined
> matches
[ '12', '34', '5', '67' ]
Code stolen from here.
If anyone is interested in a proper regex solution to match digits surrounded by space characters, it is simple for languages that support lookbehind (like Perl and Python, but not JavaScript at the time of writing):
(?<=^|\s)\d+(?=\s|$)
Debuggex PCRE Demo
As illustrated in the accepted answer, in languages that don't support lookbehind, it is necessary to use a hack, e.g. to include the 1st space in the match, while keepting the important stuff in a capturing group:
(?:^|\s)(\d+)(?=\s|$)
Debuggex JavaScript Demo
Then you just need to extract that capturing group from the matches, see e.g. this answer to How do you access the matched groups in a JavaScript regular expression?