Split string to CamelCase words and also uppercase acronymns - javascript

Given a string containing CamelCase and also uppercase acronymns e.g. 'ManualABCTask';
How can it be split to a string with a space between all words and acronyms in a less wordy way?
I had the following process:
let initial = 'ManualABCTask'
//Split on single upper case followed by any number of lower case:
.split(/(['A-Z'][a-z]*)/g)
//the returned array includes empty string entries e.g. ["", "", "Manual", "A", "", "B", "", "C","", "Task", ""] so remove these:
.filter(x => x != '');
//When joining the array, the acronymn uppercase single letters have a space e.g. 'Manual A B C Task' so instead, reduce and add space only if array entry has more than one character
let word = initial.reduce((prevVal,currVal) => {
return (currVal.length == 1) ? prevVal + currVal : prevVal + ' ' + currVal + ' ';
}, '');
This does the job on the combinations it needs to e.g:
'ManualABCTask' => 'Manual ABC Task'
'ABCManualTask' => 'ABC Manual Task'
'ABCManualDEFTask' => 'ABC Manual DEF Task'
But it was a lot of code for the job done and surely could be handled in the initial regex.
I was experimenting while writing the question and with a tweak to the regex, got it down to one line, big improvement! So posting anyway with solution.
My regex know how isn't great so this could maybe be improved on still.

I know near to nothing about JavaScript but i had a bash at it:
let initial = 'ManualABCTask'
initial = initial.replace(/([A-Z][a-z]+)/g, ' $1 ').trim();

There 2 groups: starting from head letter with following lowercases, and starting from head letter until next letter isn't lowercase:
find = new RegExp(
"(" +
"[A-Z][a-z]+" + // Group starting from head letter with following lowercases
"|" +
"[A-Z]+(?![a-z])" + // Group with head letters until next letter isn't lowercase:
")",
"g"
)
initial = 'ManualABCTask'.split(find)

As mentioned in post, changed to handle in regex:
initial = 'ManualABCTask'.split(/(['A-Z']{2,99})(['A-Z'][a-z]*)/g).join(' ');
Group any concurrent upper characters with length of 2 to 99 to get the acronyms, and any single upper character followed by any number of lower to get the other words. Join with space.

Related

Regex to match character unless it is preceded by an odd number of another specific character

Another way to state my problem is to match a character always when it is preceded by an even number (0, 2, 4, ...) of another specific character.
In my case I want to match all ' characters in string unless it is preceded by an odd number (1, 3, 5 ...) of ?
example:
- ?' => shouldn't match (preceded by one ?)
- ??' => Should match (preceded by 2 ?)
- ?????' => Shouldn't match (preceded by 5 ?)
Lets consider this scenario:
We have this string : ' ??' ????' ?' ??????' then the regex should match all ' characters in this case except for the 4th one, so for example if I want to use String.split(regex) the result would be ['', '??', '????', ?' '??????']
Currently I was using this regex: (?<!\?)', but the problem is that it matches only if there is no ? before '
You can use
/(?<=(?<!\?)(?:\?\?)*)'/g
See the regex demo. Details:
(?<=(?<!\?)(?:\?\?)*) - a positive lookbehind that matches a location that is preceded with any zero or more occurrences of double ? not immediately preceded with another ?
' - a ' char.
Sample code:
const texts = ["The ?' should not match","The ??' should match","?????' => The ?????' should not match"];
const rx = /(?<=(?<!\?)(?:\?\?)*)'/g
for (var text of texts) {
console.log(text, '=>', rx.test(text));
}
If you need replacing, it is possible with
const texts = ["The ?' should not match","The ??' should match","?????' => The ?????' should not match"];
const rx = /(?<=(?<!\?)(?:\?\?)*)'/g
for (var text of texts) {
console.log(text, '=>', text.replace(rx, '<MATCH>$&</MATCH>'));
}
You may use this regex with a lookbehind condition:
(?<=([^?]|^)(?:\?\?)*)'
RegEx Demo
RegEx Explanation
(?<=: Start lookbehind condition
([^?]|^): Match a non-? character or start
(?:\?\?)*: Match 0 or more pairs of ?
): End lookbehind condition
': Match a '

Regex to get substring to the left of a special character in a string

I have certain expressions that looks like this
"sum='29'"
'The total score =" 29"'
"Your name = 'John'"
"Your grade is A"
Now what I want to do is to check if the left side of quote (' or ") contains an =.
So this is what I do
leftTermOfQuote = string.match(/\S+(?' *')/)[0]
But I get null. What am I doing wrong?
To get the left side of a ' or ", you could capture the first part in a group while matching the first ' or "
To not cross the quotes or = boundary, you could use a negated character class [^'"\r\n=] matching any char except the listed.
^([^'"\r\n=]*=[^'"\r\n=]*)['"]
Explanation
^ Start of string
( Capture group 1
[^'"\r\n=]* Match any char except the quotes, equals sign or newline
= Match the equals sign
[^'"\r\n=]* Same as previous character class
) Close group
['"] Match a ' or "
Regex demo
[
`sum='29'`,
`The total score =" 29"`,
`Your name = 'John'`,
`Your grade is A`
].forEach(s => {
let res = s.match(/^([^'"\r\n=]*=[^'"\r\n=]*)['"]/);
if (res) {
console.log(res[1]);
}
})
Now what I want to do is to check if the left side of quote (' or ") contains an =.
You can do that by searching for =.*['"]. But this would also find The result is 'A=B' because =B' matches the requirements.
So you can anchor the regex to the first character and request that before the first quote, there is an equal sign:
^[^'"]*=[^'"]*['"]
This reads as: "from the beginning of the string ^, there may be non-quotes [^'"], in any number *, before an equal sign =, followed by any number of non-quotes, finally followed by a quote"
You can also parse the whole assignment:
^([^=]*)\\s*=\\s*"\\s*(.*)\\s*"
this will also extract the parenthesized parts of the assignment, giving you an array with whatever is on the left of the equal sign, and whatever is on the right inside the quotes. It should also remove whitespaces, so
' The total score = " 29" '
is parsed into
[ "The total score", "29" ]
You can use the split and includes functions as well:
const word = 'The total score =" 29"';
const wordSplitByQuotes = word.split(/"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*'/);
const containsEqualSign = wordSplitByQuotes[0].includes('=');

How to ban words with diacritics using a blacklist array and regex?

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
alert(valid);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>
Now on the word băţ it returns true instead of false for example.
Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.
For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:
var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');
$(function() {
$("input").on("input", function() {
var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
$('#log').html(invalid ? 'bad' : 'good');
});
$("input").trigger("input").focus();
function dashPaddedWords(str) {
return '-' + str.replace(/./g, wordCharOrDash) + '-';
};
function wordCharOrDash(ch) {
return isWordChar(ch) ? ch : '-'
};
function isWordChar(ch) {
return ch.toUpperCase() != ch.toLowerCase();
};
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>
Let's see what's going on:
alert("băţ".match(/\w\b/));
This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.
I think the best you can do is something like this:
var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
+ bannedWords.join('|')
+ ')(?=' + bound + '|$)', 'i');
where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)
Given ["bad", "mad", "testing", "băţ"], this becomes:
/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).
You can also use var bound = '[\\s!-/:-#[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.
You need a Unicode aware word boundary. The easiest way is to use XRegExp package.
Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:
\b word \b
---------------------------------------
| | |
([^\pL0-9_]|^) word (?=[^\pL0-9_]|$)
The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.
The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.
See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
//alert(valid);
console.log("The word is", valid ? "allowed" : "banned");
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>
In stead of using word boundary, you could do it with
(?:[^\w\u0080-\u02af]+|^)
to check for start of word, and
(?=[^\w\u0080-\u02af]|$)
to check for the end of it.
The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.
Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.
Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.
var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');
function myFunction() {
document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}
<!DOCTYPE html>
<html>
<body>
Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>
<p id='result'></p>
</body>
</html>
When dealing with characters outside my base set (which can show up at any time), I convert them to an appropriate base equivalent (8bit, 16bit, 32bit). before running any character matching over them.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var bannedWordsBits = {};
bannedWords.forEach(function(word){
bannedWordsBits[word] = "";
for (var i = 0; i < word.length; i++){
bannedWordsBits[word] += word.charCodeAt(i).toString(16) + "-";
}
});
var bannedWordsJoin = []
var keys = Object.keys(bannedWordsBits);
keys.forEach(function(key){
bannedWordsJoin.push(bannedWordsBits[key]);
});
var regex = new RegExp(bannedWordsJoin.join("|"), 'i');
function checkword(word) {
var wordBits = "";
for (var i = 0; i < word.length; i++){
wordBits += word.charCodeAt(i).toString(16) + "-";
}
return !regex.test(wordBits);
};
The separator "-" is there to make sure that unique characters don't bleed together creating undesired matches.
Very useful as it brings all the characters down to a common base that everything can interact with. And this can be re-encoded back to it's original without having to ship it in key/value pair.
For me the best thing about it is that I don't have to know all of the rules for all of the character sets that I might intersect with, because I can pull them all into a common playing field.
As a side note:
To speed things up, rather than passing the large regex statement that you probably have, which takes exponentially longer to pass with the length of the words that you're banning, I would pass each separate word in the sentence through the filter. And break the filter up into length based segments. like;
checkword3Chars();
checkword4Chars();
checkword5chars();
who's functions you can generate systematically and even create on the fly as and when they become required.

regex to remove number (year only) from string

I know the regex that separates two words as following:
input:
'WonderWorld'
output:
'Wonder World'
"WonderWorld".replace(/([A-Z])/g, ' $1');
Now I am looking to remove number in year format from string, what changes should be done in the above code to get:
input
'WonderWorld 2016'
output
'Wonder World'
You can match the location before an uppercase letter (but excluding the beginning of a line) with \B(?=[A-Z]) and match the trailing spaces if any with 4 digits right before the end (\s*\b\d{4}\b). In a callback, check if the match is not empty, and replace accordingly. If a match is empty, we matched the location before an uppercase letter (=> replace with a space) and if not, we matched the year at the end (=> replace with empty string). The four digit chunks are only matched as whole words due to the \b word boundaries around the \d{4}.
var re = /\B(?=[A-Z])|\s*\d{4}\b/g;
var str = 'WonderWorld 2016';
var result = str.replace(re, function(match) {
return match ? "" : " ";
});
document.body.innerHTML = "<pre>'" + result + "'</pre>";
A similar approach, just a different pattern for matching glued words (might turn out more reliable):
var re = /([a-z])(?=[A-Z])|\s*\b\d{4}\b/g;
var str = 'WonderWorld 2016';
var result = str.replace(re, function(match, group1) {
return group1 ? group1 + " " : "";
});
document.body.innerHTML = "<pre>'" + result + "'</pre>";
Here, ([a-z])(?=[A-Z]) matches and captures into Group 1 a lowercase letter that is followed with an uppercase one, and inside the callback, we check if Group 1 matched (with group1 ?). If it matched, we return the group1 + a space. If not, we matched the year at the end, and remove it.
Try this:
"WonderWorld 2016".replace(/([A-Z])|\b[0-9]{4}\b/g, ' $1')
How about this, a single regex to do what you want:
"WonderWorld 2016".replace(/([A-Z][a-z]+)([A-Z].*)\s.*/g, '$1 $2');
"Wonder World"
get everything apart from digits and spaces.
re-code of #Wiktor Stribiżew's solution:
str can be any "WonderWorld 2016" | "OneTwo 1000 ThreeFour" | "Ruby 1999 IamOnline"
str.replace(/([a-z])(?=[A-Z])|\s*\d{4}\b/g, function(m, g) {
return g ? g + " " : "";
});
import re
remove_year_regex = re.compile(r"[0-9]{4}")
Test regex expression here

JS string replace only replacing every other occurence

I have the following JS:
"a a a a".replace(/(^|\s)a(\s|$)/g, '$1')
I expect the result to be '', but am instead getting 'a a'. Can anyone explain to me what I am doing wrong?
Clarification: What I am trying to do is remove all occurrences of 'a' that are surronded by whitespace (i.e. a whole token)
It's because this regex /(^|\s)a(\s|$)/g match the previous char and the next char to each a
in string "a a a a" the regex matches :
"a " , then the string to check become "a a a"$ (but now the start of the string is not the beginning and there is not space before)
" a " (the third a) , then become "a"$ (that not match because no space before)
Edit:
Little bit tricky but working (without regex):
var a = "a a a a";
// Handle beginning case 'a '
var startI = a.indexOf("a ");
if (startI === 0){
var off = a.charAt(startI + 2) !== "a" ? 2 : 1; // test if "a" come next to keep the space before
a = a.slice(startI + off);
}
// Handle middle case ' a '
var iOf = -1;
while ((iOf = a.indexOf(" a ")) > -1){
var off = a.charAt(iOf + 3) !== "a" ? 3 : 2; // same here
a = a.slice(0, iOf) + a.slice(iOf+off, a.length);
}
// Handle end case ' a'
var endI = a.indexOf(" a");
if (endI === a.length - 2){
a = a.slice(0, endI);
}
a; // ""
First "a " matches.
Then it will try to match against "a a a", which will skip first a, and then match "a ".
Then it will try to match against "a", which will not match.
First match will be replaced to beginning of line. => "^"
Then we have "a" that didn't match => "a"
Second match will be replaced to " " => " "
Then we have "a" that didn't match => "a"
The result will be "a a".
To get your desired result you can do this:
"a a a a".replace(/(?:\s+a(?=\s))+\s+|^a\s+(?=[^a]|$|a\S)|^a|\s*a$/g, '')
As others have tried to point out, the issue is that the regex consumes the surrounding spaces as part of the match. Here's a [hopefully] more straight forward explanation of why that regex doesn't work as you expect:
First let's breakdown the regex, it says match the a space or start of string, followed by an 'a' followed by a space or the end of the string.
Now let's apply it to the string. I've added character indexes beneath the string to make things easier to talk about:
a a a a
0123456
The regex looks at the 0 index char, and finds an 'a' at that location, followed by a space at index 2. This is a match because it is the start of the string, followed by an a followed by a space. The length of our match is 2 (the 'a' and the space), so we consume two characters and start our next search at index 2.
Character 2 ('a') is neither a space nor the start of the string, and therefore it doesn't match the start of our regular expression, so we consume that character (without replacing it) and move on to the next.
Character 3 is a space, followed by an 'a' followed by another space, which is a match for our regex. We replace it with an empty string, consume the length of the match (3 characters - " a ") and move on to index 6.
Character 6 ('a') is neither a space nor the start of the string, and therefore it doesn't match the start of our regular expression, so we consume that character (without replacing it) and move on to the next.
Now we're at the end of the string, so we're done.
The reason why the regex #caeth suggested (/(^|\s+)a(?=\s|$)/g) works is because of the ?= quantifier. From the MDN Regexp Documentation:
Matches x only if x is followed by y. For example, /Jack(?=Sprat)/ matches "Jack" only if it is followed by "Sprat". /Jack(?=Sprat|Frost)/ matches "Jack" only if it is followed by "Sprat" or "Frost". However, neither "Sprat" nor "Frost" is part of the match results.
So, in this case, the ?= quantifier checks to see if the following character is a space, without actually consuming that character.
(^|\s)a(?=\s|$)
Try this.Replace by $1.See demo.
https://regex101.com/r/gQ3kS4/3
Use this instead:
"a a a a".replace(/(^|\s*)a(\s|$)/g, '$1')
With "* this you replace all the "a" occurrences
Greetings
Or you can just split the string up, filter it and glue it back:
"a ba sl lf a df a a df r a".split(/\s+/).filter(function (x) { return x != "a" }).join(" ")
>>> "ba sl lf df df r"
"a a a a".split(/\s+/).filter(function (x) { return x != "a" }).join(" ")
>>> ""
Or in ECMAScript 6:
"a ba sl lf a df a a df r a".split(/\s+/).filter(x => x != "a").join(" ")
>>> "ba sl lf df df r"
"a a a a".split(/\s+/).filter(x => x != "a").join(" ")
>>> ""
I assume that there is no leading and trailing spaces. You can change the filter to x && x != 'a' if you want to remove the assumption.

Categories