Regex for Strings without Consecutive Letters - javascript

Given an array of words, I want to find all letters that don't appear consecutively (e.g., ee, aa, ZZ, TT) in any of the words.
I have tried a variety of approaches and am hitting a roadblock in my understanding. This should be with vanilla JavaScript ES6 (no libraries or imports).
Here is a short sample word list I'm using to test:
const sampleArr = [
"BORROW", "BRANCH", "CYST", "DEIFIED", "DIPLOMATIC",
"GEESE", "HAIRCUT", "HYMN", "LEVEL", "MOSQUITO",
"MURDRUM", "NON", "POP", "POWER", "GOD", "THY"
]
And here is the code I came up with, but it is only returning me the matches, but I need inverse/reverse matches.
So, if there are no occurrences of "AA" for instance, the code should add "A" to the return array.
I've tried negative lookahead regex, but couldn't get it to work right.
Here is the code I currently have that gives no errors, but doesn't work right:
wordlist = sampleArr
let joinedWordList = wordlist.join('')
console.log(joinedWordList)
let pattern = /([A-Z])\1+/g
doubleLettersFound = joinedWordList.match(pattern)
let singleDoubleLettersFound = doubleLettersFound.filter(el => el.split('')[0]
// console.log(el)
)
console.log(singleDoubleLettersFound)
This is the result I'm receiving:
[ 'RR', 'MM', 'DD', 'EE', 'PP' ]
Also, if it helps, here is an earlier regex (in context) I was trying:
// Join word as string then process; For each letter, if consecutives found anywhere,
// go to next letter; If no consecutives found, add letter to out.lettersNoConsec
haystack = arr.join('')
out.lettersNonConsec = abc.split('').filter(ltr => haystack.match(RegExp(`(${ltr})\\1`)))
// console.log(letters

The pattern with a back reference is definitely a good idea to identify letters that repeat consecutively, but:
As some letters might not occur at all in any of the strings, you cannot only rely on the strings themselves; you need to iterate all the letters of the alphabet -- which is what you seemed to try in the second attempt.
If you join all words together, you should leave a space or some other punctuation between words as otherwise you may get false matches: the last letter of a word might be the same as the first letter of the next word
If you change the regex to have a look-ahead for the second character, it will match only the first character of a repeated sequence of a letter. That single letter will make it easier to work with.
Here is a possible solution:
const sampleArr = [
"BORROW", "BRANCH", "CYST", "DEIFIED", "DIPLOMATIC",
"GEESE", "HAIRCUT", "HYMN", "LEVEL", "MOSQUITO",
"MURDRUM", "NON", "POP", "POWER", "GOD", "THY"
];
const allwords = sampleArr.join(" ").toUpperCase();
const paired = new Set(allwords.match(/([A-Z])(?=\1)/g));
const result = [..."ABCDEFGHIJKLMNOPQRSTUVWXYZ"]
.filter(ch => !paired.has(ch));
console.log(...result);

Related

How can I match atomic elements and their amounts in regex with Javascript?

I wanted to make a tool to parse atomic elements from a formula
so say I started with Ba(Co3Ti)2 + 3BrH20 I would first want to parse each compound in the formula, which is easy enough with let regions = str.replace(/\s/g, '').split(/\+/g);
Now for each compound, I want to identify each element and its numerical "amount"
so for the example above, for the first compound, Id want an array like this:
[
"Ba",
[
"Co3",
"Ti"
],
"2"
]
and if finding sub-compounds within parenthesis isnt possible, then I could work with this:
[
"Ba",
"(Co3",
"Ti)",
"2"
]
Is this possible with regex?
This is what I've come up with in a few minutes..
let compounds = str.replace(/\s/g, '').split(/\+/g);
for (var r = 0; r < compounds.length; ++r) {
let elements = compounds[r]
}
You can use
str.match(/\(?(?:[A-Z][a-z]*\d*|\d+)\)?/g)
See the regex demo. Details:
\(? - an optional (
(?:[A-Z][a-z]*\d*|\d+) - either of the two options:
[A-Z][a-z]*\d* - an uppercase letter, then zero or more lowercase letters and then zero or more digits
| - or
\d+ - one or more digits
\)? - an optional ).
See a JavaScript demo:
const str = 'Ba(Co3Ti)2';
const re = /\(?(?:[A-Z][a-z]*\d*|\d+)\)?/g;
let compounds = str.match(re);
console.log(compounds);

Regular expression for matching alphabet

I have a CLI where the user can declare an alphabet and pass it to my code. My code generate a string with that alphabet
For example if the user declare these groups of alphabet abc abcAB1234 and ##1$2% I need to generate a string where every single character is at least in one group and the generated string has all the characters defined by the alphabet. No repetition are allowed (case sensitive)
So, if the alphabet is abc abcAB1234 ##1$2% the admitted output can be B#1a or #ca41% but not aA## (same character 'a' repeated) or aBcZ# ('Z' is not part of the alphabet) or aBA43 (some characters of alphabet are not presents)
I tried with this ^(?!.*([abcabcAB1234##1$2%])\1{1})(?!.*([abc])\1{1})(?!.*([abcAB1234])\1{1})(?!.*([##1$2%])\1{1})[abcabcAB1234##1$2%]{8,}$ but, obviously, doesn't work
Can someone please help me to understand where I'm wrong with my regexp?
I don't think this is possible with a RexExp. But it is easy to achieve using a Set.
const alphabet = 'abcBEL'
const wordToMatch = 'BLa'
const wordToMatch2 = 'BLaa'
const wordToMatch3 = 'zBLa'
function checkWord(alphabet, word) {
const set = new Set(alphabet.split(''))
for (const c of word){
if (!set.has(c)) return false
set.delete(c)
}
return true
}
console.log(checkWord(alphabet, wordToMatch))
console.log(checkWord(alphabet, wordToMatch2))
console.log(checkWord(alphabet, wordToMatch3))

How do I make my code concise and short using Regex Expressions

I'm trying to make the code a lot cleaner and concise. The main goal I want to do is to change the string to my requirements .
Requirements
I want to remove any empty lines (like the one in the middle of the two sentences down below)
I want to remove the * in front of each sentence, if there is.
I want to make the first letter of each word capital and the rest lowercase (except words that have $ in front of it)
This is what I've done so far:
const string =
`*SQUARE HAS ‘NO PLANS’ TO BUY MORE BITCOIN: FINANCIAL NEWS
$SQ
*$SQ UPGRADED TO OUTPERFORM FROM PERFORM AT OPPENHEIMER, PT $185`
const nostar = string.replace(/\*/g, ''); // gets rid of the * of each line
const noemptylines = nostar.replace(/^\s*[\r\n]/gm, ''); //gets rid of empty blank lines
const lowercasestring = noemptylines.toLowerCase(); //turns it to lower case
const tweets = lowercasestring.replace(/(^\w{1})|(\s{1}\w{1})/g, match => match.toUpperCase()); //makes first letter of each word capital
console.log(tweets)
I've done most of the code, however, I want to keep words that have $ in front of it, capital, which I don't know how to do.
Furthermore, I was wondering if its possible to combine regex expression, so its even shorter and concise.
You could make use of capture groups and the callback function of replace.
^(\*|[\r\n]+)|\$\S*|(\S+)
^ Start of string
(\*|[\r\n]*$) Capture group 1, match either * or 1 or more newlines
| Or
\$\S* Match $ followed by optional non whitespace chars (which will be returned unmodified in the code)
| Or
(\S+) Capture group 2, match 1+ non whitespace chars
Regex demo
const regex = /^(\*|[\r\n]+)|\$\S*|(\S+)/gm;
const string =
`*SQUARE HAS ‘NO PLANS’ TO BUY MORE BITCOIN: FINANCIAL NEWS
$SQ
*$SQ UPGRADED TO OUTPERFORM FROM PERFORM AT OPPENHEIMER, PT $185`;
const res = string.replace(regex, (m, g1, g2) => {
if (g1) return ""
if (g2) {
g2 = g2.toLowerCase();
return g2.toLowerCase().charAt(0).toUpperCase() + g2.slice(1);
}
return m;
});
console.log(res);
Making it readable is more important than making it short.
const tweets = string
.replace(/\*/g, '') // gets rid of the * of each line
.replace(/^\s*[\r\n]/gm, '') //gets rid of empty blank lines
.toLowerCase() //turns it to lower case
.replace(/(^\w{1})|(\s{1}\w{1})/g, match => match.toUpperCase()) //makes first letter of each word capital
.replace(/\B\$(\w+)\b/g, match => match.toUpperCase()); //keep words that have $ in front of it, capital

is there a way for the content.replace to sort of split them into more words than these?

const filter = ["bad1", "bad2"];
client.on("message", message => {
var content = message.content;
var stringToCheck = content.replace(/\s+/g, '').toLowerCase();
for (var i = 0; i < filter.length; i++) {
if (content.includes(filter[i])){
message.delete();
break
}
}
});
So my code above is a discord bot that deletes the words when someone writes ''bad1'' ''bad2''
(some more filtered bad words that i'm gonna add) and luckily no errors whatsoever.
But right now the bot only deletes these words when written in small letters without spaces in-between or special characters.
I think i have found a solution but i can't seem to put it into my code, i mean i tried different ways but it either deleted lowercase words or didn't react at all and instead i got errors like ''cannot read property of undefined'' etc.
var badWords = [
'bannedWord1',
'bannedWord2',
'bannedWord3',
'bannedWord4'
];
bot.on('message', message => {
var words = message.content.toLowerCase().trim().match(/\w+|\s+|[^\s\w]+/g);
var containsBadWord = words.some(word => {
return badWords.includes(word);
});
This is what i am looking at. the var words line. specifically (/\w+|\s+|[^\s\w]+/g);.
Anyway to implement that into my const filter code (top/above) or a different approach?
Thanks in advance.
Well, I'm not sure what you're trying to do with .match(/\w+|\s+|[^\s\w]+/g). That's some unnecessary regex just to get an array of words and spaces. And it won't even work if someone were to split their bad word into something like "t h i s".
If you want your filter to be case insensitive and account for spaces/special characters, a better solution would probably require more than one regex, and separate checks for the split letters and the normal bad word check. And you need to make sure your split letters check is accurate, otherwise something like "wash it" might be considered a bad word despite the space between the words.
A Solution
So here's a possible solution. Note that it is just a solution, and is far from the only solution. I'm just going to use hard-coded string examples instead of message.content, to allow this to be in a working snippet:
//Our array of bad words
var badWords = [
'bannedWord1',
'bannedWord2',
'bannedWord3',
'bannedWord4'
];
//A function that tests if a given string contains a bad word
function testProfanity(string) {
//Removes all non-letter, non-digit, and non-space chars
var normalString = string.replace(/[^a-zA-Z0-9 ]/g, "");
//Replaces all non-letter, non-digit chars with spaces
var spacerString = string.replace(/[^a-zA-Z0-9]/g, " ");
//Checks if a condition is true for at least one element in badWords
return badWords.some(swear => {
//Removes any non-letter, non-digit chars from the bad word (for normal)
var filtered = swear.replace(/\W/g, "");
//Splits the bad word into a 's p a c e d' word (for spaced)
var spaced = filtered.split("").join(" ");
//Two different regexes for normal and spaced bad word checks
var checks = {
spaced: new RegExp(`\\b${spaced}\\b`, "gi"),
normal: new RegExp(`\\b${filtered}\\b`, "gi")
};
//If the normal or spaced checks are true in the string, return true
//so that '.some()' will return true for satisfying the condition
return spacerString.match(checks.spaced) || normalString.match(checks.normal);
});
}
var result;
//Includes one banned word; expected result: true
var test1 = "I am a bannedWord1";
result = testProfanity(test1);
console.log(result);
//Includes one banned word; expected result: true
var test2 = "I am a b a N_N e d w o r d 2";
result = testProfanity(test2);
console.log(result);
//Includes one banned word; expected result: true
var test3 = "A bann_eD%word4, I am";
result = testProfanity(test3);
console.log(result);
//Includes no banned words; expected result: false
var test4 = "No banned words here";
result = testProfanity(test4);
console.log(result);
//This is a tricky one. 'bannedWord2' is technically present in this string,
//but is 'bannedWord22' really the same? This prevents something like
//"wash it" from being labeled a bad word; expected result: false
var test5 = "Banned word 22 isn't technically on the list of bad words...";
result = testProfanity(test5);
console.log(result);
I've commented each line thoroughly, such that you understand what I am doing in each line. And here it is again, without the comments or testing parts:
var badWords = [
'bannedWord1',
'bannedWord2',
'bannedWord3',
'bannedWord4'
];
function testProfanity(string) {
var normalString = string.replace(/[^a-zA-Z0-9 ]/g, "");
var spacerString = string.replace(/[^a-zA-Z0-9]/g, " ");
return badWords.some(swear => {
var filtered = swear.replace(/\W/g, "");
var spaced = filtered.split("").join(" ");
var checks = {
spaced: new RegExp(`\\b${spaced}\\b`, "gi"),
normal: new RegExp(`\\b${filtered}\\b`, "gi")
};
return spacerString.match(checks.spaced) || normalString.match(checks.normal);
});
}
Explanation
As you can see, this filter is able to deal with all sorts of punctuation, capitalization, and even single spaces/symbols in between the letters of a bad word. However, note that in order to avoid the "wash it" scenario I described (potentially resulting in the unintentional deletion of a clean message), I made it so that something like "bannedWord22" would not be treated the same as "bannedWord2". If you want it to do the opposite (therefore treating "bannedWord22" the same as "bannedWord2"), you must remove both of the \\b phrases in the normal check's regex.
I will also explain the regex, such that you fully understand what is going on here:
[^a-zA-Z0-9 ] means "select any character not in the ranges of a-z, A-Z, 0-9, or space" (meaning all characters not in those specified ranges will be replaced with an empty string, essentially removing them from the string).
\W means "select any character that is not a word character", where "word character" refers to the characters in ranges a-z, A-Z, 0-9, and underscore.
\b means "word boundary", essentially indicating when a word starts or stops. This includes spaces, the beginning of a line, and the end of a line. \b is escaped with an additional \ (to become \\b) in order to prevent javascript from confusing the regex token with strings' escape sequences.
The flags g and i used in both of the regex checks indicate "global" and "case-insensitive", respectively.
Of course, to get this working with your discord bot, all you have to do in your message handler is something like this (and be sure to replace badWords with your filter variable in testProfanity()):
if (testProfanity(message.content)) return message.delete();
If you want to learn more about regex, or if you want to mess around with it and/or test it out, this is a great resource for doing so.

Regex match everything after match set except for match set

This may be a simple expression to write but I am having the hardest time with this one. I need to match group sets where each group has 2 parts, what we can call the operation and the value. I need the value to match to anything after the operation EXCEPT another operation.
Valid operations to match (standard math operators): [>,<,=,!,....]
For example: '>=25!30<50' Would result in three matching groups:
1. (>=, 25)
2. (!, 30)
3. (<, 50)
I can currently solve the above using: /(>=|<=|>|<|!|=)(\d*)/g however this only works if the characters in the second match set are numbers.
The wall I am running into is how to match EVERYTHING after EXCEPT for the specified operators.
For example I don't know how to solve: '<=2017-01-01' without writing a regex to specify each and every character I would allow (which is anything except the operators) and that just doesn't seem like the correct solution.
There has got to be a way to do this! Thanks guys.
What you might do is match the operations (>=|<=|>|<|!|=) which will be the first of the 2 parts and in a capturing group use a negative lookahead to match while there is not an operation directly at the right side which will be the second of the 2 parts.
(?:>=|<=|>|<|!|=)((?:(?!(?:>=|<=|>|<|!|=)).)+)
(?:>=|<=|>|<|!|=) Match one of the operations using an alternation
( Start capturing group (This will contain your value)
(?: Start non capturing group
(?!(?:>=|<=|>|<|!|=)). Negative lookahead which asserts what is on the right side is not an operation and matches any character .
)+ Close non capturing group and repeat one or more times
) Close capturing group
const regex = /(?:>=|<=|>|<|!|=)((?:(?!(?:>=|<=|>|<|!|=)).)+)/gm;
const strings = [
">=25!30<50",
">=test!30<$##%",
"34"
];
let m;
strings.forEach((s) => {
while ((m = regex.exec(s)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
console.log(m[1]);
}
});
You can use this code
var str = ">=25!30<50";
var pattern = RegExp(/(?:([\<\>\=\!]{1,2})(\d+))/, "g");
var output = [];
let matchs = null;
while((matchs = pattern.exec(str)) != null) {
output.push([matchs[1], matchs[2]]);
}
console.log(output);
Output array :
0: Array [ ">=", "25" ]
​
1: Array [ "!", "30" ]
​
2: Array [ "<", "50" ]
I think this is what you need:
/((?:>=|<=|>|<|!|=)[^>=<!]+)/g
the ^ excludes characters you don't want, + means any number of

Categories