Regex won't match words as expected - javascript

I am trying to use XRegExp to test if a string is a valid word according to these criteria:
The string begins with one or more Unicode letters, followed by
an apostrophe (') followed by one or more Unicode letters, repeated 0 or more times.
The string ends immediately after the matched pattern.
That is, it will match these terms
Hello can't Alah'u'u'v'oo O'reilly
but not these
eatin' 'sup 'til
I am trying this pattern,
^(\\p{L})+('(\\p{L})+)*$
but it won't match any words that contain apostrophes. What am I doing wrong?
EDIT: The code using the regex
var separateWords = function(text) {
var word = XRegExp("(\\p{L})+('(\\p{L})+)*$");
var splits = [];
for (var i = 0; i < text.length; i++) {
var item = text[i];
while (i + 1 < text.length && word.test(item + text[i + 1])) {
item += text[i + 1];
i++;
}
splits.push(item);
}
return splits;
};

I think you will need to omit the string start/end anchors to match single words:
"(\\p{L})+('(\\p{L})+)*"
Also I'm not sure what those capturing groups are needed for (that may depend on your application), but you could shorten them to
"\\p{L}+('\\p{L}+)*"

Try this regex:
^[^'](?:[\w']*[^'])?$
First it checks to ensure the first character is not an apostrophe. Then it either gets any number of word characters or apostrophes followed by anything other than an apostrophe, or it gets nothing (one-letter word).

Related

Using regex, extract values from a string in javascript

Need to extract values from a string using regex(for perf reasons).
Cases might be as follows:
RED,100
RED,"100"
RED,"100,"
RED,"100\"ABC\"200"
The resulting separated [label, value] array should be:
['RED','100']
['RED','100']
['RED','100,']
['RED','100"ABC"200']
I looked into solutions and a popular library even, just splits the entire string to get the values,
e.g. 'RED,100'.split(/,/) might just do the thing.
But I was trying to make a regex with comma, which splits only if that comma is not enclosed within a quotes type value.
This isnt a standard CSV behaviour might be. But its very easy for end-user to enter values.
enter label,value. Do whatever inside value, if thats surrounded by quotes. If you wanna contain quotes, use a backslash.
Any help is appreciated.
You can use this regex that takes care of escaped quotes in string:
/"[^"\\]*(?:\\.[^"\\]*)*"|[^,"]+/g
RegEx Explanation:
": Match a literal opening quote
[^"\\]*: Match 0 or more of any character that is not \ and not a quote
(?:\\.[^"\\]*)*: Followed by escaped character and another non-quote, non-\. Match 0 or more of this combination to get through all escaped characters
": Match closing quote
|: OR (alternation)
[^,"]+: Match 1+ of non-quote, non-comma string
RegEx Demo
const regex = /"[^"\\]*(?:\\.[^"\\]*)*"|[^,"]+/g;
const arr = [`RED,100`, `RED,"100"`, `RED,"100,"`,
`RED,"100\\"ABC\\"200"`];
let m;
for (var i = 0; i < arr.length; i++) {
var str = arr[i];
var result = [];
while ((m = regex.exec(str)) !== null) {
result.push(m[0]);
}
console.log("Input:", str, ":: Result =>", result);
}
You could use String#match and take only the groups.
var array = ['RED,100', 'RED,"100"', 'RED,"100,"', 'RED,"100\"ABC\"200"'];
console.log(array.map(s => s.match(/^([^,]+),(.*)$/).slice(1)))

How to get the last alphabetic character in string?

Here, I want to give each character a space except for the last alphabetic one:
var str = "test"
var result = "";
for (var i = 0; i < str.length; i++) {
result += (/*condition*/) ? str[i]
: str[i] + " ";
}
console.log(result);
So it prints "t e s t".
I tried this (i === str.length - 1) but it didn't work when a string had period(.) as it's last character ("test.") while I wanna target only alphabetics.
You can use a regular expression to remove all non-alphabetical characters first, and then do a split/join combination to insert the spaces (or use another regex):
var str = "don't test.";
var result = str.replace(/[^a-z]/gi, '').split('').join(' ');
console.log(result);
"testasdf. asdf asdf asd.d?".replace(/./g,"$& ").replace(/([A-Za-z]) ([^A-Za-z]*)$/, '$1$2')
the first replace add a space to all char
the second replace removes the space after the last letter
console.log("testasdf?".replace(/./g,"$& ").replace(/([A-Za-z]) ([^A-Za-z]*)$/, '$1$2'));
console.log("Super test ! Yes".replace(/./g,"$& ").replace(/([A-Za-z]) ([^A-Za-z]*)$/, '$1$2'));
There is such a feature like lookahead assertions.
So it could be
str.replace(/(\w)(?=.*\w)/, "$1")
Using a regex replace function, you can space out all of your characters like this:
"test string, with several words.".replace(/\w{2,}/g, match => match.split('').join(' '))
Explanation
\w{2,} match 2 or more alphabetic characters
match.split('').join(' ') split each match into characters, and rejoin with spaces between

Remove leading and trailing characters from a

I have a text file which has strings separated by whitespace. The text file contains some special characters (latin, currency, punctuations etc.) Which need to be discarded from final output. Please note that legal characters are all characters in Unicode except these special characters.
We need to separate/split text by whitespaces and then remove only leading and trailing special characters. If special characters are in between two legal characters then we won't remove them.
I can easily do it in two phases. Split text by whitespaces and then remove only leading and trailing special characters from each string. However, I need to process string only once. Is there any way, it could be achieved in one pass. Note: We can't use RegEx.
For this question assume that these characters are special:
[: , ! . < ; ' " > [ ] { } ` ~ = + - ? / ]
Example:
:!/,.<;:.?;,BBM!/,.<;:.?;,` IS TALKING TO `B!?AM!/,.<;:.?;,
Here output would be an array of valid strings: ["BBM", "IS", "TALKING", "TO", "B!?AM"]
Make simple state machine (finite automata)
Walk in a loop through all chars
At every step check if current char is letter, space or special
Execute some operation (perhaps empty) depending on state and char kind
Change state if needed
for example, you may stay in "special" state until letter is met. Remember starting index of the word and make state "inside word". Continue until special char or space is met (it is still not clear from your question).
I have used typescript and have done it in a single pass.
Please note that isSpecialCharacterCode(charCode) function simply checks whether unicode of text character is same as unicode of provided special characters.Same is true for isWhitespaceCode(charCode) function.
parseText(text: string): string[]{
let words : string[] = [];
let word = "";
let charCode = 1;
let haveSeenLegalChar = false; //set it if we have encountered legal character in text
let seenSpecialCharsToInclude = false; //set it if we have encountered //special character in text
let inBetweenSpecialChars = ""; // string containing special chars //which may be included in between legal word
for(let index = 0; index < text.length; index++){
charCode = text.charCodeAt(index);
let isSpecialChar = isSpecialCharacterCode(charCode);
let isWhitespace = isWhitespaceCode(charCode);
if(isSpecialChar && !isWhitespace){
//if this is a special character then two cases
//first is: It can be part of word (it is only possible if we have already seen atleast one legal character)
//Since it can be part of word but we are not sure whether this will be part of word so store it for now
//second is: This is either leading or trailing special character..we should not include these in word
if(haveSeenLegalChar){
inBetweenSpecialChars += text[index];
seenSpecialCharsToInclude = true;
}else{
//since we have not seen any legal character till now so it must be either leading or trailing special chars
seenSpecialCharsToInclude = false;
inBetweenSpecialChars = "";
}
}else if(isWhitespace){
//we have encountered a whitespace.This is either beginning of word or ending of word.
//if we have encountered any leagl char, push word into array
if(haveSeenLegalChar){
words.push(word);
word = "";
inBetweenSpecialChars = "";
}
haveSeenLegalChar = false;
}else if(!isSpecialChar){
//legal character case
haveSeenLegalChar = true;
if(seenSpecialCharsToInclude){
word += inBetweenSpecialChars;
seenSpecialCharsToInclude = false;
inBetweenSpecialChars = "";
}
word += text[index];
}
}
return words;
}

Replace matching elements in array using regular expressions: invalid character

var input = [paul, Paula, george];
var newReg = \paula?\i
for(var text in input) {
if (newReg.test(text) == true) {
input[input.indexOf(text)] = george
}
}
console.log(input)
I don't know what's wrong in my code. it should change paul and Paula to george but when I run it it says there's an illegal character
The backslash (\) is an escape character in Javascript (along with a lot of other C-like languages). This means that when Javascript encounters a backslash, it tries to escape the following character. For instance, \n is a newline character (rather than a backslash followed by the letter n).
So, thats what is causing your error, you need to replace \paula?\i with /paula?/i
You need to replace \ by / in your regexp pattern.
You should wrap the strings inside quotes "
You need to match correctly your array, val is just the index of the word, not the word himself.
var input = ["paul", "Paula", "george"];
var newReg = /paula?/i;
for (var val in input) {
if (newReg.test(input[val]) == true) {
input[input.indexOf(input[val])] = "george";
}
}
console.log(input);
JSFIDDLE

Javascript regex parsing dots and whitespaces

In Javascript I have several words separated by either a dot or one ore more whitepaces (or the end of the string).
I'd like to replace certain parts of it to insert custom information at the appropriate places.
Example:
var x = "test1.test test2 test3.xyz test4";
If there's a dot it should be replaced with ".X_"
If there's one or more space(s) and the word before does not contain a dot, replace with ".X "
So the desired output for the above example would be:
"test1.X_test test2.X test3.X_xyz test4.X"
Can I do this in one regex replace? If so, how?
If I need two or more what would they be?
Thanks a bunch.
Try this:
var str = 'test1.test test2 test3.xyz test4';
str = str.replace(/(\w+)\.(\w+)/g, '$1.X_$2');
str = str.replace(/( |^)(\w+)( |$)/g, '$1$2.X$3');
console.log(str);
In the first replace it replaces the dot in the dotted words with a .X_, where a dotted word is two words with a dot between them.
In the second replace it adds .X to words that have no dot, where words that have no dot are words that are preceded by a space OR the start of the string and are followed by a space OR the end of the string.
To answer this:
If there's a dot it should be replaced with ".X_"
If there's one or more spaces it should be replaced with ".X"
Do this:
x.replace(/\./g, '.X_').replace(/\s+/g, '.X');
Edit: To get your desired output (rather than your rules), you can do this:
var words = x.replace(/\s+/g, ' ').split(' ');
for (var i = 0, l = words.length; i < l; i++) {
if (words[i].indexOf('.') === -1) {
words[i] += ".X";
}
else {
words[i] = words[i].replace(/\./g, '.X_');
}
}
x = words.join(' ');
Basically...
Strip all multiple spaces and create an array of "words"
Loop through each word.
If it doesn't have a period in it, then add ".X" to the end of the word
Else, replace the periods with ".X_"
Join the "words" back into a string and separate it by spaces.
Edit 2:
Here's a solution using only javascript's replace function:
x.replace(/\s+/g, ' ') // replace multiple spaces with one space
.replace(/\./g, '.X_') // replace dots with .X_
// find words without dots and add a ".X" to the end
.replace(/(^|\s)([^\s\.]+)($|\s)/g, "$1$2.X$3");

Categories