I have a text file which has strings separated by whitespace. The text file contains some special characters (latin, currency, punctuations etc.) Which need to be discarded from final output. Please note that legal characters are all characters in Unicode except these special characters.
We need to separate/split text by whitespaces and then remove only leading and trailing special characters. If special characters are in between two legal characters then we won't remove them.
I can easily do it in two phases. Split text by whitespaces and then remove only leading and trailing special characters from each string. However, I need to process string only once. Is there any way, it could be achieved in one pass. Note: We can't use RegEx.
For this question assume that these characters are special:
[: , ! . < ; ' " > [ ] { } ` ~ = + - ? / ]
Example:
:!/,.<;:.?;,BBM!/,.<;:.?;,` IS TALKING TO `B!?AM!/,.<;:.?;,
Here output would be an array of valid strings: ["BBM", "IS", "TALKING", "TO", "B!?AM"]
Make simple state machine (finite automata)
Walk in a loop through all chars
At every step check if current char is letter, space or special
Execute some operation (perhaps empty) depending on state and char kind
Change state if needed
for example, you may stay in "special" state until letter is met. Remember starting index of the word and make state "inside word". Continue until special char or space is met (it is still not clear from your question).
I have used typescript and have done it in a single pass.
Please note that isSpecialCharacterCode(charCode) function simply checks whether unicode of text character is same as unicode of provided special characters.Same is true for isWhitespaceCode(charCode) function.
parseText(text: string): string[]{
let words : string[] = [];
let word = "";
let charCode = 1;
let haveSeenLegalChar = false; //set it if we have encountered legal character in text
let seenSpecialCharsToInclude = false; //set it if we have encountered //special character in text
let inBetweenSpecialChars = ""; // string containing special chars //which may be included in between legal word
for(let index = 0; index < text.length; index++){
charCode = text.charCodeAt(index);
let isSpecialChar = isSpecialCharacterCode(charCode);
let isWhitespace = isWhitespaceCode(charCode);
if(isSpecialChar && !isWhitespace){
//if this is a special character then two cases
//first is: It can be part of word (it is only possible if we have already seen atleast one legal character)
//Since it can be part of word but we are not sure whether this will be part of word so store it for now
//second is: This is either leading or trailing special character..we should not include these in word
if(haveSeenLegalChar){
inBetweenSpecialChars += text[index];
seenSpecialCharsToInclude = true;
}else{
//since we have not seen any legal character till now so it must be either leading or trailing special chars
seenSpecialCharsToInclude = false;
inBetweenSpecialChars = "";
}
}else if(isWhitespace){
//we have encountered a whitespace.This is either beginning of word or ending of word.
//if we have encountered any leagl char, push word into array
if(haveSeenLegalChar){
words.push(word);
word = "";
inBetweenSpecialChars = "";
}
haveSeenLegalChar = false;
}else if(!isSpecialChar){
//legal character case
haveSeenLegalChar = true;
if(seenSpecialCharsToInclude){
word += inBetweenSpecialChars;
seenSpecialCharsToInclude = false;
inBetweenSpecialChars = "";
}
word += text[index];
}
}
return words;
}
Related
I've a variable named var text = 'Saif'
So how can I check the first character of this value (S) is a letter, number or special character??
I've already tried with the code bellow -
var text = 'Saif'
var char = /[A-Z]/g
var num = /[0-9]/g
if (text.match(char)) {
console.log("The string starts with Letter")
} else if (text.match(num)){
console.log("The string starts with Number")
} else {
console.log("The string starts with Special character")
}
It's working fine with the condition of letter and number. But I can't being able to find the special character instead of letter or number.
How can I do that?
First of all, char is a reserved word in JavaScript - best not to use it in your variable names.
Secondly, if you want to test a pattern but not actually retrieve the match, use test() rather than match().
Thirdly, your current patterns don't enforce only the first character of the string; they allow any character within it.
if (/^[a-z]/ig.test(text))
console.log("The string starts with Letter")
else if (/^\d/.test(text))
console.log("The string starts with Number")
else
console.log("The string starts with Special character")
Try this:
var text = 's2Saif'
var char = /^[A-Z]/g
var num = /^\d/g
if (text.match(char)) {
console.log("The string starts with Letter")
} else if (text.match(num)){
console.log("The string starts with Number")
} else {
console.log("The string starts with Special character")
}
Does Letter contain lowercase character? If so, let var char = /^\w/g;
Give this a try:
var format = /[ `!##$%^&*()_+\-=\[\]{};':"\\|,.<>\/?~]/;
// This ↓ method will return true or false value.
if (format.test(text)) {
console.log("The string starts with Special character");
}
I am getting very confused in writing the regex pattern for my requirement.
I want that a text field should not accept any special character except underscore and hyphen. Also, it shouldn't accept underscore, hyphen, and space if entered alone in the text field.
I tried following pattern->
/[ !##$%^&*()+\=\[\]{};':"\\|,.<>\/?]/;
but this is also allowing underscore and hyphen, as well as space if entered alone.
Rather than matching what you do not want, you should match what you actually want. Since you never specified if you string could have letter, number and spaces in it, i just assumed it was a single word, so I matched uppercase and lowercase letters only, with underscore and hyphen.
^(([A-Za-z])+([\-|_ ])?)+$
I have created a regex101 if you wish to try more cases.
If you want your string not to contain special characters except underscore and hyphen. But there is an exception for that if they contain space with the hyphen and underscore, then you can handle that exception separately. This will make your code easier to understand and easily adaptable for further exceptions.
function validateString(str){
let reg = /[^!##$%^&*()+\=\[\]{};':"\\|,.<>\/?]/g;
let match = str.match(reg);
console.log(match);
if(match && (match.includes(" ") || match.includes("_") || match.includes("-")) && (!match.join(",").match(/[a-zA-Z]/))){
// regex contains invalid characters
console.log(str + ": Invalid input");
}
else if(match){
console.log(str + ": Valid string");
}
}
let str = "-_ ";
let str1 = "Mathus-Mark";
let str2 = "Mathus Mark";
let str3 = "Mathus_Mark";
let str4 = " ";
let str5 = "-";
let str6 = "_";
validateString(str);
validateString(str1);
validateString(str2);
validateString(str3);
validateString(str4);
validateString(str5);
validateString(str6);
var input = [paul, Paula, george];
var newReg = \paula?\i
for(var text in input) {
if (newReg.test(text) == true) {
input[input.indexOf(text)] = george
}
}
console.log(input)
I don't know what's wrong in my code. it should change paul and Paula to george but when I run it it says there's an illegal character
The backslash (\) is an escape character in Javascript (along with a lot of other C-like languages). This means that when Javascript encounters a backslash, it tries to escape the following character. For instance, \n is a newline character (rather than a backslash followed by the letter n).
So, thats what is causing your error, you need to replace \paula?\i with /paula?/i
You need to replace \ by / in your regexp pattern.
You should wrap the strings inside quotes "
You need to match correctly your array, val is just the index of the word, not the word himself.
var input = ["paul", "Paula", "george"];
var newReg = /paula?/i;
for (var val in input) {
if (newReg.test(input[val]) == true) {
input[input.indexOf(input[val])] = "george";
}
}
console.log(input);
JSFIDDLE
I am trying to use XRegExp to test if a string is a valid word according to these criteria:
The string begins with one or more Unicode letters, followed by
an apostrophe (') followed by one or more Unicode letters, repeated 0 or more times.
The string ends immediately after the matched pattern.
That is, it will match these terms
Hello can't Alah'u'u'v'oo O'reilly
but not these
eatin' 'sup 'til
I am trying this pattern,
^(\\p{L})+('(\\p{L})+)*$
but it won't match any words that contain apostrophes. What am I doing wrong?
EDIT: The code using the regex
var separateWords = function(text) {
var word = XRegExp("(\\p{L})+('(\\p{L})+)*$");
var splits = [];
for (var i = 0; i < text.length; i++) {
var item = text[i];
while (i + 1 < text.length && word.test(item + text[i + 1])) {
item += text[i + 1];
i++;
}
splits.push(item);
}
return splits;
};
I think you will need to omit the string start/end anchors to match single words:
"(\\p{L})+('(\\p{L})+)*"
Also I'm not sure what those capturing groups are needed for (that may depend on your application), but you could shorten them to
"\\p{L}+('\\p{L}+)*"
Try this regex:
^[^'](?:[\w']*[^'])?$
First it checks to ensure the first character is not an apostrophe. Then it either gets any number of word characters or apostrophes followed by anything other than an apostrophe, or it gets nothing (one-letter word).
I would like to limit the substr by words and not chars. I am thinking regular expression and spaces but don't know how to pull it off.
Scenario: Limit a paragraph of words to 200 words using javascript/jQuery.
var $postBody = $postBody.substr(' ',200);
This is great but splits words in half :) Thanks ahead of time!
function trim_words(theString, numWords) {
expString = theString.split(/\s+/,numWords);
theNewString=expString.join(" ");
return theNewString;
}
if you're satisfied with a not-quite accurate solution, you could simply keep a running count on the number of space characters within the text and assume that it is equal to the number of words.
Otherwise, I would use split() on the string with " " as the delimiter and then count the size of the array that split returns.
very quick and dirty
$("#textArea").val().split(/\s/).length
I suppose you need to consider punctuation and other non-word, non-whitespace characters as well. You want 200 words, not counting whitespace and non-letter characters.
var word_count = 0;
var in_word = false;
for (var x=0; x < text.length; x++) {
if ( ... text[x] is a letter) {
if (!in_word) word_count++;
in_word = true;
} else {
in_word = false;
}
if (!in_word && word_count >= 200) ... cut the string at "x" position
}
You should also decide whether you treat digits as a word, and whether you treat single letters as a word.