Trouble with word-boundary (\b) - javascript

I have an array of keywords, and I want to know whether at least one of the keywords is found within some string that has been submitted. I further want to be absolutely sure that it is the keyword that has been matched, and not something that is very similar to the word.
Say, for example, that our keywords are [English, Eng, En] because we are looking for some variation of English.
Now, say that the input from a user is i h8 eng class, or something equally provocative and illiterate - then the eng should be matched. It should also fail to match a word like england or some odd thing chen, even though it's got the en bit.
So, in my infinite lack of wisdom I believed I could do something along the lines of this in order to match one of my array items with the input:
.match(RegExp('\b('+array.join('|')+')\b','i'))
With the thinking that the regular expression would look for matches from the array, now presented like (English|Eng|En) and then look to see whether there were zero-width word bounds on either side.

You need to double the backslashes.
When you create a regex with the RegExp() constructor, you're passing in a string. JavaScript string constant syntax also treats the backslash as a meta-character, for quoting quotes etc. Thus, the backslashes will be effectively stripped out before the RegExp() code even runs!
By doubling them, the step of parsing the string will leave one backslash behind. Then the RegExp() parser will see the single backslash before the "b" and do the right thing.

You need to double the backslashes in a JavaScript string or you'll encode a Backspace character:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))

You need to double-escape a \b, cause it have special value in strings:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))

\b is an escape sequence inside string literals (see table 2.1 on this page). You should escape it by adding one extra slash:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
You do not need to escape \b when used inside a regular expression literal:
/\b(english|eng|en)\b/i

Related

What is this "/\,$/"?

Tried to search for /\,$/ online, but coudnt find anything.
I have:
coords = coords.replace(/\,$/, "");
Im guessing it returns coords string index number. What I have to search online for this, so I can learn more?
/\,$/ finds the comma character (,) at the end of a string (denoted by the $) and replaces it with empty (""). You sometimes see this in regex code aiming to clean up excerpts of text.
It's a regular expression to remove a trailing comma.
That thing is a Regular Expression, also known as regex or regexp. It is a way to "match" strings using some rules. If you want to learn how to use it in JavaScript, read the Mozilla Developer Network page about RegExp.
By the way, regular expressions are also available on most languages and in some tools. It is a very useful thing to learn.
That's a regular expression that finds a comma at the end of a string. That code removes the comma.
// defines a JavaScript regular expression, used to match a pattern within a string.
\,$ is the pattern
In this case \, translates to ,. A backslash is used to escape special characters, but in this case, it's not necessary. An example where it would be necessary would be to remove trailing periods. If you tried to do that with /.$/ the period here has a different meaning; it is used as a wildcard to match [almost] any character (aside for some newlines). So in this case to match on "." (period character) you would have to escape the wildcard (/\.$/).
When $ is placed at the end of the pattern, it means only look at the end of the string. This means that you can't mistakingly find a comma anywhere in the middle of the string (e.g., not after help in help, me,), only at the end (trailing). It also speeds of the regular expression search considerably. If you wanted to match on characters only at the beginning of the string, you would start off the pattern with a carat (^), for instance /^,/ would find a comma at the start of a string if one existed.
It's also important to note that you're only removing one comma, whereas if you use the plus (+) after the comma, you'd be replacing one or more: /,+$/.
Without the +; trailing commas,, becomes trailing commas,
With the +; no trailing comma,, becomes no trailing comma

How to collate sequence \" using ECMAScript regular expressions?

I'm trying to construct a regular expression to treat delimited speech marks (\") as a single character.
The following code compiles fine, but terminates on trying to initialise rgx, throwing the error Abort trap: 6 using libc++.
std::regex rgx("[[.\\\\\".]]");
std::smatch results;
std::string test_str("\\\"");
std::regex_search(test_str, results, rgx);
If I remove the [[. .]], it runs fine, results[0] returning \" as intended, but as said, I'd like for this sequence to be usable as a character class.
Edit: Ok, I realise now that my previous understanding of collated sequences was incorrect, and the reason it wouldn't work is that \\\\\" is not defined as a sequence. So my new question: is it possible to define collated sequences?
So I figured out where I was going wrong and thought I'd leave this here in case anyone stumbles across it.
You can specify a passive group of characters with (?:sequence), allowing quantifiers to be applied as with a character class. Perhaps not exactly what I'd originally asked, but fulfils the same purpose, in my case at least.
To match a string beginning and ending with double quotation marks (including these characters in the results), but allowing delimited quotation marks within the the string, I used the expression
\"(?:[^\"^\\\\]+|(?:\\\\\\\\)+|\\\\\")*\"
which says to grab the as many characters as possible, provided characters are not quotation marks or backslashes, then if this does not match, to firstly attempt to match an even number of backslashes (to allow delimiting of this character), or secondly a delimited quotation mark. This non-capturing group is matched as many times as possible, stopping only when it reaches a \".
I couldn't comment on the efficiency of this, but it definitely works.

Regular expression in JS for alphanumeric, dot and hyphen

I need a JS regular expression which should allow only the word having alphanumeric, dot and hyphen.
Let me know this is correct.
var regex = /^[a-zA-Z_0-9/.-]+$/;
Almost. That will also allow underscores and slashes. Remove those from your range:
var regex = /^[a-zA-Z0-9.-]+$/;
This will also not match the empty string. That may be what you want, but it also may not be what you want. If it's not what you want, change + to *.
The first simplifications I'd make are to use the "word character" shorthand '\w', which is about the same as 'a-zA-Z', but shorter, and automagically stays correct when you move to other languages that include some accented alphabetic characters, and the "digit character" shorthand '\d'.
Also, although dot is special in most places in regular expressions, it's not special inside square brackets, and shouldn't be quoted there. (Besides, the single character quote character is back-slash, not forward-slash. That forward-slash of yours inside the brackets is the same character that begins and ends the RE, and so is likely to prematurely terminate the RE and so cause a parse error!) Since we're completely throwing it away, it no longer matters whether it should be forward-slash or back-slash, quoted or bare.
And as you've noticed, hyphen has a special meaning of "range" inside brackets (ex: a-z), so if you want a literal hyphen you have to do something a little different. By convention that something is to put the literal hyphen first inside the brackets.
So my result would be var regex = /^[-.\w\d]+$/;
(As you've probably noticed, there's almost always more than one way to express a regular expression so it works, and RE weenies spend as much time on a) economy of expression and b) run-time performance as they do on getting it "correct". In other words, you can ignore much of what I've just said, as it doesn't really matter to you. I think all that really matters is a) getting rid of that extraneous forward-slash and b) moving the literal hyphen to be the very first character inside the square brackets.)
(Another thought: very frequently when accepting alphabetic characters and hyphens, underscore is acceptable too ...so did you really mean to have that underscore after all?)
(Yet another thought: sometimes the very first character of an identifier must be an alpha, in which case what you probably want is var regex = /^\w[-.\w\d]*$/; You may want a different rule for the very first character in any case, as the naive recipe above would allow "-" and "." as legitimate words of length one.)

Alternation operator inside square brackets does not work

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Javascript String pattern Validation

I have a string and I want to validate that string so that it must not contain certain characters like '/' '\' '&' ';' etc... How can I validate all that at once?
You can solve this with regular expressions!
mystring = "hello"
yourstring = "bad & string"
validRegEx = /^[^\\\/&]*$/
alert(mystring.match(validRegEx))
alert(yourstring.match(validRegEx))
matching against the regex returns the string if it is ok, or null if its invalid!
Explanation:
JavaScript RegEx Literals are delimited like strings, but with slashes (/'s) instead of quotes ("'s).
The first and last characters of the validRegEx cause it to match against the whole string, instead of just part, the carat anchors it to the beginning, and the dollar sign to the end.
The part between the brackets ([ and ]) are a character class, which matches any character so long as it's in the class. The first character inside that, a carat, means that the class is negated, to match the characters not mentioned in the character class. If it had been omited, the class would match the characters it specifies.
The next two sequences, \\ and \/ are backslash escaped because the backslash by itself would be an escape sequence for something else, and the forward slash would confuse the parser into thinking that it had reached the end of the regex, (exactly similar to escaping quotes in strings).
The ampersand (&) has no special meaning and is unescaped.
The remaining character, the kleene star, (*) means that whatever preceeded it should be matched zero or more times, so that the character class will eat as many characters that are not forward or backward slashes or ampersands, including none if it cant find any. If you wanted to make sure the matched string was non-empty, you can replace it with a plus (+).
I would use regular expressions.
See this guide from Mozillla.org. This article does also give a good introduction to regular expressions in JavaScript.
Here is a good article on Javascript validation. Remember you will need to validate on the server side too. Javascript validation can easily be circumvented, so it should never be used for security reasons such as preventing SQL Injection or XSS attacks.
You could learn regular expressions, or (probably simpler if you only check for one character at a time) you could have a list of characters and then some kind of sanitize function to remove each one from the string.
var myString = "An /invalid &string;";
var charList = ['/', '\\', '&', ';']; // etc...
function sanitize(input, list) {
for (char in list) {
input = input.replace(char, '');
}
return input
}
So then:
sanitize(myString, charList) // returns "An invalid string"
You can use the test method, with regular expressions:
function validString(input){
return !(/[\\/&;]/.test(input));
}
validString('test;') //false
You can use regex. For example if your string matches:
[\\/&;]+
then it is not valid. Look at:
http://www.regular-expressions.info/javascriptexample.html
You could probably use a regular expression.
As the others have answered you can solve this with regexp but remember to also check the value server-side. There is no guarantee that the user has JavaScript activated. Never trust user input!

Categories