Unicode Javascript - Need to display invalid characters back to user - javascript

I'm looking for a solution that will solve the following problem but only have limited experience with Unicode.
Basically the user is able to type into a text field, however when they submit i want to display a list of the characters that WEREN"T GSM compliant. I.E everything that doesn't have a char code of 0-127.
However, it breaks severely when you bring emojis into the mix because if i char array it some emoji characters will get broken up and it will display the wrong reason why the validation failed.
I.E "πŸ˜€".length = 2, it will get split into 2 characters and therefore when i tell the user why it failed they will get the wrong reason.
Any ideas on how i can solve this would be greatly appreciated.
EDIT: Can't use ES6 and need an array of the invalid characters

Supposing you’re using a regex like this to find characters that aren’t in the valid range:
/[^\0-\x7f]/
you can modify it to prefer to match UTF-16 surrogate pairs:
/[\ud800-\udbff][\udc00-\udfff]|[^\0-\x7f]/
On modern browsers, you can also just use the u flag to operate on Unicode codepoints directly:
/[^\0-\x7f]/u
This will still only get codepoints, though, and not grapheme clusters (important for combining characters, modern combined emoji, skin tone, and general correctness in all languages). Those are harder to deal with. When (if?) browser support appears, they will be less hard; until then, a dedicated package is your best bet.
var NON_GSM_CODEPOINT = /[\ud800-\udbff][\udc00-\udfff]|[^\0-\x7f]/;
var input = document.getElementById('input');
input.addEventListener('input', function () {
var match = this.value.match(NON_GSM_CODEPOINT);
this.setCustomValidity(match ? 'Invalid character: β€œ' + match[0] + '”' : '');
this.form.reportValidity();
});
<form>
<textarea id="input"></textarea>
</form>

You can use the spread operator (...) to break the characters into an array and then charCodeAt to get the value:
let str = `πŸ˜€abcπŸ˜€defπŸ˜€ghi`;
let chars = [...str];
console.log(`All Chars: ${chars}`);
console.log('Bad Chars:',
chars.filter(v=>v.charCodeAt(0)>127)
);

Interesting! This is merely trial and error, but looks like converting the string to an array of chars strings using Array.from will allow you to index the characters correctly:
Array.from('πŸ˜€').length
1
Array.from('πŸ˜€abc').length
4
Array.from('πŸ˜€abc')[0]
"πŸ˜€"

Related

How to get the nth (Unicode) character from a string in JavaScript

Suppose we have a string with some (astral) Unicode characters:
const s = 'Hi πŸ‘‹ Unicode!'
The [] operator and .charAt() method don't work for getting the 4th character, which should be "πŸ‘‹":
> s[3]
'οΏ½'
> s.charAt(3)
'οΏ½'
The .codePointAt() does get the correct value for the 4th character, but unfortunately it's a number and has to be converted back to a string using String.fromCodePoint():
> String.fromCodePoint(s.codePointAt(3))
'πŸ‘‹'
Similarly, converting the string into an array using splats yields valid Unicode characters, so that's another way of getting the 4th one:
> [...s][3]
'πŸ‘‹'
But i can't believe that going from string to number back to string, or having to split the string into an array are the only ways of doing this seemingly trivial thing. Isn't there a simple method for doing this?
> s.simpleMethod(3)
'πŸ‘‹'
Note: i know that the definition of "character" is somewhat fuzzy, but for the purpose of this question a character is simply the symbol that corresponds to a Unicode codepoint (no combining characters, no grapheme clusters, etc).
Update: the String.fromCodePoint(str.codePointAt(n)) method is not really viable, since the nth position there doesn't take previous astral symbols into account: String.fromCodePoint('πŸ‘‹πŸ™ˆ'.codePointAt(1)) // => 'οΏ½'
(I feel kinda dumb asking this; like i'm probably missing something obvious. But previous answers to this questions don't work on strings with Unicode simbols on astral planes.)
The string iterator is the only thing that iterates through code points rather than UCS-2/UTF-16 code units. So:
const string = 'Hi πŸ‘‹ Unicode!';
for (const symbol of string) {
console.log(symbol);
}
So to get a specific code point based on its index from a string:
const string = 'Hi πŸ‘‹ Unicode!';
// Note: The spread operator uses the string iterator under the hood.
const symbols = [...string];
symbols[3]; // 'πŸ‘‹'
Still, this would break with grapheme clusters, or emoji sequences such as πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ (πŸ‘¨ + U+200D ZERO WIDTH JOINER + πŸ‘© + U+200D ZERO WIDTH JOINER + πŸ‘§ + U+200D ZERO WIDTH JOINER + πŸ‘¦). Text segmentation helps with that.
Do you actually need to get the 4th code point in the string, though? What’s your use case?
You can use the new u flag to regexp if it's available to you.
const chars = 'Hi πŸ‘‹ Unicode!'.match(/./ug);
console.log(chars);
The accepted answer to this question is out of date.
There is now a member of the String object called .at()/1 which does exactly what you're hoping for. If you have shims, shams, a transcompiler like TypeScript or Babel, etc, just set whatever your local configuration is, and you should be good to go.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/at
Amusingly, the spec for this feature, as well as the most common implementation shim (the one that I use,) is written by the person who authored the now out-of date accepted answer here. So even when he's out of date, he's still up to date.
If shimming or transcompiling isn't appropriate for you, there's a library called jsesc that can handle it for you through simple escaping. I'll give you three guesses who wrote the library. First two don't count.
https://www.npmjs.com/package/jsesc

Number is different than itself (trimming strange characters)

I've copied the first number from the windows calculator, and typed the second one. In Chrome console I get:
"‭65033‬" == "65033"
//false
65033‬ == 65033
//Uncaught SyntaxError: Invalid or unexpected token
It seems there is an unknown character at the beginning and end of it.
1) Is there a way to trim all "strange" characters without knowing them a priori?
2) Why does the windows calculator puts such chars in the number?
Edit: Was not explicit in the question, but any chars with valid information, such as Γ£,ΓΌ,Γ§,Β’,Β£ would also be valid. What I don't want is characters that do not carry any information for the human reader.
Edit: after the edit of the original question, this answer no longer offers a bulletproof solution.
var myNumber = 'foo123bar';
var realNumber = window.parseInt(myNumber.replace(/\D*/g, ''), 10);
What this does?
It replaces all the non-digit characters with empty character and then parses the integer out of numbers left in the string.
A quick solution for this case:
eval("65033‬ == 65033".replace(/[^a-zA-Z0-9 =-_.]/, ''))
You can place your copied text in a string, then remove all unnecessary characters (by explicitly listing the ones that should stay there).
These may include non-alphanumerical characters + hyphen, underscore, equality, space et cetera - actual character that need to stay there will depend on your choice and needs.
Alternatively, you may try to remove all non-printable characters, as suggested here.
Finally, evaluate resulting code. Remember this is not necessarily the best idea for production code.

Regex for integer, integer + dot, and decimals

I have searched StackOverflow and I can't find an answer as to how to check for regex of numeric inputs for a calculator app that will check for the following format with every keyup (jquery key up):
Any integer like: 34534
When a dot follows the integer when the user is about to enter a decimal number like this: 34534. Note that a dot can only be entered once.
Any float: 34534.093485
I don't plan to use commas to separate the thousands...but I would welcome if anyone can also provide a regex for that.
Is it possible to check the above conditions with just one regex? Thanks in advance.
Is a lone . a successful match or not? If it is then use:
\d+(\.\d*)?|\.\d*
If not then use:
\d+(\.\d*)?|\.\d+
Rather than incorporating commas into the regexes, I recommend stripping them out first: str = str.replace(/,/g, ''). Then check against the regex.
That wouldn't verify that digits are properly grouped into groups of three, but I don't see much value in such a check. If a user types 1,024 and then decides to add a digit (1,0246), you probably shouldn't force them to move the comma.
Let's write our your specifications, and develop from that.
Any integer: \d+
A comma, optionally followed by an integer: \.\d*
Combine the two and make the latter optional, and you get:
\d+\.?\d*
As for handling commas, I'd rather not go into it, as it gets very ugly very fast. You should simply strip all commas from input if you still care about them.
you can use in this way:
[/\d+./]
I think this can be used for any of your queries.
Whether it's 12445 or 1244. or 12445.43
I'm going to throw in a potentially downvoted answer here - this is a better solution:
function valid_float (num) {
var num = (num + '').replace(/,/g, ''), // don't care about commas, this turns `num` into a String
float_num = parseFloat(num);
return float_num == num || float_num + '.' == num; // allow for the decimal point, deliberately using == to ignore type as `num` is a String now
}
Any regex that does your job correctly will come with a big asterisk after it saying "probably", and if it's not spot on, it'll be an absolute pig to debug.
Sure, this answer isn't giving you the most awesomely cool one-liner that's going to make you go "Cool!", but in 6 months time when you realise it's going wrong somewhere, or you want to change it to do something slightly different, it's going to be a hell of a lot easier to see where, and to fix.
I'm using ^(\d)+(.(\d)+)+$ to capture each integer and to have an unlimited length, so long as the string begins and ends with integers and has dots between each integer group. I'm capturing the integer groups so that I can compare them.

Validate numbers, parenthesis and spaces only in jQuery validation

I am trying and failing hard in validating a phone number within jQuery validation. All I want is to allow a number like (01660) 888999. Looking around the net I find a million examples but nothing seems to work. Here is my current effort
$.validator.addMethod("phonenumber", function(value) {
var re = new RegExp("/[\d\s()+-]/g");
return re.test(value);
//return value.match("/[\d\s]*$");
}, "Please enter a valid phone number");
Bergi is correct that the way you are constructing the regular expression is wrong.
Another problem is that you are missing anchors and a +:
var re = /^[\d\s()+-]+$/;
Note though that a regular expression based solution will still allow some inputs that aren't valid phone numbers. You can improve your regular expression in many ways, for example you might want to require that there are at least x digits, for example.
There are many rules for what phone numbers are valid and invalid. It is unlikely you could encode all those rules into a regular expression in a maintainable way, so you could try one of these approaches:
Find a library that is able to validate phone numbers (but possibly not regular expression based).
If you need a regular expression, aim for something that is a close approximation to the rules, but doesn't attempt to handle all the special cases. I would suggest trying to write an expression that accepts all valid phone numbers, but doesn't necessarily reject all invalid phone numbers.
You may also want to consider writing test cases for your solution. The tests will also double as a form of documentation of which inputs you wish to accept and reject.
You need to use either a regex literal or a string literal in the RegExp constructor:
var re = /[\d\s()+-]/g;
// or
var re = new RegExp("[\\d\\s()+-]", "g");
See also Creating a Regular Expression.
Apart from that, you would need to use start- and end-of-string anchors to make sure that the regex matches the whole string, not only a part of it, and some repetition modifier to allow more than one character:
var re = /^[\d\s()+-]+$/g;
Another approach may be:
function(value) {
return /^\d+$/.test(value.replace(/[()\s+-]/g,''));
}
and if you want to check for the length of the number too, say it has to be a string with 10 digits:
function(value) {
return /^\d{10}$/.test(value.replace(/[()\s+-]/g,''));
}

Why is my RegExp construction not accepted by JavaScript?

I'm using a RegExp to validate some user input on an ASP.NET web page. It's meant to enforce the construction of a password (i.e. between 8 and 20 long, at least one upper case character, at least one lower case character, at least one number, at least one of the characters ##!$% and no use of letters L or O (upper or lower) or numbers 0 and 1. This RegExp works fine in my tester (Expresso) and in my C# code.
This is how it looks:
(?-i)^(?=.{8,20})(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])
(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]*$
(Line break added for formatting)
However, when I run the code it lives in in IE6 or IE7 (haven't tried other browsers as this is an internal app and we're a Microsoft shop), I get a runtime error saying 'Syntax error in regular expression'. That's it - no further information in the error message aside from the line number.
What is it about this that JavaScript doesn't like?
Well, there are two ways of defining a Regex in Javascript:
a. Through a Regexp object constructor:
var re = new RegExp("pattern","flags");
re.test(myTestString);
b. Using a string literal:
var re = /pattern/flags;
You should also note that JS does not support some of the tenets of Regular Expressions. For a non-comprehensive list of features unsupported in JS, check out the regular-expressions.info site.
Specifically speaking, you appear to be setting some flags on the expression (for example, the case insensitive flag). I would suggest that you use the /i flag (as indicated by the syntax above) instead of using (?-i)
That would make your Regex as follows (Positive Lookahead appears to be supported):
/^(?=.{8,20})(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]*$/i;
For a very good article on the subject, check out Regular Expressions in JavaScript.
Edit (after Howard's comment)
If you are simply assigning this Regex pattern to a RegularExpressionValidator control, then you will not have the ability to set Regex options (such as ignore case). Also, you will not be able to use the Regex literal syntax supported by Javascript. Therefore, the only option that remains is to make your pattern intrinsically case insensitive. For example, [a-h] would have to be written as [A-Ha-h]. This would make your Regex quite long-winded, I'm sorry to say.
Here is a solution to this problem, though I cannot vouch for it's legitimacy. Some other options that come to mind may be to turn of Client side validation altogether and validate exclusively on the Server. This will give you access to the full Regex flavour implemented by the System.Text.RegularExpressions.Regex object. Alternatively, use a CustomValidator and create your own JS function which applies the Regex match using the patterns that I (and others) have suggested.
I'm not familiar with C#'s regular expression syntax, but is this (at the start)
(?-i)
meant to turn the case insensitivity pattern modifier on? If so, that's your problem. Javascript doesn't support specifying the pattern modifiers in the expression. There's two ways to do this in javascript
var re = /pattern/i
var re = new RegExp('pattern','i');
Give one of those a try, and your expression should be happy.
As Cerberus mentions, (?-i) is not supported in JavaScript regexps. So, you need to get rid of that and use /i. Something to keep in mind is that there is no standard for regular expression syntax; it is different in each language, so testing in something that uses the .NET regular expression engine is not a valid test of how it will work in JavaScript. Instead, try and look for a reference on JavaScript regular expressions, such as this one.
Your match that looks for 8-20 characters is also invalid. This will ensure that there are at least 8 characters, but it does not limit the string to 20, since the character class with the kleene-closure (* operator) at the end can match as many characters as provided. What you want instead is to replace the * at the end with the {8,20}, and eliminate it from the beginning.
var re = /^(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]{8,20}$/i;
On the other hand, I'm not really sure why you would want to restrict the length of passwords, unless there's a hard database limit (which there shouldn't be, since you shouldn't be storing passwords in plain text in the database, but instead hashing them down to something fixed size using a secure hash algorithm with a salt). And as mentioned, I don't see a reason to be so restrictive on the set of characters you allow. I'd recommend something more like this:
var re = /^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##!$%])[a-zA-Z0-9##!$%]{8,}$/i;
Also, why would you forbid 1, 0, L and O from your passwords (and it looks like you're trying to forbid I as well, which you forgot to mention)? This will make it very hard for people to construct good passwords, and since you never see a password as you type it, there's no reason to worry about letters which look confusingly similar. If you want to have a more permissive regexp:
var re = /^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##!$%]).{8,}$/i;
Are you enclosing the regexp in / / characters?
var regexp = /[]/;
return regexp.test();
(?-i)
Doesn't exist in JS Regexp. Flags can be specified as β€œnew RegExp('pattern', 'i')”, or literal syntax β€œ/pattern/i”.
(?=
Exists in modern implementations of JS Regexp, but is dangerously buggy in IE. Lookahead assertions should be avoided in JS for this reason.
between 8 and 20 long, at least one upper case character, at least one lower case character, at least one number, at least one of the characters ##!$% and no use of letters L or O (upper or lower) or numbers 0 and 1.
Do you have to do this in RegExp, and do you have to put all the conditions in one RegExp? Because those are easy conditions to match using multiple RegExps, or even simple string matching:
if (
s.length<8 || s.length>20 ||
s==s.toLowerCase() || s==s.toUpperCase() ||
s.indexOf('0')!=-1 || s.indexOf('1')!=-1 ||
s.toLowerCase().indexOf('l')!=-1 || s.toLowerCase().indexOf('o')!=-1 ||
(s.indexOf('#')==-1 && s.indexOf('#')==-1 && s.indexOf('!')==-1 && s.indexOf('%')==-1 && s.indexOf('%')==-1)
)
alert('Bad password!');
(These are really cruel and unhelpful password rules if meant for end-users BTW!)
I would use this regular expression:
/(?=[^2-9]*[2-9])(?=[^a-hj-km-np-z]*[a-hj-km-np-z])(?=[^A-HJ-KM-NP-Z]*[A-HJ-KM-NP-Z])(?=[^##!$%]*[##!$%])^[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]{8,}$/
The [^a-z]*[a-z] will make sure that the match is made as early as possible instead of expanding the .* and doing backtracking.
(?-i) is supposed to turn case-insensitivity off. Everybody seems to be assuming you're trying to turn it on, but that would be (?i). Anyway, you don't want it to be case-insensitive, since you need to ensure that there are both uppercase and lowercase letters. Since case-sensitive matching is the default, prefacing a regex with (?-i) is pointless even in those flavors (like .NET) that support inline modifiers.

Categories