Remove Unicode characters within various ranges in javascript

Remove Unicode characters within various ranges in javascript - javascript

I'm trying to remove every Unicode character in a string if it falls in any the ranges below.
\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF
As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace function.
var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
In this case, the character seems to have been replaced fine.
However, when I replace that with
var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
I see something unexpected. My output shows up as:
he�llo worl᷿fd is replaced with
There are two things to note here:
\u1dfff does not show up as one character - \u1dff gets converted to a character and the f at the end it treated as its own character
the result is an empty string.
Any suggestions on how I can accomplish this would be much appreciated.
EDIT
My overall goal is to filter out all characters that the encodeURIComponent function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff to a unicode character before passing that to encodeURIComponent causes an exception to be raised by the latter function.
var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);
I edited parts of the question after #Blender pointed out that i was using x instead of u in my code to represent Unicode characters.
EDIT 2
I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.

It seems you're trying to remove Unicode surrogate code units from the string. However, only U+D800 through U+DFFF are surrogate code points; the remaining values you name are not, and could be allocated to valid Unicode characters. In that case, the following will suffice (use \u rather than \x to refer to Unicode characters):
buffer.replace(/[\ud800-\udfff]/g, "");

Related

Javascript timestamp formatting with regular expression?

how do i format a string of 2014-09-10 10:07:02 into something like this:
2014,09,10,10,07,02
Thanks!

Nice and simple.
var str = "2014-09-10 10:07:02";
var newstr = str.replace(/[ :-]/g, ',');
console.log(newstr);

Based on the assumption that you want to get rid of everything but the digits, an alternative is to inverse the regex to exclude everything but digits. This is, in effect, a white-listing approach as compared to the previously posted black-listing approach.
var dateTimeString = "2016-11-23 02:00:00";
var regex = /[^0-9]+/g; // Alternatively (credit zerkms): /\D+/g
var reformattedDateTimeString = dateTimeString.replace(regex, ',');
Note the + which has the effect of replacing groups of characters (e.g. two spaces would be replaced by only a single comma).
Also note that if you intend to use the strings as digits (e.g. via parseInt), numbers with a leading zero are interpreted within JavaScript as being base-8.

Need a RegExp to filter out all but one decimal point

I'm using the following code to negate the characters in the regexp. By checking the inverse, I can determine if the value entered is correctly formatted. Essentially, any digit can be allowed but only one decimal point (placed anywhere in the string.) The way I have it now, it catches all numerals, but allows for multiple decimal points (creating invalid floats.) How can I adjust this to catch more than one decimal points (since I only want to allow for one)?
var regex = new RegExp(/[^0-9\.]/g);
var containsNonNumeric = this.value.match(regex);
if(containsNonNumeric){
this.value = this.value.replace(regex,'');
return false;
}
Here is what I'm expecting to happen:
First, valid input would be any number of numerals with the possibility of only one decimal point. The current behavior: The user enters characters one by one, if they are valid characters they will show up. If the character is invalid (e.g. the letter A) the field will replace that character with ''(essentially behaving like a backspace immediately after filling the character in. What I need is the same behavior for the addition of one too many decimal points.

As I understand your question the code below might be what you are looking for:
var validatedStr=str.replace(/[^0-9.]|\.(?=.*\.)/g, "");
It replaces all characters other then numbers and dot (.), then it replaces all dots followed by any number of 0-9 characters followed by dot.
EDIT based on first comment - the solution above erases all dots but the last, the author wants to erase all but the first one:
Since JS does not support "look behind", the solution might be to reverse string before regex, then reverse it again or to use this regex:
var counter=0;
var validatedStr=str.replace(/[^0-9.]|\./g, function($0){
if( $0 == "." && !(counter++) ) // dot found and counter is not incremented
return "."; // that means we met first dot and we want to keep it
return ""; // if we find anything else, let's erase it
});
JFTR: counter++ only executes if the first part of condition is true, so it works even for strings beginning with letters

Building upon the original regex from #Jan Legner with a pair of string reversals to work around the look behind behavior. Succeeds at keeping the first decimal point.
Modified with an attempt to cover negatives as well. Can't handle negative signs that are out of place and special cases that should logically return zero.
let keep_first_decimal = function(s) {
return s.toString().split('').reverse().join('').replace(/[^-?0-9.]|\.(?=.*\.)/g, '').split('').reverse().join('') * 1;
};
//filters as expected
console.log(keep_first_decimal("123.45.67"));
console.log(keep_first_decimal(123));
console.log(keep_first_decimal(123.45));
console.log(keep_first_decimal("123"));
console.log(keep_first_decimal("123.45"));
console.log(keep_first_decimal("a1b2c3d.e4f5g"));
console.log(keep_first_decimal("0.123"));
console.log(keep_first_decimal(".123"));
console.log(keep_first_decimal("0.123.45"));
console.log(keep_first_decimal("123."));
console.log(keep_first_decimal("123.0"));
console.log(keep_first_decimal("-123"));
console.log(keep_first_decimal("-123.45.67"));
console.log(keep_first_decimal("a-b123.45.67"));
console.log(keep_first_decimal("-ab123"));
console.log(keep_first_decimal(""));
//NaN, should return zero?
console.log(keep_first_decimal("."));
console.log(keep_first_decimal("-"));
//NaN, can't handle minus sign after first character
console.log(keep_first_decimal("-123.-45.67"));
console.log(keep_first_decimal("123.-45.67"));
console.log(keep_first_decimal("--123"));
console.log(keep_first_decimal("-a-b123"));

Chrome counts characters wrong in textarea with maxlength attribute

Here is an example:
$(function() {
$('#test').change(function() {
$('#length').html($('#test').val().length)
})
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea id=test maxlength=10></textarea>
length = <span id=length>0</span>
Fill textarea with lines (one character at one line) until browser allows.
When you finish, leave textarea, and js code will calculate characters too.
So in my case I could enter only 7 characters (including whitespaces) before chrome stopped me. Although value of maxlength attribute is 10:

Here's how to get your javascript code to match the amount of characters the browser believes is in the textarea:
http://jsfiddle.net/FjXgA/53/
$(function () {
$('#test').keyup(function () {
var x = $('#test').val();
var newLines = x.match(/(\r\n|\n|\r)/g);
var addition = 0;
if (newLines != null) {
addition = newLines.length;
}
$('#length').html(x.length + addition);
})
})
Basically you just count the total line breaks in the textbox and add 1 to the character count for each one.

Your carriage returns are considered 2 characters each when it comes to maxlength.
1\r\n
1\r\n
1\r\n
1
But it seems that the javascript only could one of the \r\n (I am not sure which one) which only adds up to 7.

It seems like the right method, based on Pointy's answer above, is to count all new lines as two characters. That will standardize it across browsers and match what will get sent when it's posted.
So we could follow the spec and replace all occurrences of a Carriage Return not followed by a New Line, and all New Lines not followed by a Carriage Return, with a Carriage Return - Line Feed pair.
var len = $('#test').val().replace(/\r(?!\n)|\n(?!\r)/g, "\r\n").length;
Then use that variable to display the length of the textarea value, or limit it, and so on.

For reasons unknown, jQuery always converts all newlines in the value of a <textarea> to a single character. That is, if the browser gives it \r\n for a newline, jQuery makes sure it's just \n in the return value of .val().
Chrome and Firefox both count the length of <textarea> tags the same way for the purposes of "maxlength".
However, the HTTP spec insists that newlines be represented as \r\n. Thus, jQuery, webkit, and Firefox all get this wrong.
The upshot is that "maxlength" on <textarea> tags is pretty much useless if your server-side code really has a fixed maximum size for a field value.
edit — at this point (late 2014) it looks like Chrome (38) behaves correctly. Firefox (33) however still doesn't count each hard return as 2 characters.

It looks like that javascript is considering length of new line character also.
Try using:
var x = $('#test').val();
x = x.replace(/(\r\n|\n|\r)/g,"");
$('#length').html(x.length);
I used it in your fiddle and it was working. Hope this helps.

That is because an new line is actually 2 bytes, and therefore 2 long. JavaScript doesn't see it that way and therefore it will count only 1, making the total of 7 (3 new lines)

Here's a more universal solution, which overrides the jQuery 'val' function. Will be making this issue into a blog post shortly and linking here.
var originalVal = $.fn.val;
$.fn.val = function (value) {
if (typeof value == 'undefined') {
// Getter
if ($(this).is("textarea")) {
return originalVal.call(this)
.replace(/\r\n/g, '\n') // reduce all \r\n to \n
.replace(/\r/g, '\n') // reduce all \r to \n (we shouldn't really need this line. this is for paranoia!)
.replace(/\n/g, '\r\n'); // expand all \n to \r\n
// this two-step approach allows us to not accidentally catch a perfect \r\n
// and turn it into a \r\r\n, which wouldn't help anything.
}
return originalVal.call(this);
}
else {
// Setter
return originalVal.call(this, value);
}
};

If you want to get remaining content length of text area then you can use match on the string containing the line breaks.
HTML:
<textarea id="content" rows="5" cols="15" maxlength="250"></textarea>
JS:
var getContentWidthWithNextLine = function(){
return 250 - content.length + (content.match(/\n/g)||[]).length;
}

var value = $('#textarea').val();
var numberOfLineBreaks = (value.match(/\n/g)||[]).length;
$('#textarea').attr("maxlength",500+numberOfLineBreaks);
works perfectly on google already in IE have to avoid the script! In IE the 'break-line' is counted only once, so avoid this solution in IE!

Textareas are still not fully in sync among browsers. I noticed 2 major problems: Carriage returns and Character encodings
Carriage return
By default are manipulated as 2 characters \r\n (Windows style).
The problem is that Chrome and Firefox will count it as one character. You can also select it to observe there is an invisivle character selected as a space.
A workaround is found here:
var length = $.trim($(this).val()).split(" ").join("").split('\n').join('').length;
Jquery word counts when user type line break
Internet explorer on the other hand will count it as 2 characters.
Their representation is :
Binary: 00001101 00001010
Hex: 0D0A
, and are represented in UTF-8 as 2 characters and counted for maxlength as 2 characters.
The HTML entities can be
1) Created from javascript code:
<textarea id='txa'></textarea>
document.getElementById("txa").value = String.fromCharCode(13, 10);
2) Parsed from the content of the textarea:
Ansi code:
<textarea>Line one.
Line two.</textarea>
3) Inserted from keyboard Enter key
4) Defined as the multiline content of the textbox
<textarea>Line one.
Line two.</textarea>
Character Encoding
Character encoding of an input field like textarea is independent than the character encoding of the page. This is important if you plan to count the bytes. So, if you have a meta header to define ANSI encoding of your page (with 1 byte per character), the content of your textbox is still UTF-8 with 2 bytes per character.
A workaround for the character encoding is provided here:
function htmlEncode(value){
// Create a in-memory div, set its inner text (which jQuery automatically encodes)
// Then grab the encoded contents back out. The div never exists on the page.
return $('<div/>').text(value).html();
}
function htmlDecode(value){
return $('<div/>').html(value).text();
}
HTML-encoding lost when attribute read from input field

Regular expression to strip thousand separator from numeral string?

I have strings which contains thousand separators, however no string-to-number function wants to consume it correctly (using JavaScript). I'm thinking about "preparing" the string by stripping all thousand separators, leaving anything else untoched and letting Number/parseInt/parseFloat functions (I'm satisfied with their behavious otherwise) to decide the rest. But it seems what i have no idea which RegExp can do that!
Better ideas are welcome too!
UPDATE:
Sorry, answers enlightened me how badly formulated question it is. What i'm triyng to achieve is: 1) to strip thousand separators only if any, but 2) to not disturb original string much so i will get NaNs in the cases of invalid numerals.
MORE UPDATE:
JavaScript is limited to English locale for parsing, so lets assume thousand separator is ',' for simplicity (naturally, it never matches decimal separator in any locale, so changing to any other locale should not pose a problem)
Now, on parsing functions:
parseFloat('1023.95BARGAIN BYTES!') // parseXXX functions just "gives up" on invalid chars and returns 1023.95
Number('1023.95BARGAIN BYTES!') // while Number constructor behaves "strictly" and will return NaN
Sometimes I use rhw loose one, sometimes strict. I want to figure out the best approach for preparing string for both functions.
On validity of numerals:
'1,023.99' is perfectly well-formed English number, and stripping all commas will lead to correct result.
'1,0,2,3.99' is broken, however generic comma stripping will give '1023.99' which is unlikely to be a correct result.

welp, I'll venture to throw my suggestion into the pot:
Note: Revised
stringWithNumbers = stringwithNumbers.replace(/(\d+),(?=\d{3}(\D|$))/g, "$1");
should turn
1,234,567.12
1,023.99
1,0,2,3.99
the dang thing costs $1,205!!
95,5,0,432
12345,0000
1,2345
into:
1234567.12
1023.99
1,0,2,3.99
the dang thing costs $1205!!
95,5,0432
12345,0000
1,2345
I hope that's useful!
EDIT:
There is an additional alteration that may be necessary, but is not without side effects:
(\b\d{1,3}),(?=\d{3}(\D|$))
This changes the "one or more" quantifier (+) for the first set of digits into a "one to three" quantifier ({1,3}) and adds a "word-boundary" assertion before it. It will prevent replacements like 1234,123 ==> 1234123. However, it will also prevent a replacement that might be desired (if it is preceded by a letter or underscore), such as A123,789 or _1,555 (which will remain unchanged).

A simple num.replace(/,/g, '') should be sufficient I think.

Depends on what your thousand separator is
myString = myString.replace(/[ ,]/g, "");
would remove spaces and commas.

This should work for you
var decimalCharacter = ".",
regex = new RegExp("[\\d" + decimalCharacter + "]+", "g"),
num = "10,0000,000,000.999";
+num.match(regex).join("");

To confirm that a numeral-string is well-formed, use:
/^(\d*|\d{1,3}(,\d{3})+)($|[^\d])/.test(numeral_string)
which will return true if the numeral-string is either (1) just a sequence of zero or more digits, or (2) a sequence of digits with a comma before each set of three digits, or (3) either of the above followed by a non-digit character and who knows what else. (Case #3 is for floats, as well as your "BARGAIN BYTES!" examples.)
Once you've confirmed that, use:
numeral_string.replace(/,/g, '')
which will return a copy of the numeral-string with all commas excised.

You can use s.replaceAll("(\\W)(?=\\d{3})","");
This regex gets all alpha-numeric character with 3 characters after it.
Strings like 4.444.444.444,00 € will be 4444444444,00 €

I have used the following in a commercial setting, and it has worked often:
numberStr = numberStr.replace(/[. ,](\d\d\d\D|\d\d\d$)/g,'$1');
In the above example, thousands can be marked with a decimal, a comma, or a space.
In some cases ( like a price of 1000,5 Euros) the above doesn't work. If you need something more robust, this should work 100% of the time:
//convert a comma or space used as the cent placeholder to a decimal
$priceStr = $priceStr.replace(/[, ](\d\d$)/,'.$1');
$priceStr = $priceStr.replace(/[, ](\d$)/,'.$1');
//capture cents
var $hasCentsRegex = /[.]\d\d?$/;
if($hasCentsRegex.test($priceStr)) {
var $matchArray = $priceStr.match(/(.*)([.]\d\d?$)/);
var $priceBeforeCents = $matchArray[1];
var $cents = $matchArray[2];
} else{
var $priceBeforeCents = $priceStr;
var $cents = "";
}
//remove decimals, commas and whitespace from the pre-cent portion
$priceBeforeCents = $priceBeforeCents.replace(/[.\s,]/g,'');
//re-create the price by adding back the cents
$priceStr = $priceBeforeCents + $cents;

What does this JS do?

var passwordArray = pwd.replace(/\s+/g, '').split(/\s*/);
I found the above line of code is a rather poorly documented JavaScript file, and I don't know exactly what it does. I think it splits a string into an array of characters, similar to PHP's str_split. Am I correct, and if so, is there a better way of doing this?

it replaces any spaces from the password and then it splits the password into an array of characters.
It is a bit redundant to convert a string into an array of characters,because you can already access the characters of a string through brackets(.. not in older IE :( ) or through the string method "charAt" :
var a = "abcdefg";
alert(a[3]);//"d"
alert(a.charAt(1));//"b"

It does the same as: pwd.split(/\s*/).
pwd.replace(/\s+/g, '').split(/\s*/) removes all whitespace (tab, space, lfcr etc.) and split the remainder (the string that is returned from the replace operation) into an array of characters. The split(/\s*/) portion is strange and obsolete, because there shouldn't be any whitespace (\s) left in pwd.
Hence pwd.split(/\s*/) should be sufficient. So:
'hello cruel\nworld\t how are you?'.split(/\s*/)
// prints in alert: h,e,l,l,o,c,r,u,e,l,w,o,r,l,d,h,o,w,a,r,e,y,o,u,?
as will
'hello cruel\nworld\t how are you?'.replace(/\s+/g, '').split(/\s*/)

The replace portion is removing all white space from the password. The \\s+ atom matches non-zero length white spcace. The 'g' portion matches all instances of the white space and they are all replaced with an empty string.

We Keep Coding

JavaScript is the programming language of the Web.

Remove Unicode characters within various ranges in javascript - javascript

Related

Javascript timestamp formatting with regular expression?

Need a RegExp to filter out all but one decimal point

Chrome counts characters wrong in textarea with maxlength attribute

Regular expression to strip thousand separator from numeral string?

What does this JS do?

Categories

Resources