I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:
strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");
It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:
strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");
Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.
I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.
Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.
I use this simple and sturdy approach:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
return output;
}
Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).
JavaScript strings are natively Unicode. They hold character sequences* not byte sequences, so it is impossible for one to contain an invalid byte sequence.
(Technically, they actually contain UTF-16 code unit sequences, which is not quite the same thing, but this probably isn't anything you need to worry about right now.)
You can, if you need to for some reason, create a string holding characters used as placeholders for bytes. ie. using the character U+0080 ('\x80') to stand for the byte 0x80. This is what you would get if you encoded characters to bytes using UTF-8, then decoded them back to characters using ISO-8859-1 by mistake. There is a special JavaScript idiom for this:
var bytelike= unescape(encodeURIComponent(characters));
and to get back from UTF-8 pseudobytes to characters again:
var characters= decodeURIComponent(escape(bytelike));
(This is, notably, pretty much the only time the escape/unescape functions should ever be used. Their existence in any other program is almost always a bug.)
decodeURIComponent(escape(bytes)), since it behaves like a UTF-8 decoder, will raise an error if the sequence of code units fed into it would not be acceptable as UTF-8 bytes.
It is very rare for you to need to work on byte strings like this in JavaScript. Better to keep working natively in Unicode on the client side. The browser will take care of UTF-8-encoding the string on the wire (in a form submission or XMLHttpRequest).
Languages like spanish and french have accented characters like "é" and codes are in the range 160-255 see https://www.ascii.cl/htmlcodes.htm
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127 || input.charCodeAt(i) >= 160 && input.charCodeAt(i) <= 255) {
output += input.charAt(i);
}
}
return output;
}
Simple mistake, big effect:
strTest = strTest.replace(/your regex here/g, "$1");
// ----------------------------------------^
without the "global" flag, the replace occurs for the first match only.
Side note: To remove any character that does not fulfill some kind of complex condition, like falling into a set of certain Unicode character ranges, you can use negative lookahead:
var re = /(?![\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})./g;
strTest = strTest.replace(re, "")
where re reads as
(?! # negative look-ahead: a position *not followed by*:
[…] # any allowed character range from above
) # end lookahead
. # match this character (only if previous condition is met!)
If you're trying to remove the "invalid character" - � - from javascript strings then you can get rid of them like this:
myString = myString.replace(/\uFFFD/g, '')
I ran into this problem with a really weird result from the Date Taken data of a digital image. My scenario is admittedly unique - using windows scripting host (wsh) and the Shell.Application activex object which allows for getting the namespace object of a folder and calling the GetDetailsOf function to essentially return exif data after it has been parsed by the OS.
var app = new ActiveXObject("Shell.Application");
var info = app.Namespace("c:\");
var date = info.GetDetailsOf(info.ParseName("testimg.jpg"), 12);
In windws vista and 7, the result looked like this:
?8/?27/?2011 ??11:45 PM
So my approach was as follows:
var chars = date.split(''); //split into characters
var clean = "";
for (var i = 0; i < chars.length; i++) {
if (chars[i].charCodeAt(0) < 255) clean += chars[i];
}
The result of course is a string that excludes those question mark characters.
I know you went with a different solution altogether, but I thought I'd post my solution in case anyone else is having troubles with this and cannot use a server side language approach.
I used #Ali's solution to not only clean my string, but replace the invalid chars with html replacement:
cleanString(input) {
var output = "";
for (var i = 0; i < input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
} else {
output += "&#" + input.charCodeAt(i) + ";";
}
}
return output;
}
I have put together some solutions proposed above to be error-safe
var removeNonUtf8 = (characters) => {
try {
// ignore invalid char ranges
var bytelike = unescape(encodeURIComponent(characters));
characters = decodeURIComponent(escape(bytelike));
} catch (error) { }
// remove �
characters = characters.replace(/\uFFFD/g, '');
return characters;
},
Related
I have UTF-32 data, an array buffer. I need to convert it into an ECMAScript string.
I've been told that I can just use TextDecoder with UTF-8, and it is supposed to "just work," I highly doubted the person who had told me this, but it worked anyways.
Except... the output text is riddled with null characters (3 per character), due to reading the null byte padding as a null character, instead of reading the whole four bytes as one character.
ex:
\x70\x00\x00\x00
becomes
P UTF-32; null padding is read as one character
P\0\0\0 UTF-8; separated
According to the whatwg encoding spec, UTF-32 is not defined as an encoding label to be used, but instead, only UTF-8, and UTF-16, not UTF-32, does anyone have any suggestions on how I can achieve proper UTF-32 decoding, within a browser?
To be clear, I care about modern browsers, so I am excluding IE, Amaya, Android Webview, and Netscape Navigator, etc.
Decoding it as UTF-8 is definitely wrong! As you found out. In addition to the NUL thing, it will fail to decode characters outside of ASCII entirely.
You can read the codepoints one by one with a DataView to decode:
const utf32Decode = bytes => {
const view = new DataView(bytes.buffer, bytes.byteOffset, bytes.byteLength);
let result = '';
for (let i = 0; i < bytes.length; i += 4) {
result += String.fromCodePoint(view.getInt32(i, true));
}
return result;
};
const result = utf32Decode(new Uint8Array([0x70, 0x00, 0x00, 0x00]));
console.log(JSON.stringify(result));
Invalid UTF-32 will throw an error, thanks to getInt32 (invalid lengths) and String.fromCodePoint (invalid codepoints).
Use this library: https://github.com/ashtuchkin/iconv-lite. It works in-browser using browserify or webpack (though it's pretty big).
Example:
const iconv = require("iconv-lite")
const yourBuffer = // however you're getting your buffer
const str = iconv.decode(yourBuffer, "utf32");
I'm looking to make a program to encrypt a string using a vigenere cipher. So far, I have been successful in doing this, apart from special characters (e.g. spaces, full stops, commas, etc).
I have come to this solution, which includes the correct special characters. However, everything after the first special character in the string becomes gibberish. They are not special characters, they are still in the alphabet, although they don't match with the cipher. I cannot work out why this happening. I've tried several totally different methods, and all of them lead to this same error. This is the neatest method I've come up with so far, but it still doesn't work (for this example you can assume that the text and the key are the same length).
for (i=0, l=[], k=[], output=""; i < text.length; i++) {
l[i] = (text.charCodeAt(i)) - 97;
k[i] = (key.charCodeAt(i)) - 97;
if ((l[i] > -1) && (l[i] < 26)) { // if the ASCII code is between 0 and 25
ans = parseInt(encryptLetter(l[i], k[i]));
output += String.fromCharCode(97 + ans);
};
if ((l[i] < 0) || (l[i] > 25)) { // if the ASCII code is not between 0 and 25
output += String.fromCharCode(97 + l[i])
};
};
function encryptLetter(l, k) {
en = l + k;
if (en > 25) { // if encrypted letter is greater than 26.
en -= 26;
}
return en;
}
If you need, you can test the encryption here. Any help would be greatly appreciated.
Edit:
I have noticed that every four special characters, there is a block of regular characters that is correct to the cipher. I have no idea why. It completely baffles me.
For anyone wondering, I fixed it. I noticed that after every special character, they key would shift back one letter. For example, if the key was apples, then after the first special character, the key would become pplesa. After the second special character, the key would become plesap. To counter this, I just added p -= 1; at the end of the if statement for special characters. This fixed the problem. Thank you to everyone who helped.
Kind regards,
You can check the character type using ASCII value. Since you specified any non-lowercase letter, you can mark a character special if its ASCII value is not in the range 97-122.
You can store the special characters from your original string in some sort of hashmap. You can make the character the key, and the value being a linked list. The linked list can store the indices of the characters, so you know where they were in the original string.
I'm using RPG Maker MV which is a game creator that uses JavaScript to create plugins. I have a plugin in JavaScript already, however I'm trying to edit a part of the plugin so that it basically checks if a certain string exists in a character in the game and if it does, then sets specific variables to numbers within that string.
for (var i = 0; i < page.list.length; i++) {
if (page.list[i].code == 108 && page.list[i].parameters[0].contains("<post:" + (n) + "," + (n) + ">")) {
var post = page.list[i].parameters[0];
var array = post.split(',');
this._origMovement.x = Number(array[1]);
this._origMovement.y = Number(array[1]);
break;
};
};
So I know the first 2 lines work and contains works when I only put a specific string. However I can't figure out how to check for 2 numbers that are separated by a comma and wrapped in '<>' tags, without knowing what the numbers would be.
Then it needs to extract those numbers and assign one to this._origMovement.x and the other to this._origMovement.y.
Any help would be greatly appreciated.
This is one of those rare cases where I'd use a regular expression. If you haven't come across regular expressions before I suggest reading an introduction to them, such as this one: https://regexone.com/
In your case, you probable want something like this:
var myRegex = /<post:(\d+),(\d+)>/;
var matches = myParameter.match(myRegex);
this._origMovement.x = matches[1]; //the first number
this._origMovement.y = matches[2]; //the second number
The myRegex variable is a regular expression that looks for the pattern you describe, and has 2 capture groups which look for a string of one or more digits (\d+ means "one or more digits"). The result of the .match() call gives you an array containing the entire match and the results of the capture groups.
If you want to allow for decimal numbers, you'll need to use a different capture group that allows for a decimal point, such as ([\d\.]+), which means "a sequence of one or more digits and decimal points", or more sophisticated, (\d+\.?\d*), which is "a sequence of one or more digits, following by an optional decimal point, followed by zero or more digits).
There are lots of good tutorials around to help you write good regular expressions, and sites that will help you live-test your expressions to make sure they work correctly. They're a powerful tool, but be careful not to over-use them!
Got it to work. For anyone who may ever be interested, the code is below.
for (var i = 0; i < page.list.length; i++) {
if (page.list[i].code == 108 && page.list[i].parameters[0].contains("<post:")) {
var myRegex = /<post:(\d+),(\d+)>/;
var matches = page.list[i].parameters[0].match(myRegex);
this._origMovement.x = matches[1]; //the first number
this._origMovement.y = matches[2]; //the second number
break;
}
};
I'm trying to remove every Unicode character in a string if it falls in any the ranges below.
\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF
As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace function.
var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
In this case, the character seems to have been replaced fine.
However, when I replace that with
var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
I see something unexpected. My output shows up as:
he�llo worl᷿fd is replaced with
There are two things to note here:
\u1dfff does not show up as one character - \u1dff gets converted to a character and the f at the end it treated as its own character
the result is an empty string.
Any suggestions on how I can accomplish this would be much appreciated.
EDIT
My overall goal is to filter out all characters that the encodeURIComponent function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff to a unicode character before passing that to encodeURIComponent causes an exception to be raised by the latter function.
var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);
I edited parts of the question after #Blender pointed out that i was using x instead of u in my code to represent Unicode characters.
EDIT 2
I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.
It seems you're trying to remove Unicode surrogate code units from the string. However, only U+D800 through U+DFFF are surrogate code points; the remaining values you name are not, and could be allocated to valid Unicode characters. In that case, the following will suffice (use \u rather than \x to refer to Unicode characters):
buffer.replace(/[\ud800-\udfff]/g, "");
To be more precise, I need to know whether (and if possible, how) I can find whether a given string has double byte characters or not. Basically, I need to open a pop-up to display a given text which can contain double byte characters, like Chinese or Japanese. In this case, we need to adjust the window size than it would be for English or ASCII.
Anyone has a clue?
I used mikesamuel answer on this one. However I noticed perhaps because of this form that there should only be one escape slash before the u, e.g. \u and not \\u to make this work correctly.
function containsNonLatinCodepoints(s) {
return /[^\u0000-\u00ff]/.test(s);
}
Works for me :)
JavaScript holds text internally as UCS-2, which can encode a fairly extensive subset of Unicode.
But that's not really germane to your question. One solution might be to loop through the string and examine the character codes at each position:
function isDoubleByte(str) {
for (var i = 0, n = str.length; i < n; i++) {
if (str.charCodeAt( i ) > 255) { return true; }
}
return false;
}
This might not be as fast as you would like.
I have benchmarked the two functions in the top answers and thought I would share the results. Here is the test code I used:
const text1 = `The Chinese Wikipedia was established along with 12 other Wikipedias in May 2001. 中文維基百科的副標題是「海納百川,有容乃大」,這是中国的清朝政治家林则徐(1785年-1850年)於1839年為`;
const regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsNonLatinCodepoints(s) {
return regex.test(s);
}
function isDoubleByte(str) {
for (var i = 0, n = str.length; i < n; i++) {
if (str.charCodeAt( i ) > 255) { return true; }
}
return false;
}
function benchmark(fn, str) {
let startTime = new Date();
for (let i = 0; i < 10000000; i++) {
fn(str);
}
let endTime = new Date();
return endTime.getTime() - startTime.getTime();
}
console.info('isDoubleByte => ' + benchmark(isDoubleByte, text1));
console.info('containsNonLatinCodepoints => ' + benchmark(containsNonLatinCodepoints, text1));
When running this I got:
isDoubleByte => 2421
containsNonLatinCodepoints => 868
So for this particular string the regex solution is about 3 times faster.
However note that for a string where the first character is unicode, isDoubleByte() returns right away and so is much faster than the regex (which still has the overhead of the regular expression).
For instance for the string 中国, I got these results:
isDoubleByte => 51
containsNonLatinCodepoints => 288
To get the best of both world, it's probably better to combine both:
var regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsDoubleByte(str) {
if (!str.length) return false;
if (str.charCodeAt(0) > 255) return true;
return regex.test(str);
}
In that case, if the first character is Chinese (which is likely if the whole text is Chinese), the function will be fast and return right away. If not, it will run the regex, which is still faster than checking each character individually.
Actually, all of the characters are Unicode, at least from the Javascript engine's perspective.
Unfortunately, the mere presence of characters in a particular Unicode range won't be enough to determine you need more space. There are a number of characters which take up roughly the same amount of space as other characters which have Unicode codepoints well above the ASCII range. Typographic quotes, characters with diacritics, certain punctuation symbols, and various currency symbols are outside of the low ASCII range and are allocated in quite disparate places on the Unicode basic multilingual plane.
Generally, projects that I've worked on elect to provide extra space for all languages, or sometimes use javascript to determine whether a window with auto-scrollbar css attributes actually has content with a height which would trigger a scrollbar or not.
If detecting the presence of, or count of, CJK characters will be adequate to determine you need a bit of extra space, you could construct a regex using the following ranges:
[\u3300-\u9fff\uf900-\ufaff], and use that to extract a count of the number of characters that match. (This is a little excessively coarse, and misses all the non-BMP cases, probably excludes some other relevant ranges, and most likely includes some irrelevant characters, but it's a starting point).
Again, you're only going to be able to manage a rough heuristic without something along the lines of a full text rendering engine, because what you really want is something like GDI's MeasureString (or any other text rendering engine's equivalent). It's been a while since I've done so, but I think the closest HTML/DOM equivalent is setting a width on a div and requesting the height (cut and paste reuse, so apologies if this contains errors):
o = document.getElementById("test");
document.defaultView.getComputedStyle(o,"").getPropertyValue("height"))
Here is benchmark test: http://jsben.ch/NKjKd
This is much faster:
function containsNonLatinCodepoints(s) {
return /[^\u0000-\u00ff]/.test(s);
}
than this:
function isDoubleByte(str) {
for (var i = 0, n = str.length; i < n; i++) {
if (str.charCodeAt( i ) > 255) { return true; }
}
return false;
}
Why not let the window resize itself based on the runtime height/width?
Run something like this in your pop-up:
window.resizeTo(document.body.clientWidth, document.body.clientHeight);