I have UTF-32 data, an array buffer. I need to convert it into an ECMAScript string.
I've been told that I can just use TextDecoder with UTF-8, and it is supposed to "just work," I highly doubted the person who had told me this, but it worked anyways.
Except... the output text is riddled with null characters (3 per character), due to reading the null byte padding as a null character, instead of reading the whole four bytes as one character.
ex:
\x70\x00\x00\x00
becomes
P UTF-32; null padding is read as one character
P\0\0\0 UTF-8; separated
According to the whatwg encoding spec, UTF-32 is not defined as an encoding label to be used, but instead, only UTF-8, and UTF-16, not UTF-32, does anyone have any suggestions on how I can achieve proper UTF-32 decoding, within a browser?
To be clear, I care about modern browsers, so I am excluding IE, Amaya, Android Webview, and Netscape Navigator, etc.
Decoding it as UTF-8 is definitely wrong! As you found out. In addition to the NUL thing, it will fail to decode characters outside of ASCII entirely.
You can read the codepoints one by one with a DataView to decode:
const utf32Decode = bytes => {
const view = new DataView(bytes.buffer, bytes.byteOffset, bytes.byteLength);
let result = '';
for (let i = 0; i < bytes.length; i += 4) {
result += String.fromCodePoint(view.getInt32(i, true));
}
return result;
};
const result = utf32Decode(new Uint8Array([0x70, 0x00, 0x00, 0x00]));
console.log(JSON.stringify(result));
Invalid UTF-32 will throw an error, thanks to getInt32 (invalid lengths) and String.fromCodePoint (invalid codepoints).
Use this library: https://github.com/ashtuchkin/iconv-lite. It works in-browser using browserify or webpack (though it's pretty big).
Example:
const iconv = require("iconv-lite")
const yourBuffer = // however you're getting your buffer
const str = iconv.decode(yourBuffer, "utf32");
Related
I want to take a string of emoji and do something with the individual characters.
In JavaScript "π΄ππβπ ππ".length == 13 because "β" length is 1, the rest are 2. So we can't do
var string = "π΄ππβπ ππ";
s = string.split("");
console.log(s);
JavaScript ES6 has a solution!, for a real split:
[..."π΄ππβπ ππ"] // ["π΄", "π", "π", "β", "π ", "π", "π"]
Yay? Except for the fact that when you run this through your transpiler, it might not work (see #brainkim's comment). It only works when natively run on an ES6-compliant browser. Luckily this encompasses most browsers (Safari, Chrome, FF), but if you're looking for high browser compatibility this is not the solution for you.
Edit: see Orlin Georgiev's answer for a proper solution in a library: https://github.com/orling/grapheme-splitter
Thanks to this answer I made a function that takes a string and returns an array of emoji:
var emojiStringToArray = function (str) {
split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
arr = [];
for (var i=0; i<split.length; i++) {
char = split[i]
if (char !== "") {
arr.push(char);
}
}
return arr;
};
So
emojiStringToArray("π΄ππβπ ππ")
// => Array [ "π΄", "π", "π", "β", "π ", "π", "π" ]
The grapheme-splitter library that does just that, is fully compatible even with old browsers and works not just with emoji but all sorts of exotic characters:
https://github.com/orling/grapheme-splitter
You are likely to miss edge-cases in any home-brew solution. This one is actually based on the UAX-29 Unicode standart
The modern / proper way to split a UTF8 string is using Array.from(str) instead of str.split('')
With the upcoming Intl.Segmenter. You can do this:
const splitEmoji = (string) => [...new Intl.Segmenter().segment(string)].map(x => x.segment)
splitEmoji("π΄ππβπ ππ") // ['π΄', 'π', 'π', 'β', 'π ', 'π', 'π']
This also solve the problem with "π¨βπ¨βπ§βπ§" and "π¦πΎ".
splitEmoji("π¨βπ¨βπ§βπ§π¦πΎ") // ['π¨βπ¨βπ§βπ§', 'π¦πΎ']
According to CanIUse, apart from IE and Firefox, this can be use 84.17% globally currently.
It can be done using the u flag of a regular expression. The regular expression is:
/.*?/u
This is broken every time there are there are at least minimally zero or more characters that may or may not be emojis, but cannot be spaces or new lines break.
There are at least minimally zero or more: ? (split in zero chars)
Zero or more: *
Cannot be spaces or new line break: .
May or may not be emojis: /u
By using the question mark ? I am forcing to cut exactly every zero chars, otherwise /.*/u it cuts by all characters until I find a space or newline break.
var string = "π΄ππβπ ππ"
var c = string.split(/.*?/u)
console.log(c)
The Grapheme Splitter library by Orlin Georgiev is pretty amazing.
Although it hasn't been updated in a while and presently (Sep 2020) it only supports Unicode 10 and below.
For an updated version of Grapheme Splitter built in Typescript with Unicode 13 support have a look at: https://github.com/flmnt/graphemer
Here is a quick example:
import Graphemer from 'graphemer';
const splitter = new Graphemer();
const string = "π΄ππβπ ππ";
splitter.countGraphemes(string); // returns 7
splitter.splitGraphemes(string); // returns array of characters
The library also works with the latest emojis.
For example "π©π»βπ¦°".length === 7 but splitter.countGraphemes("π©π»βπ¦°") === 1.
Full disclosure: I created the library and did the work to update to Unicode 13. The API is identical to Grapheme Splitter and is entirely based on that work, just updated to the latest version of Unicode as the original library hasn't been updated for a couple of years and seems to be no longer maintained.
I have a string containing special characters, like:
Hello π.
As far as I understand "π" is an UTF16 character.
How can I remove this "π" character and any other not UTF8 characters from string?
The problem is that .Net and JavaScript see it as two valid UTF8 characters:
int cs_len = "π".Length; // == 2 - C#
var js_len = "π".length // == 2 - javascript
where
strIn[0] is 55356 UTF8 character == β
and
strIn[1] is 57152 UTF8 character == β
And also next code snippets returns the same result:
string strIn = "Hello π";
string res;
byte[] bytes = Encoding.UTF8.GetBytes(strIn);
res = Encoding.UTF8.GetString(bytes);
return res;//Hello π
and
string res = null;
using (var stream = new MemoryStream())
{
var sw = new StreamWriter(stream, Encoding.UTF8);
sw.Write(strIn);
sw.Flush();
stream.Position = 0;
using (var sr = new StreamReader(stream, Encoding.UTF8))
{
res = sr.ReadToEnd();
}
}
return res;//Hello π
I also need to support not only English but also Chinese and Japanese and any other languages, also any other UTF8 characters. How can I remove or replace any UTF16 characters in C# or JavaScript code, including π sign.
Thanks.
UTF-16 and UTF-8 "contain" the same number of "characters" (to be precise: of code points that may represent a character, thanks to David Haim), the only difference is how they are encoded to bytes.
In your example "π" is 3C D8 40 DF in UTF-16 and F0 9F 8D 80 in UTF-8.
From your problem-description and your pasted string I suspect that your sourcecode is encoded in UTF-8 but your compiler/interpreter is reading it as UTF-16. So it will interpret the one-character UTF-sequence F0 9F 8D 80 as two separate UTF-16-characters F0 9f and 8D 80 - the first is an invalid unicode-character and the second is the "Han Character".
As for how to solve the issue:
In your example you should look at the editor you use for creating your sources what encoding it uses to save the files plus you should check whether you can specify that encoding as a compiler-option.
You should also be aware that things will look quite different once you don't use hardcoded string-literals but read your input from a file or over the network - you will have to handle encoding-issues already when reading your input.
I found a solution to my question, it does not covers all the utf-16 characters, but removes many of them:
var title =
title.replace(/([\uE000-\uF8FF]|\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDDFF])/g, '*');
Here, I replace all special characters with a "star" *. You can also put an empty string '' to remove them.
The meaning of /g at the end of the string, is to remove all the occurrences of these special characters, because without it string.replace(...) probably will remove only the first one.
string teste = #"F:\Thiago\Programação\Projetos\OnlineAppfdsdf^~²$\XML\nfexml";
string strConteudo = Regex.Replace(teste, "[^0-9a-zA-Z\\.\\,\\/\\x20\\/\\x1F\\-\\r\\n]+", string.Empty);
WriteLine($"Teste: {teste}" +
$"\nTeste2: {strConteudo}");
I want to take a string of emoji and do something with the individual characters.
In JavaScript "π΄ππβπ ππ".length == 13 because "β" length is 1, the rest are 2. So we can't do
var string = "π΄ππβπ ππ";
s = string.split("");
console.log(s);
JavaScript ES6 has a solution!, for a real split:
[..."π΄ππβπ ππ"] // ["π΄", "π", "π", "β", "π ", "π", "π"]
Yay? Except for the fact that when you run this through your transpiler, it might not work (see #brainkim's comment). It only works when natively run on an ES6-compliant browser. Luckily this encompasses most browsers (Safari, Chrome, FF), but if you're looking for high browser compatibility this is not the solution for you.
Edit: see Orlin Georgiev's answer for a proper solution in a library: https://github.com/orling/grapheme-splitter
Thanks to this answer I made a function that takes a string and returns an array of emoji:
var emojiStringToArray = function (str) {
split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
arr = [];
for (var i=0; i<split.length; i++) {
char = split[i]
if (char !== "") {
arr.push(char);
}
}
return arr;
};
So
emojiStringToArray("π΄ππβπ ππ")
// => Array [ "π΄", "π", "π", "β", "π ", "π", "π" ]
The grapheme-splitter library that does just that, is fully compatible even with old browsers and works not just with emoji but all sorts of exotic characters:
https://github.com/orling/grapheme-splitter
You are likely to miss edge-cases in any home-brew solution. This one is actually based on the UAX-29 Unicode standart
The modern / proper way to split a UTF8 string is using Array.from(str) instead of str.split('')
With the upcoming Intl.Segmenter. You can do this:
const splitEmoji = (string) => [...new Intl.Segmenter().segment(string)].map(x => x.segment)
splitEmoji("π΄ππβπ ππ") // ['π΄', 'π', 'π', 'β', 'π ', 'π', 'π']
This also solve the problem with "π¨βπ¨βπ§βπ§" and "π¦πΎ".
splitEmoji("π¨βπ¨βπ§βπ§π¦πΎ") // ['π¨βπ¨βπ§βπ§', 'π¦πΎ']
According to CanIUse, apart from IE and Firefox, this can be use 84.17% globally currently.
It can be done using the u flag of a regular expression. The regular expression is:
/.*?/u
This is broken every time there are there are at least minimally zero or more characters that may or may not be emojis, but cannot be spaces or new lines break.
There are at least minimally zero or more: ? (split in zero chars)
Zero or more: *
Cannot be spaces or new line break: .
May or may not be emojis: /u
By using the question mark ? I am forcing to cut exactly every zero chars, otherwise /.*/u it cuts by all characters until I find a space or newline break.
var string = "π΄ππβπ ππ"
var c = string.split(/.*?/u)
console.log(c)
The Grapheme Splitter library by Orlin Georgiev is pretty amazing.
Although it hasn't been updated in a while and presently (Sep 2020) it only supports Unicode 10 and below.
For an updated version of Grapheme Splitter built in Typescript with Unicode 13 support have a look at: https://github.com/flmnt/graphemer
Here is a quick example:
import Graphemer from 'graphemer';
const splitter = new Graphemer();
const string = "π΄ππβπ ππ";
splitter.countGraphemes(string); // returns 7
splitter.splitGraphemes(string); // returns array of characters
The library also works with the latest emojis.
For example "π©π»βπ¦°".length === 7 but splitter.countGraphemes("π©π»βπ¦°") === 1.
Full disclosure: I created the library and did the work to update to Unicode 13. The API is identical to Grapheme Splitter and is entirely based on that work, just updated to the latest version of Unicode as the original library hasn't been updated for a couple of years and seems to be no longer maintained.
I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:
strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");
It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:
strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");
Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.
I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.
Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.
I use this simple and sturdy approach:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
return output;
}
Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).
JavaScript strings are natively Unicode. They hold character sequences* not byte sequences, so it is impossible for one to contain an invalid byte sequence.
(Technically, they actually contain UTF-16 code unit sequences, which is not quite the same thing, but this probably isn't anything you need to worry about right now.)
You can, if you need to for some reason, create a string holding characters used as placeholders for bytes. ie. using the character U+0080 ('\x80') to stand for the byte 0x80. This is what you would get if you encoded characters to bytes using UTF-8, then decoded them back to characters using ISO-8859-1 by mistake. There is a special JavaScript idiom for this:
var bytelike= unescape(encodeURIComponent(characters));
and to get back from UTF-8 pseudobytes to characters again:
var characters= decodeURIComponent(escape(bytelike));
(This is, notably, pretty much the only time the escape/unescape functions should ever be used. Their existence in any other program is almost always a bug.)
decodeURIComponent(escape(bytes)), since it behaves like a UTF-8 decoder, will raise an error if the sequence of code units fed into it would not be acceptable as UTF-8 bytes.
It is very rare for you to need to work on byte strings like this in JavaScript. Better to keep working natively in Unicode on the client side. The browser will take care of UTF-8-encoding the string on the wire (in a form submission or XMLHttpRequest).
Languages like spanish and french have accented characters like "Γ©" and codes are in the range 160-255 see https://www.ascii.cl/htmlcodes.htm
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127 || input.charCodeAt(i) >= 160 && input.charCodeAt(i) <= 255) {
output += input.charAt(i);
}
}
return output;
}
Simple mistake, big effect:
strTest = strTest.replace(/your regex here/g, "$1");
// ----------------------------------------^
without the "global" flag, the replace occurs for the first match only.
Side note: To remove any character that does not fulfill some kind of complex condition, like falling into a set of certain Unicode character ranges, you can use negative lookahead:
var re = /(?![\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})./g;
strTest = strTest.replace(re, "")
where re reads as
(?! # negative look-ahead: a position *not followed by*:
[β¦] # any allowed character range from above
) # end lookahead
. # match this character (only if previous condition is met!)
If you're trying to remove the "invalid character" - οΏ½ - from javascript strings then you can get rid of them like this:
myString = myString.replace(/\uFFFD/g, '')
I ran into this problem with a really weird result from the Date Taken data of a digital image. My scenario is admittedly unique - using windows scripting host (wsh) and the Shell.Application activex object which allows for getting the namespace object of a folder and calling the GetDetailsOf function to essentially return exif data after it has been parsed by the OS.
var app = new ActiveXObject("Shell.Application");
var info = app.Namespace("c:\");
var date = info.GetDetailsOf(info.ParseName("testimg.jpg"), 12);
In windws vista and 7, the result looked like this:
?8/?27/?2011 ??11:45 PM
So my approach was as follows:
var chars = date.split(''); //split into characters
var clean = "";
for (var i = 0; i < chars.length; i++) {
if (chars[i].charCodeAt(0) < 255) clean += chars[i];
}
The result of course is a string that excludes those question mark characters.
I know you went with a different solution altogether, but I thought I'd post my solution in case anyone else is having troubles with this and cannot use a server side language approach.
I used #Ali's solution to not only clean my string, but replace the invalid chars with html replacement:
cleanString(input) {
var output = "";
for (var i = 0; i < input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
} else {
output += "&#" + input.charCodeAt(i) + ";";
}
}
return output;
}
I have put together some solutions proposed above to be error-safe
var removeNonUtf8 = (characters) => {
try {
// ignore invalid char ranges
var bytelike = unescape(encodeURIComponent(characters));
characters = decodeURIComponent(escape(bytelike));
} catch (error) { }
// remove οΏ½
characters = characters.replace(/\uFFFD/g, '');
return characters;
},
Are there any equivalent JavaScript functions for Python's urllib.parse.quote() and urllib.parse.unquote()?
The closest I've come across are encodeURI()/encodeURIComponent() and escape() (and their corresponding un-encoding functions), but they don't encode/decode the same set of special characters as far as I can tell.
JavaScript | Python
-----------------------------------
encodeURI(str) | urllib.parse.quote(str, safe='~##$&()*!+=:;,?/\'');
-----------------------------------
encodeURIComponent(str) | urllib.parse.quote(str, safe='~()*!\'')
On Python 3.7+ you can remove ~ from safe=.
OK, I think I'm going to go with a hybrid custom set of functions:
Encode: Use encodeURIComponent(), then put slashes back in.
Decode: Decode any %hex values found.
Here's a more complete variant of what I ended up using (it handles Unicode properly, too):
function quoteUrl(url, safe) {
if (typeof(safe) !== 'string') {
safe = '/'; // Don't escape slashes by default
}
url = encodeURIComponent(url);
// Unescape characters that were in the safe list
toUnencode = [ ];
for (var i = safe.length - 1; i >= 0; --i) {
var encoded = encodeURIComponent(safe[i]);
if (encoded !== safe.charAt(i)) { // Ignore safe char if it wasn't escaped
toUnencode.push(encoded);
}
}
url = url.replace(new RegExp(toUnencode.join('|'), 'ig'), decodeURIComponent);
return url;
}
var unquoteUrl = decodeURIComponent; // Make alias to have symmetric function names
Note that if you don't need "safe" characters when encoding ('/' by default in Python), then you can just use the built-in encodeURIComponent() and decodeURIComponent() functions directly.
Also, if there are Unicode characters (i.e. characters with codepoint >= 128) in the string, then to maintain compatibility with JavaScript's encodeURIComponent(), the Python quote_url() would have to be:
def quote_url(url, safe):
"""URL-encodes a string (either str (i.e. ASCII) or unicode);
uses de-facto UTF-8 encoding to handle Unicode codepoints in given string.
"""
return urllib.quote(unicode(url).encode('utf-8'), safe)
And unquote_url() would be:
def unquote_url(url):
"""Decodes a URL that was encoded using quote_url.
Returns a unicode instance.
"""
return urllib.unquote(url).decode('utf-8')
The requests library is a bit more popular if you don't mind the extra dependency
from requests.utils import quote
quote(str)
Python: urllib.quote
Javascript:unescape
I haven't done extensive testing but for my purposes it works most of the time. I guess you have some specific characters that don't work. Maybe if I use some Asian text or something it will break :)
This came up when I googled so I put this in for all the others, if not specifically for the original question.
Here are implementations based on a implementation on github repo purescript-python:
import urllib.parse as urllp
def encodeURI(s): return urllp.quote(s, safe="~##$&()*!+=:;,.?/'")
def decodeURI(s): return urllp.unquote(s, errors="strict")
def encodeURIComponent(s): return urllp.quote(s, safe="~()*!.'")
def decodeURIComponent(s): return urllp.unquote(s, errors="strict")
Try a regex. Something like this:
mystring.replace(/[\xFF-\xFFFF]/g, "%" + "$&".charCodeAt(0));
That will replace any character above ordinal 255 with its corresponding %HEX representation.
decodeURIComponent() is similar to unquote
const unquote = decodeURIComponent
const unquote_plus = (s) => decodeURIComponent(s.replace(/\+/g, ' '))
except that Python is much more forgiving. If one of the two characters after a % is not a hex digit (or there's not two characters after a %), JavaScript will throw a URIError: URI malformed error, whereas Python will just leave the % as is.
encodeURIComponent() is not quite the same as quote, you need to percent encode a few more characters and un-escape /:
const quoteChar = (c) => '%' + c.charCodeAt(0).toString(16).padStart(2, '0').toUpperCase()
const quote = (s) => encodeURIComponent(s).replace(/[()*!']/g, quoteChar).replace(/%2F/g, '/')
const quote_plus = (s) => quote(s).replace(/%20/g, '+')
The characters that Python's quote doesn't escape is documented here and is listed as (on Python 3.7+) "Letters, digits, and the characters '_.-~' are never quoted. By default, this function is intended for quoting the path section of a URL. The optional safe parameter specifies additional ASCII characters that should not be quoted β its default value is '/'"
The characters that JavaScript's encodeURIComponent doesn't encode is documented here and is listed as uriAlpha (upper and lowercase ASCII letters), DecimalDigit and uriMark, which are - _ . ! ~ * ' ( ).