In Node.js I'm able to print to the console a Japanese character like this:
console.log('\u3041');
If I have 3041 as a number, for instance because it is randomly generated, how do I print the corresponding UTF-8 sigil?
const charNumber = 3041;
// of course this doesn't work, but I need something like that:
console.log(`\u${charNumber}`);
You can use the HEX representation with .fromCharCode by replacing \u with 0x:
const charNumber = 3041;
console.log(String.fromCharCode(`0x${charNumber}`));
Related
I want to upload a string to a Firebase Storage file. I tried it with
let myString = '...';
storage.ref('path')
.child('file.txt')
.putString(myString, 'raw');
and with
let myString = '...';
storage.ref('path')
.child('file.txt')
.putString(btoa(myString), 'base64');
and with
let myString = '...';
let array = new TextEncoder("utf-8").encode(myString);
storage.ref('path')
.child('file.txt')
.put(myString);
but when I download the files via the Firebase Console, they are always encoded as a string representation of an char array. I.e. the text {"type":"a"} ends up as
123,34,116,121,112,101,34,58,34,97,34,124
which is the ASCII representation of
{"type":"a"}
This is the actual content of the file. Instead of characters, I get a comma-separated list of character codes.
Is that a bug, or am I doing something wrong?
EDIT: In case it matters, this happens in Javascript, on React Native / Expo.
I have a string containing special characters, like:
Hello 🍀.
As far as I understand "🍀" is an UTF16 character.
How can I remove this "🍀" character and any other not UTF8 characters from string?
The problem is that .Net and JavaScript see it as two valid UTF8 characters:
int cs_len = "🍀".Length; // == 2 - C#
var js_len = "🍀".length // == 2 - javascript
where
strIn[0] is 55356 UTF8 character == ☐
and
strIn[1] is 57152 UTF8 character == ☐
And also next code snippets returns the same result:
string strIn = "Hello 🍀";
string res;
byte[] bytes = Encoding.UTF8.GetBytes(strIn);
res = Encoding.UTF8.GetString(bytes);
return res;//Hello 🍀
and
string res = null;
using (var stream = new MemoryStream())
{
var sw = new StreamWriter(stream, Encoding.UTF8);
sw.Write(strIn);
sw.Flush();
stream.Position = 0;
using (var sr = new StreamReader(stream, Encoding.UTF8))
{
res = sr.ReadToEnd();
}
}
return res;//Hello 🍀
I also need to support not only English but also Chinese and Japanese and any other languages, also any other UTF8 characters. How can I remove or replace any UTF16 characters in C# or JavaScript code, including 🍀 sign.
Thanks.
UTF-16 and UTF-8 "contain" the same number of "characters" (to be precise: of code points that may represent a character, thanks to David Haim), the only difference is how they are encoded to bytes.
In your example "🍀" is 3C D8 40 DF in UTF-16 and F0 9F 8D 80 in UTF-8.
From your problem-description and your pasted string I suspect that your sourcecode is encoded in UTF-8 but your compiler/interpreter is reading it as UTF-16. So it will interpret the one-character UTF-sequence F0 9F 8D 80 as two separate UTF-16-characters F0 9f and 8D 80 - the first is an invalid unicode-character and the second is the "Han Character".
As for how to solve the issue:
In your example you should look at the editor you use for creating your sources what encoding it uses to save the files plus you should check whether you can specify that encoding as a compiler-option.
You should also be aware that things will look quite different once you don't use hardcoded string-literals but read your input from a file or over the network - you will have to handle encoding-issues already when reading your input.
I found a solution to my question, it does not covers all the utf-16 characters, but removes many of them:
var title =
title.replace(/([\uE000-\uF8FF]|\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDDFF])/g, '*');
Here, I replace all special characters with a "star" *. You can also put an empty string '' to remove them.
The meaning of /g at the end of the string, is to remove all the occurrences of these special characters, because without it string.replace(...) probably will remove only the first one.
string teste = #"F:\Thiago\Programação\Projetos\OnlineAppfdsdf^~²$\XML\nfexml";
string strConteudo = Regex.Replace(teste, "[^0-9a-zA-Z\\.\\,\\/\\x20\\/\\x1F\\-\\r\\n]+", string.Empty);
WriteLine($"Teste: {teste}" +
$"\nTeste2: {strConteudo}");
How do I convert the below string:
var string = "Bouchard+P%E8re+et+Fils"
using javascript into UTF-8, so that %E8 would become %C3%A8?
Reason is this character seems to be tripping up decodeURIComponent
You can test it out by dropping the string into http://meyerweb.com/eric/tools/dencoder/ and seeing the console error that says Uncaught URIError: URI malformed
I'm looking specifically for something that can decode an entire html document, that claims to be windows-1252 encoded which is where I assume this %E8 character is coming from, into UTF-8.
Thanks!
First create a map of Windows-1252. You can find references to the encoding using your search engine of choice.
For the sake of this example, I'm going to include on the character in your sample data.
Then find all the percentage signs followed by two hexadecimal characters, convert them to numbers, and convert them using the map (to get raw data), then convert them again using encodeURIComponent (to get the encoded data).
var string = "Bouchard+P%E8re+et+Fils"
var w2512chars = [];
w2512chars[232] = "è"
var percent_encoded = /(%[a-fA-F0-9]{2})/g;
function filter(match, group) {
var number = parseInt(group.substr(1), 16);
var character = w2512chars[number];
return encodeURIComponent(character);
}
string = string.replace(percent_encoded, filter);
alert(string);
I'm trying to remove every Unicode character in a string if it falls in any the ranges below.
\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF
As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace function.
var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
In this case, the character seems to have been replaced fine.
However, when I replace that with
var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
I see something unexpected. My output shows up as:
he�llo worl᷿fd is replaced with
There are two things to note here:
\u1dfff does not show up as one character - \u1dff gets converted to a character and the f at the end it treated as its own character
the result is an empty string.
Any suggestions on how I can accomplish this would be much appreciated.
EDIT
My overall goal is to filter out all characters that the encodeURIComponent function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff to a unicode character before passing that to encodeURIComponent causes an exception to be raised by the latter function.
var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);
I edited parts of the question after #Blender pointed out that i was using x instead of u in my code to represent Unicode characters.
EDIT 2
I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.
It seems you're trying to remove Unicode surrogate code units from the string. However, only U+D800 through U+DFFF are surrogate code points; the remaining values you name are not, and could be allocated to valid Unicode characters. In that case, the following will suffice (use \u rather than \x to refer to Unicode characters):
buffer.replace(/[\ud800-\udfff]/g, "");
I'm looking for av way to convert a string into whitespace; spaces, newlines and tabs, and the other way around.
I found a Python script, but I have no idea how to do it using Javascript.
I need it for a white-hacking contest.
I can has banana? ;)
var ws={x:'0123',y:' \t\r\n',a:/[\w\W]/g,b:/[\w\W]{8}/g,c:function(z){return(
ws.y+ws.x)[(ws.x+ws.y).indexOf(z)]},e:function(s){return(65536+s.charCodeAt(0)
).toString(4).substr(1).replace(ws.a,ws.c)},d:function(s){return String.
fromCharCode(parseInt(s.replace(ws.a,ws.c),4))},encode:function(s){return s.
replace(ws.a,ws.e)},decode:function(s){return s.replace(ws.b,ws.d)}};
// test string
var s1 = 'test0123456789AZaz€åäöÅÄÖ';
// show test string
alert(s1);
// encode test string
var code = ws.encode(s1);
// show encoded string
alert('"'+code+'"');
// decode string
var s2 = ws.decode(code);
// show decoded string
alert(s2);
// verify that the strings are completely identical
alert(s1 === s2);