Javascript convert windows-1252 encoding to UTF-8

Javascript convert windows-1252 encoding to UTF-8 - javascript

How do I convert the below string:
var string = "Bouchard+P%E8re+et+Fils"
using javascript into UTF-8, so that %E8 would become %C3%A8?
Reason is this character seems to be tripping up decodeURIComponent
You can test it out by dropping the string into http://meyerweb.com/eric/tools/dencoder/ and seeing the console error that says Uncaught URIError: URI malformed
I'm looking specifically for something that can decode an entire html document, that claims to be windows-1252 encoded which is where I assume this %E8 character is coming from, into UTF-8.
Thanks!

First create a map of Windows-1252. You can find references to the encoding using your search engine of choice.
For the sake of this example, I'm going to include on the character in your sample data.
Then find all the percentage signs followed by two hexadecimal characters, convert them to numbers, and convert them using the map (to get raw data), then convert them again using encodeURIComponent (to get the encoded data).
var string = "Bouchard+P%E8re+et+Fils"
var w2512chars = [];
w2512chars[232] = "è"
var percent_encoded = /(%[a-fA-F0-9]{2})/g;
function filter(match, group) {
var number = parseInt(group.substr(1), 16);
var character = w2512chars[number];
return encodeURIComponent(character);
}
string = string.replace(percent_encoded, filter);
alert(string);

Related

decode hex encoded cbor string as a string in javascript

I have this hex encoded cbor string that I need decoded as a string: 821a0485c1e3ae581c368bedeac13b8f1fbc30cdaadb987635ff95e88960b1ea216e3f96faa14f74656e7461696c656761637931323201581c3dc002874772549f7758066ae025b92a4dab66c57d187dd78821b673a34a6c746c6d6b7973363535014a6c746c6d6b7973383333014b6c746c6d6b79733238393901581c482fb00dc32186a4c587dca2df3c7cf2bc455332ab581d51967306e1a1444d4f4149190743581c49da502b625d310ad3a742c1e747c0027e83426f8717180b21c871b1a14959524941433136373401581c4b92de6b0398970dcafe4aee5329e591819ca11aa3dc163a981c7f99a54f4d654d654f66546865446179363534014f4d654d654f6654686544617937343001504d654d654f665468654461793132343001504d654d654f665468654461793133373801504d654d654f665468654461793234343901581c561696ab9e70db98f8ff5c12f0fdbd837bd1b95d84c748b04ede8fbaa14441504f43190226581c5993061274861159508aef767fc0ccd8b8a9b836171a989ad543fa1fa74f44446f53546573744d696e74373432015044446f53546573744d696e7431323132015044446f53546573744d696e7431333238015044446f53546573744d696e7432333130015044446f53546573744d696e7432393239015044446f53546573744d696e7433303037015044446f53546573744d696e743330313701581c5cf33cfea1b37c289060f55fa09c1fb3b9cb6972e40d9ed2f94a5adaa14f484d5072696f4d6f6e73746572313901581c9b542bc33521163a7ff3a05d1df1bcc0c0ec6a0638337e4b2870f6eaa153496e74726f76657274736d636172643034303201581caaf1f848b36940b0703f43f8d406b815132efe64fccb34bc30f993a0a14f466c7566667957686974656c69737401581cc76975c66380ad2bcdf6b465b3b0df34bdd76112046979b9a834364fa1534f4e434841494e20534b554c4c20233038353301581cd79181749db228d10c98501a7e1728585780bcf133b7b3df953a9017a24e496e74726f766572747332313135014e496e74726f76657274733436383901581cdd589bbcfa48c9a133a22e205da33a5d07ef79dac1f8d5d8067b1004a158187768657265734472696c6c646f54686557616c646f39353301581cf8ed2f8e4fdc992644710e45be7fdc4a556b7517ee2956414b5c2b25a14d434e4654536c6162733032373601
I found this website: https://cbor.me/ which are able to decode it as a string as seen in the picture
And I have been trying to figure out how it works so I could recreate it in my own code, but have been unsuccessful.
I have tried this, which gives me a string like I wanted but it doesn't decode the words as seen in the picture:
buf = new Uint8Array(itxt.match(/.{1,2}/g).map(byte => parseInt(byte, 16)))
data = await cbor.diagnose(buf)
console.log(data)
So I have come here to ask if any of you might know the code to do it. thanks

console.log(data.replace(/h'(.*?)'/g, function(m, p) {
var s = Buffer.from(p, "hex").toString();
if (/\P{L}/u.test(s)) return m;
else return `'${s}'`;
}));
replaces the hex strings with UTF-8 strings, unless the UTF-8 string would contain a non-letter (regular expression \P{L}).
An alternative to using the Buffer class is the code you already gave in your question:
var s = p.match(/.{1,2}/g)
.map(byte => String.fromCharCode(parseInt(byte, 16))).join("");

Serialize for JavaScript apex

I need to serialize some simple object from .NET to JavaScript...
But I've some problem with apex...
C# example
var obj = new { id = 0, label = #"some ""important"" text" };
string json1 = Newtonsoft.Json.JsonConvert.SerializeObject(obj);
string json2 = Newtonsoft.Json.JsonConvert.SerializeObject(obj,
new Newtonsoft.Json.JsonSerializerSettings()
{
StringEscapeHandling = Newtonsoft.Json.StringEscapeHandling.EscapeHtml
});
JavaScript example
var resJson1= JSON.parse('{"id":0,"label":"some \"important\" text"}');
var resJson2= JSON.parse('{"id":0,"label":"some \u0022important\u0022 text"}');
Both parse give me the same error
VM517:1 Uncaught SyntaxError: Unexpected token I in JSON at position
23 at JSON.parse(<anonymous>)
Where am I wrong?

You're pasting the generated string of JSON into a JavaScript string constant without escaping it further. Try
console.log('{"id":0,"label":"some \"important\" text"}');
You'll see {"id":0,"label":"some "important" text"} i.e. the "important" quotes are no longer escaped by backslashes. (And you'll get the same for your \u0022 example too.) If you want to paste in the backslashes you'll have to escape them again:
var resJson1= JSON.parse('{"id":0,"label":"some \\"important\\" text"}');
The JSON you've generated with a single backslash would be fine if read from a file or URL, just not pasted into JavaScript as a string constant.

Decoding not working with Base64

Encoding my URL works perfectly with base-64 encoding. So does decoding but not with the string literal variable.
This works:
document.write(atob("hi"));
This does not:
var tempvar = "hello";
document.write(atob(tempvar));
What am I doing wrong? Nothing is displayed. But if I quote "tempvar", then it of course works but is not the same thing since "tempvar" is a string, not a variable.

Your Question
What am I doing wrong?
The string being passed to atob() is a string literal of length 5 (and not technically a base-64 encoded string). The browser console should reveal an exception in the error log (see explanation in The cause below).
The cause
Per the MDN documentation of atob():
Throws
Throws a DOMException if the length of passed-in string is not a multiple of 4. 1
The length of the string literal "hello" (i.e. 5) is not a multiple of 4. Thus the exception is thrown instead of returning the decoded version of the string literal.
A Solution
One solution is to either use a string that has actually been encoded (e.g. with btoa()) or at least has a length of four (e.g. using String.prototype.substring()). See the snippet below for an example.
var tempvar = "hello";
window.addEventListener("DOMContentLoaded", function(readyEvent) {
var container = document.getElementById("container");
//encode the string
var encoded = btoa(tempvar);
container.innerHTML = encoded;
var container2 = document.getElementById("container2");
//decode the encoded string
container2.innerHTML = atob(encoded);
var container3 = document.getElementById("container3");
//decode the first 4 characters of the string
container3.innerHTML = atob(tempvar.substring(0, 4));
});
<div> btoa(tempvar): <span id="container"></span></div>
<div> atob(decoded): <span id="container2"></span></div>
<div> atob(tempvar.substring(0, 4)): <span id="container3"></span></div>
1https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/atob

It's because it can't decode the string "hello", try an actual string that can be decoded from base64, here is an example;
var tempvar = "aHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL3F1ZXN0aW9ucy80MzEyOTEzNi9kZWNvZGluZy1ub3Qtd29ya2luZy13aXRoLWJhc2U2NA==";
document.write(atob(tempvar));
If you want to encode, use the btoa function instead,
var tempvar = "hello";
document.write(btoa(tempvar));
You can use this website to test decoding and encoding base64, https://www.base64encode.org/

it's because you are trying to decode a not base64 encoded string
that it works on hi is just a coincidence it seems.
atob = decode
btoa = encode

You're using the wrong function. You should use btoa() to encode.
When you do atob('hi'), you're actually decoding 'hi', which happens to be valid base-64.

How to remove UTF16 characters from string?

I have a string containing special characters, like:
Hello 🍀.
As far as I understand "🍀" is an UTF16 character.
How can I remove this "🍀" character and any other not UTF8 characters from string?
The problem is that .Net and JavaScript see it as two valid UTF8 characters:
int cs_len = "🍀".Length; // == 2 - C#
var js_len = "🍀".length // == 2 - javascript
where
strIn[0] is 55356 UTF8 character == ☐
and
strIn[1] is 57152 UTF8 character == ☐
And also next code snippets returns the same result:
string strIn = "Hello 🍀";
string res;
byte[] bytes = Encoding.UTF8.GetBytes(strIn);
res = Encoding.UTF8.GetString(bytes);
return res;//Hello 🍀
and
string res = null;
using (var stream = new MemoryStream())
{
var sw = new StreamWriter(stream, Encoding.UTF8);
sw.Write(strIn);
sw.Flush();
stream.Position = 0;
using (var sr = new StreamReader(stream, Encoding.UTF8))
{
res = sr.ReadToEnd();
}
}
return res;//Hello 🍀
I also need to support not only English but also Chinese and Japanese and any other languages, also any other UTF8 characters. How can I remove or replace any UTF16 characters in C# or JavaScript code, including 🍀 sign.
Thanks.

UTF-16 and UTF-8 "contain" the same number of "characters" (to be precise: of code points that may represent a character, thanks to David Haim), the only difference is how they are encoded to bytes.
In your example "🍀" is 3C D8 40 DF in UTF-16 and F0 9F 8D 80 in UTF-8.
From your problem-description and your pasted string I suspect that your sourcecode is encoded in UTF-8 but your compiler/interpreter is reading it as UTF-16. So it will interpret the one-character UTF-sequence F0 9F 8D 80 as two separate UTF-16-characters F0 9f and 8D 80 - the first is an invalid unicode-character and the second is the "Han Character".
As for how to solve the issue:
In your example you should look at the editor you use for creating your sources what encoding it uses to save the files plus you should check whether you can specify that encoding as a compiler-option.
You should also be aware that things will look quite different once you don't use hardcoded string-literals but read your input from a file or over the network - you will have to handle encoding-issues already when reading your input.

I found a solution to my question, it does not covers all the utf-16 characters, but removes many of them:
var title =
title.replace(/([\uE000-\uF8FF]|\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDDFF])/g, '*');
Here, I replace all special characters with a "star" *. You can also put an empty string '' to remove them.
The meaning of /g at the end of the string, is to remove all the occurrences of these special characters, because without it string.replace(...) probably will remove only the first one.

string teste = #"F:\Thiago\Programação\Projetos\OnlineAppfdsdf^~²$\XML\nfexml";
string strConteudo = Regex.Replace(teste, "[^0-9a-zA-Z\\.\\,\\/\\x20\\/\\x1F\\-\\r\\n]+", string.Empty);
WriteLine($"Teste: {teste}" +
$"\nTeste2: {strConteudo}");

Convert UTF-8 data into the proper string format

If I receive a UTF-8 string via a socket (or for that matter via any external source) I would like to get it as a properly parsed string object. The following code shows what I mean
var str='21\r\nJust a demo string \xC3\xA4\xC3\xA8-should not be anymore parsed';
// Find CRLF
var i=str.indexOf('\r\n');
// Parse size up until CRLF
var x=parseInt(str.slice(0, i));
// Read size bytes
var s=str.substr(i+2, x)
console.log(s);
This code should print
Just a demo string äè
but as the UTF-8 data is not properly parsed it only parses it up to the first Unicode character
Just a demo string Ã¤
Would anyone have an idea how to convert this properly?

It seems you could use this decodeURIComponent(escape(str)):
var badstr='21\r\nJust a demo string \xC3\xA4\xC3\xA8-should not be anymore parsed';
var str=decodeURIComponent(escape(badstr));
// Find CRLF
var i=str.indexOf('\r\n');
// Parse size up until CRLF
var x=parseInt(str.slice(0, i));
// Read size bytes
var s=str.substr(i+2, x)
console.log(s);
BTW, this kind of issue occurs when you mix UTF-8 and other types of enconding. You should check that as well.

You should use utf8.js which is available on npm.
var utf8 = require('utf8');
var encoded = '21\r\nJust a demo string \xC3\xA4\xC3\xA8-foo bar baz';
var decoded = utf8.decode(encoded);
console.log(decoded);

We Keep Coding

JavaScript is the programming language of the Web.

Javascript convert windows-1252 encoding to UTF-8 - javascript

Related

decode hex encoded cbor string as a string in javascript

Serialize for JavaScript apex

Decoding not working with Base64

How to remove UTF16 characters from string?

Convert UTF-8 data into the proper string format

Categories

Resources