Convert UTF-8 data into the proper string format - javascript

If I receive a UTF-8 string via a socket (or for that matter via any external source) I would like to get it as a properly parsed string object. The following code shows what I mean
var str='21\r\nJust a demo string \xC3\xA4\xC3\xA8-should not be anymore parsed';
// Find CRLF
var i=str.indexOf('\r\n');
// Parse size up until CRLF
var x=parseInt(str.slice(0, i));
// Read size bytes
var s=str.substr(i+2, x)
console.log(s);
This code should print
Just a demo string äè
but as the UTF-8 data is not properly parsed it only parses it up to the first Unicode character
Just a demo string ä
Would anyone have an idea how to convert this properly?

It seems you could use this decodeURIComponent(escape(str)):
var badstr='21\r\nJust a demo string \xC3\xA4\xC3\xA8-should not be anymore parsed';
var str=decodeURIComponent(escape(badstr));
// Find CRLF
var i=str.indexOf('\r\n');
// Parse size up until CRLF
var x=parseInt(str.slice(0, i));
// Read size bytes
var s=str.substr(i+2, x)
console.log(s);
BTW, this kind of issue occurs when you mix UTF-8 and other types of enconding. You should check that as well.

You should use utf8.js which is available on npm.
var utf8 = require('utf8');
var encoded = '21\r\nJust a demo string \xC3\xA4\xC3\xA8-foo bar baz';
var decoded = utf8.decode(encoded);
console.log(decoded);

Related

InputStream Encoding with Special Characters

Apologies, I'm not a JS developer and this is the first time I've worked with InputStream.
In the InputStream, I am processing one line of delimited text at a time that will always contain a character that is not UTF-8. My goal is to parse the InputStream to a string, split it by the delimiter, and read a certain value that is UTF-8 at an index.
The line will always be tab delimited, and will always contain the same number of delimiters. I might see something like this (two separate lines):
stuff morestuff 0.00 A ç F00012049333302129FF
stuff2 morestuff2 B è F00012205229521042CB
In my code, the value at the index position always seems to leave my variable undefined, and I'm assuming it's from the UTF-8 encoding in the toString method. My assumption is that the encoding is turning the non UTF-8 character into something that messes up the split function, but I'm not sure what or how. Here's some test code:
var InputStreamCallback = Java.type("org.apache.nifi.processor.io.InputStreamCallback");
var IOUtils = Java.type("org.apache.commons.io.IOUtils");
var StandardCharsets = Java.type("java.nio.charset.StandardCharsets");
var flowFile = session.get();
var index = 5;
session.read(flowFile,
new InputStreamCallback(function(inputStream) {
// Convert the single line of the flowfile into a UTF_8 encoded string
var line = IOUtils.toString(inputStream, StandardCharsets.UTF_8);
// Split the delimited string into an array
var dataArray = line.split('\t');
// Capture the required value at the defined index position
var capturedValue = dataArray[index];
}));
if (typeof capturedValue === 'undefined') {
// log an error
}
else {
// do what it's supposed to do
}
I'm hoping someone could explain what exactly is happening, and help me find a solution that will allow me to look up the correct value at my predetermined index position.

Javascript JSON parse problem with decoded POST request data

I get base64 data as an answer to a POST request.
It's decoded the following way (based on the documentation of the REST API):
let buf = Buffer.from(base64, "base64");
buf = buf.slice(4);
let data = gunzipSync(buf).toString()
console.log(data) // -> {"Code":200,"Value":"8e286fdb-aad2-43c6-87b1-1c6c0d21808a","Route":""}
console.log(data.length) // -> 140 -> Seems weird? Shouldn't it be 70?
Problem:
console.log(JSON.parse(data)) -> SyntaxError: Unexpected token in JSON at position 1
I tried to delete all white characters via replace(/\s/g,''), tried decoding with toString("utf8"), etc.
Nothing helps. The only thing that could help is the weird wrong length described above.
Your buffer is UTF-16 encoded and contains \0 bytes, like {·"·C·o·d·e·"·=·… (with · representing \0), that's why it's double the expected length. The \0 bytes don't print when you output the buffer with console.log(), that's why the output seems to be correct.
Decode the buffer before JSON-parsing it.
var buffer = Buffer.from(base64, "base64");
var str = buffer.toString('utf16le');
console.log(str) // -> {"Code":200,"Value":"8e286fdb-aad2-43c6-87b1-1c6c0d21808a","Route":""}
console.log(str.length) // -> 70
console.log(JSON.parse(str)) // -> { Code: 200, Value: '8e286fdb-aad2-43c6-87b1-1c6c0d21808a', Route: '' }
In general, never work with buffers as if they were strings. Buffers are always encoded in some way, that is their fundamental, defining difference from strings. You must decode them before outputting their contents as text.

Decoding not working with Base64

Encoding my URL works perfectly with base-64 encoding. So does decoding but not with the string literal variable.
This works:
document.write(atob("hi"));
This does not:
var tempvar = "hello";
document.write(atob(tempvar));
What am I doing wrong? Nothing is displayed. But if I quote "tempvar", then it of course works but is not the same thing since "tempvar" is a string, not a variable.
Your Question
What am I doing wrong?
The string being passed to atob() is a string literal of length 5 (and not technically a base-64 encoded string). The browser console should reveal an exception in the error log (see explanation in The cause below).
The cause
Per the MDN documentation of atob():
Throws
Throws a DOMException if the length of passed-in string is not a multiple of 4. 1
The length of the string literal "hello" (i.e. 5) is not a multiple of 4. Thus the exception is thrown instead of returning the decoded version of the string literal.
A Solution
One solution is to either use a string that has actually been encoded (e.g. with btoa()) or at least has a length of four (e.g. using String.prototype.substring()). See the snippet below for an example.
var tempvar = "hello";
window.addEventListener("DOMContentLoaded", function(readyEvent) {
var container = document.getElementById("container");
//encode the string
var encoded = btoa(tempvar);
container.innerHTML = encoded;
var container2 = document.getElementById("container2");
//decode the encoded string
container2.innerHTML = atob(encoded);
var container3 = document.getElementById("container3");
//decode the first 4 characters of the string
container3.innerHTML = atob(tempvar.substring(0, 4));
});
<div> btoa(tempvar): <span id="container"></span></div>
<div> atob(decoded): <span id="container2"></span></div>
<div> atob(tempvar.substring(0, 4)): <span id="container3"></span></div>
1https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/atob
It's because it can't decode the string "hello", try an actual string that can be decoded from base64, here is an example;
var tempvar = "aHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL3F1ZXN0aW9ucy80MzEyOTEzNi9kZWNvZGluZy1ub3Qtd29ya2luZy13aXRoLWJhc2U2NA==";
document.write(atob(tempvar));
If you want to encode, use the btoa function instead,
var tempvar = "hello";
document.write(btoa(tempvar));
You can use this website to test decoding and encoding base64, https://www.base64encode.org/
it's because you are trying to decode a not base64 encoded string
that it works on hi is just a coincidence it seems.
atob = decode
btoa = encode
You're using the wrong function. You should use btoa() to encode.
When you do atob('hi'), you're actually decoding 'hi', which happens to be valid base-64.

Javascript convert windows-1252 encoding to UTF-8

How do I convert the below string:
var string = "Bouchard+P%E8re+et+Fils"
using javascript into UTF-8, so that %E8 would become %C3%A8?
Reason is this character seems to be tripping up decodeURIComponent
You can test it out by dropping the string into http://meyerweb.com/eric/tools/dencoder/ and seeing the console error that says Uncaught URIError: URI malformed
I'm looking specifically for something that can decode an entire html document, that claims to be windows-1252 encoded which is where I assume this %E8 character is coming from, into UTF-8.
Thanks!
First create a map of Windows-1252. You can find references to the encoding using your search engine of choice.
For the sake of this example, I'm going to include on the character in your sample data.
Then find all the percentage signs followed by two hexadecimal characters, convert them to numbers, and convert them using the map (to get raw data), then convert them again using encodeURIComponent (to get the encoded data).
var string = "Bouchard+P%E8re+et+Fils"
var w2512chars = [];
w2512chars[232] = "è"
var percent_encoded = /(%[a-fA-F0-9]{2})/g;
function filter(match, group) {
var number = parseInt(group.substr(1), 16);
var character = w2512chars[number];
return encodeURIComponent(character);
}
string = string.replace(percent_encoded, filter);
alert(string);

Convert string to whitespace

I'm looking for av way to convert a string into whitespace; spaces, newlines and tabs, and the other way around.
I found a Python script, but I have no idea how to do it using Javascript.
I need it for a white-hacking contest.
I can has banana? ;)
var ws={x:'0123',y:' \t\r\n',a:/[\w\W]/g,b:/[\w\W]{8}/g,c:function(z){return(
ws.y+ws.x)[(ws.x+ws.y).indexOf(z)]},e:function(s){return(65536+s.charCodeAt(0)
).toString(4).substr(1).replace(ws.a,ws.c)},d:function(s){return String.
fromCharCode(parseInt(s.replace(ws.a,ws.c),4))},encode:function(s){return s.
replace(ws.a,ws.e)},decode:function(s){return s.replace(ws.b,ws.d)}};
// test string
var s1 = 'test0123456789AZaz€åäöÅÄÖ';
// show test string
alert(s1);
// encode test string
var code = ws.encode(s1);
// show encoded string
alert('"'+code+'"');
// decode string
var s2 = ws.decode(code);
// show decoded string
alert(s2);
// verify that the strings are completely identical
alert(s1 === s2);

Categories