How to recover from decodeURI(encodeURIComponent(originalString))?

How to recover from decodeURI(encodeURIComponent(originalString))? - javascript

I have some legacy code (pre-React) that encodes part of a URL with encodeURIComponent before calling history.push() on the https://www.npmjs.com/package/history module in order to navigate to that URL.
history.push() inside history.js then uses decodeURI to decode the entire URL partially (decodeURI only decodes the same characters that encodeURI encodes)
this partially decoded location.pathname ends up in ReactRouter where useParams() gives me the partially decoded URL component back again.
Now I'm stuck with a partially decoded URL component which I cannot use. I need it fully decoded.
I can't use decodeURIComponent on the partially decoded string, because the original string might contain a %, in which case this % will already be decoded in the partially decoded string and this would cause decodeURIComponent to crash with a Uncaught URIError: URI malformed.
My options seem to be:
use unescape to fully decode the partially decoded string (it doesn't complain about the single %) even though its use is discouraged (why?)
manually re-encode any % (that isn't followed by a digit and a subsequent hex character) back to %25 and then run the result through decodeURIComponent
Are there any less ugly solutions that I haven't thought of yet ?
EDIT : I was asked for examples of what I meant by partially decided string
const original = 'A-Za-z0-9;,/?:#&=+$-_.!~*()#%';
const encodedURIComponent = encodeURIComponent(original); // "A-Za-z0-9%3B%2C%2F%3F%3A%40%26%3D%2B%24-_.!~*()%23%25"
console.log(decodeURIComponent(encodedURIComponent)); // "A-Za-z0-9;,/?:#&=+$-_.!~*()#%"
const partiallyUnescaped = decodeURI(encodedURIComponent); // "A-Za-z0-9%3B%2C%2F%3F%3A%40%26%3D%2B%24-_.!~*()%23%" - notice the '%25' at the end was decoded back to '%'
console.log(unescape(partiallyUnescaped)); // "A-Za-z0-9;,/?:#&=+$-_.!~*()#%"
//console.log(decodeURIComponent(partiallyUnescaped)); // error
EDIT 2: In case it can be of any help, here's a more realistic example of some of the characters our URL might contain, but because it's user generated, it could be anything really:
console.log( encodeURIComponent('abcd+%;- efgh')) ; // "abcd%2B%25%3B-%20efgh"
console.log( decodeURI(encodeURIComponent('abcd+%; -efgh'))) ; // "abcd%2B%%3B- efgh"
//console.log(decodeURIComponent(decodeURI(encodeURIComponent('abcd+%; -efgh')))); // Error: URI malformed

Related

why is j: or s: are added when setting cookie

I was intending to set a cookie, and want do to some work with that in client side JS.
While decoding the URI through decodeURIComponent() function, some undesired characters were appended before the URI and also in decoded URI. I did a quick fix by removing some of the first characters in my URI and decoding it to get JSON,
I would like to know why was j: added in URI and also how to deal with it.
Also : It looks right i.e nothing was appended in cookie when seeing decoded URI in DevTools
For setting my cookie having name as note with JS object
res.cookie('note',note,{maxAge : 1000*60*60*24});
let decoded = decodeURIComponent(document.cookie.substring(9, ));
decoded = JSON.parse(decoded);
I did this to decode my cookie
and converting JSON I got from decodeURIComponent fun to JS Object which I want to use
I tried encoding my object with encodeURIComponent but it seems it automatically get encoded.

base64 encoding in javascript decoding in php

I am trying to encode a string in javascript and decode it in php.
I use this code to put the string in a inputbox and then send it via form PUT.
document.getElementById('signature').value= b64EncodeUnicode(ab2str(signature));
And this code to decode
$signature=base64_decode($signature);
Here there is a jsfiddle for the encoding page:
https://jsfiddle.net/okaea662/
The problem is that I always get a string 98% correct but with some different characters.
For example: (the first string is the string printed in the inputbox)
¦S÷ä½m0×C|u>£áWÅàUù»¥ïs7Dþ1Ji%ýÊ{\ö°(úýýÁñxçO9Ù¡ö}XÇIWçÎ²Æü8ú²ðÑOA¤nì6S+Ì½ i¼?¼ºNËÒo·a©8»eO|PPþBE=HèÑqaX©$Ìç£°©b2(Ðç.$nÈR,ä_OX¾xè¥3éÂòkå¾ N,sáW§ÝáV:ö~Å×à<4)íÇKo¡L¤<Í»äA(!xón#WÙÕGù¾g!)ùC)]Q(*}?Ìp
¦S÷ ä½m0×C|u>£áWÅàUù»¥ïs7Dþ1Ji%ýÊ{\ö°(úýýÁñxçO9Ù¡ö}XÇIWçÎ²Æü8ú²ðÑOA¤nì6S+Ì½ i¼?¼ºNËÒo·a©8»eO|PPþBE=HèÑ qaX©$Ìç£°©b2(Ðç.$nÈR,ä_OX¾xè¥3éÂòkå¾ N ,sá W§ÝáV:ö~Å×à<4)íÇKo¡L¤<Í»äA(!xón#WÙÕGù¾g!)ùC)]Q(*}?Ìp
Note that the 4th character is distinct and then there is one or two more somewhere.
The string corresponds to a digital signature so these characters make the signature to be invalid.
I have no idea what is happening here. Any idea? I use Chrome browser and utf-8 encoding in header and metas (Firefox seems to use a different encoding in the inputbox but I will look that problem later)
EDIT:
The encoding to base64 apparently is not the problem. The base64 encoded string is the same in the browser than in the server. If I base64-decode it in javascript I get the original string but if I decode it in PHP I get a slightly different string.
EDIT2:
I still don't know what the problem is but I have avoided it sending the data in a blob with ajax.

Try using this command to encode your string with js:
var signature = document.getElementById('signature');
var base64 = window.btoa(signature);
Now with php, you simply use: base64_decode($signature)
If that doesn't work (I haven't tested it) there may be something wrong with the btoa func. So checkout this link here:
https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding
There is a function in there that should work (if the above does not)
function b64EncodeUnicode(str) {
return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
return String.fromCharCode('0x' + p1);
}));
}
b64EncodeUnicode(signature); // "4pyTIMOgIGxhIG1vZGU="

Decode parameters encoded using the JS encodeURIComponent function

I have an application that takes text entered by a user and passes it to the server as part of a URL so that an image containing the text can be rendered. The URL parameter is encoded using encodeURIComponent function.
The problem I have is that if the user enters text containing + or foreign characters I cannot get the string decoded correctly server side.
For example, If the string is "François + Anna"
The encoded URL is previewImage.ashx?id=1&text=Fran%25E7ois%2520%2B%2520Anna
On the server
Uri.UnescapeDataString( Context.Request.QueryString["text"] )
Throws an "Invalid URI: There is an invalid sequence in the string." exception. If I replace the extended character from the string, it is decoded as "Francois + Anna"
However, if I use
HttpUtility.UrlDecode(
Context.Request.QueryString["text"], System.Text.UTF8Encoding.UTF7 )
the foreign characters are decoded correctly but the encoded + is changed to a space; "François Anna".

The URL wasn't encoded correctly to begin with. previewImage.ashx?id=1&text=Fran%25E7ois%2520%2B%2520Anna is not the correct URL encoding of François + Anna
I believe the correct encoding should have been previewImage.ashx?id=1&text=Fran%E7ois+%2B+Anna or previewImage.ashx?id=1&text=Fran%E7ois%20%2B%20Anna
Once the encoding has been fixed, then you should be able to retrieve the result via a simple Context.Request.QueryString["text"] call. No need to do anything special.

Detect whether JavasScript string has been encoded using encodeURIComponent

I'm working to integrate some code with a third party, and sometimes a string argument they pass to a Javascript function I'm writing will be encoded using encodeURIComponent, sometimes it won't be.
Is there a definitive way to check whether it's been encoded using encodeURIComponent
If not, I'll do the encoding then

You could decode it and see if the string is still the same
decodeURIComponent(string) === string

Not reliably, especially in the case where a string may be encoded twice:
encodeURIComponent('http://stackoverflow.com/')
// yields 'http%3A%2F%2Fstackoverflow.com%2F'
encodeURIComponent(encodeURIComponent('http://stackoverflow.com/'))
// yields 'http%253A%252F%252Fstackoverflow.com%252F'
In essence, if you were to try and detect the string encoding when the passed argument is not actually encoded but has qualities of an encoded string, you'd be decoding something you shouldn't.
I'd recommend adding a second parameter in the definition "isURIComponent".
However, if you wanted to attempt, perhaps the following would do the trick:
if ( str.match(/[_\.!~*'()-]/) && str.match(/%[0-9a-f]{2}/i) ) {
// probably encoded with encodeURIComponent
}
This tests that the non alphanumeric characters that don't get encoded are intact, and that hexadecimals exist (e.g. %20 for a space)

Handling unicode in the http response xml

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiriâ�ª</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiriâ�ª
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");

<title>Onegai My Melody Sukkiriâ�ª</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.

So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.

You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode

We Keep Coding

JavaScript is the programming language of the Web.