Invisible Character Encoding - javascript

Could you use unicode whitespace characters to encode information in browser pages?
I'm thinking about characters that display as whitespace, but are encoded by a different Unicode character? It seems like this could be used to identify a page displayed by a user without making it obvious (i.e., for copyright or tracking purposes).
Are there guides to doing this out there already?

Related

URL encoding inside browsers

I was reading about URL encode inside browser. I know it is done to send URL over internet. But when I see this encoding inside address bar, it seems that only some characters being encoded. However, when I copy and paste inside notepad, all SPECIAL CHARACTERS seem to be encoded. What we see in address browser is fake? Actually all characters are encoded? Do browsers use javascript function encodeURI() for this purpose?
Yes, what the address bar shows is often not the actual URL. Most browsers decode encoded characters to display a more friendly looking URL, and to allow you to read the URL as actual characters, which may or may not be useful.
However, some characters look extremely similar to other characters, like the cyrillic letter А (U+0410) which is virtually indistinguishable from the latin letter A (U+0041). To avoid confusion and/or phishing attacks, letters which are easily confused may not be decoded.

When is it necessary to escape a JavaScript bookmarklet?

Bookmarklets I've made seem to always work well even though I don't escape them, so why do bookmarklet builders like John Gruber's escape special characters like space to %20?
If you copy/paste code into a bookmark using the browsers bookmark properties editor, you don't need to encode anything except newlines and percent signs. But, in fact, Firefox (and maybe other browsers) will encode spaces for you. If you check the properties again after saving, you'll see them encoded. You need to encode newlines because a bookmark can be only 1 line. You need to encode percent signs because the browser expects percent sights to always represent the first character in an encoded set. If you put the bookmarklet on a website in the HREF property of an A tag, the same rules apply, but you must also either HTML encode or URL encode any double quotes.
A lot of bookmarklet makers will just use encodeURIComponent, which encodes more than what is strictly necessary, but it might be considered more "correct".

Deciding charset for HTTP Headers. Should i simply put utf-8 and fuggedaboutit?

I was analyzing a page using Google Page Speed http://pagespeed.googlelabs.com/#url=http_3A_2F_2Fqweop.com&mobile=false&rule=SpecifyCharsetEarly
and it says that we should specify an explicit character set in HTTP Headers.
So basically my question is what determines what character set I should be using?
which character sets will have the least size / fastest ?
OR
What kind of savings can I have by using ASCII instead of say UTF-16 ?
Should i simply put utf-8 and fuggedaboutit ?
You should include the charset that the page is encoded in. You'll want to be sure that you're telling the truth. For instance, there are a lot pages running around without a charset designation (and therefore being treated as UTF-8 or ISO-8859-1) which are actually encoded as Windows-1252. That's fine as long as you stick to character codes they have in common (certainly 32-127 and all the important control characters like newline, tab, etc.). But you start with any accented letters or special symbols, and suddenly your page doesn't look right cross-browser.
This article on charsets and Unicode by Joel Spolsky is well worth a read, if you haven't already.
Setting encoding in HTTP headers does not encode the page. It only tells browsers how the page is encoded and how they should treat it. So set the encoding in which the page is encoded.
If you want to decide which encoding to use, I would recommend UTF-8.
You can display all alphabetic characters of all languages (and much more) in UTF-8 encoding. There isn't any reason for use different encoding unless your pages need to be displayed by a device which does not support UTF-8 (such a device probably does not exist) or you have some very special requirements.
The performance impact of using different encoding is negligible as well as the page size.

how do I properly encode a URL in JavaScript?

I am working on a browser plugin that takes the URL of the current page or any selected link as a parameter and send it to our server.
When characters of the basic latin alphabet are present in the url (like http://en.wikipedia.org/wiki/Vehicle), the plugin works fine. However, when the URL contains characters from another alphabet such as http://ru.wikipedia.org/wiki/Коляска the plugin does not work, I do use the encodeURIComponentmethod but that does not seem to solve the issue. Any idea?
Thanks,
Olivier.
You probably want to use encodeURI/decodeURI, if you are trying to take a full URI with non-ASCII characters and translate it into its encoded form. These preserve special URI characters, such as : and /, instead of escaping them; it only escapes non-ASCII characters and characters that are invalid in URIs. Thus, they do essentially the same thing as typing in the address bar or putting a URI in an <a href="..."> (though the behavior might vary somewhat between browser, and isn't exactly the same).
encodeURIComponent is intended for encoding only a single component of a URI, replacing special characters that are meaningful in URIs, so that you can use the component as a query parameter or path component in a longer URI.
This one says it covers UTF-8. http://www.webtoolkit.info/javascript-url-decode-encode.html. Might solve your problem

How do I ensure that the text encoded in a form is utf8

I have an html box with which users may enter text. I would like to ensure all text entered in the box is either encoded in UTF-8 or converted to UTF-8 when a user finishes typing. Furthermore, I don't quite understand how various UTF encoding are chosen when being entered into a text box.
Generally I'm curious about the following:
How does a browser determine which encodings to use when a user is typing into a text box?
How can javascript determine the encoding of a string value in an html text box?
Can I force the browser to only use UTF-8 encoding?
How can I encode arbitrary encodings to UTF-8 I assume there is a JavaScript library for this?
** Edit **
Removed some questions unnecessary to my goals.
This tutorial helped me understand JavaScript character codes better, but is buggy and does not actually translate character codes to utf-8 in all cases.
http://www.webtoolkit.info/javascript-base64.html
How does a browser determine which encodings to use when a user is typing into a text box?
It uses the encoding the page was decoded as by default. According to the spec, you should be able to override this with the accept-charset attribute of the <form> element, but IE is buggy, so you shouldn't rely on this (I've seen several different sources describe several different bugs, and I don't have all the relevant versions of IE in front of me to test, so I'll leave it at that).
How can javascript determine the encoding of a string value in an html text box?
All strings in JavaScript are encoded in UTF-16. The browser will map everything into UTF-16 for JavaScript, and from UTF-16 into whatever the page is encoded in.
UTF-16 is an encoding that grew out of UCS-2. Originally, it was thought that 65,536 code points would be enough for all of Unicode, and so a 16 bit character encoding would be sufficient. It turned out that the is not the case, and so the character set was expanded to 1,114,112 code points. In order to maintain backwards compatibility, a few unused ranges of the 16 bit character set were set aside for surrogate pairs, in which two 16 bit code units were used to encode a single character. Read up on UTF-16 and UCS-2 on Wikipedia for details.
The upshot is that when you have a string str in JavaScript, str.length does not give you the number of characters, it gives you the number of code units, where two code units may be used to encode a single character, if that character is not within the Basic Multilingual Plane. For instance, "abc".length gives you 3, but "𐤀𐤁𐤂".length gives you 6; and "𐤀𐤁𐤂".substring(0,1) gives what looks like an empty string, since a half of a surrogate pair cannot be displayed, but the string still contains that invalid character (I will not guarantee this works cross browser; I believe it is acceptable to drop broken characters). To get a valid character, you must use "𐤀𐤁𐤂".substring(0,2).
Can I force the browser to only use UTF-8 encoding?
The best way to do this is to deliver your page in UTF-8. Ensure that your web server is sending the appropriate Content-type: text/html; charset=UTF-8 headers. You may also want to embed a <meta charset="UTF-8"> element in your <head> element, for cases in which the Content-Type does not get set properly (such as if your page is loaded off of the local disk).
How can I encode arbitrary encodings to UTF-8 I assume there is a JavaScript library for this?
There isn't much need in JavaScript to encode text in particular encodings. If you are simply writing to the DOM, or reading or filling in form controls, you should just use JavaScript strings which are treated as sequences of UTF-16 code units. XMLHTTPRequest, when used to send(data) via POST, will use UTF-8 (if you pass it a document with a different encoding declared in the <?xml ...> declaration, it may or may not convert that to UTF-8, so for compatibility you generally shouldn't use anything other than UTF-8).
I would like to ensure all text entered in the box is either encoded in UTF-8
Text in an HTML DOM including input fields has no intrinsic byte encoding; it is stored as Unicode characters (specifically, at a DOM and ECMAScript standard level, UTF-16 code units; on the rare case you use characters outside the Basic Multilingual Plane it is possible to see the difference, eg. '𝅘𝅥𝅯'.length is 2).
It is only when the form is sent that the text is serialised into bytes using a particular encoding, by default the same encoding as was used to parse the page So you should serve your page containing the form as UTF-8 (via Content-Type header charset parameter and/or equivalent <meta> tag).
Whilst in principle there is an override for this in the accept-charset attribute of the <form> element, it doesn't work correctly (and is actively harmful in many cases) in IE. So avoid that one.
There are no explicit encoding-handling functions available in JavaScript itself. You can hack together a Unicode-to-UTF-8-bytes encoder by chaining unescape(encodeURIComponent(str)) (and similarly the other way round with the inverse function), but that's about it.
The text in a text box is not encoded in any way; it is "text", an abstract series of characters. In almost every contemporary application, that text is expressed as a sequence of Unicode code points, which are integers mapped to particular abstract characters. Text doesn't get "encoded" until it is turned into a sequence of bytes, as when submitting the form. At that time, the encoding is determined by the encoding of the HTML page in which the form appears, or by the accept-charset attribute of the form element.

Categories