I was reading about URL encode inside browser. I know it is done to send URL over internet. But when I see this encoding inside address bar, it seems that only some characters being encoded. However, when I copy and paste inside notepad, all SPECIAL CHARACTERS seem to be encoded. What we see in address browser is fake? Actually all characters are encoded? Do browsers use javascript function encodeURI() for this purpose?
Yes, what the address bar shows is often not the actual URL. Most browsers decode encoded characters to display a more friendly looking URL, and to allow you to read the URL as actual characters, which may or may not be useful.
However, some characters look extremely similar to other characters, like the cyrillic letter А (U+0410) which is virtually indistinguishable from the latin letter A (U+0041). To avoid confusion and/or phishing attacks, letters which are easily confused may not be decoded.
Related
Could you use unicode whitespace characters to encode information in browser pages?
I'm thinking about characters that display as whitespace, but are encoded by a different Unicode character? It seems like this could be used to identify a page displayed by a user without making it obvious (i.e., for copyright or tracking purposes).
Are there guides to doing this out there already?
Bookmarklets I've made seem to always work well even though I don't escape them, so why do bookmarklet builders like John Gruber's escape special characters like space to %20?
If you copy/paste code into a bookmark using the browsers bookmark properties editor, you don't need to encode anything except newlines and percent signs. But, in fact, Firefox (and maybe other browsers) will encode spaces for you. If you check the properties again after saving, you'll see them encoded. You need to encode newlines because a bookmark can be only 1 line. You need to encode percent signs because the browser expects percent sights to always represent the first character in an encoded set. If you put the bookmarklet on a website in the HREF property of an A tag, the same rules apply, but you must also either HTML encode or URL encode any double quotes.
A lot of bookmarklet makers will just use encodeURIComponent, which encodes more than what is strictly necessary, but it might be considered more "correct".
Is there any valid use for javascript's encodeURI function?
As far as I can tell, when you are trying to make a HTTP request you should either have:
a complete URI
some fragment you want to put in a URI, which is either a unicode string or UTF-8 byte sequence
In the first case, obviously nothing needs to be done to request it. Note: if you actually want to pass it as a parameter (e.g ?url=http...) then you actually have an instance of the second case that happens to look like a URI.
In the second case, you should always convert a unicode string into UTF-8, and then call encodeURIComponent to escape all characters before adding it to a URI. (If you have a UTF-8 byte sequence instead of a unicode string you can skip the convert-to-utf8 step).
Assuming I havent missed anything, I can't see a valid use for encodeURI. If you use it, it's likely you've constructed an invalid URI and then attempted to "sanitize" it after the fact, which is simply not possible since you don't know which characters were intended literally, and which were intended to be escaped.
I have seen a lot of advice against using escape(), but don't see anybody discouraging encodeURI. Am I missing a valid use?
I have a blog post which answers this question in a lot of detail.
You should never use encodeURI to construct a URI programmatically, for the reasons you say -- you should always use encodeURIComponent on the individual components, and then compose them into a complete URI.
Where encodeURI is almost useful is in "cleaning" a URI, in accordance with Postel's Law ("Be liberal in what you accept, and conservative in what you send.") If someone gives you a complete URI, it may contain illegal characters, such as spaces, certain ASCII characters (such as double-quotes) and Unicode characters. encodeURI can be used to convert those illegal characters into legal percent-escaped sequences, without encoding delimiters. Similarly, decodeURI can be used to "pretty-print" a URI, showing percent-escaped sequences as technically-illegal bare characters.
For example, the URL:
http://example.com/admin/login?name=Helen Ødegård&gender=f
is illegal, but it is still completely unambiguous. encodeURI converts it into the valid URI:
http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f
An example of an application that might want to do this sort of "URI cleaning" is a web browser. When you type a URL into the address bar, it should attempt to convert any illegal characters into percent-escapes, rather than just having an error. Software that processes URIs (e.g., an HTML scraper that wants to get all the URLs in hyperlinks on a page) may also want to apply this kind of cleaning in case any of the URLs are technically illegal.
Unfortunately, encodeURI has a critical flaw, which is that it escapes '%' characters, making it completely useless for URI cleaning (it will double-escape any URI that already had percent-escapes). I have therefore borrowed Mozilla's fixedEncodeURI function and improved it so that it correctly cleans URIs:
function fixedEncodeURI(str) {
return encodeURI(str).replace(/%25/g, '%').replace(/%5B/g, '[').replace(/%5D/g, ']');
}
So you should always use encodeURIComponent to construct URIs internally. You should only never use encodeURI, but you can use my fixedEncodeURI to attempt to "clean up" URIs that have been supplied from an external source (usually as part of a user interface).
encodeURI does not encode the following: , / ? : # & = + $ # whereas encodeURIComponent does.
There are a myriad of reasons why you might want to use encodeURI over encodeURIComponent, such as assigning a URL as a variable value. You want to maintain the URL but encode paths, query string and hash values. Using encodeURIComponent would make the URL invalid.
I was analyzing a page using Google Page Speed http://pagespeed.googlelabs.com/#url=http_3A_2F_2Fqweop.com&mobile=false&rule=SpecifyCharsetEarly
and it says that we should specify an explicit character set in HTTP Headers.
So basically my question is what determines what character set I should be using?
which character sets will have the least size / fastest ?
OR
What kind of savings can I have by using ASCII instead of say UTF-16 ?
Should i simply put utf-8 and fuggedaboutit ?
You should include the charset that the page is encoded in. You'll want to be sure that you're telling the truth. For instance, there are a lot pages running around without a charset designation (and therefore being treated as UTF-8 or ISO-8859-1) which are actually encoded as Windows-1252. That's fine as long as you stick to character codes they have in common (certainly 32-127 and all the important control characters like newline, tab, etc.). But you start with any accented letters or special symbols, and suddenly your page doesn't look right cross-browser.
This article on charsets and Unicode by Joel Spolsky is well worth a read, if you haven't already.
Setting encoding in HTTP headers does not encode the page. It only tells browsers how the page is encoded and how they should treat it. So set the encoding in which the page is encoded.
If you want to decide which encoding to use, I would recommend UTF-8.
You can display all alphabetic characters of all languages (and much more) in UTF-8 encoding. There isn't any reason for use different encoding unless your pages need to be displayed by a device which does not support UTF-8 (such a device probably does not exist) or you have some very special requirements.
The performance impact of using different encoding is negligible as well as the page size.
I am working on a browser plugin that takes the URL of the current page or any selected link as a parameter and send it to our server.
When characters of the basic latin alphabet are present in the url (like http://en.wikipedia.org/wiki/Vehicle), the plugin works fine. However, when the URL contains characters from another alphabet such as http://ru.wikipedia.org/wiki/Коляска the plugin does not work, I do use the encodeURIComponentmethod but that does not seem to solve the issue. Any idea?
Thanks,
Olivier.
You probably want to use encodeURI/decodeURI, if you are trying to take a full URI with non-ASCII characters and translate it into its encoded form. These preserve special URI characters, such as : and /, instead of escaping them; it only escapes non-ASCII characters and characters that are invalid in URIs. Thus, they do essentially the same thing as typing in the address bar or putting a URI in an <a href="..."> (though the behavior might vary somewhat between browser, and isn't exactly the same).
encodeURIComponent is intended for encoding only a single component of a URI, replacing special characters that are meaningful in URIs, so that you can use the component as a query parameter or path component in a longer URI.
This one says it covers UTF-8. http://www.webtoolkit.info/javascript-url-decode-encode.html. Might solve your problem