Why is character encoding important for URLs? - javascript

I'm currently learning JavaScript, and I don't understand why it is important to encode URLs.
>>> var url = 'http://www.packtpub.com/scr ipt.php?q=this and that';
>>> encodeURI(url);
"http://www.packtpub.com/scr%20ipt.php?q=this%20and%20that"
For instance, in this example what purpose would it serve to change the first URL to the latter one.

It depends on what you're going to be doing with that URL.
When you just use a document.location = url, you don't want it encoded.
If you plan on passing that URL as a variable, then yes you want it encoded or it will confuse the browser. For instance:
http://www.someurl.com?myFavwebsite=http://www.stackoverflow.com?someParam=test.
See how that could be confusing to the browser?
By the way, never use a space in a url or php file. i've always found that to cause unnecessary stress. :)

Only a limited number of characters are allowed in URLs, according to the RFC 3986 standard. If you have a space in a URL, for example, this will make the URL invalid unless you encode it.
Often, browsers can deal with URLs that are not properly encoded by doing the encoding themselves, but that's not something you should rely on as a web developer.
URL encoding is also critical when using URLs as parameters of another URL. In this case the reserved characters of the URL need to be encoded, not just the non-permitted characters. For this, however, you don't use encodeURI, but encodeURIComponent.

Related

Should we enclose filename with encodeURIComponent in Javascript?

I'm accepting files to be uploaded to my site. So, is it a safe practice to encodeURIComponent the filename? Or should I use escape()? OR is it necessary at all?
You should never use escape for anything (unless forced to because you're sending information to something that will use unescape [which it shouldn't]).
Whether you need to use encodeURIComponent depends entirely on whether you're going to use the filename directly as a URI component¹. If you are, yes, you should use it. If you aren't, no, you probably shouldn't.
¹ for instance, as a query string parameter when you're creating the query string manually rather than via URLSearchParams (which is generally better practice)
encodeURIComponent takes a string and escapes it to make it safe to insert into a URI, typically used for query string data.
If you are inserting a string into a URI then you can use it, but should probably use URLSearchParams to construct the whole query string instead.
If you aren't inserting a string into a URI then you probably should not use it.
escape is deprecated and should not be used. It doesn't work property with Unicode.
Considerations for accepting files are typically more along the lines of "Will this accidentally overwrite an existing file?" and "Are the characters in this filename allowed by my filesystem?".
Some people prefer to generate a completely new file name (e.g. with a guid library) to ensure it is safe. You could store the original name in a database (at which point your escaping should be handled by parametrised queries).

javascript encodeURIComponent and escape?

I use JS to sent encodeURIComponent string to a PHP file write and has been working fine for years; until recently I met with a strange effect that the text need to be further encoded with escape in order to get it to work! The sympton start to show only when I use an open source wysiwyg editor !
What could be the offending characters in URI that need escape to fix it? I used to think URI only reserve ? & = for its syntax to work.
The situation you describe could possibly be explained--although there's no way of knowing without you telling us what the string is, and how it's being used--by a URL which involves two levels of nested URL-like values.
Consider a URL taking a query parameter which is another URL:
http://me.com?url=http://you.com?qp=1
That URL is subject to misinterpretation, so we would normally URL-encode the you.com URL, giving us:
http://me.com?url=http%3A%2F%2Fyou.com%3Fqp%3D1
Whoever is working with this URL can now extract the query parameter named url with the value http%3A%2F%2Fyou.com%3Fqp%3D1, decode it (often a framework or library will decode it for you), and then use it to jump to or call that URL.
Consider, however, the case where the you.com URL itself has a query parameter, not ?qp=1 as given in the first example, but rather something that itself needs to be URL-encoded. To keep things simple, we'll just use "cat?pictures". We'd need to encode that, making the query parameter
In other words, the URL in question is going to be
?qp=cat%3Fpictures
If we just use that as is, then our entire URL becomes
http://me.com?url=http%3A%2F%2Fyou.com%3Fqp=cat%3Fpictures
Unfortunately, if we now decode that in a naive way, we get
http://me.com?url=http://you.com?qp=cat?pictures
In other words, the nested URL has been decoded as well, meaning that it will think the URL has two query paramters, namely url and qp. To successfully deal with this problem, we need to encode the second query parameter a second time, yielding
http://me.com?url=http%3A%2F%2Fyou.com%3Fqp%3Dcat%253Fpictures
Please note, however, that if you use your language or environment's built-in tools and libraries for handling query parameters, most of this will happen automatically and prevent you from having to worry about it.
The symptom start to show only when I use an open source wysiwyg editor
An editor merely places characters in a file. It's very hard to imagine that an editor is causing the problem you refer to, unless perhaps one editor is configured to use smart quotes, for example, which would pretty much break everything that involved quotes.

Encoding uri component, do not automatically decode

I have users with slahes in their usernames. I want to give them easy urls such as /user/username even if their username is problematic. ie /user/xXx/superboy.
I'm using client side routing and I don't think there's any wildcard support. One obvious way to fix this would be to encode their username. href="/user/xXx%2Fsuperboy". But the browser automatically decodes the url when going to the link and then my router ends up not matching anyway. Is there some way to keep the browser from automatically decoding the url or any other way to solve my problem (perhaps a different decoding scheme?). Thanks.
I'm using angularjs with angular ui-router for routing.
Part 1.
Automatic decoding of URIs can be encounted in many situations, such as it being interpreted once then the interpretation passed on (to be re-interpreted).
Part 2.
In a path in a URI, / has a special meaning, so you can't use it as the name of a file or directory. This means if you're mapping something that isn't a real path to a path, you may end up with unexpected characters causing problems. To solve this the characters need to be encoded.
As you want to map usernames to a URI, you have to consider this might happen, so you have to encode in a way that allows for this. From your question, it looks like this happens once, so you'll need to double encode any part of the URI that isn't a "real URI path".
Also maybe you can explain how reliable this is and whether it's advisable
If you always have it used in the same way, it should be reliable. As for advisable, it would be much better to use the query part, rather than the path for this. href="/user?xXx/superboy" is a valid URI and you can get the query string easily (everything after first ?, or an inbuilt method). The only character you'd have to watch for is #, which has special meaning again.

should encodeURI ever be used?

Is there any valid use for javascript's encodeURI function?
As far as I can tell, when you are trying to make a HTTP request you should either have:
a complete URI
some fragment you want to put in a URI, which is either a unicode string or UTF-8 byte sequence
In the first case, obviously nothing needs to be done to request it. Note: if you actually want to pass it as a parameter (e.g ?url=http...) then you actually have an instance of the second case that happens to look like a URI.
In the second case, you should always convert a unicode string into UTF-8, and then call encodeURIComponent to escape all characters before adding it to a URI. (If you have a UTF-8 byte sequence instead of a unicode string you can skip the convert-to-utf8 step).
Assuming I havent missed anything, I can't see a valid use for encodeURI. If you use it, it's likely you've constructed an invalid URI and then attempted to "sanitize" it after the fact, which is simply not possible since you don't know which characters were intended literally, and which were intended to be escaped.
I have seen a lot of advice against using escape(), but don't see anybody discouraging encodeURI. Am I missing a valid use?
I have a blog post which answers this question in a lot of detail.
You should never use encodeURI to construct a URI programmatically, for the reasons you say -- you should always use encodeURIComponent on the individual components, and then compose them into a complete URI.
Where encodeURI is almost useful is in "cleaning" a URI, in accordance with Postel's Law ("Be liberal in what you accept, and conservative in what you send.") If someone gives you a complete URI, it may contain illegal characters, such as spaces, certain ASCII characters (such as double-quotes) and Unicode characters. encodeURI can be used to convert those illegal characters into legal percent-escaped sequences, without encoding delimiters. Similarly, decodeURI can be used to "pretty-print" a URI, showing percent-escaped sequences as technically-illegal bare characters.
For example, the URL:
http://example.com/admin/login?name=Helen Ødegård&gender=f
is illegal, but it is still completely unambiguous. encodeURI converts it into the valid URI:
http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f
An example of an application that might want to do this sort of "URI cleaning" is a web browser. When you type a URL into the address bar, it should attempt to convert any illegal characters into percent-escapes, rather than just having an error. Software that processes URIs (e.g., an HTML scraper that wants to get all the URLs in hyperlinks on a page) may also want to apply this kind of cleaning in case any of the URLs are technically illegal.
Unfortunately, encodeURI has a critical flaw, which is that it escapes '%' characters, making it completely useless for URI cleaning (it will double-escape any URI that already had percent-escapes). I have therefore borrowed Mozilla's fixedEncodeURI function and improved it so that it correctly cleans URIs:
function fixedEncodeURI(str) {
return encodeURI(str).replace(/%25/g, '%').replace(/%5B/g, '[').replace(/%5D/g, ']');
}
So you should always use encodeURIComponent to construct URIs internally. You should only never use encodeURI, but you can use my fixedEncodeURI to attempt to "clean up" URIs that have been supplied from an external source (usually as part of a user interface).
encodeURI does not encode the following: , / ? : # & = + $ # whereas encodeURIComponent does.
There are a myriad of reasons why you might want to use encodeURI over encodeURIComponent, such as assigning a URL as a variable value. You want to maintain the URL but encode paths, query string and hash values. Using encodeURIComponent would make the URL invalid.

how do I properly encode a URL in JavaScript?

I am working on a browser plugin that takes the URL of the current page or any selected link as a parameter and send it to our server.
When characters of the basic latin alphabet are present in the url (like http://en.wikipedia.org/wiki/Vehicle), the plugin works fine. However, when the URL contains characters from another alphabet such as http://ru.wikipedia.org/wiki/Коляска the plugin does not work, I do use the encodeURIComponentmethod but that does not seem to solve the issue. Any idea?
Thanks,
Olivier.
You probably want to use encodeURI/decodeURI, if you are trying to take a full URI with non-ASCII characters and translate it into its encoded form. These preserve special URI characters, such as : and /, instead of escaping them; it only escapes non-ASCII characters and characters that are invalid in URIs. Thus, they do essentially the same thing as typing in the address bar or putting a URI in an <a href="..."> (though the behavior might vary somewhat between browser, and isn't exactly the same).
encodeURIComponent is intended for encoding only a single component of a URI, replacing special characters that are meaningful in URIs, so that you can use the component as a query parameter or path component in a longer URI.
This one says it covers UTF-8. http://www.webtoolkit.info/javascript-url-decode-encode.html. Might solve your problem

Categories