Related
What is the difference between the JavaScript functions decodeURIComponent and decodeURI?
To explain the difference between these two let me explain the difference between encodeURI and encodeURIComponent.
The main difference is that:
The encodeURI function is intended for use on the full URI.
The encodeURIComponent function is intended to be used on .. well .. URI components that is
any part that lies between separators (; / ? : # & = + $ , #).
So, in encodeURIComponent these separators are encoded also because they are regarded as text and not special characters.
Now back to the difference between the decode functions, each function decodes strings generated by its corresponding encode counterpart taking care of the semantics of the special characters and their handling.
encodeURIComponent/decodeURIComponent() is almost always the pair you want to use, for concatenating together and splitting apart text strings in URI parts.
encodeURI in less common, and misleadingly named: it should really be called fixBrokenURI. It takes something that's nearly a URI, but has invalid characters such as spaces in it, and turns it into a real URI. It has a valid use in fixing up invalid URIs from user input, and it can also be used to turn an IRI (URI with bare Unicode characters in) into a plain URI (using %-escaped UTF-8 to encode the non-ASCII).
Where encodeURI should really be named fixBrokenURI(), decodeURI() could equally be called potentiallyBreakMyPreviouslyWorkingURI(). I can think of no valid use for it anywhere; avoid.
js> s = "http://www.example.com/string with + and ? and & and spaces";
http://www.example.com/string with + and ? and & and spaces
js> encodeURI(s)
http://www.example.com/string%20with%20+%20and%20?%20and%20&%20and%20spaces
js> encodeURIComponent(s)
http%3A%2F%2Fwww.example.com%2Fstring%20with%20%2B%20and%20%3F%20and%20%26%20and%20spaces
Looks like encodeURI produces a "safe" URI by encoding spaces and some other (e.g. nonprintable) characters, whereas encodeURIComponent additionally encodes the colon and slash and plus characters, and is meant to be used in query strings. The encoding of + and ? and & is of particular importance here, as these are special chars in query strings.
As I had the same question, but didn't find the answer here, I made some tests in order to figure out what the difference actually is.
I did this, since I need the encoding for something, which is not URL/URI related.
encodeURIComponent("A") returns "A", it does not encode "A" to "%41"
decodeURIComponent("%41") returns "A".
encodeURI("A") returns "A", it does not encode "A" to "%41"
decodeURI("%41") returns "A".
-That means both can decode alphanumeric characters, even though they did not encode them. However...
encodeURIComponent("&") returns "%26".
decodeURIComponent("%26") returns "&".
encodeURI("&") returns "&".
decodeURI("%26") returns "%26".
Even though encodeURIComponent does not encode all characters, decodeURIComponent can decode any value between %00 and %7F.
Note: It appears that if you try to decode a value above %7F (unless it's a unicode value), then your script will fail with an "URI error".
encodeURIComponent()
Converts the input into a URL-encoded
string
encodeURI()
URL-encodes the input, but
assumes a full URL is given, so
returns a valid URL by not encoding
the protocol (e.g. http://) and
host name (e.g.
www.stackoverflow.com).
decodeURIComponent() and decodeURI() are the opposite of the above
decodeURIComponent will decode URI special markers such as &, ?, #, etc, decodeURI will not.
encodeURIComponent
Not Escaped:
A-Z a-z 0-9 - _ . ! ~ * ' ( )
encodeURI()
Not Escaped:
A-Z a-z 0-9 ; , / ? : # & = + $ - _ . ! ~ * ' ( ) #
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI
Encode URI:
The encodeURI() method does not encodes:
, / ? : # & = + $ * #
Example
URI: https://my test.asp?name=ståle&car=saab
Encoded URI: https://my%20test.asp?name=st%C3%A5le&car=saab
Encode URI Component:
The encodeURIComponent() method also encodes:
, / ? : # & = + $ #
Example
URI: https://my test.asp?name=ståle&car=saab
Encoded URI: https%3A%2F%2Fmy%20test.asp%3Fname%3Dst%C3%A5le%26car%3Dsaab
For More: W3Schoools.com
What are the allowed characters in both cookie name and value? Are they same as URL or some common subset?
Reason I'm asking is that I've recently hit some strange behavior with cookies that have - in their name and I'm just wondering if it's something browser specific or if my code is faulty.
According to the ancient Netscape cookie_spec the entire NAME=VALUE string is:
a sequence of characters excluding semi-colon, comma and white space.
So - should work, and it does seem to be OK in browsers I've got here; where are you having trouble with it?
By implication of the above:
= is legal to include, but potentially ambiguous. Browsers always split the name and value on the first = symbol in the string, so in practice you can put an = symbol in the VALUE but not the NAME.
What isn't mentioned, because Netscape were terrible at writing specs, but seems to be consistently supported by browsers:
either the NAME or the VALUE may be empty strings
if there is no = symbol in the string at all, browsers treat it as the cookie with the empty-string name, ie Set-Cookie: foo is the same as Set-Cookie: =foo.
when browsers output a cookie with an empty name, they omit the equals sign. So Set-Cookie: =bar begets Cookie: bar.
commas and spaces in names and values do actually seem to work, though spaces around the equals sign are trimmed
control characters (\x00 to \x1F plus \x7F) aren't allowed
What isn't mentioned and browsers are totally inconsistent about, is non-ASCII (Unicode) characters:
in Opera and Google Chrome, they are encoded to Cookie headers with UTF-8;
in IE, the machine's default code page is used (locale-specific and never UTF-8);
Firefox (and other Mozilla-based browsers) use the low byte of each UTF-16 code point on its own (so ISO-8859-1 is OK but anything else is mangled);
Safari simply refuses to send any cookie containing non-ASCII characters.
so in practice you cannot use non-ASCII characters in cookies at all. If you want to use Unicode, control codes or other arbitrary byte sequences, the cookie_spec demands you use an ad-hoc encoding scheme of your own choosing and suggest URL-encoding (as produced by JavaScript's encodeURIComponent) as a reasonable choice.
In terms of actual standards, there have been a few attempts to codify cookie behaviour but none thus far actually reflect the real world.
RFC 2109 was an attempt to codify and fix the original Netscape cookie_spec. In this standard many more special characters are disallowed, as it uses RFC 2616 tokens (a - is still allowed there), and only the value may be specified in a quoted-string with other characters. No browser ever implemented the limitations, the special handling of quoted strings and escaping, or the new features in this spec.
RFC 2965 was another go at it, tidying up 2109 and adding more features under a ‘version 2 cookies’ scheme. Nobody ever implemented any of that either. This spec has the same token-and-quoted-string limitations as the earlier version and it's just as much a load of nonsense.
RFC 6265 is an HTML5-era attempt to clear up the historical mess. It still doesn't match reality exactly but it's much better then the earlier attempts—it is at least a proper subset of what browsers support, not introducing any syntax that is supposed to work but doesn't (like the previous quoted-string).
In 6265 the cookie name is still specified as an RFC 2616 token, which means you can pick from the alphanums plus:
!#$%&'*+-.^_`|~
In the cookie value it formally bans the (filtered by browsers) control characters and (inconsistently-implemented) non-ASCII characters. It retains cookie_spec's prohibition on space, comma and semicolon, plus for compatibility with any poor idiots who actually implemented the earlier RFCs it also banned backslash and quotes, other than quotes wrapping the whole value (but in that case the quotes are still considered part of the value, not an encoding scheme). So that leaves you with the alphanums plus:
!#$%&'()*+-./:<=>?#[]^_`{|}~
In the real world we are still using the original-and-worst Netscape cookie_spec, so code that consumes cookies should be prepared to encounter pretty much anything, but for code that produces cookies it is advisable to stick with the subset in RFC 6265.
In ASP.Net you can use System.Web.HttpUtility to safely encode the cookie value before writing to the cookie and convert it back to its original form on reading it out.
// Encode
HttpUtility.UrlEncode(cookieData);
// Decode
HttpUtility.UrlDecode(encodedCookieData);
This will stop ampersands and equals signs spliting a value into a bunch of name/value pairs as it is written to a cookie.
I think it's generally browser specific. To be on the safe side, base64 encode a JSON object, and store everything in that. That way you just have to decode it and parse the JSON. All the characters used in base64 should play fine with most, if not all browsers.
Here it is, in as few words as possible. Focus on characters that need no escaping:
For cookies:
abdefghijklmnqrstuvxyzABDEFGHIJKLMNQRSTUVXYZ0123456789!#$%&'()*+-./:<>?#[]^_`{|}~
For urls
abdefghijklmnqrstuvxyzABDEFGHIJKLMNQRSTUVXYZ0123456789.-_~!$&'()*+,;=:#
For cookies and urls ( intersection )
abdefghijklmnqrstuvxyzABDEFGHIJKLMNQRSTUVXYZ0123456789!$&'()*+-.:#_~
That's how you answer.
Note that for cookies, the = has been removed because it is
usually used to set the cookie value.
For urls this the = was kept. The intersection is obviously without.
var chars = "abdefghijklmnqrstuvxyz"; chars += chars.toUpperCase() + "0123456789" + "!$&'()*+-.:#_~";
Turns out escaping still occuring and unexpected happening, especially in a Java cookie environment where the cookie is wrapped with double quotes if it encounters the last characters.
So to be safe, just use A-Za-z1-9. That's what I am going to do.
Newer rfc6265 published in April 2011:
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
If you look to #bobince answer you see that newer restrictions are more strict.
you can not put ";" in the value field of a cookie, the name that will be set is the string until the ";" in most browsers...
that's simple:
A <cookie-name> can be any US-ASCII characters except control
characters (CTLs), spaces, or tabs. It also must not contain a
separator character like the following: ( ) < > # , ; : \ " / [ ] ? =
{ }.
A <cookie-value> can optionally be set in double quotes and any
US-ASCII characters excluding CTLs, whitespace, double quotes, comma,
semicolon, and backslash are allowed. Encoding: Many implementations
perform URL encoding on cookie values, however it is not required per
the RFC specification. It does help satisfying the requirements about
which characters are allowed for though.
Link: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#Directives
There are 2 versions of cookies specifications
1. Version 0 cookies aka Netscape cookies,
2. Version 1 aka RFC 2965 cookies
In version 0 The name and value part of cookies are sequences of characters, excluding the semicolon, comma, equals sign, and whitespace, if not used with double quotes
version 1 is a lot more complicated you can check it here
In this version specs for name value part is almost same except name can not start with $ sign
There is another interesting issue with IE and Edge. Cookies that have names with more than 1 period seem to be silently dropped.
So
This works:
cookie_name_a=valuea
while this will get dropped
cookie.name.a=valuea
One more consideration. I recently implemented a scheme in which some sensitive data posted to a PHP script needed to convert and return it as an encrypted cookie, that used all base64 values I thought were guaranteed 'safe". So I dutifully encrypted the data items using RC4, ran the output through base64_encode, and happily returned the cookie to the site. Testing seemed to go well until a base64 encoded string contained a "+" symbol. The string was written to the page cookie with no trouble. Using the browser diagnostics I could also verify the cookies was written unchanged. Then when a subsequent page called my PHP and obtained the cookie via the $_COOKIE array, I was stammered to find the string was now missing the "+" sign. Every occurrence of that character was replaced with an ASCII space.
Considering how many similar unresolved complaints I've read describing this scenario since then, often siting numerous references to using base64 to "safely" store arbitrary data in cookies, I thought I'd point out the problem and offer my admittedly kludgy solution.
After you've done whatever encryption you want to do on a piece of data, and then used base64_encode to make it "cookie-safe", run the output string through this...
// from browser to PHP. substitute troublesome chars with
// other cookie safe chars, or vis-versa.
function fix64($inp) {
$out =$inp;
for($i = 0; $i < strlen($inp); $i++) {
$c = $inp[$i];
switch ($c) {
case '+': $c = '*'; break; // definitly won't transfer!
case '*': $c = '+'; break;
case '=': $c = ':'; break; // = symbol seems like a bad idea
case ':': $c = '='; break;
default: continue;
}
$out[$i] = $c;
}
return $out;
}
Here I'm simply substituting "+" (and I decided "=" as well) with other "cookie safe" characters, before returning the encoded value to the page, for use as a cookie. Note that the length of the string being processed doesn't change. When the same (or another page on the site) runs my PHP script again, I'll be able to recover this cookie without missing characters. I just have to remember to pass the cookie back through the same fix64() call I created, and from there I can decode it with the usual base64_decode(), followed by whatever other decryption in your scheme.
There may be some setting I could make in PHP that allows base64 strings used in cookies to be transferred back to to PHP without corruption. In the mean time this works. The "+" may be a "legal" cookie value, but if you have any desire to be able to transmit such a string back to PHP (in my case via the $_COOKIE array), I'm suggesting re-processing to remove offending characters, and restore them after recovery. There are plenty of other "cookie safe" characters to choose from.
If you are using the variables later, you'll find that stuff like path actually will let accented characters through, but it won't actually match the browser path. For that you need to URIEncode them. So i.e. like this:
const encodedPath = encodeURI(myPath);
document.cookie = `use_pwa=true; domain=${location.host}; path=${encodedPath};`
So the "allowed" chars, might be more than what's in the spec. But you should stay within the spec, and use URI-encoded strings to be safe.
Years ago MSIE 5 or 5.5 (and probably both) had some serious issue with a "-" in the HTML block if you can believe it. Alhough it's not directly related, ever since we've stored an MD5 hash (containing letters and numbers only) in the cookie to look up everything else in server-side database.
I ended up using
cookie_value = encodeURIComponent(my_string);
and
my_string = decodeURIComponent(cookie_value);
That seems to work for all kinds of characters. I had weird issues otherwise, even with characters that weren't semicolons or commas.
I am developing for javascript disabled phones. My code looks like this
Link 1
Link 2
But the browser interprets the URL as -
someurl?var=a%e2%8c%a9=english (Link 1, incorrect)
someurl?lang=english&var=a (Link 2 works just fine !)
It seems like &lang=english is being converted to a%e2%8c%a9=english
Could someone explain why this is happening?
In HTML, the & character represents the start of a character reference.
If you try to specify an invalid character reference, then browsers will perform error recovery and treat it as an ampersand instead.
From the HTML DTD:
<!ENTITY lang CDATA "〈" -- left-pointing angle bracket = bra,
U+2329 ISOtech -->
… so &lang is not an invalid character reference.
To include an ampersand character as data, use the character reference for an ampersand: &
By HTML 4.01 rules, the &lang entity reference denotes the character U+2329 LEFT-POINTING ANGLE BRACKET “〈”. In UTF-8 encoding, that character is represented as 0xE2 0x8C 0xA9, and therefore in a URL, it gets %-encoded as a%e2%8c%a9.
Nowadays, most browsers don’t work that way. Specifically, in a URL, the reference &lang is not recognized when followed by an equals sign = (even though it is valid HTML 4.01 in that context).
To deal with browsers that may follow the old rules, as well as in order to comply with syntax rules independently of HTML version, escape each occurrence of the ampersand “&” as &—it is safest to do this for all occurrences of “&” as a data character, in attribute values and elsewhere.
Depending on the server-side software that processes the URL when they have been followed, you might be able to use an unproblematic character like “;” instead of “&” as a separator.
http://www.htmlhelp.com/tools/validator/problems.html#amp (linked by w3 from http://validator.w3.org/docs/help.html) explains it.
& marks the start of a so called entity. Entities are for example € (€), < (<),..
If you now put in the URL &lang, this throws an error in any validator, because its not a valid entity. The browser is then escaping this sequence.
Solution:
You have to escape the & by its own entity: & so the URL will look like:
Link 1
My problem at hand is that I need a variable that will keep track of all of my cookies so that I can split up the string in that variable into an array and then parse the string from there. I am wondering why the simple following code is not doing that for me?
var count = 0; //keeps track of how many times this page has been visited
var lastVisit = new Date(); //records the last visit date in UTC format (or extra-challenge: in a user-friendly format like "Tuesday 10/12/2013 at 9:34:50")
var exDate = new Date(lastVisit.getTime() + 30000);
var savedData = decodeURI(document.cookie); //contains cookie contents
document.cookie = encodeURI("count=" + count.toString() + "; expires=" + exDate.toUTCString());
What I need to happen is whenever I set a cookie for it to be added to the savedData variable, I cannot figure out why this is not happening. Thank you
You're not supposed to encode 'expires='. It will turn into 'expires%3D', which is not what you want.
In addition to that, it might be a bad idea to use 'encodeURI', because it does not encode ';' and ',' as required.
You can use encodeURIComponent for encoding the cookie value, but it would be technically correct to use escape() to encode the cookie value.
So...
document.cookie = "count=" + encodeURIComponent(count.toString()) + "; expires=" + exDate.toUTCString();
...should do what you want.
The cookie consists of several parts; we're mostly interested in the name, the value and the expiry date.
(End of official answer)
Let's clear up the confusion on encoding cookies
If ever in doubt, contact the RFC, do not just pick anything you find on the Web that seems to work.
The cookie-name is of type token, which means that only these values are allowed within it:
0x21-0x27, 0x2A-0x2B, 0x2D-0x2E, 0x30-0x39, 0x41-0x5A, 0x5E-0x7A and 0x7E.
In other words: The following values should be percent-encoded:
0x00-0x20, '(', ')', ',', '/', ':', ';', '<', '=', '>', '?', '#', '', '[', ']', '{', '}' and 0x7F-0xFF.
The cookie-value is of type cookie-octet, which means that only these values are allowed within it:
0x21, 0x23-0x2B, 0x2D-0x3A, 0x3C-0x5B, 0x5D-0x7E.
In other words: The following values should be percent-encoded:
0x00-0x20, 0x22, ',', ';', '' and 0x7F-0xFF.
Now, the expiry date is encoded using toUTCString(), as you're correctly doing.
The result looks something like this: Wed, 09 Jun 2021 10:18:14 GMT
-So it will contain a comma. BUT! You're not supposed to encode anything, except the cookie-name and cookie-value strings.
Note: W3Schools says that escape() was deprecated in JavaScript 1.5, but it is technically incorrect to use encodeURI() or encodeURIComponent() for cookies. It is technically correct to use escape() for cookies.
RFC 6265 section 5.4 clearly states:
NOTE: Despite its name, the cookie-string is actually a sequence of
octets, not a sequence of characters. To convert the cookie-string
(or components thereof) into a sequence of characters (e.g., for
presentation to the user), the user agent might wish to try using the
UTF-8 character encoding [RFC3629] to decode the octet sequence.
This decoding might fail, however, because not every sequence of
octets is valid UTF-8.
As decodeURIComponent() is for unicode strings, and choke on byte values between 0x00 and 0xFF, they can not safely be used.
On the other hand, unescape() is not for strings, but for 8-bit byte-sequences, aka. octets, but only if your byte-sequences do not contain unicode characters.
If your cookie value contains unicode characters, you should however, use encodeURIComponent()/decodeURIComponent(), but you should also catch any exceptions, because the server may not send you exactly what you want to receive.
Most browsers also support btoa for encoding in Base64 and atob for decoding Base64. All characters in Base64 are legal cookie-octets (when interpreted in ASCII or UTF-8). So you can (at the cost of additional storage space) store a Base64 encoding of your persistent value as the cookie value, encoding (and decoding) as you persist (and retrieve) the value.
In particular, when saving a JSON to the cookie is it safe to just save the raw value?
The reason I dopn't want to encode is because the json has small values and keys but a complex structure, so encoding, replacing all the ", : and {}, greatly increases the string length
if your values contain "JSON characters" (e.g. comma, quotes, [] etc) then you should probably use encodeURIComponent so these get escaped and don't break your code when reading the values back.
You can convert your JSON object to a string using the JSON.stringify() method then save it in a cookie.
Note that cookies have a 4000 character limit.
If your Json string is valid there should be no need to encode it.
e.g.
JSON.stringify({a:'foo"bar"',bar:69});
=> '{"a":"foo\"bar\"","bar":69}' valid json stings are escaped.
This is documented very well on MDN
To avoid unexpected requests to the server, you should call encodeURIComponent on any user-entered parameters that will be passed as part of a URI. For example, a user could type "Thyme &time=again" for a variable comment. Not using encodeURIComponent on this variable will give comment=Thyme%20&time=again. Note that the ampersand and the equal sign mark a new key and value pair. So instead of having a POST comment key equal to "Thyme &time=again", you have two POST keys, one equal to "Thyme " and another (time) equal to again.
If you can't be certain that your JSON will not include reserved characters such as ; then you will want to perform escaping on any strings being stored as a cookie. RFC 6265 covers special characters that are not allowed in the cookie-name or cookie-value.
If you are encoding static content you control, then this escaping may be unnecessary. If you are encoding dynamic content such as encoding user generated content, you probably need escaping.
MDN recommends using encodeURIComponent to escape any disallowed characters.
You can pull in a library such as cookie to handle this for you, but if your server is written in another language you will need to ensure it uses a library or language utilities to encodeURIComponent when setting cookies and to decodeURIComponent when reading cookies.
JSON.stringify is not sufficient as illustrated by this trivial example:
const bio = JSON.stringify({ "description": "foo; bar; baz" });
document.cookie = `bio=${stringified}`;
// Notice that the content after the first `;` is dropped.
// Attempting to JSON.parse this later will fail.
console.log(document.cookie) // bio={\"description\":\"foo;
Cookie: name=value; name2=value2
Spaces are part of the cookie separation in the HTTP Cookie header. Raw spaces in cookie values could thus confuse the server.