The reason for this "escapes" me.
JSON escapes the forward slash, so a hash {a: "a/b/c"} is serialized as {"a":"a\/b\/c"} instead of {"a":"a/b/c"}.
Why?
JSON doesn't require you to do that, it allows you to do that. It also allows you to use "\u0061" for "A", but it's not required, like Harold L points out:
The JSON spec says you CAN escape forward slash, but you don't have to.
Harold L answered Oct 16 '09 at 21:59
Allowing \/ helps when embedding JSON in a <script> tag, which doesn't allow </ inside strings, like Seb points out:
This is because HTML does not allow a string inside a <script> tag to contain </, so in case that substring's there, you should escape every forward slash.
Seb answered Oct 16 '09 at 22:00 (#1580667)
Some of Microsoft's ASP.NET Ajax/JSON API's use this loophole to add extra information, e.g., a datetime will be sent as "\/Date(milliseconds)\/". (Yuck)
The JSON spec says you CAN escape forward slash, but you don't have to.
I asked the same question some time ago and had to answer it myself. Here's what I came up with:
It seems, my first thought [that it comes from its JavaScript
roots] was correct.
'\/' === '/' in JavaScript, and JSON is valid JavaScript. However,
why are the other ignored escapes (like \z) not allowed in JSON?
The key for this was reading
http://www.cs.tut.fi/~jkorpela/www/revsol.html, followed by
http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2. The feature of
the slash escape allows JSON to be embedded in HTML (as SGML) and XML.
PHP escapes forward slashes by default which is probably why this appears so commonly. I suspect it's because embedding the string "</script>" inside a <script> tag is considered unsafe.
Example:
<script>
var searchData = <?= json_encode(['searchTerm' => $_GET['search'], ...]) ?>;
// Do something else with the data...
</script>
Based on this code, an attacker could append this to the page's URL:
?search=</script> <some attack code here>
Which, if PHP's protection was not in place, would produce the following HTML:
<script>
var searchData = {"searchTerm":"</script> <some attack code here>"};
...
</script>
Even though the closing script tag is inside a string, it will cause many (most?) browsers to exit the script tag and interpret the items following as valid HTML.
With PHP's protection in place, it will appear instead like this, which will NOT break out of the script tag:
<script>
var searchData = {"searchTerm":"<\/script> <some attack code here>"};
...
</script>
This functionality can be disabled by passing in the JSON_UNESCAPED_SLASHES flag but most developers will not use this since the original result is already valid JSON.
Yes, some JSON utiltiy libraries do it for various good but mostly legacy reasons. But then they should also offer something like setEscapeForwardSlashAlways method to set this behaviour OFF.
In Java, org.codehaus.jettison.json.JSONObject does offer a method called
setEscapeForwardSlashAlways(boolean escapeForwardSlashAlways)
to switch this default behaviour off.
Related
Is there a known XSS or other attack that makes it past a
$content = "some HTML code";
$content = strip_tags($content);
echo $content;
?
The manual has a warning:
This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.
but that is related to using the allowable_tags parameter only.
With no allowed tags set, is strip_tags() vulnerable to any attack?
Chris Shiflett seems to say it's safe:
Use Mature Solutions
When possible, use mature, existing solutions instead of trying to create your own. Functions like strip_tags() and htmlentities() are good choices.
is this correct? Please if possible, quote sources.
I know about HTML purifier, htmlspecialchars() etc.- I am not looking for the best method to sanitize HTML. I just want to know about this specific issue. This is a theoretical question that came up here.
Reference: strip_tags() implementation in the PHP source code
As its name may suggest, strip_tags should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...') call, without a second argument for whitelisted tags.
First at all, some theory about HTML tags: a tag starts with a < followed by non-whitespace characters. If this string starts with a ?, it should not be parsed. If this string starts with a !--, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->, inside such a comment, characters like < and > are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character (' or "). If such a quote exist, it must be closed, otherwise if a > is encountered, the tag is not closed.
The code text is interpreted in Firefox as:
text
The PHP function strip_tags is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.
Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth holds the number of open angle brackets (<).
The variable in_q contains the quote character (' or ") if any, and 0 otherwise. The last character is stored in the variable lc.
The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:
State 0 is the output state (not in any tag)
State 1 means we are inside a normal html tag (the tag buffer contains <)
State 2 means we are inside a php tag
State 3: we came from the output state and encountered the < and ! characters (the tag buffer contains <!)
State 4: inside HTML comment
We need just to be careful that no tag can be inserted. That is, < followed by a non-whitespace character. Line 4326 checks an case with the < character which is described below:
If inside quotes (e.g. <a href="inside quotes">), the < character is ignored (removed from the output).
If the next character is a whitespace character, < is added to the output buffer.
if outside a HTML tag, the state becomes 1 ("inside HTML tag") and the last character lc is set to <
Otherwise, if inside the a HTML tag, the counter named depth is incremented and the character ignored.
If > is met while the tag is open (state == 1), in_q becomes 0 ("not in a quote") and state becomes 0 ("not in a tag"). The tag buffer is discarded.
Attribute checks (for characters like ' and ") are done on the tag buffer which is discarded. So the conclusion is:
strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.
By "outside tags", I mean not in tags as in outside tag. Text may contain < and > though, as in >< a>>. The result is not valid HTML though, <, > and & need still to be escaped, especially the &. That can be done with htmlspecialchars().
The description for strip_tags without an whitelist argument would be:
Makes sure that no HTML tag exist in the returned string.
I cannot predict future exploits, especially since I haven't looked at the PHP source code for this. However, there have been exploits in the past due to browsers accepting seemingly invalid tags (like <s\0cript>). So it's possible that in the future someone might be able to exploit odd browser behavior.
That aside, sending the output directly to the browser as a full block of HTML should never be insecure:
echo '<div>'.strip_tags($foo).'</div>'
However, this is not safe:
echo '<input value="'.strip_tags($foo).'" />';
because one could easily end the quote via " and insert a script handler.
I think it's much safer to always convert stray < into < (and the same with quotes).
According to this online tool, this string will be "perfectly" escaped, but
the result is another malicious one!
<<a>script>alert('ciao');<</a>/script>
In the string the "real" tags are <a> and </a>, since < and script> alone aren't tags.
I hope I'm wrong or that it's just because of an old version of PHP, but it's better to check in your environment.
YES, strip_tags() is vulnerable to scripting attacks, right through to (at least) PHP 8. Do not use it to prevent XSS. Instead, you should use filter_input().
The reason that strip_tags() is vulnerable is because it does not run recursively. That is to say, it does not check whether or not valid tags will remain after valid tags have been stripped. For example, the string
<<a>script>alert(XSS);<</a>/script> will strip the <a> tag successfully, yet fail to see this leaves
<script>alert(XSS);</script>.
This can be seen (in a safe environment) here.
Strip tags is perfectly safe - if all that you are doing is outputting the text to the html body.
It is not necessarily safe to put it into mysql or url attributes.
I would like to store a JSON's contents in a HTML document's source, inside a script tag.
The content of that JSON does depend on user submitted input, thus great care is needed to sanitise that string for XSS.
I've read two concept here on SO.
1. Replace all occurrences of the </script tag into <\/script, or replace all </ into <\/ server side.
Code wise it looks like the following (using Python and jinja2 for the example):
// view
data = {
'test': 'asdas</script><b>as\'da</b><b>as"da</b>',
}
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False).replace('</script', r'<\/script'),
}
// template
<script>
var data_json = {{ data_json | safe }};
</script>
// js
access it simply as window.data_json object
2. Encode the data as a HTML entity encoded JSON string, and unescape + parse it in client side. Unescape is from this answer: https://stackoverflow.com/a/34064434/518169
// view
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False),
}
// template
<script>
var data_json = '{{ data_json }}'; // encoded into HTML entities, like < > &
</script>
// js
function htmlDecode(input) {
var doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
var decoded = htmlDecode(window.data_json);
var data_json = JSON.parse(decoded);
This method doesn't work because \" in a script source becames " in a JS variable. Also, it creates a much bigger HTML document and also is not really human readable, so I'd go with the first one if it doesn't mean a huge security risk.
Is there any security risk in using the first version? Is it enough to sanitise a JSON encoded string with .replace('</script', r'<\/script')?
Reference on SO:
Best way to store JSON in an HTML attribute?
Why split the <script> tag when writing it with document.write()?
Script tag in JavaScript string
Sanitize <script> element contents
Escape </ in script tag contents
Some great external resources about this issue:
Flask's tojson filter's implementation source
Rail's json_escape method's help and source
A 5 year long discussion in Django ticket and proposed code
Here's how I dealt with the relatively minor part of this issue, the encoding problem with storing JSON in a script element. The short answer is you have to escape either < or / as together they terminate the script element -- even inside a JSON string literal. You can't HTML-encode entities for a script element. You could JavaScript-backslash-escape the slash. I preferred to JavaScript-hex-escape the less-than angle-bracket as \u003C.
.replace('<', r'\u003C')
I ran into this problem trying to pass the json from oembed results. Some of them contain script close tags (without mentioning Twitter by name).
json_for_script = json.dumps(data).replace('<', r'\u003C');
This turns data = {'test': 'foo </script> bar'}; into
'{"test": "foo \\u003C/script> bar"}'
which is valid JSON that won't terminate a script element.
I got the idea from this little gem inside the Jinja template engine. It's what's run when you use the {{data|tojson}} filter.
def htmlsafe_json_dumps(obj, dumper=None, **kwargs):
"""Works exactly like :func:`dumps` but is safe for use in ``<script>``
tags. It accepts the same arguments and returns a JSON string. Note that
this is available in templates through the ``|tojson`` filter which will
also mark the result as safe. Due to how this function escapes certain
characters this is safe even if used outside of ``<script>`` tags.
The following characters are escaped in strings:
- ``<``
- ``>``
- ``&``
- ``'``
This makes it safe to embed such strings in any place in HTML with the
notable exception of double quoted attributes. In that case single
quote your attributes or HTML escape it in addition.
"""
if dumper is None:
dumper = json.dumps
rv = dumper(obj, **kwargs) \
.replace(u'<', u'\\u003c') \
.replace(u'>', u'\\u003e') \
.replace(u'&', u'\\u0026') \
.replace(u"'", u'\\u0027')
return Markup(rv)
(You could use \x3C instead of \u003C and that would work in a script element because it's valid JavaScript. But might as well stick to valid JSON.)
First of all, your paranoia is well founded.
an HTML-parser could be tricked by a closing script tag (better assume by any closing tag)
a JS-parser could be tricked by backslashes and quotes (with a really bad encoder)
Yes, it would be much "safer" to encode all characters that could confuse the different parsers involved. Keeping it human-readable might be contradicting your security paradigm.
Note: The result of JSON String encoding should be canoncical and OFC, not broken, as in parsable. JSON is a subset of JS and thus be JS parsable without any risk. So all you have to do is make sure the HTML-Parser instance that extracts the JS-code is not tricked by your user data.
So the real pitfall is the nesting of both parsers. Actually, I would urge you to put something like that into a separate request. That way you would avoid that scenario completely.
Assuming all possible styles and error-corrections that could happen in such a parser it might be that other tags (open or close) might achieve a similar feat.
As in: suggesting to the parser that the script tag has ended implicitly.
So it is advisable to encode slash and all tag braces (/,<,>), not just the closing of a script-tag, in whatever reversible method you choose, as long as long as it would not confuse the HTML-Parser:
Best choice would be base64 (but you want more readable)
HTMLentities will do, although confusing humans :)
Doing your own escaping will work as well, just escape the individual characters rather than the </script fragment
In conclusion, yes, it's probably best with a few changes, but please note that you will be one step away from "safe" already, by trying something like this in the first place, instead of loading the JSON via XHR or at least using a rigorous string encoding like base64.
P.S.: If you can learn from other people's code encoding the strings that's nice, but you should not resort to "libraries" or other people's functions if they don't do exactly what you need.
So rather write and thoroughly test your own (de/en)coder and know that this pitfall has been sealed.
I want to initialize a Javascript variable from some JSON (generated via Jackson) in my JSPX, something like this:
<script>
var x = <c:out value="${myJson}" />;
</script>
But the output I get looks like:
<script>
var x = {"foo":"bar"};
</script>
I see what you did there, HTML-escaping the string. Obviously, I can't leave it completely unescaped because angle brackets in the data could break the page. But I don't really need all the quotes to be escaped, since I'm not putting the JSON within an attribute value, do I?
Now, this looks like it would be a perfectly valid way to write a script in HTML, just needlessly complicated (like, say, replacing spaces with ). As it turns out, it works just fine in XHTML, but with an HTML content type, I get an error, both in Firefox and IE. I'm not sure of the rationale, but that's how it is.
So, what's the best approach here? Do I really want to simply escape angle brackets but not escape double quotes, or are there any other gotchas? Is there a tag out there that would replace c:out (I know there are Spring tags for escaping Javascript, but that's still not the right kind of escaping)? How do people get this to work?
BTW, yes, I could make a separate AJAX call, but an extra round trip just to work around this problem seems silly.
UPDATE
I had a lot to learn about CDATA vs. PCDATA and how HTML is different from XHTML. Here I thought JSPX would make polyglot markup easy, but it turns out to be, as someone put it, a big ball of nasty.
For HTML, the <script> element has a CDATA content model (not to be confused with CDATA sections), which means nothing can be escaped, but </ must be absolutely avoided.
In the special case of JSON, where end tags can only occur within a quoted string, this therefore means the safe way to escape is to use Javascript (rather than HTML) escaping and replace </ with <\/.
For XHTML (if you care about such things) on the other hand, you just XML-escape everything as usual (& becomes &, etc.) and it all works beautifully. A compatible solution would have to use CDATA with guarding comments (<!--/*--><![CDATA[/*><!--*/ etc.) around the entire <script> body and then escape any occurrences of ]]> within the JSON; furthermore, I'd still escape </ too just to be safe. Big ball of nasty, indeed, but at least it can be automated.
set escapeXml=false
<c:out value="${myJson}" escapeXml="false"/>
OK, answering my own question here, after much research and no real help.
Based my "update" above, the most straightforward way targeting HTML is just:
<script>
var x = ${fn:replace(myJson, "\</", "\<\\/")};
</script>
Ugly but simple.
This will not yield valid XML or XHTML, unfortunately. If you really need that, the original c:out will work fine, though it will not yield valid HTML. And if you really need a single solution to work on both, you probably need a custom taglib (or TAGX) that will either switch from the content type or do all of the following:
wrap the script body in a comment-guarded CDATA section
replace each </ with <\/
replace each ]]> with ]]\>
This question already has answers here:
Why split the <script> tag when writing it with document.write()?
(5 answers)
Closed 8 years ago.
What exactly needs to be escaped in javascript strings. Or, more specifically, why does
var snaphtml = '<script src="http://seadragon.com/embed/lxe.js?width=auto&height=400px"></script>';
give a syntax error? Escaping the final <\/script> seems to fix the syntax error, but this doesn't make sense to me as a javascript beginner.
The problem may be that the web browser sees the "</script>" sequence and decides that's the end of the script block.
Another way to fix the problem aside from using an escape sequence like you did is to break it apart into 2 strings that are concatenated:
"<" + "/script>"
The behavior you're seeing isn't a bug n the part of the browser.
Browsers don't "look inside" a script block, they just pass the content to the script engine. The "</script>" sequence is how they know they've come to the end of the block, and since the browser doesn't interpret the contents of the block, it has no way to know that it's in the context of a literal string in the script code.
Remember that browsers can support more script languages than just Javascript, even if it's not commonly seen. Internet Explorer supports VBscript (and I think any scripting language that can be run by a windows script host, but I'm not sure about that). And when the ability to have script blocks was put into browsers way back when, no one could be sure that Javascript would end up being so universal.
You're actually running into an html-escaping issue: the browser interprets </script> in your string as the close-tag for the script element in which your javascript is embedded -- so to the browser, your line looks like it's missing the close single-quote:
var snaphtml = '<script src="http://seadragon.com/embed/lxe.js?width=auto&height=400px">
To fix it, as you've found, you just need to change </script> to anything else, like <\/script> or \074/script>, etc.
The only characters you normally need to worry about in a javascript string are double-quote (if you're quoting the string with a double-quote), single-quote (if you're quoting the string with a single-quote), backslash, carriage-return (\r), or linefeed (\n).
The HTML parser will interpret the end of tag token (ETAGO delimiter </) in your string, as the end of the current script tag, giving you the unterminated string SyntaxError.
There are several workarounds, including the use of CDATA blocks, but the simplest way is to escape that character, or make a string concatenation:
var snaphtml = '<script src="...">\x3C/script>';
var snaphtml = '<script src="..."><' + '/script>';
And of course, you can also create your script element programmatically and append it to the head:
var newScript = document.createElement("script");
newScript.src = "...";
//...
See Everything you always wanted to know about </, aka. the end-tag open (ETAGO) delimiter for a detailed explanation. TL;DR there’s no need for crazy hacks like string concatenation or char literal escapes — just escape it as such:
var snaphtml = '<\/script>';
Also, note that this is only necessary for inline scripts.
The host that the majority of my script's users are on forces an text ad at the end of every page. This code is sneaking into my script's AJAX responses. It's an HTML comment, followed by a link to their signup page. How can I strip this comment and link from the end of my AJAX responses?
Typically those scripts basically look for text/html content and just shove the code into the stream. Have you tried setting the content type to something else such as text/json, text/javascript, text/plain and see if it gets by without the injection?
Regular Expressions
My first suggestion would be to find a regular expression that can match and eliminate that trailing information. I'm not the greatest at writing regular expressions but here's an attempt:
var response = "I am the data you want. <strong>And nothing more</strong> <!-- haha -> <a href='google.com'>Sucker!</a>";
var myStuff = response.replace("/\s+?<!--.*>$/gi", "");
Custom Explosion String
What would be an easy and quick solution would be to place a string at the end of your message ("spl0de!"), and then split the ajax response on that, and only handle that which comes before it.
var myStuff = response.split("spl0de!")[0];
This would remove anything anybody else sneaks onto the end of your data.
you see a lot of this with hand-generated xml, it isn't valid , so consumers try to fix-up the broken xml with hand-rolled regex -- its completely the wrong approach.
you need to fix this at the source, at the broken host.