Escape with Unicode Code Points similar to PHP in Javascript

Escape with Unicode Code Points similar to PHP in Javascript - javascript

When using PHP to generate JSON, it encodes higher characters using the \u0123 code-point notation.
(I know this not necessary, but for unnamed reasons I want that.)
I am trying to achieve the same in JavaScript. I searched high and low and found nothing. The encodeUri function does nothing for me (even though many suggested that it would).
Any helpful hints? I hope that I do not have to use some big external library but rather something build-in or a few lines of nice code - this can not be this hard, can it...?!
I have an input string in the form of:
var stringVar = 'Hällö Würld.';
My desired conversion would give me something like:
var resultStringVar = 'H\u00e4ll\u00f6 W\u00fcrld.';

I’ve made a library for just that called jsesc. From its README:
This is a JavaScript library for escaping JavaScript strings while generating the shortest possible valid ASCII-only output. Here’s an online demo.
This can be used to avoid mojibake and other encoding issues, or even to avoid errors when passing JSON-formatted data (which may contain U+2028 LINE SEPARATOR, U+2029 PARAGRAPH SEPARATOR, or lone surrogates) to a JavaScript parser or an UTF-8 encoder, respectively.
For your specific example, use it as follows:
var stringVar = 'Hällö Würld.';
var resultStringVar = jsesc(stringVar, { 'json': true, 'wrap': false });

Related

How can I replace some calls to JavaScript's eval() with Ext.decode()?

We are trying to get rid of all of our eval() calls in our JavaScript. Unfortunately, I am not much of a JavaScript programmer, and I need some help.
Many of our eval() calls operate on strings, outputs from a web service, that are very JSON-like, for example, we might eval the following string:
ClassMetaData['Asset$Flex'] = {
fields: {
}
,label: 'Flex Fields'
};
I've seen various suggestions on the Internet suggesting Ext.decode(). The documentation for it says - "Decodes (parses) a JSON string to an object. If the JSON is invalid, this function throws a SyntaxError unless the safe option is set." The string that I am supplying as an argument isn't legitimate JSON as I understand it (the field names aren't quoted), but Ext.decode() nearly works for me anyway. If I decode the above string, I get an error (why?) - "Uncaught SyntaxError: Unexpected token ;". However, if I remove the trailing semi-colon, and decode, everything seems to be fine.
I am using the following code to determine whether the decode call and the eval call do the same thing:
var evaled = eval(inputString);
var decoded = Ext.decode(inputString.replace(";", "")); // remove trailing ";", if any
console.log("Equal? - " + (JSON.stringify(decoded) == JSON.stringify(evaled)));
Unfortunately, this is not a very good solution. For example, some of the input strings to eval are fairly complex. They may have all sorts of embedded characters - semicolons, HTML character encodings, etc. Decode may complain about some other syntax problem, besides semicolons at the end, and I haven't found a good way to determine where the problem is that decode objects to. (It doesn't say "illegal character in position 67", for example.)
My questions:
Could we, with a small amount of work, create a generic solution
using decode?
Is there an easy way to convert our JSON-like input
into true JSON?
Is there a better way of comparing the results of
eval and decode than JSON.stringify(decoded) == JSON.stringify(evaled)?

How to insert arbitrary JSON in HTML's script tag

I would like to store a JSON's contents in a HTML document's source, inside a script tag.
The content of that JSON does depend on user submitted input, thus great care is needed to sanitise that string for XSS.
I've read two concept here on SO.
1. Replace all occurrences of the </script tag into <\/script, or replace all </ into <\/ server side.
Code wise it looks like the following (using Python and jinja2 for the example):
// view
data = {
'test': 'asdas</script><b>as\'da</b><b>as"da</b>',
}
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False).replace('</script', r'<\/script'),
}
// template
<script>
var data_json = {{ data_json | safe }};
</script>
// js
access it simply as window.data_json object
2. Encode the data as a HTML entity encoded JSON string, and unescape + parse it in client side. Unescape is from this answer: https://stackoverflow.com/a/34064434/518169
// view
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False),
}
// template
<script>
var data_json = '{{ data_json }}'; // encoded into HTML entities, like < > &
</script>
// js
function htmlDecode(input) {
var doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
var decoded = htmlDecode(window.data_json);
var data_json = JSON.parse(decoded);
This method doesn't work because \" in a script source becames " in a JS variable. Also, it creates a much bigger HTML document and also is not really human readable, so I'd go with the first one if it doesn't mean a huge security risk.
Is there any security risk in using the first version? Is it enough to sanitise a JSON encoded string with .replace('</script', r'<\/script')?
Reference on SO:
Best way to store JSON in an HTML attribute?
Why split the <script> tag when writing it with document.write()?
Script tag in JavaScript string
Sanitize <script> element contents
Escape </ in script tag contents
Some great external resources about this issue:
Flask's tojson filter's implementation source
Rail's json_escape method's help and source
A 5 year long discussion in Django ticket and proposed code

Here's how I dealt with the relatively minor part of this issue, the encoding problem with storing JSON in a script element. The short answer is you have to escape either < or / as together they terminate the script element -- even inside a JSON string literal. You can't HTML-encode entities for a script element. You could JavaScript-backslash-escape the slash. I preferred to JavaScript-hex-escape the less-than angle-bracket as \u003C.
.replace('<', r'\u003C')
I ran into this problem trying to pass the json from oembed results. Some of them contain script close tags (without mentioning Twitter by name).
json_for_script = json.dumps(data).replace('<', r'\u003C');
This turns data = {'test': 'foo </script> bar'}; into
'{"test": "foo \\u003C/script> bar"}'
which is valid JSON that won't terminate a script element.
I got the idea from this little gem inside the Jinja template engine. It's what's run when you use the {{data|tojson}} filter.
def htmlsafe_json_dumps(obj, dumper=None, **kwargs):
"""Works exactly like :func:`dumps` but is safe for use in ``<script>``
tags. It accepts the same arguments and returns a JSON string. Note that
this is available in templates through the ``|tojson`` filter which will
also mark the result as safe. Due to how this function escapes certain
characters this is safe even if used outside of ``<script>`` tags.
The following characters are escaped in strings:
- ``<``
- ``>``
- ``&``
- ``'``
This makes it safe to embed such strings in any place in HTML with the
notable exception of double quoted attributes. In that case single
quote your attributes or HTML escape it in addition.
"""
if dumper is None:
dumper = json.dumps
rv = dumper(obj, **kwargs) \
.replace(u'<', u'\\u003c') \
.replace(u'>', u'\\u003e') \
.replace(u'&', u'\\u0026') \
.replace(u"'", u'\\u0027')
return Markup(rv)
(You could use \x3C instead of \u003C and that would work in a script element because it's valid JavaScript. But might as well stick to valid JSON.)

First of all, your paranoia is well founded.
an HTML-parser could be tricked by a closing script tag (better assume by any closing tag)
a JS-parser could be tricked by backslashes and quotes (with a really bad encoder)
Yes, it would be much "safer" to encode all characters that could confuse the different parsers involved. Keeping it human-readable might be contradicting your security paradigm.
Note: The result of JSON String encoding should be canoncical and OFC, not broken, as in parsable. JSON is a subset of JS and thus be JS parsable without any risk. So all you have to do is make sure the HTML-Parser instance that extracts the JS-code is not tricked by your user data.
So the real pitfall is the nesting of both parsers. Actually, I would urge you to put something like that into a separate request. That way you would avoid that scenario completely.
Assuming all possible styles and error-corrections that could happen in such a parser it might be that other tags (open or close) might achieve a similar feat.
As in: suggesting to the parser that the script tag has ended implicitly.
So it is advisable to encode slash and all tag braces (/,<,>), not just the closing of a script-tag, in whatever reversible method you choose, as long as long as it would not confuse the HTML-Parser:
Best choice would be base64 (but you want more readable)
HTMLentities will do, although confusing humans :)
Doing your own escaping will work as well, just escape the individual characters rather than the </script fragment
In conclusion, yes, it's probably best with a few changes, but please note that you will be one step away from "safe" already, by trying something like this in the first place, instead of loading the JSON via XHR or at least using a rigorous string encoding like base64.
P.S.: If you can learn from other people's code encoding the strings that's nice, but you should not resort to "libraries" or other people's functions if they don't do exactly what you need.
So rather write and thoroughly test your own (de/en)coder and know that this pitfall has been sealed.

Finding text strings in JavaScript

I have a large valid JavaScript file (utf-8), from which I need to extract all text strings automatically.
For simplicity, the file doesn't contain any comment blocks in it, only valid ES6 JavaScript code.
Once I find an occurrence of ' or " or `, I'm supposed to scan for the end of the text block, is where I got stuck, given all the possible variations, like "'", '"', "\'", '\"', '", `\``, etc.
Is there a known and/or reusable algorithm for detecting the end of a valid ES6 JavaScript text block?
UPDATE-1: My JavaScript file isn't just large, I also have to process it as a stream, in chunks, so Regex is absolutely not usable. I didn't want to complicate my question, mentioning joint chunks of code, I will figure that out myself, If I have an algorithm that can work for a single piece of code that's in memory.
UPDATE-2: I got this working initially, thanks to the many advises given here, but then I got stuck again, because of the Regular Expressions.
Examples of Regular Expressions that break any of the text detection techniques suggested so far:
/'/
/"/
/\`/
Having studied the matter closer, by reading this: How does JavaScript detect regular expressions?, I'm afraid that detecting regular expressions in JavaScript is a whole new ball game, worth a separate question, or else it gets too complicated. But I appreciate very much if somebody can point me in the right direction with this issue...
UPDATE-3: After much research I found with regret that I cannot come up with an algorithm that would work in my case, because presence of Regular Expressions makes the task incredibly more complicated than was initially thought. According to the following: When parsing Javascript, what determines the meaning of a slash?, determining the beginning and end of regular expressions in JavaScript is one of the most complex and convoluted tasks. And without it we cannot figure out when symbols ', '"' and ` are opening a text block or whether they are inside a regular expression.

The only way to parse JavaScript is with a JavaScript parser. Even if you were able to use regular expressions, at the end of the day they are not powerful enough to do what you are trying to do here.
You could either use one of several existing parsers, that are very easy to use, or you could write your own, simplified to focus on the string extraction problem. I hardly imagine you want to write your own parser, even a simplified one. You will spend much more time writing it and maintaining it than you might think.
For instance, an existing parser will handle something like the following without breaking a sweat.
`foo${"bar"+`baz`}`
The obvious candidates for parsers to use are esprima and babel.
By the way, what are you planning to do with these strings once you extract them?

If you only need an approximate answer, or if you want to get the string literals exactly as they appear in the source code, then a regular expression can do the job.
Given the string literal "\n", do you expect a single-character string containing a newline or the two characters backslash and n?
In the former case you need to interpret escape sequences exactly like a JavaScript interpreter does. What you need is a lexer for JavaScript, and many people have already programmed this piece of code.
In the latter case the regular expression has to recognize escape sequences like \x40 and \u2026, so even in that case you should copy the code from an existing JavaScript lexer.
See https://github.com/douglascrockford/JSLint/blob/master/jslint.js, function tokenize.

Try code below:
txt = "var z,b \n;z=10;\n b='321`1123`321321';\n c='321`321`312`3123`';"
function fetchStrings(txt, breaker){
var result = [];
for (var i=0; i < txt.length; i++){
// Define possible string starts characters
if ((txt[i] == "'")||(txt[i] == "`")){
// Get our text string;
textString = txt.slice(i+1, i + 1 + txt.slice(i+1).indexOf(txt[i]));
result.push(textString)
// Jump to end of fetched string;
i = i + textString.length + 1;
}
}
return result;
};
console.log(fetchStrings(txt));

Javascript eval fails to evaluate json when escape characters are involved

I am using rhino javascript engine to evaluate the json. The Json structure is as following :
{"DataName":"111","Id":"222","Date":"2015-12-31T00:00:00","TextValue":"{\"Id\":\"1\",\"Name\":\"Daugherty\",\"ContactName\":\"May C\",\"ContactEmail\":\"may.c#gamil.com\",\"Total\":25,\"Phone\":\"111-111-1111\",\"Type\":\"Daily\",\"Notes\":[{\"Comments\":\"One\",\"Date\":\"2014-11-27T00:00:00.000\"},{\"Comments\":\"Two\",\"Date\":\"2014-11-28T00:00:00.000\"}],\"ImportComplete\":true,\"RunComplete\":true,\"CompleteDate\":\"2014-07-31T00:00:00.000\",\"Amount\":2400.00,\"ProcessingComplete\":true}","NumberValue":4444.5555,"DateValue":"2014-12-01T00:00:00"}
Since I am using Rhino js engine I can't use JSON.parse and JSON.stringify.
As you can see the json has embedded json, this json I am getting from a .net web api which is putting the escape character '\'. I am trying to replace that escape character in javascript but no help.
Is there any way in javascript where we can replace that escape character and use 'eval()' to evaluate the json.
Here's the code that I am trying
var json = '{"DataName":"111","Id":"222","Date":"2015-12-31T00:00:00","TextValue":"{\"Id\":\"1\",\"Name\":\"Daugherty\",\"ContactName\":\"May C\",\"ContactEmail\":\"may.c#gamil.com\",\"Total\":25,\"Phone\":\"111-111-1111\",\"Type\":\"Daily\",\"Notes\":[{\"Comments\":\"One\",\"Date\":\"2014-11-27T00:00:00.000\"},{\"Comments\":\"Two\",\"Date\":\"2014-11-28T00:00:00.000\"}],\"ImportComplete\":true,\"RunComplete\":true,\"CompleteDate\":\"2014-07-31T00:00:00.000\",\"Amount\":2400.00,\"ProcessingComplete\":true}","NumberValue":4444.5555,"DateValue":"2014-12-01T00:00:00"}';
var find = '\"';
var regex = new RegExp(find,'g');
var inj = json.replace(regex,'"');
var pl = eval('(' + inj +')');

confusing backslashes
The problem you are getting is due to the fact of not fully understanding escape characters, when you are more than one level of "string" deep. Whilst a single slash is fine for one level i.e:
"It is no coincidence that in no known language does the " +
"phrase \"As pretty as an Airport\" appear.";
If you take that and then wrap it in outer quotes:
'"It is no coincidence that in no known language does the "' +
'"phrase \"As pretty as an Airport\" appear."';
The backslashes (if supported as escape characters by the system parsing the string) work for the outer-most wrapping quote, not any of the inner quotes/strings as they were before. This means once the js engine has parsed the string, internally the string will be.
'"It is no coincidence that in no known language does the phrase "As pretty as an Airport" appear."';
Which makes it impossible to tell the difference between the " and the \" from the original string. In order to get around this, you need to escape the backslashes in the original string, before you wrap it. This has the result of one level of escaping being used by the JavaScript engine, but still leaving another level remaining within the string. e.g.
'"It is no coincidence that in no known language does the "' +
'"phrase \\"As pretty as an Airport\\" appear."';
Now when the string is parsed, internally it will be:
'"It is no coincidence that in no known language does the phrase \"As pretty as an Airport\" appear."';
ignore the my random Douglas Adams quotes being separated onto more than one line (using +), I've only done that for ease of reading within a fix width area. I've kept it parsable by JavaScript, just in case people copy and paste and expect things to work.
So in order to fix your issue, your JSON source (before placing in the JavaScript code) will have to look like this:
var json = '{"DataName":"111","Id":"222","Date":"2015-12-31T00:00:00","TextValue":"{\\"Id\\":\\"1\\",\\"Name\\":\\"Daugherty\\",\\"ContactName\\":\\"May C\\",\\"ContactEmail\\":\\"may.c#gamil.com\\",\\"Total\\":25,\\"Phone\\":\\"111-111-1111\\",\\"Type\\":\\"Daily\\",\\"Notes\\":[{\\"Comments\\":\\"One\\",\\"Date\\":\\"2014-11-27T00:00:00.000\\"},{\\"Comments\\":\\"Two\\",\\"Date\\":\\"2014-11-28T00:00:00.000\\"}],\\"ImportComplete\\":true,\\"RunComplete\\":true,\\"CompleteDate\\":\\"2014-07-31T00:00:00.000\\",\\"Amount\\":2400.00,\\"ProcessingComplete\\":true}","NumberValue":4444.5555,"DateValue":"2014-12-01T00:00:00"}';
You should find the above will eval directly, without any replacements.
In order to achieve the above programatically, you will have to see what the .NET system you are using offers in the way of escaping backslashes. I mainly work with PHP or Python on the server side. Using those languages you could use:
the $s and s strings below have been cropped for brevity.
<?php
$s = '{"DataName":"111","Id":"222"...';
$s = str_replace("\\", "\\\\", $s);
echo "var json = '$s';";
or ...
#!/usr/bin/env python
s = r'{"DataName":"111","Id":"222"...'
s = s.replace("\\", "\\\\")
print "var json = '" + s + "';"
another solution
It all depends on how you are requesting the content you are wrapping in the string in JavaScript. If you have the ability to write out your js from the server side (most likely with .NET). Like I have shown above with PHP or Python, you don't need to wrap the content in a string at all. You can instead just output the content without being wrapped in single quotes. JavaScript will then just parse and treat it as a literal object structure:
var jso = {"DataName":"111","Id":"222","Date":"2015-12-31T00:00:00","TextValue":"{\"Id\":\"1\",\"Name\":\"Daugherty\",\"ContactName\":\"May C\",\"ContactEmail\":\"may.c#gamil.com\",\"Total\":25,\"Phone\":\"111-111-1111\",\"Type\":\"Daily\",\"Notes\":[{\"Comments\":\"One\",\"Date\":\"2014-11-27T00:00:00.000\"},{\"Comments\":\"Two\",\"Date\":\"2014-11-28T00:00:00.000\"}],\"ImportComplete\":true,\"RunComplete\":true,\"CompleteDate\":\"2014-07-31T00:00:00.000\",\"Amount\":2400.00,\"ProcessingComplete\":true}","NumberValue":4444.5555,"DateValue":"2014-12-01T00:00:00"};
This works because JSON is just a more strict version of a JavaScript Object, and the quote/escape level you already have will work fine.
The only downside to the above solution is that you have to implicitly trust the source of where you are getting this data from, and it will always have to be well formed. If not, you could introduce parse errors or unwanted js into your code; which could be avoided with an eval/JSON.parse system.

Javascript string comparison fails when comparing unicode characters

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).
JavaScript code:
var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";
print(filenameFromJS == filenameFromServer); // This prints false why?
The solution
What worked for me is unicode normalization as slevithan pointed out.
I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.

Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.
To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm

The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.
The two strings do not contain the same number and sequence of characters.
There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.
Surprise typecasting. The programmer is comparing datatypes that are incompatible.
There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.

UTF-8 is a complex thing. The charset has two different codes for characters such as á, é etc. As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.
See this answer for more information.

I had this same problem.
Adding
<meta charset="UTF-8">
to the HTML file fixed the issue.
In my case the templating engine was baking a json string into the HTML file. This string was in unicode.
While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.
I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")

Let the browser normalize unicode for you. This approach worked for me:
function normalizeUnicode(s) {
let div = $('<div style="display: none"></div>').html(s).appendTo('body');
let res = div.html();
div.remove();
return res;
}
normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)

We Keep Coding

JavaScript is the programming language of the Web.