Stripping filename from HTML Encoded UNIX Path - javascript

I am writing a NodeJS application using Express and Google Datastore. I am trying to get the filename from a UNIX path. The path is stored in an HTML encoded format in the database.
Here's the path un-encoded:
/toplevel/example/text123.txt
Here's how the path is stored in the database HTML encoded format:
/toplevel/example/test123.txt
Since the path is HTML encoded, this line is not working.
let filename_only = requested_filepath_unescaped.split('/').pop().toString();
I also tried splitting by the encoded characters but that does not work either (perhaps because split doesn't work with multiple characters?)
let filename_only = requested_filepath_unescaped.split('&#x2F').pop().toString();
What is the best way to either split the string as-is, or de-code the HTML back into an unencoded string?

Well, split works with multiple characters, so I don't know what goes wrong when you tried it.
However if you can use jQuery, you can also decode the html like this:
var htmlDecoded = $('<div />').html(htmlEncoded).text()
After that you can split on '/'.
(The code I gave creates a div tag in memory (it is not added to the DOM, the web page), after that it sets the html of it, which automatically decodes the html entities.
EDIT:
As I am unsure what the problem of the OP is and I can't comment due to low reputation, I give some more suggestions here.
Maybe the variable you call split on is not really a string object. Try converting to string first:
var filename = filepath.toString().split('/');
Other option is to use regex, but I don't know what exactly solves that, but might be worth trying.
var filename = filepath.toString().split(/&#2F;/);
EDIT2: Tested and working in Chrome v62 and Node v6.11.4.

Related

How to change the encoding of a CSV file to ISO-8859-1 (ISO-Latin-1)

I have searched for way too long, to solve my existing Problem.
So, there is a Software, that can only read csv-files with ISO-8859-1 (ISO-Latin-1) encoding. And If tried alost everything I can find on the web, but nothing worked. I don´t want to change the Text, I want to change the encoding.
I´ve tried working with this lib: https://github.com/inexorabletash/text-encoding Library and the PapaParse Library and much more. But they are just converting the Text so there are weird symbols replace ä,ö,ü and other characters.
The characters in the csv file itself has characters which are not part of the character set which you're encoding into. You might find that the end of lines have a character which does not exist in the character set you want to use. If you open the csv file in a basic text editor or at the dos prompt use Type myFile.csv you will be able to see which charaters there are in a most basic format. Then stip them out and you should have a file that can be converted. Always work on a copy of the original. The easiest way would be to search and replace in a text editor where you replace the unwanted characters with nothing. not even a space. Keep in mind that if the character is part of the csv construct - eg .. a delimiter - then you would want to replace that with the latin equivalent character.
I found it myself, but thank you anyway.
In js :
const uint8array = new TextEncoder(
'windows-1252',
{NONSTANDARD_allowLegacyEncoding: true}
).encode(csv_data);
const blob = new Blob([uint8array])
(For those, who searched the question: csv_data is an String, containing the values from the read csv-file. Make sure to ad ; behind every value to move to the next column and a /n to move to the next line.)
In HTML :
<script>
window.TextEncoder = window.TextDecoder = null;
</script>
<script src="encoding-indexes.js"></script>
<script src="encoding.js"></script>
And you have to download encoding.js and encoding-indexes.js from https://github.com/inexorabletash/text-encoding

How to encode and decode all special characters in javascript or jquery?

I want to be able to encode and decode all the following characters using javascript or jquery...
~!##$%^&*()_+|}{:"?><,./';[]\=-`
I tried to encode them using this...
var cT = encodeURI(oM); // oM holds the special characters
cT = cT.replace(/[!"#$%&'()*+,.\/:;<=>?#[\\\]^`{|}~]/g, "\\\\$&");
Which does encode them, or escape them rather, but then I am trying to do the reverse with this...
decodeURIComponent(data.convo.replace(/\+/g, ' '));
But, it's not coming out in any way desired.
I've built a chat plugin for jquery, but the script crashes if someone enters a special character. I want the special characters to get encoded, then when they get pulled out of the data base, they should be decoded. I tried using urldecode in PHP before the data is returned to the ajax request but it's coming out horribly wrong.
I would think that there exists some function to encode and decode all special characters.
Oh, one caveat for this is that I'm wrapping each message with html elements, so I think the decoding needs to be done server side, before the message is wrapped, or be able to know when to ignore valid html tags and decode the other characters that are just what the user wanted to type.
Am I encoding/escaping them wrong to begin with?
Is that why the results are horrible?
This is pretty simple in javascript
//Note that i have escaped the " in the string - this means it still gets processed
var exampleInput = "Hello there h4x0r ~!##$%^&*()_+|}{:\"?><,./';[]\=-`";
var encodedInput = encodeURI(exampleInput);
var decodedInput = decodeURI(encodedInput);
console.log(exampleInput);
console.log(encodedInput);
console.log(decodedInput);
Just encode and decode the input. If something else is breaking in your script it means you are not stripping away things that you are somehow processing. It's hard to provide an accurate answer as you can see encoding and decoding the URI standards does not crash things. Only the processing of this content improperly would cause issues.
When you output the content in HTML you should be encoding the HTML entities.
Reference this thread Encode html entities in javascript if you need to actually encode for display inside HTML safely.
An additional reference on how html entities work can be found here: W3 Schools - HTML Entities and W3 Schools - HTML Symbols

How to insert arbitrary JSON in HTML's script tag

I would like to store a JSON's contents in a HTML document's source, inside a script tag.
The content of that JSON does depend on user submitted input, thus great care is needed to sanitise that string for XSS.
I've read two concept here on SO.
1. Replace all occurrences of the </script tag into <\/script, or replace all </ into <\/ server side.
Code wise it looks like the following (using Python and jinja2 for the example):
// view
data = {
'test': 'asdas</script><b>as\'da</b><b>as"da</b>',
}
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False).replace('</script', r'<\/script'),
}
// template
<script>
var data_json = {{ data_json | safe }};
</script>
// js
access it simply as window.data_json object
2. Encode the data as a HTML entity encoded JSON string, and unescape + parse it in client side. Unescape is from this answer: https://stackoverflow.com/a/34064434/518169
// view
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False),
}
// template
<script>
var data_json = '{{ data_json }}'; // encoded into HTML entities, like < > &
</script>
// js
function htmlDecode(input) {
var doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
var decoded = htmlDecode(window.data_json);
var data_json = JSON.parse(decoded);
This method doesn't work because \" in a script source becames " in a JS variable. Also, it creates a much bigger HTML document and also is not really human readable, so I'd go with the first one if it doesn't mean a huge security risk.
Is there any security risk in using the first version? Is it enough to sanitise a JSON encoded string with .replace('</script', r'<\/script')?
Reference on SO:
Best way to store JSON in an HTML attribute?
Why split the <script> tag when writing it with document.write()?
Script tag in JavaScript string
Sanitize <script> element contents
Escape </ in script tag contents
Some great external resources about this issue:
Flask's tojson filter's implementation source
Rail's json_escape method's help and source
A 5 year long discussion in Django ticket and proposed code
Here's how I dealt with the relatively minor part of this issue, the encoding problem with storing JSON in a script element. The short answer is you have to escape either < or / as together they terminate the script element -- even inside a JSON string literal. You can't HTML-encode entities for a script element. You could JavaScript-backslash-escape the slash. I preferred to JavaScript-hex-escape the less-than angle-bracket as \u003C.
.replace('<', r'\u003C')
I ran into this problem trying to pass the json from oembed results. Some of them contain script close tags (without mentioning Twitter by name).
json_for_script = json.dumps(data).replace('<', r'\u003C');
This turns data = {'test': 'foo </script> bar'}; into
'{"test": "foo \\u003C/script> bar"}'
which is valid JSON that won't terminate a script element.
I got the idea from this little gem inside the Jinja template engine. It's what's run when you use the {{data|tojson}} filter.
def htmlsafe_json_dumps(obj, dumper=None, **kwargs):
"""Works exactly like :func:`dumps` but is safe for use in ``<script>``
tags. It accepts the same arguments and returns a JSON string. Note that
this is available in templates through the ``|tojson`` filter which will
also mark the result as safe. Due to how this function escapes certain
characters this is safe even if used outside of ``<script>`` tags.
The following characters are escaped in strings:
- ``<``
- ``>``
- ``&``
- ``'``
This makes it safe to embed such strings in any place in HTML with the
notable exception of double quoted attributes. In that case single
quote your attributes or HTML escape it in addition.
"""
if dumper is None:
dumper = json.dumps
rv = dumper(obj, **kwargs) \
.replace(u'<', u'\\u003c') \
.replace(u'>', u'\\u003e') \
.replace(u'&', u'\\u0026') \
.replace(u"'", u'\\u0027')
return Markup(rv)
(You could use \x3C instead of \u003C and that would work in a script element because it's valid JavaScript. But might as well stick to valid JSON.)
First of all, your paranoia is well founded.
an HTML-parser could be tricked by a closing script tag (better assume by any closing tag)
a JS-parser could be tricked by backslashes and quotes (with a really bad encoder)
Yes, it would be much "safer" to encode all characters that could confuse the different parsers involved. Keeping it human-readable might be contradicting your security paradigm.
Note: The result of JSON String encoding should be canoncical and OFC, not broken, as in parsable. JSON is a subset of JS and thus be JS parsable without any risk. So all you have to do is make sure the HTML-Parser instance that extracts the JS-code is not tricked by your user data.
So the real pitfall is the nesting of both parsers. Actually, I would urge you to put something like that into a separate request. That way you would avoid that scenario completely.
Assuming all possible styles and error-corrections that could happen in such a parser it might be that other tags (open or close) might achieve a similar feat.
As in: suggesting to the parser that the script tag has ended implicitly.
So it is advisable to encode slash and all tag braces (/,<,>), not just the closing of a script-tag, in whatever reversible method you choose, as long as long as it would not confuse the HTML-Parser:
Best choice would be base64 (but you want more readable)
HTMLentities will do, although confusing humans :)
Doing your own escaping will work as well, just escape the individual characters rather than the </script fragment
In conclusion, yes, it's probably best with a few changes, but please note that you will be one step away from "safe" already, by trying something like this in the first place, instead of loading the JSON via XHR or at least using a rigorous string encoding like base64.
P.S.: If you can learn from other people's code encoding the strings that's nice, but you should not resort to "libraries" or other people's functions if they don't do exactly what you need.
So rather write and thoroughly test your own (de/en)coder and know that this pitfall has been sealed.

Handling unicode in the http response xml

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiri�</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiri�
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");
<title>Onegai My Melody Sukkiri�</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.
So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.
You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode

Escaping HTML in Ruby & adding to DOM with JS

I'm trying to dynamically add contents of a div using JS. Back end is Ruby on Rails. I am having a problem. Here's what is included in the view file:
var product_sidebar_inner = "<%= CGI.escapeHTML(render(...some partial...)).gsub(/\r/," ").gsub(/\n/," ") %>";
document.getElementById("left_sidebar_wrapper").innerHTML = unescape(product_sidebar_inner);
The above inserts html as text to div#left_sidebar_wrapper. Spent some time on this but still can't make this work. Any idea what am I am doing wrong?
Based on your comment to macarthy, I think you want CGI.escape (or CGI.unescape), that's what you use for URL encoding. You can also use URI.escape (or URI.unescape) but you'll get tired of having to pass the unsafe regex all the time to get it to do what you want.
Also, on the JavaScript side, you should be using encodeURI or encodeURIComponent as escape is deprecated because it has problems with non-ASCII characters.
THink you need to use raw
unescape(raw(product_sidebar_inner));

Categories