window.open UTF-8 Issue - javascript

I have this site:
http://a.b/x – y
where the dash is non-ASCII \u2013 or %E2%80%93 in UTF-8 speak.
The following link with UTF-8 works fine:
True Link
but scripting it with window.open() with the exact same URL gives a 404:
Raw JS Link
Viewing properties on the error page to see the resulting URL I note the extended dash is replaced with:
â??
If I replace the extended dash, and only the extended dash with "\u2013" the link works fine:
Modified JS Link
and the resulting URL seems to have re-endocded the extended dash back to UTF-8.
With this in mind I tried to decode the UTF-8 encoding and re-encode just the space but this failed with the same error as before:
Raw JS Link
I suspect that window.open() is mangling the URL for some reason.
I then went on to try a bunch of different ideas and combinations of decode / encode and even dragged escpae()/unescape() back into use, but to no avail.
The reason for window.open is that I am limited to controlling just the content of the HREF attribute. In this case it's an SSRS expression in a "Go to URL" Action, which SSRS UTF-8 encodes certain characters, so that even with the split(' ') above I actually have to use split(String.fromCharCode(32)).
However I've stripped everything out into a simple HTML page which is where I am doing my analysis with.
PS: IE8, though user base is IE8+
PSS: Added missing quote.
PSS: It looks like this might be an IE8 specific issue.

<a href="javascript:void(window.open('http://a.b/...component...
So here you've got multiple nested escaping contexts. You're injecting text into:
a component of a URL (needs URL-escaping), inside
a JavaScript string literal (needs JS-escaping), inside
a javascript: pseudo-URL (needs URL-escaping), inside
an HTML attribute value (needs HTML-escaping)
So the value x – y has to be escaped four times:
URL-escape to x%20%E2%80%93%20y
JS-escape to x%20%E2%80%93%20y (no changes this time as there are no JS-special characters in this value)
URL-escape to x%2520%25E2%2580%2593%2520y
HTML-escape to x%2520%25E2%2580%2593%2520y (no changes this time as there are no HTML-special characters in this value).
Nested syntaxes needing escaping are very, very difficult to get right. And generally you should never use javascript: URLs: as well as being a nightmare of multiple-escaping, they're also pretty bad for usability and accessibility.
Avoid injecting into nested code. A better pattern for links that open in a new window (if you absolutely must) is to put the real URL in the href, so it responds correctly to middle-click and other link affordances, and then read that href from JS, eg.:
<a href="http://a.b/x%20%E2%80%93%20y" onclick="window.open(this.href, ...options...); return false;"
(The return-false prevents the link being followed after the window is opened.) Also consider breaking the JS code out into a separate script that binds to all appropriate links automatically (eg by class attribute) so you don't have to have inline JavaScript in your HTML.

The single quotes were misplaced on your last example, also, there's no need for .split(' ').join('%20') as it will create errors.
Raw JS Link
demo
http://jsfiddle.net/bf2703ah/1/

Related

What is this HTML notation and how can I use it myself?

AddThis uses a notation which seems to extend the parameters available in an HTML div tag.
The tag that contains the button array can include additional parameters such as:
<div addthis:url="someUrl"> </div>
Along with defining some css classes for the element seems to give their JavaScript code access to manipulate this element AND read the value of the additional addthis: parameter.
I'd like to implement something similar myself but am confused as to how to correctly allow additional parameters in the standard HTML tags.
I've also seen the AddThis code throw W3C validation errors sometimes so wonder if this is entirely legitimate.
Searching around I've found some discussions about extending the HTML tags via extending the prototypes in JavaScript but everything I've read seems to be about adding new events etc.
This addthis:url notation looks more 'schema'-like to me, or am I on completely the wrong track?
I've made some progress on this, at least functionally, but what I have now breaks the HTML validation quite seriously.
To explain a little further what I am trying to achieve...
In the same way that AddThis allows you to include their sharing elements by adding a simple <DIV> tag to your page and including some JavaScript, I want to provide similar functionality with <IMG> tags.
Someone wanting to use this functionality will include an <IMG> tag that has some additional name=value pairs that are outside of the standard image tags attribute and are defined by my spec.
The JavaScript that is included will then read these additional attributes and perform some actions on the image tags.
To this end I have the following:
<IMG id="first image" class="imageThatCanBeWorkedOn" src="holding.png"
my-API-name:attribute1="some data"
my-API-name:attribute2="some other data">
I then use `getAttribute('my-API-name:attribute1') to access the additional tag data from JavaScript.
(I'm selecting all of the tags with a particular class name into an array and then processing each tag in turn, in case anyone is interested.)
This all works great - I can manipulate the <IMG> tags as needed based on the additional data, BUT the markup is not valid HTML according to the W3C validator.
With the above I get:
Warning Attribute my-API-name:attribute1 is not serializable as XML 1.0.
Warning Attribute my-API-name:attribute2 is not serializable as XML 1.0.
Error: Attribute my-API-name:attribute1 not allowed on element img at this point.
Error: Attribute my-API-name:attribute2 not allowed on element img at this point.
If I remove the : from the attribute name (eg my-API-name-attribute2) the 'not serializable' warnings disappear but I still get the 'not allowed' errors.
So how would I add these additional bits of data to an <IMG> tag and not invalidate the markup but while maintaining a level of clarity/branding by including the 'my-API-name' part in the way that AddThis does?
(I note from the comments that I could use data- attributes. I haven't tried these yet, but I'd prefer to be able to do this in the 'branded' way that AddThis seems to have managed without breaking their users' markup.)
If we were talking about XML (which includes XHTML) it'd be a namespace prefix. In HTML5 it's just a regular attribute:
Attribute names must consist of one or more characters other than the
space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027
APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and
U+003D EQUALS SIGN (=) characters, the control characters, and any
characters that are not defined by Unicode.
... though slightly harder to manipulate (not too much, though) and totally non-standard.
I'd like to implement something similar myself but am confused as to
how to correctly allow additional parameters in the standard HTML
tags.
Before HTML5, some web developers deployed a technique of adding custom data to an element's class attribute (or to any other attribute which will happily attach itself to any element).
This worked, but it was self-evidently a hack.
For this reason HTML5 introduced custom data-* attributes as the standard approach to extending an element's attributes - and data-* is precisely what you should be deploying.
So how would I add these additional bits of data to an tag and
not invalidate the markup but while maintaining a level of
clarity/branding by including the 'my-API-name' part in the way that
AddThis does?
<img id="first image" class="imageThatCanBeWorkedOn" src="holding.png"
data-myAPIName_attribute1="some data"
data-myAPIName_attribute2="some other data" />
Further Reading:
Time Travel back to 2010: http://html5doctor.com/html5-custom-data-attributes/
Time Travel back to 2008: http://ejohn.org/blog/html-5-data-attributes/

Regex replace with multiple wildcards works in PHP, not in JavaScript

I'm attempting to implement center alignment for two Markdown parsers:
In PHP for Parsedown (successfully)
In JavaScript for Bootstrap Markdown (without success)
The idea I'm following and finding the easiest is to work with the final HTML output, and just snap inline styling onto the tags.
The following regex does what I need, it adds style="text-align:center;" to any element so far*, as needed:
$text = preg_replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>', $text);
That is, <p>text</p> becomes <p style="text-align:center;">text</p>.
However, when I attempted to port this into JavaScript to also make it available for previewing on client-side, the pattern does not match as it should:
content = content.replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>');
The replacement in content does not occur.
I'm aware there are slight differences between Regex of PHP and JavaScript, but I have found examples for all the expected behavior here on both sides, working.
*If someone is wondering by any chance, I'm also successfully adding the center alignment to tags that already have a style attribute - on server side only, so far.
You'll need to use the literal syntax for regular expression in JavaScript, like so:
content = content.replace(/\<(.*?)\>\->(.+)<\-\<\/(.+)\>/gi, '<$1 style="text-align:center;">$2</$3>');
Note that the gi at the end of the regular expression simply enables global searching (that is, replace all occurrences matching the pattern) and case-insensitive matching. They are both technically optional, but you will most likely want the g flag enabled for certain. However, keeping the i flag is up to you (depends on whether or not your content contains &GT;, for example).

Linking to File With Space in JQuery

How would you link to a file that contains a space? Is it possible? I have a javascript document and already have dozens of images that contain spaces but I was hoping to be able to still link to them.
%20 is the escaped value for a blank space. Use that in a hyperlink, and you'll get the file you want :)
In case you test it in a browser: modern browsers (Chrome for sure) does not visually change the space to %20 anymore in the address bar, but it does still escape all characters before making a web request.
Edit
Generally speaking, you'd like to html encode your strings via an accessible method, rather than manually escaping the needed characters.
The following SO question has a very elegant solution. If you use it with an element that is not visible to the user (or not even part of the DOM, as is the case with the linked answer), they won't even know.

How to prevent & conversion to & when using JavaScript? (browser specifc)

I have a problem concerning string output on HTML page when using Javascript and ASP. Logic of page generation goes like this:
We use asp page to generate HTML code using Response.Write(). If string contains numeric character reference (for example С) it would show on the user's side just fine as a character.
After that we add OnLoad event, which calls for a Javascript function. All this happens inside <body><\body> tags. Source for JavaScript added inside <script></script> tags. The function only adds document.href, which contains reference to the same asp page.
The asp logic loads again and adds some text to the page using Response.BinaryWrite() (Response.Write can be used all the same) All character references are shown as codes:С. Obviously all '&' symbols become &(asp automatic conversion), browser decodes it as & and we can only see a code С and not the symbol 'С'.
As far as I know such behaviour can be caused by <script> tags, as a precaution against xss attacks. In the end I want to stop encoding '&' as &.
However here is the most important part:
If I add header with "Content-Type" "text\html", IE (any version) starts encoding NCR symbols in a correct way. But Firefox, Chrome and Safari do not change behavior and keep encoding & as &. I can see several questions on Stack Overflow which looks like mine, yet the situation is not exactly the same (My strings are not inserted directly by JavaScript, so I cannot manipulate output string and change & to &, also my strings have correct symbols in the first place, they get changed by asp or by browser). Is there any elegant way to force Firefox or Chrome to decode page as IE? Maybe some settings or attributes in HTML tags? This problem looks like it depends on a browser to me, am I right?

Javascript replace() function adding strange characters

Consider the following Javascript:
var previewImg = 'http://example.com/preview_img/hey.jpg';
var fullImg = previewImg.replace('preview','full');
I would expect the value of fullImg to be:
http://example.com/full_img/hey.jpg
In fact, it is... sort of. Running alert(fullImg); shows the expected url string. But when I deliver that variable to jQuery Fancybox, like this:
jQuery.fancybox.open(fullImg);
Something adds characters into the string, like this:
http://example.com/%EF%BF%BCfull_img/hey.jpg
Where is this %EF%BF%BC coming from? What is it? And most importantly, how do I get rid of it?
Some other clues: This is a Drupal 7 site, running jQuery 1.5.1. I'm using that same Fancybox script elsewhere on the site with no issues.
%EF%BF%BC is a sequence of three URL-encoded characters.
You clearly can't see any unexpected characters in the string. That's because the character sequence %EF%BF%BC is invisible.
It's actually a UTF-8 byte-order mark sequence. This sequence typically comes at the start of a UTF-8 encoded text file. They probably got into your code when you did a copy+paste from another file.
The quickest way to get rid of them is to find the bit of code that was copied+pasted, delete the characters on either side of the problem, and retype them. Depending on your editor, you may find the delete behaves strangely as it deletes the hidden characters.
Some text editors and IDEs will have an option to show hidden characters. If your editor has this, it may help you see where the mystery characters are so you can delete them.
Hope that helps.

Categories