I have a web app, which allows searching. So when I go to somedomain.com/search/<QUERY> it searches for entities according to <QUERY>. The problem is, when I try to search for . or .. it doesn't work as expected (which is pretty obvious). What surprised me though, is that if I manually enter the url of somedomain.com/search/%2E, the browser (tested Chrome and IE11) converts it somedomain.com/search/ and issues a request without necessary payload.
So far I haven't found anything that would say it's not possible to make this work, so I came here. Right now I have only one option: replacing . and .. to something like __dotPlaceholder__, but this feels like a dirty hack to me.
Any solution (js or non-js) will be welcomed. Any information on why do browsers strip url-encoded dots is also a nice-to-have.
Unfortunately part of RFC3986 defines the URI dot segments to be normalised and stripped out in that case, ie http://example.com/a/./ to become http://example.com/a
see https://www.rfc-editor.org/rfc/rfc3986#page-33 for more information
Related
I use browserless.js (headless Chrome) to fetch the html code of a website, and then use a regular expression to find certain image URLs.
One example is the following:
https://vignette.wikia.nocookie.net/moviepedia/images/8/88/Adrien_Brody.jpg/revision/latest/top-crop/width/360/height/450?cb\u003d20141113231800\u0026path-prefix\u003dde
There are unicode characters such as \u003d, which should be decoded (in this case to =). The reason is that I want to include these images in a site, and without decoding some of them cannot be displayed (like that one above, just paste the URL; it gives broken-image.webp).
I have tried lots of things, but nothing works.
JSON.parse(JSON.stringify(...))
String.prototype.normalize()
decodeURIComponent
Curiously, the regular expression for "\u003d" (i.e. "\\u003d" in js) does not match that string above, but "u003d" does.
This is all very weird, and my current guess is that browserless is responsible for some weird formatting behind the scenes. Namely, when I console log the URL and copy paste it somewhere else, every method mentioned above works for decoding.
I hope that someone can help me on this.
Just to mark this one as answered. Thomas replied:
JSON.parse(`"${url}"`)
I want to open word-documents by clicking on a link in my solution. The link below shows how it is structured in ofe for office. This solution is really nice because it works in every browser but i have problems with special characters.
ms-word:ofe|u|file://our.local/Testing ÅÄÖ.DOCX
I Have tried different approaches to solve this problem but its not working when åäö is present in the path. EncodeURI on the path does not help for instance.
https://learn.microsoft.com/en-us/office/client-developer/office-uri-schemes does not describe anything out of the ordinary and only follow URI spec.
Documents without special characters works great but i can not figure out how special characters should be encoded to make it work.
If i take the file:\... part and past it into any browser it is working but not with the ofe prefix. So it should be some problem with encoding due to it is working fine without any special characters.
Running in cmd is also working:
So in this case i guess the browser is encoding the characters before sending to protocolhandler ??
I encode a filename and send it as a part of URL like /rest/get?name=Filename.txt. In JS link construction is as simple as
url = '/rest/get?name=' + window.encodeURIComponent(file.name);
It works good for simple cases but for hardcore testing I use a file named
你好abcABCæøåÆØÅäöüïëêîâéíáóúýñ½§!#¤%&()=`#£$€{[]}+´¨^~'-_,;.txt
After URI encoding I expect to get a link
/rest/get?name=%E4%BD%A0%E5%A5%BDabcABC%C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85%C3%A4%C3%B6%C3%BC%C3%AF%C3%AB%C3%AA%C3%AE%C3%A2%C3%A9%C3%AD%C3%A1%C3%B3%C3%BA%C3%BD%C3%B1%C2%BD%C2%A7%3F%3FabcABC%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD!%23%C2%A4%25%26()%3D%60%40%C2%A3%24%E2%82%AC%7B%5B%5D%7D%2B%C2%B4%C2%A8%5E~%27-_%2C%3B.txt
And I get it. The constructed link works ok in the latest versions of IE and Chrome but fails in Firefox. After some investigation I've found that in Firefox encodeURIcomponent works differently. Here's actual result in Firefox:
/rest/get?name=%3F%3FabcABC%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD!%23%EF%BF%BD%25%26%28%29%3D%60%40%EF%BF%BD%24%3F{[]}%2B%EF%BF%BD%EF%BF%BD^~%27-_%2C%3B.txt
Visual comparison (Chrome link is on the left and Firefox link is on the right):
I've also tried to copy and paste the valid link (constructed in Chrome) to Firefox and it works ok.
Why do I get different results?
I̶s̶ ̶i̶t̶ ̶a̶ ̶b̶u̶g̶ ̶w̶i̶t̶h̶ ̶̶e̶n̶c̶o̶d̶e̶U̶R̶I̶c̶o̶m̶p̶o̶n̶e̶n̶t̶(̶)̶̶ ̶i̶n̶ ̶F̶i̶r̶e̶f̶o̶x̶?̶
Does Firefox use a different encoding in encodeURIComponent()?
UPD. I've found similar questions (encodeURIComponent behaves differently in browsers for China as location [搜索] and
encodeURIComponent difference with browsers and ä-ö-å characters [äöå]), both without an answer.
UPD.2 Further investigation has shown that the following characters are encoded differently and causing 'File not found' exception on server:
你好
æøåÆØÅäöüïëêîâéíáóúýñ
½§¤
£€
I suppose your problem is not the encodeURIComponent() method. It's the encoding of whatever constructs file.name. Extend your question. How is file.name initialized? Where do the chars come from?
encodeURIComponent() is a native function so Firefox is obviously using some different implementation under the covers.
If you are stuck, then just deliver your own javascript implementation of encodeURIComponent(), then you'll get the same results across browsers. Here's a link how to get an open source copy of that:
encodeURIComponent algorithm source code
Trying to create a simple 'mailto' function using javascript. I just need to be able to send some links (like: See this article bla bla).
Some of the links I need to send include spaces, danish chars. So I've been using the
encodeURI() function.
The problem arises when I try to mail the link (sample code below)
var _encodedPath = encodeURI(path);
var _tempString = "mailto:someemail#somewhere.dk?subject=Shared%20from%20some%20page&body=" + _encodedPath;
If I output the _tempString to the console I get the correct encoded string. However when using the same string in 'mailto' the string loses it's encoding and returns to the way it was before.
Any clue as to why this is?
Thanks in advance :)
The link is decoded when you click it - that's normal. Since you have an http link within a mailto link, it should be encoded twice.
Email clients do their best to make things that look like links clickable. They typically decide where the link ends in a somewhat arbitrary and unpredictable manner.
In email, the best way to keep a link contiguous is to enclose it in angle-brackets like this:
<http://www.example.com/url with spaces>
But this isn't foolproof. Email is fragile and you can't control the content well enough with a mailto link. It might be better to try to reduce the complexity of the url - perhaps by providing or utilizing a url-shortener service. Any url longer than 74 or so characters is likely to be mangled by some email clients.
You should use encodeURIComponent instead of encodeURI.
More information here.
this site helped me solving any troubles with mailto links:
http://www.1ngo.de/web/formular.html
may be it's not the nicest way, but it always works with every browser i know. And it also has very cool algorithm implemented to format the content so that everything should be alright. Just try it and play around a little with code by quoting out parts of the code and you will understand very fast what exactly happens there and how to modify it for your wishes. Althoug it's a little late I hope this one helps anybody checking this question.
althoug it's in german, you just need to copy the code shown there and run it and experiment with it.
In an OSS web app, we have JS code that performs some Ajax update (uses jQuery, not relevant). After the page update, a call is made to the html5 history interface History.pushState, in the following code:
var updateHistory = function(url) {
var context = { state:1, rand:Math.random() };
/* -----> bedfore the problem call <------- */
History.pushState( context, "Questions", url );
/* -----> after the problem call <------- */
setTimeout(function (){
/* HACK: For some weird reson, sometimes something overrides the above pushState so we re-aplly it
This might be caused by some other JS plugin.
The delay of 10msec allows the other plugin to override the URL.
*/
History.replaceState( context, "Questions", url );
}, 10);
};
[Please note: the full code segment is provided for context, the HACK part is not the issue of this question]
The app is i18n'ed and is using URL encoded Unicode segments in the URL's, so just before the marked problem call in the above code, the URL argument contains (as inspected in Firebug):
"/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/scope:all/sort:activity-desc/page:1/"
The encoded segment is utf-8 in percent encoding. The URL in the browser window is: (just for completeness, doesn't really matter)
http://<base-url>/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/
Just after the call, the URL displayed in the browser window changes to:
http://<base-url>/%C3%98%C2%A7%C3%99%C2%84%C3%98%C2%A3%C3%98%C2%B3%C3%98%C2%A6%C3%99%C2%84%C3%98%C2%A9/scope:all/sort:activity-desc/page:1/
The URL encoded segment is just mojibake, the result of using the wrong encoding at some level. The correct URL would've been:
http://<base-url>/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/scope:all/sort:activity-desc/page:1/
This behavior has been tested on both FF and Chrome.
The history interface specs don't mention anything about encoded URL's, but I assume the default standard for URL formation (utf-8 and percent encoding etc) would apply when using URL's in function calls for the interface.
Any idea on what's going on here.
Edit:
I wasn't paying attention to the uppercase H in History - this code is actually using the History.js wrapper for the history interface. I replaced with a direct call to history.pushState (notice the lowercase h) without going through the wrapper, and the code is working as expected as far as I can tell. The issue with the original code still stands - so an issue with the History.js library it seems.
Update
As Doug S explains in the comments below, the latest version of History.js includes a fix for this behaviour. He also found that my solution caused double-encoding when used in browsers (such as IE 9 and below) which require the hash fallback, so I recommend that instead of using the fix detailed below, just download the latest version.
I've kept my original answer below, since it does explain what's going on in much more detail.
Basel found a resolution of sorts, but there's still some confusion about what's happening under the hood. This answer goes into detail about the problem and suggests a better fix. (You can skip straight to the fix if you want.)
The problem
First, open your browser's JS console and run this:
window.encodeURI(window.unescape('%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9'))
Does that look familiar? It should—that's what your URL is being mangled to. The problem lies in the implementation of History.unescapeString, specifically this line:
tmp = window.unescape(result);
window.unescape is a DOM Level 0 function—which is to say, an unstandardised relic from the hoary days of Netscape 2. It uses the escaping rules defined in RFC 2396, according to which characters outside of the unreserved range (alphanumerics and a small set of punctuation symbols) are encoded as octets.
This works fine for the US-ASCII range, but not all (indeed, the vast majority) of the characters in UTF-8 can be represented in a single byte. Since URIs do not have a built-in way of representing the character set being used, window.unescape just assumes each character maps to a single octet and blithely mangles any that don't.
In this example, the first letter in your URL is the Arabic letter alef (ا), represented by two bytes: 0xD8 0xA7. window.unescape interprets these as two separate characters: 0x00 0xD8 (Ø—capital O with stroke) and 0x00 0xA7 (§—section sign).
This is a known issue with History.js.
The fix
As noted above by the asker, the issue can be sidestepped by using the native implementation of the History API instead of the History.js wrapper, i.e. history.pushState instead of History.pushState.
This works for browsers that support the History API, but loses the benefit of having a polyfill for those that don't. Fortunately, there's a better fix. Open up the History.js source you're referencing and find this line (~1059 in my copy):
tmp = window.unescape(result);
Replace it with:
tmp = window.unescape(encodeURIComponent(result));
Or, if you're using the compressed source, replace a.unescape(c) with a.unescape(encodeURIComponent(c)).
To test this change, I ran the History.js HTML5 jQuery test suite on a local web server inside an Arabic-named directory. Before making the change, test 14 fails; after the change, all tests passed.
Credit
Though I found the problem and solution independently, Damien Antipa deserves credit for finding it first and making a pull request with the fix.
I'm still able to reproduce this in the following case:
History.pushState(null, null, "?" + some_Unicode_String_Or_A_String_With_Whitespace);
document.location.hash += "&someStuff";
In this case the _suid parameter gets removed and &someStuff as well. If the string is not unicode or doesn't have whitespaces (so no % chars) - this does not happen.
This workaround worked for me:
History.pushState(null, null, "?" + some_Unicode_String_Or_A_String_With_Whitespace + "&someStuff");