I have read a bunch of different StackOverflow answers and similar questions but none of them have been any help.
I am using Javascript to make an ajax request to get some data in json form.
I am receving json data such as the following:
\u0093title\u0094
Now I believe json is delivered in utf-8 by default, however these characters \u0093and \u0094 I believe are latin1 control characters meant to represent open and close speech marks.
The issue is when I make the GET with Javascript, the response ends up being something like:
“title”
I have tried doing encodeURIComponent( data.body )) and it produces the same result
This is extremely annoying, has anyone else encountered these issues before?
EDIT:
Imagine the following raw JSON data, this is what I am going to retrieve:
\u0093title\u0094
So for example, I run the following piece of jQuery/Javascript to get the above JSON data
$.ajax({
type: "GET",
url: "myurl",
success: function(data){
console.log(data.body);
}
});
The following is printed to the console (which looks fine, except it is omitting the control characters):
title
And then I encode and decode it, which should cancel out and change nothing:
console.log(decodeURIComponent(encodeURIComponent( data.body )))
Except this ends up printing the following:
“title”
Where it has picked up those extra  characters as well as the “ and ”, despite these not showing up in the console before the encode/decode step
First of all, Code Points U+0093 and U+0094 are not curved quotes, they are control characters for something else... (which to be quite honest, I have no idea). Curved Quotes code points are U+201C for “ and U+201D for ”. You still have another problem:
This pretty much looks like an example of incorrect decoding format. The program which is decoding the character saw: C2 93, the hex value of unicode point 0093. He's not assuming it's UTF-8 or he would have make a translation to unicode point 0093. Instead, it's using Windows Code Page-1252. Which makes: C2 into Â, 93 into “ and 94 into ”.
I could only think of 2 reasons why is it doing that but they all involve your browser. Is not really a problem with Javascript not using UTF-8, because this works:
document.getElementById('result').innerHTML = '\u201CHello\u201D';
<pre id="result"></pre>
The problem could be the HTTP response, your browser is reading the HTTP response as Windows Code Page-1252. The other thing it could be is because your browser is presenting data incorrectly (which now that i think of it, doesn't make much sense).
Try setting up the Content-Type of your HTTP response by sending this HTTP header:
Content-Type: application/json; charset=utf-8
And I insist that you put the:
<meta charset="utf-8">
To your document.
Related
I want to be able to encode and decode all the following characters using javascript or jquery...
~!##$%^&*()_+|}{:"?><,./';[]\=-`
I tried to encode them using this...
var cT = encodeURI(oM); // oM holds the special characters
cT = cT.replace(/[!"#$%&'()*+,.\/:;<=>?#[\\\]^`{|}~]/g, "\\\\$&");
Which does encode them, or escape them rather, but then I am trying to do the reverse with this...
decodeURIComponent(data.convo.replace(/\+/g, ' '));
But, it's not coming out in any way desired.
I've built a chat plugin for jquery, but the script crashes if someone enters a special character. I want the special characters to get encoded, then when they get pulled out of the data base, they should be decoded. I tried using urldecode in PHP before the data is returned to the ajax request but it's coming out horribly wrong.
I would think that there exists some function to encode and decode all special characters.
Oh, one caveat for this is that I'm wrapping each message with html elements, so I think the decoding needs to be done server side, before the message is wrapped, or be able to know when to ignore valid html tags and decode the other characters that are just what the user wanted to type.
Am I encoding/escaping them wrong to begin with?
Is that why the results are horrible?
This is pretty simple in javascript
//Note that i have escaped the " in the string - this means it still gets processed
var exampleInput = "Hello there h4x0r ~!##$%^&*()_+|}{:\"?><,./';[]\=-`";
var encodedInput = encodeURI(exampleInput);
var decodedInput = decodeURI(encodedInput);
console.log(exampleInput);
console.log(encodedInput);
console.log(decodedInput);
Just encode and decode the input. If something else is breaking in your script it means you are not stripping away things that you are somehow processing. It's hard to provide an accurate answer as you can see encoding and decoding the URI standards does not crash things. Only the processing of this content improperly would cause issues.
When you output the content in HTML you should be encoding the HTML entities.
Reference this thread Encode html entities in javascript if you need to actually encode for display inside HTML safely.
An additional reference on how html entities work can be found here: W3 Schools - HTML Entities and W3 Schools - HTML Symbols
I know this sounds bad, but it's necessary.
I have a HTML form on a site with utf-8 charset which is sent to a server which works with the iso-8859-1 charset. The problem is that the server doesn't understand correctly characters we use in Spain like à, á, è, é, ì, í, ò, ó, ù, ú, ñ, ç and so on. So if I search something like artículo it answers nothing found with artÃculo.
I send the form with ajaxform (http://malsup.com/jquery/form/), and the code loks like this:
$(".form-wrap, #pagina").on("submit", "form", function(event){
event.preventDefault();
$(this).ajaxSubmit({
success: function(data){
$("#temp").html(data);
//Handle data in #temp div
$("#temp").html('');
}
});
return false;
});
My problem is: I have no acces to the search server and I cannot change the whole website to iso-8859-1 encodig since this would break other stuff.
I have alredy tried with no succes these scripts:
http://phpjs.org/functions/utf8_decode/
http://phpjs.org/functions/utf8_encode/
http://ecmanaut.blogspot.com.es/2006/07/encoding-decoding-utf8-in-javascript.html
May I be doing everything wrong?
Edit: the escape function isn't useful for me as it turns these spacial chars into % prefixed codes which are useless to the server, it then searches for art%EDculo.
Edit using the encodeURIComponent function the server understands art%C3%ADculo. P.S. I just use the word artículo for testing but the solution it should cover all special chars.
I have a HTML form on a site with utf-8 charset which is sent to a server which works with the iso-8859-1 charset.
You can try setting form accept-charset:
<form accept-charset="iso-8859-1">
....
</form>
Finally the one who runs that server gave us access to the server and we made an utf-8 compatible php script. So I don't care any more.
I am trying to use the League of Legends API and request data on a certain user. I use the line
var user = getUrlVars()["username"].replace("+", " ");
to store the username. However, when I do the XMLHttpRequest with that username, it'll put %20 instead of a space.
y.open("GET", "https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/"+user, false);
Edit: When I run this code with a user that has no space in their name it works, however when they have a space in their name it says the user is undefined.
For example, if I was looking for the user "the man", it would do a get at
https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/the%20man
But the correct request URL is
https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/the man
When you're creating a URL, you should use encodeURIComponent to encode all the special characters properly:
y.open("GET", "https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/"+encodeURIComponent(user), false);
Actually there are no "spaces" in the summoner names on Riot's side. So:
https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/the man
Becomes:
https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/theman
Have a look at this: https://developer.riotgames.com/discussion/community-discussion/show/jomoRum7
I am unsure how + are handled (in fact I don't think you're able to have a + in your name). All you have to do is remove the spaces.
For "funny" characters, just request them with the funny character in them, and Riot returned it fine.
https://euw.api.pvp.net/api/lol/euw/v1.4/summoner/by-name/Trøyer?api_key=<insert your own>
will auto correct to
https://euw.api.pvp.net/api/lol/euw/v1.4/summoner/by-name/Tr%C3%B8yer?api_key=<insert your own>
and you generally don't even have to decode it. (I used JS as my language to fetch it, if you use something else your results may require the decoded value)
What you're experiencing is correct behaviour and is called URL encoding. HTTP requests have to conform to certain standards. The first line is always made up of three parts delimited by a space:
Method (GET, POST, etc.)
Path (i.e. /api/lol/na/v1.4/summoner/by-name/the%20man)
HTTP version (HTTP/1.1, HTTP/1.0, etc.)
This is usually followed by HTTP headers which I'll leave out for the time being since it is beyond the scope of your question (if interested, read this https://www.rfc-editor.org/rfc/rfc7230). So a normal request looks like this:
GET /api/lol/na/v1.4/summoner/by-name/the%20man HTTP/1.1
Host: na.api.pvp.net
User-Agent: Mozilla
...
With regards to your original question, the reason the library is URL encoding the space to %20 is because you cannot have a space character in the request line. Otherwise, you would throw off most HTTP message parsers because the man would replace the HTTP version line like so:
GET /api/lol/na/v1.4/summoner/by-name/the man HTTP/1.1
Host: na.api.pvp.net
User-Agent: Mozilla
...
In most cases, servers will return a 400 bad request response because they wouldn't understand what HTTP version man refers to. However, nothing to fear hear, most server-side applications/frameworks automatically decode the %20 or + to space prior to processing the data in the HTTP request. So even though your URL looks unusual, the server side will process it as the man.
Finally, one last thing to note. You shouldn't be using the String.replace() to URL decode your messages. Instead, you should be using decodeURI() and encodeURI() for decoding and encoding strings, respectively. For example:
var user = getUrlVars()["username"].replace("+", " ");
becomes
var user = decodeURI(getUrlVars()["username"]);
This ensures that usernames containing special characters (like / which would be URL encoded as %2f) are also probably decoded. Hope this helps!
I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiri�</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiri�
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");
<title>Onegai My Melody Sukkiri�</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.
So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.
You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode
The host that the majority of my script's users are on forces an text ad at the end of every page. This code is sneaking into my script's AJAX responses. It's an HTML comment, followed by a link to their signup page. How can I strip this comment and link from the end of my AJAX responses?
Typically those scripts basically look for text/html content and just shove the code into the stream. Have you tried setting the content type to something else such as text/json, text/javascript, text/plain and see if it gets by without the injection?
Regular Expressions
My first suggestion would be to find a regular expression that can match and eliminate that trailing information. I'm not the greatest at writing regular expressions but here's an attempt:
var response = "I am the data you want. <strong>And nothing more</strong> <!-- haha -> <a href='google.com'>Sucker!</a>";
var myStuff = response.replace("/\s+?<!--.*>$/gi", "");
Custom Explosion String
What would be an easy and quick solution would be to place a string at the end of your message ("spl0de!"), and then split the ajax response on that, and only handle that which comes before it.
var myStuff = response.split("spl0de!")[0];
This would remove anything anybody else sneaks onto the end of your data.
you see a lot of this with hand-generated xml, it isn't valid , so consumers try to fix-up the broken xml with hand-rolled regex -- its completely the wrong approach.
you need to fix this at the source, at the broken host.