I'm dealing with a JSON file that i cannot modify, i have to keep it AS IS.
it contains text, with all the apostrophes converted to ’, and other special chars here and there...
what is that? unicode? how can i convert to the regular apostrophe?
i placed already the META tag utf-8 on the header but it doesn't seem to change anything...
What mime type is your JSON response being sent with? (Look in the headers in FireBug or the Developer Console.) It seems that you one of these steps is using a different encoding:
The JSON string generated by the web server
The mime type encoding sent along with the response
The mime type of your HTML page
The mime type for your JavaScript code
If you supply the community with actual code, or better yet a working reproducible test case, then the community can better help you.
what is that?
It is an attempt to interpret data stored in one character encoding as data stored in a different character encoding.
To ensure everything displays correctly you need to:
Pick an encoding (UTF-8 is a good bet)
Store everything in that encoding
Configure your editor to use it!
Configure your database (if applicable) to use it!
Ensure any server side code you use expects UTF-8 input and gives UTF-8 output
Configure your webserver to include charset=utf-8 on the Content-Type HTTP header
The W3C has a good introductory article on the subject, which has links to lots of useful further reading at the end.
Related
I'm working on something that will read a user's text messages and export them to a csv file, which they can then download. The messages are being retrieved from a third-party web interface—I am essentially using js to grab the html of each message and compiling it as needed. The content of each message is added to a variable which, once all message are gathered, is given to a new Blob, which is then downloaded.
The problem I am having is that, in this web interface, emoji are represented as images, rather than characters. Thus, when writing a message containing an emoji to a file, the result is as so:
"Blah blah blah <img height="18px" width="18px" class="emoji adjustedSpriteForMessageDisplay spriteEMOJI sprite-1f612" data-textvalue="%F0%9F%98%92" src="assets/blank.gif">"
Now, from this image, we can get 2 workable values:
The UTF-8 hex value
F09F9892
and the Unicode codepoint (I may be referring to this wrong, I don't know much about encoding).
U+1f612
Now, what I want to do is take either of these values (whichever works better), and write it to the csv file as the character itself. So that, when viewing the csv file in a text editor or what have you, it would appear as
Though I have no idea where to even start with this. Maybe it's as simple as throwing some syntax around the character values, but I haven't been able to get anything from google, because I'm not familiar enough with encoding to know what to Google.
I suggest preprocessing the data as you grab it from the webpage instead of extracting it from the string afterwards.
You can then use decodeURIComponent() to decode the percent-encoded string:
decodeURIComponent('%F0%9F%98%92')
Combine that with jQuery to access the data-textvalue-attribute:
decodeURIComponent($(element).data('textvalue'))
I created a simple example on JSFiddle.
For some reason the emoji doesn't render correctly in the result screen in my browser, but that is a font issue. When looking at the result using a DOM inspector (or copying the text into a different application), the result is shown with a smiley.
CSV file format does not have character encoding information, so Excel usually assumes ASCII.
https://en.wikipedia.org/wiki/Comma-separated_values#General_functionality
Microsoft Excel mangles Diacritics in .csv files?
In an external javascript file I have a function that is used to append text to table cells (within the HTML doc that the javascript file is added to), text that can sometimes have Finnish characters (such as ä). That text is passed as an argument to my function:
content += addTableField(XML, 'Käyttötarkoitus', 'purpose', 255);
The problem is that diacritics such as "ä" get converted to some other bogus characters, such as "�". I see this when viewing the HTML doc in a browser. This is obviously not desirable, and is quite strange as well since the character encoding for the HTML doc is UTF-8.
How can I solve this problem?
Thanks in advance for helping out!
The file that contains content += addTableField(XML, 'Käyttötarkoitus', 'purpose', 255); is not saved in UTF-8 encoding.
I don't know what editor you are using but you can find it in settings or in the save dialog.
Example:
If you can't get this to work you could always write out the literal code points in javascript:
content += addTableField(XML, 'K\u00E4ytt\u00f6tarkoitus', 'purpose', 255);
credit: triplee
To check out the character encoding announced by a server, you can use Firebug (in the Info menu, there’s a command for viewing HTTP headers). Alternatively, you can use online services like Web-Sniffer.
If the headers for the external .js file specify a charset parameter, you need to use that encoding, unless you can change the relevant server settings (perhaps a .htaccess file).
If they lack a charset parameter, you can specify the encoding in the script element, e.g. <script src="foo.js" charset="utf-8">.
The declared encoding should of course match the actual encoding, which you can normally select when you save a file (using “Save As” command if needed).
The character encoding of the HTML file / doc does not matter any external ressource.
You will need to deliver the script file with UTF8 character encoding. If it was saved as such, your server config is bogus.
I am reading local text file using input-type-file and FileReader.readAsText(). The problem arises when the local text file contains characters like Ü. In that case it is converted to ï¿. Of course I can set encoding manually to iso8859-1 as a parameter of FileReader.readAsText(File, encoding) but the thing is that I have no clue what kind of encoding user has set on his side.
My question is whether there is an option to determine encoding on client machine ?
Best regards
kkris1983
You'd need to analyze the raw binaries of the text file to have a best guess at what the encoding is. There isn't any libraries for this in javascript AFAIK but you could port one from other languages.
Since that isn't very robust, you should also provide a manual override like Characters not showing correctly? Change encoding:
You can also have smart defaults, for example ISO-8859-1 if you detect it's western windows machine.
first of all it is a userscript and I can't change the server-side encoding.
My problem is that when using encodeURIComponent() for encoding POST params (later sent via xhr.setRequestHeader), the characters are encoded in utf-8, but the server needs to receive iso-8859-1 data. Is there an alternative to encodeURIComponent() that would encode in iso-8859-1 ?
.
To make sure you understand, here is an exemple:
A classic form on the website send é like this: yournewmessage:%E9
Ajax via xhr.send('yournewmessage='+encodeURIComponent('é')) sends this: yournewmessage:%E9%80%80
The server needs the former. Thanks to anyone who can help me.
So, I’ve since figured out this problem. What I did was searching for an equivalence between utf-8 and iso-8859-1, what I found was between utf-8 and cp1252 (Windows-1252) so there are two conversions, utf-8 to cp1252 and cp1252 to iso-8859-1 (these two having a lot of similarities)
http://pastebin.com/jTDqR2PQ
Ugly code, comments left in French, and unelegant solution, but I feel bad seeing this question unansered while I actually found a solution that works.
I have a javascript script which is calling a php page to supply an ajax form with suggestions. The suggestions are returned fine by the php page, but for some reason, when i set the responsetext of the javascript object request as an element in my HTML page, all the special characters (ie. á or ã) show up as this question mark. Is there a function II must run on the response text of the request to make sure these are read properly?
Thanks.
If you are not serving your HTML pages as UTF-8, the browser will guess an encoding, typically a single-byte Windows codepage depending on the user's locale.
But this doesn't happen for AJAX. With XMLHttpRequest, unless you specifically state an encoding in the Content-Type: ...; charset= parameter, the browser will treat it as UTF-8. That means if you are actually serving Windows code page 1252 (Western European) content, you will get an invalid UTF-8 sequence and consequent question mark.
You don't want to be using a non-UTF-8 encoding! Make sure you are using UTF-8 throughout your application. Serve all your pages with Content-Type: text/html; charset=utf-8, store your data in UTF-8 tables, use mysql_set_charset() to choose UTF-8, etc.
In any case consider passing AJAX responses using JSON. The function json_encode() will create a JSON string that uses JavaScript escape sequences for non-ASCII characters, which avoids any problem of encoding mismatch. Also this is easier to extend to add functionality than returning raw HTML.
I would try, in your php script, to encode everything as html entities.
This can be easily tested by doing something like this before returning the results to javascript:
$results = htmlentities($htmlstring);
There's also the htmlspecialchars function you might try.
More about this here:
http://php.net/manual/en/function.htmlentities.php