Parsing unicode in unescaped XML

Parsing unicode in unescaped XML - javascript

I'm trying to parse some poorly formatted XML.
I say poorly formatted - because everyone knows that you're not supposed to have un-escaped ampersands in an XML file.
Problem is, I need to collect some unicode formatted phrases from an XML file. I need the format to be as close to the original as possible. You can replicate this issue in your console log...
console.log($("<test>â</test>").text())
// Outputs 'â' instead of desired 'â'
I've tried every combination of escape, unescape(), encodeURI(), decodeURI() I can fathom.
I've tried both settings for jQuery's ajax({processData: bool}) flag. All answers I've found point to these solutions - and it seems like none of them work...
How can I modify the above code to output the original XML content?

Use new Option(yourUnescapedXml).innerHTML. So to answer your question directly,
console.log($(`<test>${new Option('â').innerHTML}</test>`).text())
This creates an HTMLOptionElement, then immediately gets its (escaped) innerHtml.

Related

Unicode characters cannot be decoded

I use browserless.js (headless Chrome) to fetch the html code of a website, and then use a regular expression to find certain image URLs.
One example is the following:
https://vignette.wikia.nocookie.net/moviepedia/images/8/88/Adrien_Brody.jpg/revision/latest/top-crop/width/360/height/450?cb\u003d20141113231800\u0026path-prefix\u003dde
There are unicode characters such as \u003d, which should be decoded (in this case to =). The reason is that I want to include these images in a site, and without decoding some of them cannot be displayed (like that one above, just paste the URL; it gives broken-image.webp).
I have tried lots of things, but nothing works.
JSON.parse(JSON.stringify(...))
String.prototype.normalize()
decodeURIComponent
Curiously, the regular expression for "\u003d" (i.e. "\\u003d" in js) does not match that string above, but "u003d" does.
This is all very weird, and my current guess is that browserless is responsible for some weird formatting behind the scenes. Namely, when I console log the URL and copy paste it somewhere else, every method mentioned above works for decoding.
I hope that someone can help me on this.

Just to mark this one as answered. Thomas replied:
JSON.parse(`"${url}"`)

How to Split a cell in JS API for MS365 excel labscript

I am trying to split a cell/string simply in MS365 using Labscript.
Split() or LEFT does not exist in Labscript which is another reason why I am not sure why MS claims that javascript is the language of use for Labscript.
Thanks!

I just realised that substr() works (or maybe even split). These commands do not show up as options though as you write the code unless the code is executed first and the string is verified as a string (I was reading a cell that I wanted to split).

Convert strange unicode characters into emoji code

I have a dll i suspect not to be supporting UTF-8 for emojis. (its an addon for mIRC)
This dll changes mIRC (text based chat program), into a full HTML/Javascript.
My problem is, when i receive a message containing emojis, they output like this
ðŸ˜€
Four "stange" chars, cause they are not converted fine i suppose.
I though about make a Javascript function matching those, and changing it to correct emoji code back (maybe using a <span> or not, since the following code type is translated correctly into smileys 😈)
so, is there any way in javascript to catch/convert ðŸ˜€ erroneous chars into 😈 for example? (those are not the same emoji)
for a correct example :
:grinning face: U+1F600
output this ðŸ˜€
sending this 😀 finaly output a square... and not the correct smiley so its even not working for all...

Javascript String.fromCharCode() latin1 encoding issue

I don't know if I can write a fiddle for this so I'll just try to explain this as well as I possibly can.
We have an application where we've written an editor. We need to check some grammar rules on strings/tokens that are being entered into the editor.
However, when using String.fromCharCode(190), instead of getting a "." as in utf-8 we get a "¾" from latin1.
I've checked whether or not we set latin1 as the default encoding somewhere but I've been unable to find anything.
Can anyone point me into the right direction or possibly find a solution for this issue?
The HTML charset is UTF-8 as well as the javascript file (this only adds to my confusion haha).

As per the doc, String.fromCharCode() returns a unicode character. It's got nothing to do with encoding. "¾" is the unicode character for 190, that's it. http://unicode-table.com/

Remove ^M from CSV file

Trying to read CSV formatted data into javascript using the jquery-csv library, but am getting a CSVDataError: Illegal Data error from the ^M character at the end of each line.
It seems no matter how a CSV is saved, I get this ^M. I can only ever see the ^M if I open the CSV file in vim, even in a text editor or my IDE the data looks fine. I don't get this problem when working in other languages either such as Python or R.
I am working on a Mac environment.
How can I fix this and avoid this problem in the future?

Use dos2unix to convert.
It's false that "no matter how it is saved" the CR (^M is a carriage return) is appended. For instance, echo 'a,b,c' > letters.csv does not append a CR. Check your text editor settings.

Take a look at the splitlines algorithm on the jquery-csv page, it seems to provide a function that will clean these problematic carriage returns for you.

Assuming ^M indicates a mac-style carriage return, support for carriage return was included in a previous release so your code should just work.
Source: I'm the author of jquery-csv

We Keep Coding

JavaScript is the programming language of the Web.

Parsing unicode in unescaped XML - javascript

Use new Option(yourUnescapedXml).innerHTML. So to answer your question directly, console.log($(`<test>${new Option('â').innerHTML}</test>`).text()) This creates an HTMLOptionElement, then immediately gets its (escaped) innerHtml.

Related

Unicode characters cannot be decoded

How to Split a cell in JS API for MS365 excel labscript

Convert strange unicode characters into emoji code

Javascript String.fromCharCode() latin1 encoding issue

Remove ^M from CSV file

Categories

Resources