Javascript DOMParser and XMLSerialier removes XML entities

Javascript DOMParser and XMLSerialier removes XML entities - javascript

I am trying to preserve some XML entities when parsing XML files in javascript. The following code snippet illustrates the problem. Is there a way for me to make round-trip parse and retain the XML entities (  is nbsp; html)? This happens in Chrome FF and IE10.
var aaa='<root><div> one two</div></root>'
var doc=new DOMParser().parseFromString(aaa,'application/xml')
new XMLSerializer().serializeToString(doc)
"<root><div> one two</div></root>"
The issue is I am taking some chunks out of html and storing them in xml, and then I want to get the spaces back in XML when I'm done.
Edit:
As Dan and others have pointed out, the parser replaces it with the ascii code 160, which to my eyes looks like an ordinary space but:
var str1=new XMLSerializer().serializeToString(doc)
str1.charCodeAt(15)
160
So where ever my application is losing the spaces, it is not here.

You can use a ranged RegExp to turn the special chars back into xml representations.
as a nice re-usable function:
function escapeExtended(s){
return s.replace(/([\x80-\xff])/g, function (a, b) {
var c = b.charCodeAt();
return "&#" + b.charCodeAt()+";"
});
}
var aaa='<root><div> one two</div></root>'
var doc=new DOMParser().parseFromString(aaa,'application/xml')
var str= new XMLSerializer().serializeToString(doc);
alert(escapeExtended(str)); // shows: "<root><div> one two</div></root>"
Note that HTML entities (ex quot;) will lose their symbol name, and be converted to XML entities (the &#number; kind). you can't get back the names without a huge conversion table.

Related

How decode HEX in XMLHtppRequest?

I have a site and I used AJAX. And I got some problems.
Server return JSON string something like this {a:"x48\x65\x6C\x6C\x6F"}.
Then in xx.responseText, we have this string '{a:"\x48\x65\x6C\x6C\x6F"}'.
But if I create JavaScript string "\x48\x65\x6C\x6C\x6F" then I have "Hello" and not HEX!
Is it possible get in xx.responseText "real" text from HEX (automatically, without .replace())?

If the output is at all regular (predictable), .replace() is probably the simplest.
var escapeSequences = xx.responseText.replace(/^\{a:/, '').replace(/\}$/, '');
console.log(escapeSequences === "\"\\x48\\x65\\x6C\\x6C\\x6F\""); // true
Or, if a string literal that's equivalent in value but may not otherwise be the same is sufficient, you could parse (see below) and then stringify() an individual property.
console.log(JSON.stringify(data.a) === "\"Hello\""); // true
Otherwise, you'll likely need to run responseText through a lexer to tokenize it and retrieve the literal from that. JavaScript doesn't include an option for this separate from parsing/evaluating, so you'll need to find a library for this.
"Lexer written in JavaScript?" may be a good place to start for that.
To parse it:
Since it appears to be a string of code, you'll likely have to use eval().
var data = eval('(' + xx.responseText + ')');
console.log(data.a); // Hello
Note: The parenthesis make sure {...} is evaluated as an Object literal rather than as a block.
Also, I'd suggest looking into alternatives to code for communicating data like this.
A common option is JSON, which takes its syntax from JavaScript, but uses a rather strict subset. It doesn't allow functions or other potentially problematic code to be included.
var data = JSON.parse(xx.responseText);
console.log(data.a); // Hello
Visiting JSON.org, you should be able to find a reference or library for the choice of server-side language to output JSON.
{ "a": "Hello" }

Why not just let the JSON parser do its job and handle the \x escape sequences, and then just convert the string back to hex again afterwards, e.g.
function charToHex(c) {
var hex = c.charCodeAt(0).toString(16);
return (hex.length === 2) ? hex : '0' + hex;
}
"Hello".replace(/./g, charToHex); // gives "48656c6c6f"

JSON.stringify and "\u2028\u2029" check?

Sometimes I see in a view source page ( html view source) this code:
if (JSON.stringify(["\u2028\u2029"]) === '["\u2028\u2029"]') JSON.stringify = function (a) {
var b = /\u2028/g,
c = /\u2029/g;
return function (d, e, f) {
var g = a.call(this, d, e, f);
if (g) {
if (-1 < g.indexOf('\u2028')) g = g.replace(b, '\\u2028');
if (-1 < g.indexOf('\u2029')) g = g.replace(c, '\\u2029');
}
return g;
};
}(JSON.stringify);
What is the problem with JSON.stringify(["\u2028\u2029"]) that it needs to be checked ?
Additional info :
JSON.stringify(["\u2028\u2029"]) value is "["  "]"
'["\u2028\u2029"]' value is also "["  "]"

I thought it might be a security feature. FileFormat.info for 2028 and 2029 have a banner stating
Do not use this character in domain names. Browsers are blacklisting it because of the potential for phishing.
But it turns out that the line and paragraph separators \u2028 and \u2029 respectively are treated as a new line in ES5 JavaScript.
From http://www.thespanner.co.uk/2011/07/25/the-json-specification-is-now-wrong/
\u2028 and \u2029 characters that can break entire JSON feeds since the string will contain a new line and the JavaScript parser will bail out
So you are seeing a patch for JSON.stringify. Also see Node.js JavaScript-stringify
Edit: Yes, modern browsers' built-in JSON object should take care of this correctly. I can't find any links to the actual source to back this up though. The Chromium code search doesn't mention any bugs that would warrant adding this workaround manually. It looks like Firefox 3.5 was the first version to have native JSON support, not entirely bug-free though. IE8 supports it too. So it is likely a now unnecessary patch, assuming browsers have implemented the specification correctly.

After reading both answers , here is the Simple visual explanation :
doing this
alert(JSON.stringify({"a":"sddd\u2028sssss"})) // can cause problems
will alert :
While changing the trouble maker to something else ( for example from \u to \1u)
will alert :
Now , let's invoke the function from my original Q ,
Lets try this alert(JSON.stringify({"a":"sddd\u2028sssss"})) again :
result :
and now , everybody's happy.

\u2028 and \u2029 are invisible Unicode line and paragraph separator characters. Natively JSON.stringify method converts these codes to their symbolic representation (as JavaScript automatically does in the strings), resulting in "["  "]". The code you have provided does not let JSON to convert the codes to symbols and preserves their \uXXXX representation in the output string, i.e. returning "["\u2028\u2029"]".

Parse ill formed JSON string

I am being sent an ill formed JSON string from a third party. I tried using JSON.parse(str) to parse it into a JavaScript object but it of course failed.
The reason being is that the keys are not strings:
{min: 100}
As opposed to valid JSON string (which parses just fine):
{"min": 100}
I need to accept the ill formed string for now. I imagine forgetting to properly quote keys is a common mistake. Is there a good way to change this to a valid JSON string so that I can parse it? For now I may have to parse character by character and try and form an object, which sounds awful.
Ideas?

You could just eval, but that would be bad security practice if you don't trust the source. Better solution would be to either modify the string manually to quote the keys or use a tool someone else has written that does this for you (check out https://github.com/daepark/JSOL written by daepark).

I did this just recently, using Uglifyjs to evaluate:
var jsp = require("uglify-js").parser;
var pro = require("uglify-js").uglify;
var orig_code = "var myobject = " + badJSONobject;
var ast = jsp.parse(orig_code); // parse code and get the initial AST
var final_code = pro.gen_code(ast); // regenerate code
$('head').append('<script>' + final_code + '; console.log(JSON.stringify(myobject));</script>');
This is really sloppy in a way, and has all the same problems as an eval() based solution, but if you just need to parse/reformat the data one time, then the above should get you a clean JSON copy of the JS object.

Depending on what else is in the JSON, you could simply do a string replace and replace '{' with '{"' and ':' with '":'.

DOM Exception 5 INVALID CHARACTER error on valid base64 image string in javascript

I'm trying to decode a base64 string for an image back into binary so it can be downloaded and displayed locally by an OS.
The string I have successfully renders when put as the src of an HTML IMG element with the data URI preface (data: img/png;base64, ) but when using the atob function or a goog closure function it fails.
However decoding succeeds when put in here: http://www.base64decode.org/
Any ideas?
EDIT:
I successfully got it to decode with another library other than the built-in JS function. But, it still won't open locally - on a Mac says it's damaged or in an unknown format and can't get opened.
The code is just something like:
imgEl.src = 'data:img/png;base64,' + contentStr; //this displays successfully
decodedStr = window.atob(contentStr); //this throws the invalid char exception but i just
//used a different script to get it decode successfully but still won't display locally
the base64 string itself is too long to display here (limit is 30,000 characters)

I was just banging my head against the wall on this one for awhile.
There are a couple of possible causes to the problem. 1) Utf-8 problems. There's a good write up + a solution for that here.
In my case, I also had to make sure all the whitespace was out of the string before passing it to atob. e.g.
function decodeFromBase64(input) {
input = input.replace(/\s/g, '');
return atob(input);
}
What was really frustrating was that the base64 parsed correctly using the base64 library in python, but not in JS.

I had to remove the data:audio/wav;base64, in front of the b64, as this was given as part of the b64.
var data = b64Data.substring(b64Data.indexOf(',')+1);
var processed = atob(data);

Returning a byte string to ExternalInterface.call throws an error

I am working on my open source project Downloadify, and up until now it simply handles returning Strings in response to ExternalInterface.call commands.
I am trying to put together a test case using JSZip and Downloadify together, the end result being that a Zip file is created dynamically in the browser, then saved to the disk using FileReference.save. However, this is my problem:
The JSZip library can return either a base64 encoded string of the Zip, or the raw byte string. The problem is, if I return that byte string in response to the ExternalInterface.call command, I get this error:
Error #1085: The element type "string" must be terminated by the matching end-tag "</string>"
ActionScript 3:
var theData:* = ExternalInterface.call('Downloadify.getTextForSave',queue_name);
Where queue_name is just a string used to identify the correct instance in JS.
JavaScript:
var zip = new JSZip();
zip.add("test.txt", "Hello world!\n");
var content = zip.generate(true);
return content;
If I instead return a normal string instead of the byte string, the call works correctly.I would like to avoid using base64 as I would have to include a base64 decoder in my swf which will increase its size.
Finally: I am not looking for a AS3 Zip generator. It is imperative to my project to have that part run in JavaScript
I am admittedly not a AS3 programmer by trade, so if you need any more detail please let me know.

When data is being returned from javascript calls it's being serialized into an XML string. So if the "raw string" returned by JSZip will include characters which make the XML non-valid, which is what I think is happening here, you'll get errors like that.
What you get as a return is actually:
<string>[your JSZip generated string]</string>
Imagine your return string includes a "<" char - this will make the xml invalid, and it's hard to tell what character codes will a raw byte stream translate too.
You can read more about the external API's XML format on LiveDocs

i think the problem is caused by the fact, that flash expects a utf8 String and you throw some binary stuff at it. i think for example 0x00FF will not turn out to be valid utf8 ...
you can try fiddling around with flash.system::System.setCodePage, but i wouldn't be too optimistic ...
i guess a base64 decoder is probably really the easiest ... i'd rather worry about speed than about file size though ... this rudimentary decoder method uses less than half a K:
public function decodeBase64(source:String):ByteArray {
var ret:ByteArray = new ByteArray();
var map:Object = new Object();
var i:int = 0;
for each (var char:String in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".split("")) map[char] = i++;
map["="] = 0;
source = source.split("\n").join("").split("\r").join("");//remove linebreaks
for (i = 0; i < source.length/4; i++) {
var buf:int = 0;
for each (char in source.substr(i * 4, 4).split("")) buf = (buf << 6) + map[char];
ret.writeByte(buf >>> 16);
ret.writeShort(buf);
}
return ret;
}
you could simply shorten function names and take a smaller image ... or use ColorTransform or ConvolutionFilter on one image instead of four ... or compile the image into the SWF for smaller overall size ... or reduce function name length ...
so unless you're planning on working with MBs of data, this is the way to go ...

We Keep Coding

JavaScript is the programming language of the Web.

Javascript DOMParser and XMLSerialier removes XML entities - javascript

Related

How decode HEX in XMLHtppRequest?

JSON.stringify and "\u2028\u2029" check?

Parse ill formed JSON string

DOM Exception 5 INVALID CHARACTER error on valid base64 image string in javascript

Returning a byte string to ExternalInterface.call throws an error

Categories

Resources