How to optimize string-to-DOM conversion?

How to optimize string-to-DOM conversion? - javascript

I'm faced with a slight inconvenient 'lag' when I attempt to populate a div created in JavaScript:
var el = document.createElement("div");
el.innerHTML = '<insert string-HTML code here>'
However, this is natural due to extent of the HTML code; sometimes it's more than 300,000 characters long and it is derived from GM_xmlHttpRequest which sometimes takes 1000ms (give or take) to complete, plus the additional 500ms caused by the DOM-ification.
I have attempted to get rid of massive amount of text using substr (granted not the best idea that could've occurred to me), and it surprisingly worked for the most part, but at certain times element would fail to accept HTML code (probably unmatched <*.?>).
I only need to access an extremely small amount of text that's stored inside; regexp is per bobince out of the question and figured this would be the best approach.
EDIT: I'm inclined to mention that my definition of parsing the DOM has been underrated, I meant to say that this 'text' was the textContent of a quite a few elements which I modify. Therefore, regexp isn't an option.

While other ansers focus on guessing whether your desire (parsing DOM without string manipulation) makes sense, I will dedicate this answer to the comparison of reasonable DOM parsing methods.
For a fair comparison, I assume that we need the <body> element (as root container) for the parsed DOM. I have created a benchmark at http://jsperf.com/domparser-vs-innerhtml-vs-createhtmldocument.
var testString = '<body>' + Array(100001).join('<div>x</div>') + '</body>';
function test_innerHTML() {
var b = document.createElement('body');
b.innerHTML = testString;
return b;
}
function test_createHTMLDocument() {
var d = document.implementation.createHTMLDocument('');
d.body.innerHTML = testString;
return d.body;
}
function test_DOMParser() {
return (new DOMParser).parseFromString(testString, 'text/html').body;
}
The first method is your current one. It is wel-supported accross all browsers.
Even though the second method has the overhead of creating a full document, it has a big benefit over the first one: resources (images) are not loaded. The overhead of the document is marginal compared to the potential network traffic of the first one.
The last method is -as of writing- only supported in Firefox 12+ (no problem, since you're writing a GreaseMonkey script), and is the specific tool for this job (with the same advantages of the previous method). As it name implies, it is a DOM parser.
The bench mark shows that the original method is the fastest 4.64 Ops/s, followed by the DOMParser method 4.22 Ops/s. The slowest method is the createHTMLDocument method 3.72 Ops/s. The differences are minimal though, so I definitely recommend the DOMParser for the reasons stated earlier.
I know that you're using GM_xmlhttprequest to fetch data. However, if you're able to use XMLHttpRequest instead, I suggest to give the following method a try: Instead of getting plain text as response, you can get a document as a response:
var xhr = new XMLHttpRequest();
xhr.open('GET', 'http://www.example.com/');
xhr.responseType = 'document';
xhr.onload = function() {
var bodyElement = xhr.response.body; // xhr.response is a document object
};
xhr.send();
If Greasemonkey script is long active on a single page, you can still use this feature for other domains which do not support CORS: Insert an iframe in the document whose domain is equal to the other domain (eg http://example.com/favicon.ico), and use it as a proxy (activate the GM script for this page as well). The overhead of insering an iframe is significant, so this option is not viable for one-time requests.
For same-origin requests, this option may be the best one (although not benchmarked, one can argue that returning a document directly instead of intermediate string manipulation offers performance benefits). Unlike the DOMParser+text/html method, the responseType="document" is supported by more browsers: Chrome 18+, Firefox 11+ and IE 10+.

We'd need to know a bit more about your application, but when you're working with that much HTML content, you might just want to use an iframe. It's asynchronous, it won't stall JS code, and it won't introduce a plethora of potential debugging problems.
It can be dangerous to populate an element with raw HTML from an xmlhttprequest, mainly due to potential XSS vulnerabilities and next-to-impossible-to-fix HTML glitches. If at all possible, consider using a template (I believe JQuery offers some sort of templating solution) and loading a small amount of XML/JSON/etc. Only do that if using an iframe is out of the question though.

I you have a giant amount of HTML and it's taking a long time to put in the DOM and you only want a small piece of it, the ways to make that faster are:
Get your server to serve up only the parts of the HTML you actually want. This would save on both the networking transfer time and the DOM parsing time.
If you can't modify the server, then you need to manually parse some of the HTML to eliminate the parts you don't want so not as much will put in the DOM. A regex is one of the slower ways to search a giant string so it's better to use something like .indexOf() if possible to identify the general area you are targeting. If there is a unique id or class and you know the general form of the HTML, you can use a faster algorithm like that to identify the target area. But, without you disclosing the actual HTML to be parsed, we can't offer more specifics than that.

Related

javascript - locally generating and downloading a huge file

I have the following JavaScript code which downloads a file, which I can't help but think I got from here: Create a file in memory for user to download, not through server
However, this function crashes in Chrome because I'm trying to download too much data (it might be a couple of MB, but it seems to work OK for downloads under 1 MB. I haven't done many metrics on it).
I really like this function because it lets me create a string from an existing huge JavaScript variable and immediately download it.
So my two questions then are:
A) Whats causing the crash? Is it the size of the string text or is there something in this function? I read that 60MB strings were possible in JavaScript and I don't think I'm quite reaching that.
B) If it is this function, is there another simple way to download some huge-ish file that allows me to generate the content locally via JavaScript?
function download(filename, text) {
var element = document.createElement('a');
element.setAttribute('href', 'data:text/plain;charset=utf-8,' + encodeURIComponent(text));
element.setAttribute('download', filename);
element.style.display = 'none';
document.body.appendChild(element);
element.click();
document.body.removeChild(element);
}

Does it work in other browsers? Try using the debugger and set a break point just inside the function and step through.
Try breaking up the element.setAttribute and the data content by creating a var that holds the string you are going to set to href, that way you can see more failure points.
See if the encodeURIComponent function is failing with large strings.
Strings are immutable in javascript, for those who are privy it means that their creation is final, you can't modify the string, or append to one, you have to create a new one for every change. encodeURIComponent which url encodes a string is possibly making thousands of changes escaping a > 1mb string depending on the contents of the string. And even if you are using zero characters that need escaped, when you call that function and then append it to the 'data:text/plain;charset=utf-8,' string, it will create a new string from those two, effective doubling the memory needed for that action.
Depending on how the particular browser is handing this function, its not optimized for long strings at all, since most browsers have a url character limitation of ~2000 characters ( 2048 typically ) then it's likely that the implementation in the browser is not doing a low level escape. If this function is indeed the culprit, you will have to find another way to uri escape your string. possibly a library or custom low level escape.
If the debugger shows that this function is not the issue, the obvious other bottleneck would be when you append this enormous link to the dom, the browser could be freezing there attempting to process this command, and for that it may require a completely different solution to your downloading issue.
Though this is just speculation, hopefully it leads you in the right direction.

While I marked Rickey's answer as the correct one because it got me to the right answer, a workaround I found for this was here:
JavaScript blob filename without link
The accepted answer at this link was capable of handling more than 8MB, while the data URI was capable of handling 2MB, because of Chrome's limit on URI length.

Share localization strings from backend to JavaScript

Consider a JSP application with a couple of JavaScript files. The backend is fully localized, using several .properties files, one for each language. Rendered HTML contains strings in the correct language - all of this is the usual stuff and works perfectly.
Now however, from time to time I need to use some localized string in a JavaScript resource. Suppose e.g.:
function foo() {
alert('This string should be localized!');
}
Note that this is somewhat similar to the need to refer some AJAX endpoints from JavaScript, a problem well solved by a reverse JS router. However the key difference here is that the backend does not use the router, but it does use the strings.
I have come up with several approaches, but none of them seems to be good enough.
Parameters
JSP that renders the code to invoke foo() will fetch the string:
foo('<%= localize("alert.text") %>');
function foo(alertText) {
alert(alertText);
}
Pros: It works.
Cons: Method signatures are bloated.
Prototypes
JSP renders a hidden span with the string, JS fetches it:
<span id="prototype" class="hidden">This string should be localized!</span>
function foo() {
alert($('#prototype').text());
}
Pros: Method signatures are no longer bloated.
Cons: Must make sure that the hidden <span>s are always present.
AJAX
There is an endpoint that localizes strings by their key, JS calls it. (The code is approximate.)
function foo() {
$.ajax({ url : '/ajax/localize', data : { key : 'alert.text' } })
.done(function(result) {
alert(result);
} );
}
Pros: Server has full control over the localized result.
Cons: One HTTP call per localized string! Any of the AJAX calls fail, the logic breaks.
This can be improved by getting multiple strings at once, but the rountrip problem is an essential one.
Shared properties files
Property file containing the current language is simply exposed as an additional JS resource on the page.
<script src="/locales/en.js" /> // brings in the i18n object
function foo() {
alert(i18n.alert.text);
}
Pros: Fast and reliable.
Cons: All the strings are pulled in - also the ones we don't need or want to expose to the user.
This can be improved by keeping a separate set of strings for JS, but that violates the DRY principle.
Now what?
So that's it, that's the ideas I've had. None of them is ideal, all have their own share of problems. I am currently usings the first two approaches, with a mixed success. Are there any other options?

Your idea with a shared properties file is the neater solution out of the 4 ideas you suggested. A popular CMS I use called Silverstripe actually does the same thing, loads a localised JS file that adds the strings to a dictionary, allowing a common API for retrieving the strings.
One point made in the comments is about including localised strings for a particular view. While this can have some uses under particular situations where you have thousands of strings per localisation (totaling more than a few hundred KBs), it can also be a little unnecessary.
Client-side
Depending how many other JS resources you are loading at the same time, you may not want another request per view just to add a few more strings for that locale. Each view's localisation would need to be requested separately which can be a little inefficient. Give the browser all the localisations in one request and let it just read from its cache for each view.
The browser caching the complete collection of locale strings can lead to a better user experience with faster page load times with one less request per view. For mobile users, this can be quite helpful as even with faster mobile internet, not every single request is lightning fast.
Server-side
If you go by the method suggested in the comments by having a locale.asp file generating the JS locale strings on the fly, you are giving the server a little more work per user. This won't be that bad if each user requests it once however if it is request per view, it might start adding up.
If the user views 5 different pages, that is 5 times the server is executing the JSP, building the data for the particular view. While your code might be basic if-statements and loading a few files from the filesystem, there is still overhead in executing that code. While it might not be a problem say for 10 requests per minute, it could lead to issues with 1,000 requests per minute.
Again, that extra overhead can be small but it just simply isn't necessary unless you really want many small HTTP requests instead of few larger HTTP requests and little browser caching.
Additional Thoughts
While this might sound like premature optimisation, I think it is a simple and important thing to consider. We don't know whether you have 5 users or 5,000, whether your users go see 5 different views or 500, whether you will have many/any mobile users, how many locales you want to support, how many different strings per locale you have.
Because of this I think it is best to see the larger picture of what the choice of having locale strings downloaded per view would do.

Preventing DOM XSS

We recently on-boarded someone else's code which has since been tested, and failed, for DOM XSS attacks.
Basically the url fragments are being passed directly into jQuery selectors and enabling JavaScript to be injected, like so:
"http://website.com/#%3Cimg%20src=x%20onerror=alert%28/XSSed/%29%3E)"
$(".selector [thing="+window.location.hash.substr(1)+"]");
The problem is that this is occurring throughout their scripts and would need a lot of regression testing to fix e.g. if we escape the data if statements won't return true any more as the data won't match.
The JavaScript file in question is concatenated at build time from many smaller files so this becomes even more difficult to fix.
Is there a way to prevent these DOM XSS attacks with some global code without having to go through and debug each instance.
I proposed that we add a little regular expression at the top of the script to detect common chars used in XSS attacks and to simply kill the script if it returns true.
var xss = window.location.href.match(/(javascript|src|onerror|%|<|>)/g);
if(xss != null) return;
This appears to work but I'm not 100% happy with the solution. Does anyone have a better solution or any useful insight they can offer?

If you stick to the regular expression solution, which is far from ideal but may be the best choice given your constraints:
Rather than defining a regular expression matching malicious hashes (/(javascript|src|onerror|%|<|>)/g), I would define a regular expression matching sound hashes (e.g. /^[\w_-]*$/).
It will avoid false-positive errors (e.g. src_records), make it clear what is authorized and what isn't, and block more complex injection mechanisms.

Your issue is caused by that jQuery's input string may be treated as HTML, not only as selector.
Use native document.querySelector() instead of jQuery.
If support for IE7- is important for you, you can try Sizzle selector engine which likely, unlike jQuery and similar to native querySelector(), does not interpret input string as something different from a selector.

create html elements on the serverside VS get data as JSON and create tags with javascript

I want to create a AJAX search to find and list topics in a forum (just topic link and subject).
The question is: Which one of the methods is better and faster?
GET threads list as a JSON string and convert it to an object, then loop over items and create a <li/> or <tr>, write data (link, subject) and append it to threads list. (jQuery Powered)
GET threads list which it wrapped in HTML tags and print it (or use innerHTML and $(e).html())
Thanks...

I prefer the second method.
I figure server-side you have to either convert your data to JSON or html format so why not go directly to the one the browser understands and avoid having to reprocess it client-side. Also you can easily adapt the second method to degrade gracefully for users who have disabled JavaScript (such that they still see the results via standard non-JS links.)

I'm not sure which way is better (I assume the second method is better as it would seem to touch the data less) but a definitive way to found out is try both ways and measure which one does better.

'Faster' is probably the second method.
'Better' is probably subjective.
For example, I've been in situations (as a front end dev) where I couldn't alter the html the server was returning and i wished they would have just delivered a json object so i could design the page how i wanted.
Also, (perhaps not specific to your use case), serving up all the html on initial page load could increase the page size and load time.

Server generated HTML is certainly faster if the javascript takes long time to process the JSON and populate the html.
However, for maintainability, JS is better. You can change HTML generation just by changing JS, not having to update server side code, making a delta release etc etc.
Best is to measure how slow it really is. Sometimes we think it is slow, but then you try it out in real world and you don't really see a big difference. You might have the major delay in transmitting the JSON object. That delay will still be there and infact increase if you send an html representation from the server.
So, if you bottleneck really is parsing JSON and generating html, not the transmission from server, then sending html from server makes sense.
However, you can do a lot of optimization in producing the html and parsing JSON. There are so many tricks to make that faster. Best if you show me the code and I can help you make a fast JS based implementation or can tell you to do it on the server.

Is it possible to use JavaScript to break the HTML of a page?

I've been asked at work whether it is possible to write, on purpose or by accident, JavaScript that will remove specific characters from a HTML document and thus break the HTML. An example would be adding some JavaScript that removes the < symbol in the page. I've tried searching online and I know JavaScript can replace strings, but my knowledge of the language is negligible.
I've been asked to look into it as a way of hopefully addressing why a site I work on needs to have controls over who can add bespoke functionality to the page. I'm hoping it's not possible but would be grateful for the peace of mind!

Yes, and in fact you can do things far more insidious with javascript as well.
http://en.wikipedia.org/wiki/Cross-site_scripting

yes, thats possible. the easiest example is
var body = document.getElemetsByTagName('body')[0];
body.innerHTML = 'destroyed';
wich will remove the whole page and just write "destroyed" instead. to get back to your example: in the same way it's possible to replace <:
var body = document.getElemetsByTagName('body')[0];
body.innerHTML = body.innerHTML.replace('<','some other character');
such "extreme" cases are very unlikely to happen by accident, but it's absolutely possible (particularly for inexperienced javascript-developers) to break things on a site that usually shouldn't be affected by javascript.
note that this will only mess op the displayed page in the clients browser and doesn't change your html-file on the server in any way. just find and remove/fix the "bad" lines of code and everything is fine again.

Any client/browser can manipulate how the page is viewed at any time, for instance in chrome hit F12 and then you can write whatever you want in the html and you will see the changes immediately. But that's not to worry about...
The scary part is when JavaScript on the site communicates with the back-end server and supplies it with some input parameters that are not being sanitized on the server side before it is processed in some way. SQL Injection can also happen this way if the back-end utilizes a database which they almost always do, and so on...
A webpage can be manipulated in two ways, either its none-persistent or its persistent.
[none-persistent]: this way you can manipulate your access to a webpage but, this won't affect other users in it self, but you can do harm once your in.
[persistent]: this way the server side code will permanently be affected by the injected code, and most likely affect other users.
Key thing here is to always sanitize the input a back-end server used before it processes anything.

You could definitely write some javascript function to modify the contents of a file. If that file is your HTML page, then sure.
If you want to prevent this from happening, you can just set the permissions of that HTML file to be read-only, though.

you could:
Overwrite the page,
Mess with the innerHTML of the body tag (almost the same),
Insert illegal elements.

Yes. In the least, you could use it to write CSS that sets any element, class, ID... even the body to display:none;

We Keep Coding

JavaScript is the programming language of the Web.