Convert XML document to HTML and set <head> - javascript

I have an xml document loaded into the browser. I need to use it as a template to generate and display as an html page in its place, with all the work happening in JavaScript on the client.
I've got it mostly done:
The xml document loads a JavaScript file.
The JavaScript reads the document and generates the html document that I want to display.
The JavaScript replaces the document's innerHTML with the new html document.
The one thing that I'm missing is that I'd like to also supply the head of the new document.
I can create the head's content, of course. But, I cannot find any way to set it back into the browser's document. All my obvious attempts fail, hitting read-only elements of the document, or operations that are not supported on non-HTML documents.
Is there any way to do this, or am I barking up the wrong tree?
Alternative question: Even if this is really impossible, maybe it doesn't matter. Possibly I can use my JavaScript to accomplish everything that might be controlled by the head (e.g., viewport settings). After all, the JavaScript knows the full display environment and can make all needed decisions. Is this a sane approach, or will it lead to thousands of lines code to handle browser-specific or device-specific special cases?
Edited - added the following:
I think the real question is broader: The browser (at least Chrome and Chromium) seems to make a sharp distinction between pages loaded as .xml and pages loaded as .html. I'm trying to bend these rules: I load a page as .xml, but then I use JavaScript to change it into .html.
But, the browser still wants to view the page as .xml. This manifests in many ways: I can't add a <head>; I can't load CSS into the page; formatting tags are not interpreted as html; etc.
How can I convince the browser that my page is now bona fide html?

Related

how do I make javascript function run in same window; it's reloading to a new page [duplicate]

I know document.write is considered bad practice; and I'm hoping to compile a list of reasons to submit to a 3rd party vendor as to why they shouldn't use document.write in implementations of their analytics code.
Please include your reason for claiming document.write as a bad practice below.
A few of the more serious problems:
document.write (henceforth DW) does not work in XHTML
DW does not directly modify the DOM, preventing further manipulation (trying to find evidence of this, but it's at best situational)
DW executed after the page has finished loading will overwrite the page, or write a new page, or not work
DW executes where encountered: it cannot inject at a given node point
DW is effectively writing serialised text which is not the way the DOM works conceptually, and is an easy way to create bugs (.innerHTML has the same problem)
Far better to use the safe and DOM friendly DOM manipulation methods
There's actually nothing wrong with document.write, per se. The problem is that it's really easy to misuse it. Grossly, even.
In terms of vendors supplying analytics code (like Google Analytics) it's actually the easiest way for them to distribute such snippets
It keeps the scripts small
They don't have to worry about overriding already established onload events or including the necessary abstraction to add onload events safely
It's extremely compatible
As long as you don't try to use it after the document has loaded, document.write is not inherently evil, in my humble opinion.
Another legitimate use of document.write comes from the HTML5 Boilerplate index.html example.
<!-- Grab Google CDN's jQuery, with a protocol relative URL; fall back to local if offline -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.6.3/jquery.min.js"></script>
<script>window.jQuery || document.write('<script src="js/libs/jquery-1.6.3.min.js"><\/script>')</script>
I've also seen the same technique for using the json2.js JSON parse/stringify polyfill (needed by IE7 and below).
<script>window.JSON || document.write('<script src="json2.js"><\/script>')</script>
It can block your page
document.write only works while the page is loading; If you call it after the page is done loading, it will overwrite the whole page.
This effectively means you have to call it from an inline script block - And that will prevent the browser from processing parts of the page that follow. Scripts and Images will not be downloaded until the writing block is finished.
Pro:
It's the easiest way to embed inline content from an external (to your host/domain) script.
You can overwrite the entire content in a frame/iframe. I used to use this technique a lot for menu/navigation pieces before more modern Ajax techniques were widely available (1998-2002).
Con:
It serializes the rendering engine to pause until said external script is loaded, which could take much longer than an internal script.
It is usually used in such a way that the script is placed within the content, which is considered bad-form.
Here's my twopence worth, in general you shouldn't use document.write for heavy lifting, but there is one instance where it is definitely useful:
http://www.quirksmode.org/blog/archives/2005/06/three_javascrip_1.html
I discovered this recently trying to create an AJAX slider gallery. I created two nested divs, and applied width/height and overflow: hidden to the outer <div> with JS. This was so that in the event that the browser had JS disabled, the div would float to accommodate the images in the gallery - some nice graceful degradation.
Thing is, as with the article above, this JS hijacking of the CSS didn't kick in until the page had loaded, causing a momentary flash as the div was loaded. So I needed to write a CSS rule, or include a sheet, as the page loaded.
Obviously, this won't work in XHTML, but since XHTML appears to be something of a dead duck (and renders as tag soup in IE) it might be worth re-evaluating your choice of DOCTYPE...
It overwrites content on the page which is the most obvious reason but I wouldn't call it "bad".
It just doesn't have much use unless you're creating an entire document using JavaScript in which case you may start with document.write.
Even so, you aren't really leveraging the DOM when you use document.write--you are just dumping a blob of text into the document so I'd say it's bad form.
It breaks pages using XML rendering (like XHTML pages).
Best: some browser switch back to HTML rendering and everything works fine.
Probable: some browser disable the document.write() function in XML rendering mode.
Worst: some browser will fire an XML error whenever using the document.write() function.
Off the top of my head:
document.write needs to be used in the page load or body load. So if you want to use the script in any other time to update your page content document.write is pretty much useless.
Technically document.write will only update HTML pages not XHTML/XML. IE seems to be pretty forgiving of this fact but other browsers will not be.
http://www.w3.org/MarkUp/2004/xhtml-faq#docwrite
Chrome may block document.write that inserts a script in certain cases. When this happens, it will display this warning in the console:
A Parser-blocking, cross-origin script, ..., is invoked via
document.write. This may be blocked by the browser if the device has
poor network connectivity.
References:
This article on developers.google.com goes into more detail.
https://www.chromestatus.com/feature/5718547946799104
Browser Violation
.write is considered a browser violation as it halts the parser from rendering the page. The parser receives the message that the document is being modified; hence, it gets blocked until JS has completed its process. Only at this time will the parser resume.
Performance
The biggest consequence of employing such a method is lowered performance. The browser will take longer to load page content. The adverse reaction on load time depends on what is being written to the document. You won't see much of a difference if you are adding a <p> tag to the DOM as opposed to passing an array of 50-some references to JavaScript libraries (something which I have seen in working code and resulted in an 11 second delay - of course, this also depends on your hardware).
All in all, it's best to steer clear of this method if you can help it.
For more info see Intervening against document.write()
I don't think using document.write is a bad practice at all. In simple words it is like a high voltage for inexperienced people. If you use it the wrong way, you get cooked. There are many developers who have used this and other dangerous methods at least once, and they never really dig into their failures. Instead, when something goes wrong, they just bail out, and use something safer. Those are the ones who make such statements about what is considered a "Bad Practice".
It's like formatting a hard drive, when you need to delete only a few files and then saying "formatting drive is a bad practice".
Based on analysis done by Google-Chrome Dev Tools' Lighthouse Audit,
For users on slow connections, external scripts dynamically injected via document.write() can delay page load by tens of seconds.
One can think of document.write() (and .innerHTML) as evaluating a source code string. This can be very handy for many applications. For example if you get HTML code as a string from some source, it is handy to just "evaluate" it.
In the context of Lisp, DOM manipulation would be like manipulating a list structure, e.g. create the list (orange) by doing:
(cons 'orange '())
And document.write() would be like evaluating a string, e.g. create a list by evaluating a source code string like this:
(eval-string "(cons 'orange '())")
Lisp also has the very useful ability to create code using list manipulation (like using the "DOM style" to create a JS parse tree). This means you can build up a list structure using the "DOM style", rather than the "string style", and then run that code, e.g. like this:
(eval '(cons 'orange '()))
If you implement coding tools, like simple live editors, it is very handy to have the ability to quickly evaluate a string, for example using document.write() or .innerHTML. Lisp is ideal in this sense, but you can do very cool stuff also in JS, and many people are doing that, like http://jsbin.com/
A simple reason why document.write is a bad practice is that you cannot come up with a scenario where you cannot find a better alternative.
Another reason is that you are dealing with strings instead of objects (it is very primitive).
It does only append to documents.
It has nothing of the beauty of for instance the MVC (Model-View-Controller) pattern.
It is a lot more powerful to present dynamic content with ajax+jQuery or angularJS.
The disadvantages of document.write mainly depends on these 3 factors:
a) Implementation
The document.write() is mostly used to write content to the screen as soon as that content is needed. This means it happens anywhere, either in a JavaScript file or inside a script tag within an HTML file. With the script tag being placed anywhere within such an HTML file, it is a bad idea to have document.write() statements inside script blocks that are intertwined with HTML inside a web page.
b) Rendering
Well designed code in general will take any dynamically generated content, store it in memory, keep manipulating it as it passes through the code before it finally gets spit out to the screen. So to reiterate the last point in the preceding section, rendering content in-place may render faster than other content that may be relied upon, but it may not be available to the other code that in turn requires the content to be rendered for processing. To solve this dilemma we need to get rid of the document.write() and implement it the right way.
c) Impossible Manipulation
Once it's written it's done and over with. We cannot go back to manipulate it without tapping into the DOM.
I think the biggest problem is that any elements written via document.write are added to the end of the page's elements. That's rarely the desired effect with modern page layouts and AJAX. (you have to keep in mind that the elements in the DOM are temporal, and when the script runs may affect its behavior).
It's much better to set a placeholder element on the page, and then manipulate it's innerHTML.

How to know if web content cannot be handled by Scrapy?

I apologize if my question sounds too basic or general, but it has puzzled me for quite a while. I am a political scientist with little IT background. My own research on this question does not solve the puzzle.
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category? I once came across some texts that show in Chrome Inspect, but could not be extracted by Xpath (I am 99.9% certain my Xpath expression was correct). Someone mentioned that the texts might be hidden behind some JavaScript. But this is still speculation, I can't be totally sure that it wasn't due to wrong Xpath expressions. Are there any signs that can make me certain that this is something beyond Scrapy and can only be dealt with programs such as Selenium? Any help appreciated.
-=-=-=-=-=
Edit (1/18/15): The webpage I'm working with is http://yhfx.beijing.gov.cn/webdig.js?z=5. The specific piece of information I want to scrape is circled in red ink (see screenshot below. Sorry, it's in Chinese).
I can see the desired text in Chrome's Inspect, which indicates that the Xpath expression to extract it should be response.xpath("//table/tr[13]/td[2]/text()").extract(). However, the expression doesn't work.
I examined response.body in Scrapy shell. The desired text is not in it. I suspect that it is JavaScript or AJAX here, but in the html, I did not see signs of JavaScript or AJAX. Any idea what it is?
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category?
The browsers do a lot of things when you open a web page. I will be oversimplify the process here:
Performs an HTTP request to the server hosting the web page.
Parses the response, which in most cases is HTML content (text-based format). We will assume we get a HTML response.
Starts the rendering the HTML, executes the Javascript code, retrieves external resources (images, css files, js files, fonts, etc). Not necessarily in this order.
Listens to events that may trigger more requests to inject more content into the page.
Scrapy provides tools to do 1. and 2. Selenium and other tools like Splash do 3., allow you to do 4. and access the rendered HTML.
Now, I think there are three basic cases you face when you want to extract text content from a web page:
The text is in plain HTML format, for example, as a text node or HTML attribute: <a>foo</a>, <a href="foo" />. The content could be visually hidden by CSS or Javascript, but as long is part of the HTML tree we can extract it via XPath/CSS rules.
The content is located in Javascript code. For example: <script>var cfg = {code: "foo"};</script>. We can locate the <script> node with a XPath rule and then use regular expressions to extract the string we want. Also there are libraries that allow us to parse pieces of Javascript so we can load objects easily. A complex solution here is executing the javascript code via a javascript engine.
The content is located in a external resource and is loaded via Ajax/XHR. Here you can emulate the XHR request with Scrapy and the parse the output, which can be a nice JSON object, arbitrary javascript code or simply HTML content. If it gets tricky to reverse engineer how the content is retrieved/parsed then you can use Selenium or Splash as a proxy for Scrapy so you can access the rendered content and still be able to use Scrapy for your crawler.
How you know which case you have? You can simply lookup the content in the response body:
$ scrapy shell http://example.com/page
...
>>> 'foo' in response.body.lower()
True
If you see foo in the web page via the browser but the test above returns False, then it's likely the content is loaded via Ajax/XHR. You have to check the network activity in the browser and see what requests are being done and what are the responses. Otherwise you are in case 1. or 2. You can simply view the source in the browser and search for the content to figure out where is located.
Let say the content you want is located in HTML tags. How do you know if your XPath expression correct? (By correct here we mean that gives you the output you expect)
Well, if you do scrapy shell and response.xpath(expression) returns nothing, then your XPath is not correct. You should reduce the specificity of your expression until you get an output that includes the content you want, and then narrow it down.

using document.write in remotely loaded javascript to write out content - why a bad idea?

I'm not a full-time Javascript developer. We have a web app and one piece is to write out a small informational widget onto another domain. This literally is just a html table with some values written out into it. I have had to do this a couple of times over the past 8 years and I always end up doing it via a script that just document.write's out the table.
For example:
document.write('<table border="1"><tr><td>here is some content</td></tr></table>');
on theirdomain.com
<body>
....
<script src='http://ourdomain.com/arc/v1/api/inventory/1' type='text/javascript'></script>
.....
</body>
I always think this is a bit ugly but it works fine and we always have control over the content (or a trusted representative has control such as like your current inventory or something). So another project like this came up and I coded it up in like 5 minutes using document.write. Somebody else thinks this is just too ugly but I don't see what the problem is. Re the widget aspect, I have also done iframe and jsonp implementations but iframe tends not to play well with other site's css and jsonp tends to just be too much. Is there a some security element I'm missing? Or is what I'm doing ok? What would be the strongest argument against using this technique? Is there a best practice I don't get?
To be honest, I don't really see a problem. Yes, document.write is very old-school, but it is simple and universally supported; you can depend on it working the same in every browser.
For your application (writing out a HTML table with some data), I don't think a more complex solution is necessary if you're willing to assume a few small risks. Dealing with DOM mutation that works correctly across browsers is not an easy thing to get right if you're not using jQuery (et al).
The risks of document.write:
Your script must be loaded synchronously. This means a normal inline script tag (like you're already using). However, if someone gets clever and adds the async or defer attributes to your script tag (or does something fancy like appending a dynamically created script element to the head), your script will be loaded asynchronously.
This means that when your script eventually loads and calls write, the main document may have already finished loading and the document is "closed". Calling write on a closed document implicitly calls open, which completely clears the DOM – it's esentially the same as wiping the page clean and starting from scratch. You don't want that.
Because your script is loaded synchronously, you put third-party pages at the mercy of your server. If your server goes down or gets overloaded and responds slowly, every page that contain your script tag cannot finish loading until your server does respond or the browser times out the request.
The people who put your widget on their website will not be happy.
If you're confident in your uptime, then there's really no reason to change what you're doing.
The alternative is to load your script asynchronously and insert your table into the correct spot in the DOM. This means third parties would have to both insert a script snippet (either <script async src="..."> or use the dynamic script tag insertion trick. They would also need to carve out a special <div id="tablegoeshere"> for you to put your table into.
Using document.write() after loading the entire DOM do not allow you to access DOM any further.
See Why do I need to use document.write instead of DOM manipulation methods?.
You are in that case putting away a very powerfull functionnality of in web page...
Is there a some security element I'm missing?
The security risk is for them in that theirdomain.com trusting your domain's script code to not do anthing malicous. Your client script will run in the context of their domain and can do what it likes such as stealing cookies or embedding a key logger (not that you would do that of course). As long as they trust you, that is fine.

Parsing HTML using JavaScript

I'm working a page that needs to fetch info from some other pages and then display parts of that information/data on the current page.
I have the HTML source code that I need to parse in a string. I'm looking for a library that can help me do this easily. (I just need to extract specific tags and the text they contain)
The HTML is well formed (All closing/ending tags present).
I've looked at some options but they are all being extremely difficult to work with for various reasons.
I've tried the following solutions:
jkl-parsexml library (The library js file itself throws up HTTPError 101)
jQuery.parseXML Utility (Didn't find much documentation/many examples to figure out what to do)
XPATH (The Execute statement is not working but the JS Error Console shows no errors)
And so I'm looking for a more user friendly library or anything(tutorials/books/references/documentation) that can let me use the aforementioned tools better, more easily and efficiently.
An Ideal solution would be something like BeautifulSoup available in Python.
Using jQuery, it would be as simple as $(HTMLstring); to create a jQuery object with the HTML data from the string inside it (this DOM would be disconnected from your document). From there it's very easy to do whatever you want with it--and traversing the loaded data is, of course, a cinch with jQuery.
You can do something like this:
$("string with html here").find("jquery selector")
$("string with html here") this will create a document fragment and put an html into it (basically, it will parse your HTML). And find will search for elements in that document fragment (and only inside it). At the same time it will not put it in page DOM

Is there a way to validate the HTML of a page after AJAX operations are performed on it?

I'm writing a web app that inserts and modifies HTML elements via AJAX using JQuery. It works very nicely, but I want to be sure everything is ok under the bonnet. When I inspect the source of the page in IE or Chrome it shows me the original document markup, not what has changed since my AJAX calls.
I love using the WC3 validator to check my markup as it occasionally reminds me that I've forgotten to close a tag etc. How can I use this to check the markup of my page after the original source served from the server has been changed via Javascript?
Thank you.
Use developer tool in chrome to explore the DOM : it will show you all the HTML you've added in javascript.
You can now copy it and paste it in any validator you want.
Or instead of inserting code in JQuery, give it to the console, the browser will then not be able to close tags for you.
console.log(myHTML)
Both previous answers make good points about the fact the browser will 'fix' some of the html you insert into the DOM.
Back to your question, you could add the following to a bookmark in your browser. It will write out the contents of the DOM to a new window, copy and paste it into a validator.
javascript:window.open("").document.open("text/plain", "").write(document.documentElement.outerHTML);
If you're just concerned about well-formedness (missing closing tags and such), you probably just want to check the structure of the chunks AJAX is inserting. (Once it's part of the DOM, it's going to be well-formed... just not necessarily the structure you intended.) The simplest way to do that would probably be to attempt to parse it using an XML library. (one with an HTML mode that can be made strict, if you're not using XHTML)
Actual validation (Testing the "You can't put tag X inside tag Y" rules which browsers generally don't care too much about) is a lot trickier and, depending on how much effort you're willing to put into it, may not be worth the trouble. (Because, if you validate them in isolation, you'll get a lot of "This is just a fragment" false positives)
Whichever you decide to use, you need to grab the AJAX responses before the browser parses them if you want a reliable test result. (While they're still just a string of text rather than a DOM tree)

Categories