Parsing HTML using JavaScript

Parsing HTML using JavaScript - javascript

I'm working a page that needs to fetch info from some other pages and then display parts of that information/data on the current page.
I have the HTML source code that I need to parse in a string. I'm looking for a library that can help me do this easily. (I just need to extract specific tags and the text they contain)
The HTML is well formed (All closing/ending tags present).
I've looked at some options but they are all being extremely difficult to work with for various reasons.
I've tried the following solutions:
jkl-parsexml library (The library js file itself throws up HTTPError 101)
jQuery.parseXML Utility (Didn't find much documentation/many examples to figure out what to do)
XPATH (The Execute statement is not working but the JS Error Console shows no errors)
And so I'm looking for a more user friendly library or anything(tutorials/books/references/documentation) that can let me use the aforementioned tools better, more easily and efficiently.
An Ideal solution would be something like BeautifulSoup available in Python.

Using jQuery, it would be as simple as $(HTMLstring); to create a jQuery object with the HTML data from the string inside it (this DOM would be disconnected from your document). From there it's very easy to do whatever you want with it--and traversing the loaded data is, of course, a cinch with jQuery.

You can do something like this:
$("string with html here").find("jquery selector")
$("string with html here") this will create a document fragment and put an html into it (basically, it will parse your HTML). And find will search for elements in that document fragment (and only inside it). At the same time it will not put it in page DOM

Related

Convert XML document to HTML and set <head>

I have an xml document loaded into the browser. I need to use it as a template to generate and display as an html page in its place, with all the work happening in JavaScript on the client.
I've got it mostly done:
The xml document loads a JavaScript file.
The JavaScript reads the document and generates the html document that I want to display.
The JavaScript replaces the document's innerHTML with the new html document.
The one thing that I'm missing is that I'd like to also supply the head of the new document.
I can create the head's content, of course. But, I cannot find any way to set it back into the browser's document. All my obvious attempts fail, hitting read-only elements of the document, or operations that are not supported on non-HTML documents.
Is there any way to do this, or am I barking up the wrong tree?
Alternative question: Even if this is really impossible, maybe it doesn't matter. Possibly I can use my JavaScript to accomplish everything that might be controlled by the head (e.g., viewport settings). After all, the JavaScript knows the full display environment and can make all needed decisions. Is this a sane approach, or will it lead to thousands of lines code to handle browser-specific or device-specific special cases?
Edited - added the following:
I think the real question is broader: The browser (at least Chrome and Chromium) seems to make a sharp distinction between pages loaded as .xml and pages loaded as .html. I'm trying to bend these rules: I load a page as .xml, but then I use JavaScript to change it into .html.
But, the browser still wants to view the page as .xml. This manifests in many ways: I can't add a <head>; I can't load CSS into the page; formatting tags are not interpreted as html; etc.
How can I convince the browser that my page is now bona fide html?

Blogger API - Render blog content on personal website

In its content attribute the blogger API returns an ugly blob of HTML. I would like to convert this HTML string data into a dom that I can parse. What is the best way to parse this text in order that I can re-render within a js widget I'm building for another website?
I'd rather not write my own parser that reverse engineers the HTML encoding that Google put into place. I'm ideally looking for a library which undoes the HTML escaping and then turns it into a dom which I can inspect with JQuery.

Apparently this question was based on some slightly false premises. I have since managed to successfully embed blogs in my website. I have been using AngularJS, which apparently escapes HTML by default before embedding it into the dom. This caused some heavy confusion from my side. The response from google is not escaped.
This means parsing it as a dom is simply a matter of calling jquery.parseHtml(). See: http://api.jquery.com/jquery.parsehtml/
Once this is done, whatever jquery transformations need to be made can be made using angularJS's JQLite by calling angular.element('').
Finally, the object can be bound to the document.
Alternatively, the raw content of the list of blog posts can be injected as an html string the regular angular way using something like this:
$scope.frontPagePosts = posts.map(function(post){
post.content = $sce.trustAsHtml(post.content);
return post;
});

Getting access to the original HTML in HtmlUnit HtmlElement?

I am using HtmlUnit to read content from a web site.
Everything works perfectly to the point where I am reading the content with:
HtmlDivision div = page.getHtmlElementById("my-id");
Even div.asText() returns the expected String object, but I want to get the original HTML inside <div>...</div> as a String object. How can I do that?
I am not willing to change HtlmUnit to something else, as the web site expects the client to run JavaScript, and HtmlUnit seems to be capable of doing what is required.

If by original HTML you mean the HTML code that HTMLUnit has already formatted then you can use div.asXml(). Now, if you really are looking for the original HTML the server sent you then you won't find a way to do so (at least up to v2.14).
Now, as a workaround, you could get the whole text of the page that the server sent you with this answer: How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?
As a side note, you should probably think twice why you need the HTML code. HTMLUnit will let you get the data from the code, so there shouldn't be any need to store the source code but rather the information it is contained in it. Just my 2 cents.

Get the source (code) of an external script?

It is possible to obtain as a string the content of an external script? Something equivalent to myInlineScript.textContent?
The scenario is that I'm just starting to get into WebGL and all the tutorials I'm finding store shaders as inline <script type="x-shader/x-foo"> tags. The element itself isn't important, however — the shaders are created from plain strings. These are easily extracted from inline scripts via .textContent, but I'd of course prefer to keep my code separate from my markup.
Is there some equivalent for getting the source / content of an external script? A quick scan of the DOM docs and a Google search didn't yield anything enlightening.
Update
The WebGL shaders aren't a huge problem — there are a number of ways of getting strings in Javascript! But this isn't the only time I've encountered script tags being used to inline non-scripts and it got be curious about a better way to do it.

If it's on the same domain you can just do a normal ajax request for it, and get the file back as a string. This works for any kind of text file.
Also, I am not familiar with WebGL, but if your shaders are in external files then I assume the "x-shader" script type is simply a way to put them inline. Otherwise, it's just a string you pass to a method somewhere. So don't over-think this too much.

Is there a way to validate the HTML of a page after AJAX operations are performed on it?

I'm writing a web app that inserts and modifies HTML elements via AJAX using JQuery. It works very nicely, but I want to be sure everything is ok under the bonnet. When I inspect the source of the page in IE or Chrome it shows me the original document markup, not what has changed since my AJAX calls.
I love using the WC3 validator to check my markup as it occasionally reminds me that I've forgotten to close a tag etc. How can I use this to check the markup of my page after the original source served from the server has been changed via Javascript?
Thank you.

Use developer tool in chrome to explore the DOM : it will show you all the HTML you've added in javascript.
You can now copy it and paste it in any validator you want.
Or instead of inserting code in JQuery, give it to the console, the browser will then not be able to close tags for you.
console.log(myHTML)

Both previous answers make good points about the fact the browser will 'fix' some of the html you insert into the DOM.
Back to your question, you could add the following to a bookmark in your browser. It will write out the contents of the DOM to a new window, copy and paste it into a validator.
javascript:window.open("").document.open("text/plain", "").write(document.documentElement.outerHTML);

If you're just concerned about well-formedness (missing closing tags and such), you probably just want to check the structure of the chunks AJAX is inserting. (Once it's part of the DOM, it's going to be well-formed... just not necessarily the structure you intended.) The simplest way to do that would probably be to attempt to parse it using an XML library. (one with an HTML mode that can be made strict, if you're not using XHTML)
Actual validation (Testing the "You can't put tag X inside tag Y" rules which browsers generally don't care too much about) is a lot trickier and, depending on how much effort you're willing to put into it, may not be worth the trouble. (Because, if you validate them in isolation, you'll get a lot of "This is just a fragment" false positives)
Whichever you decide to use, you need to grab the AJAX responses before the browser parses them if you want a reliable test result. (While they're still just a string of text rather than a DOM tree)

We Keep Coding

JavaScript is the programming language of the Web.