Getting access to the original HTML in HtmlUnit HtmlElement? - javascript

I am using HtmlUnit to read content from a web site.
Everything works perfectly to the point where I am reading the content with:
HtmlDivision div = page.getHtmlElementById("my-id");
Even div.asText() returns the expected String object, but I want to get the original HTML inside <div>...</div> as a String object. How can I do that?
I am not willing to change HtlmUnit to something else, as the web site expects the client to run JavaScript, and HtmlUnit seems to be capable of doing what is required.

If by original HTML you mean the HTML code that HTMLUnit has already formatted then you can use div.asXml(). Now, if you really are looking for the original HTML the server sent you then you won't find a way to do so (at least up to v2.14).
Now, as a workaround, you could get the whole text of the page that the server sent you with this answer: How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?
As a side note, you should probably think twice why you need the HTML code. HTMLUnit will let you get the data from the code, so there shouldn't be any need to store the source code but rather the information it is contained in it. Just my 2 cents.

Related

How to know if web content cannot be handled by Scrapy?

I apologize if my question sounds too basic or general, but it has puzzled me for quite a while. I am a political scientist with little IT background. My own research on this question does not solve the puzzle.
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category? I once came across some texts that show in Chrome Inspect, but could not be extracted by Xpath (I am 99.9% certain my Xpath expression was correct). Someone mentioned that the texts might be hidden behind some JavaScript. But this is still speculation, I can't be totally sure that it wasn't due to wrong Xpath expressions. Are there any signs that can make me certain that this is something beyond Scrapy and can only be dealt with programs such as Selenium? Any help appreciated.
-=-=-=-=-=
Edit (1/18/15): The webpage I'm working with is http://yhfx.beijing.gov.cn/webdig.js?z=5. The specific piece of information I want to scrape is circled in red ink (see screenshot below. Sorry, it's in Chinese).
I can see the desired text in Chrome's Inspect, which indicates that the Xpath expression to extract it should be response.xpath("//table/tr[13]/td[2]/text()").extract(). However, the expression doesn't work.
I examined response.body in Scrapy shell. The desired text is not in it. I suspect that it is JavaScript or AJAX here, but in the html, I did not see signs of JavaScript or AJAX. Any idea what it is?
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category?
The browsers do a lot of things when you open a web page. I will be oversimplify the process here:
Performs an HTTP request to the server hosting the web page.
Parses the response, which in most cases is HTML content (text-based format). We will assume we get a HTML response.
Starts the rendering the HTML, executes the Javascript code, retrieves external resources (images, css files, js files, fonts, etc). Not necessarily in this order.
Listens to events that may trigger more requests to inject more content into the page.
Scrapy provides tools to do 1. and 2. Selenium and other tools like Splash do 3., allow you to do 4. and access the rendered HTML.
Now, I think there are three basic cases you face when you want to extract text content from a web page:
The text is in plain HTML format, for example, as a text node or HTML attribute: <a>foo</a>, <a href="foo" />. The content could be visually hidden by CSS or Javascript, but as long is part of the HTML tree we can extract it via XPath/CSS rules.
The content is located in Javascript code. For example: <script>var cfg = {code: "foo"};</script>. We can locate the <script> node with a XPath rule and then use regular expressions to extract the string we want. Also there are libraries that allow us to parse pieces of Javascript so we can load objects easily. A complex solution here is executing the javascript code via a javascript engine.
The content is located in a external resource and is loaded via Ajax/XHR. Here you can emulate the XHR request with Scrapy and the parse the output, which can be a nice JSON object, arbitrary javascript code or simply HTML content. If it gets tricky to reverse engineer how the content is retrieved/parsed then you can use Selenium or Splash as a proxy for Scrapy so you can access the rendered content and still be able to use Scrapy for your crawler.
How you know which case you have? You can simply lookup the content in the response body:
$ scrapy shell http://example.com/page
...
>>> 'foo' in response.body.lower()
True
If you see foo in the web page via the browser but the test above returns False, then it's likely the content is loaded via Ajax/XHR. You have to check the network activity in the browser and see what requests are being done and what are the responses. Otherwise you are in case 1. or 2. You can simply view the source in the browser and search for the content to figure out where is located.
Let say the content you want is located in HTML tags. How do you know if your XPath expression correct? (By correct here we mean that gives you the output you expect)
Well, if you do scrapy shell and response.xpath(expression) returns nothing, then your XPath is not correct. You should reduce the specificity of your expression until you get an output that includes the content you want, and then narrow it down.

Parsing HTML using JavaScript

I'm working a page that needs to fetch info from some other pages and then display parts of that information/data on the current page.
I have the HTML source code that I need to parse in a string. I'm looking for a library that can help me do this easily. (I just need to extract specific tags and the text they contain)
The HTML is well formed (All closing/ending tags present).
I've looked at some options but they are all being extremely difficult to work with for various reasons.
I've tried the following solutions:
jkl-parsexml library (The library js file itself throws up HTTPError 101)
jQuery.parseXML Utility (Didn't find much documentation/many examples to figure out what to do)
XPATH (The Execute statement is not working but the JS Error Console shows no errors)
And so I'm looking for a more user friendly library or anything(tutorials/books/references/documentation) that can let me use the aforementioned tools better, more easily and efficiently.
An Ideal solution would be something like BeautifulSoup available in Python.
Using jQuery, it would be as simple as $(HTMLstring); to create a jQuery object with the HTML data from the string inside it (this DOM would be disconnected from your document). From there it's very easy to do whatever you want with it--and traversing the loaded data is, of course, a cinch with jQuery.
You can do something like this:
$("string with html here").find("jquery selector")
$("string with html here") this will create a document fragment and put an html into it (basically, it will parse your HTML). And find will search for elements in that document fragment (and only inside it). At the same time it will not put it in page DOM

How Can I Read A Web Browser Hidden Document Value Using IWebBrowser2 In LabVIEW?

I've searched around the internet for a couple of hours and could not find anything related to what I'm trying to do. I wrote a HTML document that collects data from a user and stores it in a javascript array. This array is then joined together and stored as a string in a document which is hidden. Originally, I was going to transfer this string to a program I wrote in C#, but now I am using LabVIEW.
In C#, I used two simple lines of code to do what I wanted:
System.Windows.Forms.HtmlElement hidden = webBrowser1.Document.GetElementById("hiddenfield1");
List<latlng> data = formharvest.extract(hidden.GetAttribute("value"));
But now I cannot find a way to access the data that is in this hidden document. I'm using the IWebBrowser2 block to embed my HTML code in my VI. Any help would be greatly appreciated. Thank you for your time!
A solution would be to start a Web server in your LabVIEW program, and serve your HTML form from it. I suppose it wouldn't be too hard to retrieve form data then, but I haven't done such thing myself.
Here's an interesting discussion on this with sample code.
I'm not sure I really understand what you are doing above: in the C#, it looks like you are embedding an HTML-rendering engine in a Windows form (i.e. window).
You can embed .NET in your labview code and therefore you should be able to embed the same HTML rendering engine in a LabView VI, but you might consider changing your approach, as CharlesB suggests, to something more traditional where a server serves some HTML to a web browser, which then sends some data back to the server via HTTP GET or POST.

Javascript redirect to dynamically created HTML

I have a javascript routine that dynamically creates an HTML page, complete with it's own head and script tags.
If I take the contents of the string and save it to a file, and view the file in a browser, all is well, but if I try document.write(newHTML), it doesn't behave the same. The javascript in the header of the dynamic newHTML is quite complicated, and I cannot include it here... But please believe me that it works great if I save it to a file, but not if I try to replace the current page with it using document.write. What possible pitfalls could be contributing to this that I'm not considering? Do I possibly need to delete the existing script tags in the existing header first? Do I need to manually re-call onLoad??
Again, it works great when the string is saved to, for example, 'sample.html' and browsed to, but if I set var Samp="[REAL HTML HERE]"; and then say document.write(Samp); document.close(); the javascript routines are not executing correctly.
Any hints as to what I could be missing?
Is there another/better way to dynamically replace the content of the page, other than document.write?
Could I somehow redirect to the new page despite the fact that doesn't exist on disk or on a server, but is only in a string in memory? I would hate to have to upload the entire file to my server simply to re-download again it to view it.
How can I, using javascript, replace the current content of the current page with entirely new content including complex client-side javascripting, dynamically, and always get exactly the same result as if I saved the string to the server as an html file and redirected to it?
How can I 'redirect' to an HTML file that only exists as a client-side string?
You can do this:
var win=window.open("") //open new window and write to it
var html = generate_html();
win.document.write(html)
win.document.close();
Maybe eval() function would help here? It's hard to give ansver without seeing the code.
Never tried this, but i think it should be possible. Some thoughts on what might make it work:
Make sure the document containing your js is sent with the correct headers / mimetype / doctype
Serve the javascript in a valid way, for example by sending a w3c valid page containing the script tag.
Maybe then it works. If not, try to erase the current html before writing the new one.
Also, it might be helpful to look how others managed to accomplish this task. If i remind it correctly, the google page is also essentially a short html page with a bunch of js.

Is there a way to validate the HTML of a page after AJAX operations are performed on it?

I'm writing a web app that inserts and modifies HTML elements via AJAX using JQuery. It works very nicely, but I want to be sure everything is ok under the bonnet. When I inspect the source of the page in IE or Chrome it shows me the original document markup, not what has changed since my AJAX calls.
I love using the WC3 validator to check my markup as it occasionally reminds me that I've forgotten to close a tag etc. How can I use this to check the markup of my page after the original source served from the server has been changed via Javascript?
Thank you.
Use developer tool in chrome to explore the DOM : it will show you all the HTML you've added in javascript.
You can now copy it and paste it in any validator you want.
Or instead of inserting code in JQuery, give it to the console, the browser will then not be able to close tags for you.
console.log(myHTML)
Both previous answers make good points about the fact the browser will 'fix' some of the html you insert into the DOM.
Back to your question, you could add the following to a bookmark in your browser. It will write out the contents of the DOM to a new window, copy and paste it into a validator.
javascript:window.open("").document.open("text/plain", "").write(document.documentElement.outerHTML);
If you're just concerned about well-formedness (missing closing tags and such), you probably just want to check the structure of the chunks AJAX is inserting. (Once it's part of the DOM, it's going to be well-formed... just not necessarily the structure you intended.) The simplest way to do that would probably be to attempt to parse it using an XML library. (one with an HTML mode that can be made strict, if you're not using XHTML)
Actual validation (Testing the "You can't put tag X inside tag Y" rules which browsers generally don't care too much about) is a lot trickier and, depending on how much effort you're willing to put into it, may not be worth the trouble. (Because, if you validate them in isolation, you'll get a lot of "This is just a fragment" false positives)
Whichever you decide to use, you need to grab the AJAX responses before the browser parses them if you want a reliable test result. (While they're still just a string of text rather than a DOM tree)

Categories