Parse HTML retrieved via AJAX in JavaScript - javascript

I'm attempting to write some JavaScript code (in particular, a Chrome extension) which does the following:
Retrieve some web page's contents via AJAX.
Get some content from that page by locating certain elements inside of the HTML string and getting their contents.
Do a thing with that data.
I have 1) and 3) working, but I'm having some trouble achieving step 2) in a reasonable way.
I currently have 2) implemented via jQuery(htmlString) and then using normal jQuery selectors and etc. to extract the data I want. The problem is that jQuery actually adds the retrieved HTML to the current page, loading and executing all external resources / scripts in the process. This is obviously bad.
So I'm looking for a way to get the text and HTML in certain tags inside my HTML string without:
Loading or executing ANY scripts or resources (images, CSS, etc.) referenced in the HTML string.
Trying to remove external resources with regular expressions, since we all know what happens when you parse [X]HTML with regex.
I believe that I can achieve what I want using jsdom and jQuery, since jsdom has a FetchExternalResources option which can be set to false. However, jsdom seems to only work in NodeJS, not in the browser.
Is there any reasonable way to do this?

You could use document.implementation.createHTMLDocument
This is an experimental technology
Because this technology's
specification has not stabilized, check the compatibility table for
the proper prefixes to use in various browsers. Also note that the
syntax and behavior of an experimental technology is subject to change
in future versions of browsers as the spec changes
Feature Chrome Firefox (Gecko) Internet Explorer Opera Safari
Basic support (Yes) 4.0 (2.0) [1] 9.0 (Yes) (Yes)
[1] The title parameter has only been made option in Firefox 23.
Javascript
$.ajax("http://www.html5rocks.com/en/tutorials/").done(function (htmlString) {
var doc = document.implementation.createHTMLDocument("");
doc.write(htmlString);
console.log(doc.getElementById('siteheader').textContent);
});
On jsFiddle
You can also take a look at DOMParser and XMLHttpRequest
Example using XMLHttpRequest
XMLHttpRequest originally supported only XML parsing. HTML parsing
support is a recent addition.
Feature Chrome Firefox (Gecko) Internet Explorer Opera Safari (WebKit)
Support 18 11 10 --- Not supported
Javascript
var xhr = new XMLHttpRequest();
xhr.onload = function () {
console.log(this.responseXML.getElementById('siteheader').textContent);
};
xhr.open("GET", "http://www.html5rocks.com/en/tutorials/");
xhr.responseType = "document";
xhr.send();
On jsFiddle

Related

Standard way of parsing HTML

I'm working on a project for my university in which i need to parse an html string into a document and add the children elements in the existing page. I have to use Javascript 1.6 with DOM Level
3 and the project must work with Firefox 32.0 and Chrome 37.0.2062.120, pretty old browsers. The problem is i have to use only standard methods and properties so i can't use innerHTML. These are my attempts so far:
I managed to parse html using the DOMParser object but i'm not sure if i'm allowed to use that (i found this document but it's not clear to me if DOMParser is standard or not), if this was the case it looks like the best option to me.
I also tried to parse html using:
var doc = document.implementation.createHTMLDocument(title);
doc.open();
doc.writeln(html);
doc.close();
The problem with this method is that it doesn't work with the version of mozilla i need to use. I also tried to use the document inside a dummy iframe pointing to about:blank but chrome then prevents me (for security reasons i believe) to add event handlers to any of the elements coming from that document, and i need to do that.

How to Disable Javascript in Chrome (-headless) using PHP Webdriver

I am using Chrome headlessly.
I tried setting the --disable-javascript command line argument.
I tried using the experimental options:
$options->setExperimentalOption('prefs', [
'profile.managed_default_content_settings.javascript' => 2//this does not work
//,'profile.default_content_setting_values.javascript' => 2//this does not work, too
]);
$capabilities = DesiredCapabilities::chrome();
$capabilities->setCapability(ChromeOptions::CAPABILITY, $options);
As of this point these two do not work.
How can I disable javascript in Chrome using the Facebook PHP Webdriver ?
Here is a test to check if JavaScript is enabled:
$this->driver->get('https://www.whatismybrowser.com/detect/is-javascript-enabled');
return [
$this->driver->getTitle(),
$this->driver->findElement(WebDriverBy::cssSelector('.detected_result'))->getText()
];
It is simply impossible. Read here http://yizeng.me/2014/01/08/disable-javascript-using-selenium-webdriver/
WARNING: Running without JavaScript is unsupported and will likely break a large portion of the ChromeDriver's functionality. I suspect you will be able to do little more than navigate to a page. This is NOT a supported use case, and we will not be supporting it.
Closing this as WontFix - the ChromeDriver (and every other WebDriver implementation I'm aware of) require JavaScript to function.
It's possible to disable the execution of Javascript by setting one of the these preferences :
"webkit.webprefs.javascript_enabled": false
"profile.content_settings.exceptions.javascript.*.setting": 2
"profile.default_content_setting_values.javascript": 2
"profile.managed_default_content_settings.javascript": 2
But it's currently not supported headlessly since this mode doesn't load the preferences and there's no command switch related to this feature.
Note that disabling JavaScript used to break Selenium since most of commands are atom scripts injected in the page. It's no longer the case. All the commands are able to run. However I noticed that the returned text doesn't include the text from a <noscript> element (text displayed only when JavaScript is disabled). One workaround is to read the innerText property with either execute_script or get_attribute.
As others have pointed, it is still NOT possible to disable JavaScript in Chrome when headless.
However, for future reference, this is how you do it (when you are NOT using the --headless argument):
$options = new ChromeOptions();
$options->setExperimentalOption('prefs', [
'profile.managed_default_content_settings.javascript' => 2,
]);
$result = DesiredCapabilities::chrome();
$result->setCapability(
ChromeOptions::CAPABILITY_W3C,
// There is a bug in php-webdriver, so ->toArray() is needed!
$options->toArray()
);
Although regular headless mode in Chrome prevents disabling JavaScript, there's a newer headless mode without restrictions.
The Chromium developers recently added a 2nd headless mode that functions the same way as normal Chrome.
The NEW way: --headless=chrome
The OLD way: --headless
There's more info on that here: https://bugs.chromium.org/p/chromium/issues/detail?id=706008#c36
This means that you can now disable JavaScript when using the newer headless mode.

How do you run an xPath query in IE11?

At one point in our system we use javascript to read in a chunk of XML and then query that XML document using xPath.
Prior to IE 11, IE supported using xmldoc.selectSingleNode(“//xpath/string”) and the non IE browsers supported using a xmldoc.evaluate(“//xpath/string”). These both returned a similar object that we could then carry on interpreting to extract the data required.
In IE11 neither of these methods seem to be available.
It seems that IE11 has some support for XML documents in that when I read in the xml using the DOMParser object using the parseFromString method, it returns an object that the IE11 debugger calls an XMLDocument.
Thanks to #Martin Honnen for pointing out that some ActivXObjects are still supported in IE11!
var doc;
try {
doc = new ActiveXObject('Microsoft.XMLDOM');
doc.loadXML(stringVarWithXml);
var node = doc.selectSingleNode('//foo');
} catch (e) { // deal with case that ActiveXObject is not supported }
I've used "Microsoft.XMLDOM" as it is sugested here that it is a more generic call to what ever xml parser is present on the system, where as it sounds like "Msxml2.DOMDocument.6.0" will fail if that exact version is not present. (We do have to support all IE vers back to 6.0 at my place!)
This just works as it always has done. The only problem I had was that the old switch I used to detect IE vs other browsers was if (typeof ActiveXObject !== "undefined") failed as I guess they are trying to discourage it's use!
Thanks all for your help.
To expand on pixelmatt's answer, some results of my tests (Win 7 64bit with IE11) I did in order to get DOMParser to work as it did in IE9 and IE10 (in IE11 it now returns an XMLDocument object which appears to not support xpath queries?).
Turns out I could make it behave like in IE10 with the following meta tag:
<meta http-equiv="X-UA-Compatible" content="IE=10" />
Results without and with above meta:
And here are the XMLDocument's memebers (for reference):

Is it possible to access local file via javascript?

if (window.ActiveXObject) {
try {
var fso = new ActiveXObject("Scripting.FileSystemObject");
fso.CopyFile("C:\\Program Files\\GM4IE\\scripts\\source.txt","C:\\Program Files\\GM4IE\\scripts\\target.txt", 1);
fso = null;
}
catch (e) {
alert (e.message);
}
}
I am getting error :
"Automation server can not create object" on the line where I am creating ActiveXObject instance.
I understand that it's considered very bad to access hard-drive data using javascript but I just need it.
I am using IE8 , Greasemonkey4IE to run my javascript.
Thank you,
Mohit
******************************
function WriteFile()
{
var fso = new ActiveXObject("Scripting.FileSystemObject");
fso.CopyFile("C:\\source.txt","C:\\target.txt", 1);
}
I've put the above code inside a simple HTML page and it worked perfect.
http://www.c-point.com/JavaScript/articles/file_access_with_JavaScript.htm
You can find the similar code on above mentioned location.
I modified it a bit, tough.
But when I am trying to run it through GreaseMonkey4IE it simply spitting the same error I specified earlier.
I did it guys, but thanks a lot for your quick and helpful replies.
All I did is :
Go to Tools > Internet options > Security > Custom Level
Under the ActiveX controls and plug-ins, select Enable for Initializing and Script ActiveX controls not marked as safe.
Using native JavaScript, no, it is not generally possible to access a local file. Using plugins and extensions like ActiveX, Flash, or Java you can get around this rule, generally with some difficulty.
For some browser and OS specific exceptions to this general rule, you might want to have a look here:
Local file access with javascript
Note that as of late 2012, the FileReader API has been supported in all major browsers and provides a native JavaScript mechanism for accessing local files that the user nominates (via an input element or by dropping them into the browser).
This still cannot be used to access an arbitrary file by name/path as in the examples in the original question.
HTML5 File API has multiple ways to access local files.
window.requestFileSystem allows you to request access to the filesystem. Browser support is very poor on this (Chrome only).
FileReader is the HTML5 FileReader API that allows you to programatically read files that users select through a <input type='file' /> Browser support is better on this.
You should use fallbacks like flash and POST to a server for full file access.
Generally reading arbitary files is considered "cheating the browser" so I you'll either have to use secure HTML5, ActiveX or Flash. All 3 of those require user permissions.
After some research I have found:
var fso = new ActiveXObject("Scripting.FileSystemObject");
//This line will create a xml file on local disk, C drive
fh = fso.CreateTextFile( "C:\\fileName.xml", true);
fh.WriteLine("this is going to be written in fileName.xml");
This is how we can do it.There are other methods also.
Accessing local file system is very bad thing to do but yes we can do it.
Automation server can not create object
To get rid of this error go to Tools → Internet Options → Security → select Internet icon → click Custom level → select Enable for Initialize and script ActiveX controls not marked as safe for scripting.
I have not tested this on any other berowser except IE8, but I am sure it will work.

Alternative methods for creating dynamic JavaScript?

Background
I am working on a project that runs in an embedded web browser in a small device with limited resources. The browser itself is a bit dated and has limits to its capabilities (HTML 4.01†, W3C DOM Level 2†, JavaScript 1.4). I have no documentation on the browser, so what I do know comes from trial and error.
The point is to retrieve dynamic content from a server so that only a minimal amount of inflexible code needs to be embedded into the device running the web browser. The browser does not support the XMLHTTPRequest object, so AJAX is out. Working with I do have, I wrote a bit of test code to dynamically insert JavaScript.
† Minor portions of these standards not supported
EDIT
While I cannot actually confirm it, I believe that this site may list the DOM support for the embedded browser because I see "Mozilla/4.0 (compatible; EBSWebC 2.6; Windows NT 5.1)" as the user agent in the server log.
<html>
<head>
</head>
<body onload="init()">
<div id="root"></div>
<script type="text/javascript">
<!--
function init() {
// Add a div element to the page.
var div = document.createElement("div");
div.id = "testDiv";
document.getElementById("root").appendChild(div);
// Set a timeout to insert the JavaScript after 2 seconds.
setTimeout("dynamicJS()", 2000);
}
function dynamicJS() {
...
}
//-->
</script>
</body>
</html>
Method 1
I initially implemented the dynamicJS function using Method 1 and found that while the code executes as expected in Chrome, IE8, and FireFox 3.5, the JavaScript is not actually retrieved by the embedded browser when the element is appended.
function dynamicJS() {
var js = document.createElement("script");
js.type = "text/javascript";
js.src = "js/test.js";
document.getElementById("root").appendChild(js);
}
Method 2
Looking for a work around, I implemented Method 2. This method actually works in the embedded browser as the JavaScript is retrieved and executed, but it does not work in other modern web browsers's I tested against (Chrome, IE8, FireFox 3.5).
function dynamicJS() {
var js= '<script type="text/javascript" src="js/test.js"> </s' + 'cript>';
document.getElementById("testDiv").innerHTML = js;
}
Question
I'm new to JavaScript and web programming in general, so I'm hoping one (or more) of the experts here can shed some light on this for me.
Is there anything technically wrong with Method 2 and if not, why doesn't it work in modern web browsers?
There is nothing technically wrong with method 2 but most modern browsers have very loose HTML parsers that tend to get caught up in the code that you're sending. Specifically they parse the </script> in your JavaScript string literal as an end tag. This manifests itself in two ways:
You'll see an "Unterminated String Literal" error.
All code after the </script> text will be rendered as text on the page.
A common workaround for this problem is to split the </script>. You can do this with the following code. Yes, I know its a hack, but it works around the problem.
function dynamicJS() {
var js= '<script type="text/javascript" src="js/test.js"></s' + 'cript>';
document.getElementById("testDiv").innerHTML = js;
}
Realistically though, you should be able to use your first approach strictly using the DOM APIs. I've found that some browsers can be really picky about loading scripts added by script in that they will only load them if they are placed as a child of the <head> element. This is how the YUILoader works, so I'd be surprised if it didn't work in all browsers.
Here's an example, you'll want to check this to make sure that it works in all browsers, and add some error checking around the assumption that there will be a <head> element but it give you the general idea.
if (!document.getElementsByTagName) {
document.getElementsByTagName = function(name) {
var nodes = [];
var queue = [document.documentElement];
while (queue.length > 0) {
var node = queue.shift();
if (node.tagName && node.tagName.toLowerCase() === name) {
nodes.push(node);
}
if (node.childNodes && node.childNodes.length > 0) {
for (var i=0; i<node.childNodes.length; i++) {
if (node.childNodes[i].nodeType === 1 /* element */) {
queue.push(node.childNodes[i]);
}
}
}
}
return nodes;
};
}
function dynamicJS() {
var js = document.createElement("script");
js.setAttribute('type', 'text/javascript');
js.setAttribute('src', 'js/test.js');
var head = document.getElementsByTagName('head')[0];
head.appendChild(js);
}
The innerHTML property has not yet actually been standardized, though all modern browsers support it, and the draft standard of HTML5 includes a definition of how it should work. According to the HTML5 specification:
When inserted using the document.write() method, script elements execute (typically synchronously), but when inserted using innerHTML and outerHTML attributes, they do not execute at all.
innerHTML was first introduced in Microsoft Internet Explorer 4, and due to its popularity among authors, has been adopted by all of the other browsers, which is what led to its inclusion in HTML5. So, let's check Microsoft's documentation:
When using innerHTML to insert script, you must include the DEFER attribute in the script element.
So apparently, in IE you can get scripts inserted via innerHTML to execute, but only if you add a defer attribute (I do not have IE in front of me to test this). defer is another feature that was first added to IE; it was included in HTML 4.01, but not picked up by any of the other browsers for quite a while. HTML5 includes a much more detailed description of how <script defer> should work, though it appears to be slightly incompatible with how it works in IE, as it does not allow execution of scripts added via innerHTML. The HTML5 definition of <script defer> appears to be implemented in Firefox 3.5 and Safari 4.
In summary, innerHTML hasn't really been standardized yet, but instead simply implemented by all of the browser vendors in slightly different ways. In IE, the original implementation, it didn't support execution of scripts except with a defer attribute, and defer hasn't been supported in other browsers until just recently, and so the other browsers simply don't support execution of scripts added using innerHTML. This behavior is what HTML5 is standardizing on, so unless Microsoft objects, is probably going to be what goes into the standard.
It sounds like the browser you are working with didn't do as good a job of implementing a compatible innerHTML, as it executes scripts added using innerHTML no matter what. This is unsurprising, as the behavior isn't standardized and so needs to be either reverse engineered or gleaned from reading the documentation of other browsers (which may not have included this fact in the past). One of the main goals of HTML5 is to actually write down all of these unwritten assumptions and undocumented behaviors, so that in the future, someone implementing a browser can do so without being misled by a spec that doesn't match reality, or without having to do the effort of reverse engineering the existing browsers.
It looks to me that you may have to use Method 2 on your embedded browser, and Method 1 if you want to run on the common desktop browsers. It would probably be a good idea to try Method 1 first, and fall back to Method 2 if that does not work, and then error out (or silently fail, depending on your needs) if neither one works.
A long shot but does the embedded browser support iframes?
And if so would you be able to use that to load in whatever additional JS you needed and access it via the iframe?

Categories