We have data as XML and there are multiple formatting XSL styles. It was working fine till now in IE.
Then, We needed to display the same content as HTML in Chrome. So, We found an API on server side (Java) to transform XML+XSL to HTML.
public static String convertXMLXSL(String xml, String xsl) throws SQLException {
System.setProperty("javax.xml.transform.TransformerFactory", "org.apache.xalan.processor.TransformerFactoryImpl");
TransformerFactory tFactory = TransformerFactory.newInstance();
String html = "";
try {
try {
StreamResult result = new StreamResult(new StringWriter());
**Transformer trans = tFactory.newTransformer(new StreamSource(new ByteArrayInputStream(xsl.getBytes("utf-8"))));
trans.transform(new StreamSource(new ByteArrayInputStream(xml.getBytes("utf-8"))), result);**
html = result.getWriter().toString();
} catch (TransformerException te) {
te.printStackTrace();
}
} catch (Exception e) {
AppendExceptionToLog(e);
}
return html;
}
But, Now after sometime, We see some thread dumps which are blocked at trasform method of javax.xml.transform.Transformer
Sep 12, 2017 12:07:49 PM org.apache.catalina.valves.StuckThreadDetectionValve notifyStuckThreadDetected
WARNING: Thread "http-8080-12" (id=15800) has been active for 6,516 milliseconds (since 9/12/17 12:07 PM) to serve the same request for
and may be stuck (configured threshold for this StuckThreadDetectionValve is 5 seconds).
There is/are 3 thread(s) in total that are monitored by this Valve and may be stuck.
java.lang.Throwable
at org.apache.xpath.axes.AxesWalker.getNextNode(AxesWalker.java:333)
at org.apache.xpath.axes.AxesWalker.nextNode(AxesWalker.java:361)
at org.apache.xpath.axes.WalkingIterator.nextNode(WalkingIterator.java:192)
at org.apache.xpath.axes.NodeSequence.nextNode(NodeSequence.java:281)
at org.apache.xpath.axes.NodeSequence.runTo(NodeSequence.java:435)
at org.apache.xpath.axes.NodeSequence.setRoot(NodeSequence.java:218)
at org.apache.xpath.axes.LocPathIterator.execute(LocPathIterator.java:210)
at org.apache.xpath.XPath.execute(XPath.java:335)
at org.apache.xalan.templates.ElemVariable.getValue(ElemVariable.java:278)
at org.apache.xalan.templates.ElemVariable.execute(ElemVariable.java:246)
at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411)
at org.apache.xalan.templates.ElemLiteralResult.execute(ElemLiteralResult.java:1374)
at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411)
at org.apache.xalan.transformer.TransformerImpl.applyTemplateToNode(TransformerImpl.java:2281)
at org.apache.xalan.transformer.TransformerImpl.transformNode(TransformerImpl.java:1367)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:709)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1284)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1262)
at Util.processXMLXSL(Util.java:3364)
Here, I wanted to know..
1) Do we have any other known implementation do the same on server side ?
2) Should I consider using client side approach using XSLTProcessor of Mozilla ?
Related
I'm trying to scrape the following site
I'm able to receive a response but i don't know how can i access the inner data of the below items in order to scrape it:
I noticed that accessing the items is actually handled by JavaScript and also the pagination.
What should i do in such case?
Below is my code:
import scrapy
from scrapy_splash import SplashRequest
class NmpaSpider(scrapy.Spider):
name = 'nmpa'
http_user = 'hidden' # as am using Cloud Splash
allowed_domains = ['nmpa.gov.cn']
def start_requests(self):
yield SplashRequest('http://app1.nmpa.gov.cn/data_nmpa/face3/base.jsp?tableId=27&tableName=TABLE27&title=%E8%BF%9B%E5%8F%A3%E5%8C%BB%E7%96%97%E5%99%A8%E6%A2%B0%E4%BA%A7%E5%93%81%EF%BC%88%E6%B3%A8%E5%86%8C&bcId=152904442584853439006654836900', args={
'wait': 5}
)
def parse(self, response):
goal = response.xpath("//*[#id='content']//a/#href").getall()
print(goal)
If you use some breakpoints you'll see its a frustrating job that I explain what I understood from my research.
when you are working with this kind of situation you have two ways:
1 - [Easy Way] use selenium and open a browser and click on each link and get the returned contents easily, you can run multiple browsers and get link contents simultaneously.
2 - [Hard Way] simulate what website does (by making similar functions inside python) and do exactly what website does in JS but in the end instead of showing the results, just save it in a variable and use it the way you want.
Now if you choose the HARD WAY this is what I found:
the link JS is like this:
commitForECMA(callbackC,'content.jsp?tableId=26&tableName=TABLE26&tableView=国产医疗器械产品(注册&Id=138150',null)
it calls a function named commitForECMA and get what this function returns and pass it to callBackC function.
well this was obvious, but its important to know what these functions do and how to replicate it.
commitForECMA:
this is the function:
function commitForECMA($_8, $_10, $_12) {
request = createXMLHttp();
request.onreadystatechange = $_8;
if ($_12 == null) {
_$du(request, _$Fe('uM6r2MG'), _$Fe("jp0YV"), $_10);
request.setRequestHeader(_$Fe("XACeXwDYXwcTV8Ur2"), _$Fe("YwDYgwceLwDT7iCYX3Ce9FKyvHKwPFa"));
} else {
var $_9 = "";
var $_19 = $_12.elements;
var $_0 = $_19.length;
for (var $_18 = 0; $_18 < $_0; $_18++) {
var $_14 = _$3P($_19, $_18);
if ($_14.type != _$Fe("yQ6YPMK20") && _$3P($_14, _$Fe('uwbm7wKV')) != "") {
if ($_9.length > 0) {
$_9 += "&" + $_14.name + "=" + _$3P($_14, _$Fe('kwbm7wKV'));
} else {
$_9 += $_14.name + "=" + _$3P($_14, _$Fe('Ewbm7wKV'));
}
$_9 += _$Fe("Jx2J03Up2Hsl");
}
}
_$du(request, _$Fe('uM6r2MG'), _$Fe("HVlesYq"), $_10);
$_9 = encodeURI($_9);
$_9 = encodeURI($_9);
request.setRequestHeader(_$Fe("d3CmOFDVz3CeXwoxBMq"), _$Fe("FMbZz3CmOFDV"));
request.setRequestHeader(_$Fe("yACeXwDYXwcTV8Ur2"), _$Fe("g3UraMD2O3UpNMCgB8cT6w6QzRbenM1TTQbS2MbJBRDY9"));
}
request.send($_9);
if ($_12 != null) {
$_12.reset();
}
}
yes as you can see it just creates a XMLHTTP request which (for the links in question) Posts the $_10 content to the server and get the results in callBackC function which is now in $_8. but the trick here is the $_10 contents goes through ~13000 lines of code to create links like this:
http://app1.nmpa.gov.cn/data_nmpa/face3/content.jsp?6SQk6G2z=GBK-56.it.xmhx8IaDT25ZyaSxljrwULe8AkNw8QjmeNqdT0YqZYbMZ2P6Jgn3ZUIgh3ibPI81bjA6xUCKJmzy1LD.4AZnk4g4G_iMO4tdiebiVDoPPtdVDIkDWw0OnDHek.d_2r.PfBtuIoxDvrbGDL.Lv2AuD6lxiObz_lldDHq6HnEw_irAP1hCH.Dr3KdW33DN2w0X1R75N3f8GXdHinmxXLtYbZNYZEE9K7lk9AGmBWgcTds.XgGVW3gDS5OEwoRat44Ecke8k7ZXoY_2revEbUrD8UpOrGprlPEwVYuAvLoTSZX8WJEWQ_QT2CDjNw0FOwAECzsFJa4hGgUtjCPzG&c1SoYK0a=GBK-4aeKAo74EouxLY.stFwdwvXQQG_hXMGG8gB0Hhe6V2Il9k9c8yiTLqduIXpv2RNt.H.weYXeF5XhV0CR2lATieRmk.cs8.fPhNpfGx7JkG1uacp75kDcmXsNtuKgbzRUHZh8vkj4UEYbPcwIYIOw5gFG_cMi9n1GYq0AXXK9UQn9IsmjCBuI7AOFw.pk91OgjvkJCcg2y0y3yDkGwZPcg5EktfAXi.PjmfaecWg8hodU87q6B3ZuPxhel9K9I3EDBxzCHtZqt_0YFlkJCcK4hLq
the problem is with obfuscation and also the nested variables and functions that can keep you out of track for hours if you try to debug it line by line (which I did) and the code makes the characters after content.jsp? part one by one and that explains why its about 13000 lines!!
this part request.send($_9); should have a body for request because its a POST request and $_9 was always null! it seems there are more protection levels to it as it seems.
callBackC:
well the callbackC is apparently a simple function to get responseText and show it to user:
function callbackC() {
if (request.readyState == 1) {
_$c2(document.getElementById(_$Fe("Y3CeXwDYXwq")), '=', _$Fe('vFKyXRUxEYlTW'), _$Fe("kHDxnHOaB3vE5HDxnHOSNMKQGQ6xOHK2z3Kw2Qne7MCm9FKyvhbwNROg"));
}
if (request.readyState == 4) {
if (request.status == 200) {
oldContent[oldContent.length] = request.responseText;
_$c2(document.getElementById(_$Fe("b3CeXwDYXwq")), '=', _$Fe('EFKyXRUxEYlTW'), request.responseText);
request = null;
} else {
_$c2(document.getElementById(_$Fe("u3CeXwDYXwq")), '=', _$Fe('BFKyXRUxEYlTW'), "<br><br><br><span style=font-size:x-large;color:#215add>服务器未返回数据</span>");
}
}
}
I didn't quiet get what those _$Xx functions do (because it goes so deep that its out of my patient!) but it seems they simply replaced the document.getElementById("someThing").innerText="Contents"; with multi layered functions so we can't understand the code easily, and the request.responseText is what you need which is HTML code for the table of results.
there is also a 3rd way which I don't know if you can implement it in your code, but since these functions are in a public scope you can simply override them by redefining these two functions (or replace the functions in the link with your own functions and run them). I tried to get the URL for the request which gave me the link I used in middle of this post, but it didn't worked (I just override the callBackC function and get request.responseURL) and the link gave me 404 error.
I don't think I said all I got from my observations but I think it's enough for you to know what you are up against if you are not already aware, and I hope I was helpful.
Reference:
XMLHttpRequest: Living Standard — Last Updated 16 August 2021
My current Win 2008 R2 server migrating to Azure. So that Im moving a web application to Azure Server Win 2008 R2.
Currently, I am facing the issue where it shows
"Message":"String was not recognized as a valid DateTime.","StackTrace":" at System.DateTimeParse.Parse(String s, DateTimeFormatInfo dtfi, DateTimeStyles styles)\r\n at System.Convert.ToDateTime(String value)\r\n at... `
Purpose of the code: Its a JQGrid library, If the code runs successfully, I will proceed to update a table. This code runs when user clicks the update button and before updating the table as validation of date.
Weird question is: My On-Prem server runs this code smoothly, all data in Azure and On-Prem server are same.
NEWLY ADDED: When I edit some rows (so far only 1 row out of 100), it works.
Working row details:
Not working row details:
JQuery code snippet:
closeAfterEdit: true,
closeOnEscape: true,
reloadAfterSubmit: true,
url: '/SFI/WebService/StaffMaster.asmx/CheckEditStaff_AssignedRoster',
ajaxEditOptions: { contentType: 'application/json; charset=utf-8' },
mtype: 'post',
datatype: 'json',
serializeEditData: function (postData) {
var PrivilegeID = $('#hdnMAPrivilegeID').val();
eStaffID = $("#StaffID").val();
eStaffNo = $("#StaffNo").val(),
eNewEndDate = $("#EffectiveEnd").val();
eStaffName = $("#StaffName").val(),
eIdentificationNo = $("#IdentificationNo").val(),
eDOB = $("#DOB").val(),
eEffectiveStart = $("#EffectiveStart").val(),
eEffectiveEnd = $("#EffectiveEnd").val(),
eGradeCode = $("#GradeDetails").val(),
eStaffType = $("#StaffType").val(),
eOrgUnit = $("#OrgUnit").val(),
eEmail= $("#Email").val().toLowerCase()
return JSON.stringify(
{
StaffID: $("#StaffID").val(),
NewEndDate: $("#EffectiveEnd").val(),
OldEndDate: StaffOldEndDte
});
.
.
.
StaffOldEndDte = $("#EffectiveEnd").val();
Web Service Call in C#:
public string CheckEditStaff_AssignedRoster(string StaffID,string NewEndDate,string OldEndDate)
{
string status = "0";
bool Changed = false;
DateTime dtnew;
DateTime dtOld;
dtnew = Convert.ToDateTime(NewEndDate);
dtOld = Convert.ToDateTime(OldEndDate);
if ((dtOld != dtnew) && (dtnew < dtOld))
{
Changed = true;
}
else
{
status = "1";
}
if (Changed)
{
if (some condition...)
{
.
.
//do something...
}
else
{
status = "1";
}
}
return status;
}
As mentioned in the comment before, different cultures might be the problem. Using InvariantCulture in your code might help. More info here: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo.invariantculture?view=netframework-4.8
Here's your problem: Convert.ToDateTime.
Here's the source code of this method:
public static DateTime ToDateTime(String value) {
if (value == null)
return new DateTime(0);
return DateTime.Parse(value, CultureInfo.CurrentCulture);
}
As you can see, it uses the current culture to parse the string.
What you should be using is DateTime.TryParseExact or at least DateTime.ParseExact and give it the exact format and correct culture info you're attempting to parse to DateTime.
Change your lines:
dtnew = Convert.ToDateTime(NewEndDate);
dtOld = Convert.ToDateTime(OldEndDate);
To
dtnew = DateTime.ParseExact(NewEndDate, "yyyy-MM-ddTHH:mm:ssZ", CultureInfo.InvariantCulture); //or however your JS is formatting the date as string
dtOld = DateTime.ParseExact(OldEndDate, "yyyy-MM-ddTHH:mm:ssZ", CultureInfo.InvariantCulture); //or however your JS is formatting the date as string
If you have a Date in JS and you use toJSON it should return a string in "yyyy-MM-ddTHH:mm:ssZ" - see The "right" JSON date format for more discussion
After knowing some knowledge on different cultures on DateTime format, from the question's comment, I moved to search again on how to make it the same as my on-premise server to new azure server (Because I have already developed a working code in my on-premise server). Found that, i can just change the 'Region and Language' setting same as on-premise to azure. Below is the link I used.
Here is the link: From where CultureInfo.CurrentCulture reads culture
I am trying to write an application in C# using CefSharp. My intention is to fetch all the links on the given page eg,
https://wixlabs---dropbox-folder.appspot.com/index?instance=lp5CbqBbK6JUFzCW2hXENEgT4Jn0Q-U1-lIAgEbjeio.eyJpbnN0YW5jZUlkIjoiYjNiNzk5YjktNjE5MS00ZDM0LTg3ZGQtYjY2MzI1NWEwMDNhIiwiYXBwRGVmSWQiOiIxNDkyNDg2NC01NmQ1LWI5NGItMDYwZi1jZDU3YmQxNmNjMjYiLCJzaWduRGF0ZSI6IjIwMTgtMDEtMjJUMTg6Mzk6MjkuNjAwWiIsInVpZCI6bnVsbCwidmVuZG9yUHJvZHVjdElkIjpudWxsLCJkZW1vTW9kZSI6ZmFsc2V9&target=_top&width=728&compId=comp-j6bjhny1&viewMode=viewer-seo
When I load the page and open the dev tools and execute
document.getElementsByTagName('a');
in the dev tools I get 374 results. Next I execute the following code from BrowserLoadingStateChanged:-
private async Task ProcessLinksAsync()
{
var frame = browser.GetMainFrame();
var response = await frame.EvaluateScriptAsync("(function() { return document.getElementsByTagName('a'); })();", null);
ExpandoObject result = response.Result as ExpandoObject;
Console.WriteLine("Result:" + result);//What do I do here?
}
I get an expando object which seems to contain nothing. I am saying this because I used a break point and inspected the object. I have gone through https://keyholesoftware.com/2019/02/11/create-your-own-web-bots-in-net-with-cefsharp/ , https://github.com/cefsharp/CefSharp/wiki/General-Usage#javascript-integration and the questions on SO but was unable to solve my problem.
Am I doing something wrong here?
My actual intention is to fetch the links and then navigate to them.
Thanks in advance.
EDIT:
I used the following script in browser and dev tools both return 187 results which is correct.
(function() {
var links=document.getElementsByClassName('file-link');
var linksArray = new Array();
for (var i = 0; i < links.length; i++) {
linksArray[i] = String(links[i].href);
}
return linksArray;
})();
But in my application I get a 0 length array.
EDIT-2:
I used the following code to get the DOM:-
public void OnContextCreated(IWebBrowser browserControl, IBrowser browser, IFrame frame)
{
ContextCreated?.Invoke(this, frame);
const string script = "document.addEventListener('DOMContentLoaded', function(){ alert(document.links.length); });";
frame.ExecuteJavaScriptAsync(script);
}
For every other site I tried the code was successful except the URL mentioned above. Could any one possibly tell me what could be possibly wrong as the DOM is loaded in the dev tools and fully accessible. So, I guess something might be missing in my code.
Thanks again.
You need to wait for the page loads. Also if the page loads data using ajax, you need to wait a bit to data also load. Then you need to shape the result to a custom javascript object.
ChromiumWebBrowser browser;
protected override void OnLoad(EventArgs e)
{
base.OnLoad(e);
browser = new ChromiumWebBrowser(
"https://google.com/"); // Tried with your URL.
browser.LoadingStateChanged += Browser_LoadingStateChanged;
browser.Dock = DockStyle.Fill;
Controls.Add(browser);
}
private async void Browser_LoadingStateChanged(object sender,
LoadingStateChangedEventArgs e)
{
if (!e.IsLoading)
{
await Task.Delay(5000); //Just for pages which use ajax loading data
var script = #"
(function () {
var data = document.getElementsByTagName('a');
return Array.from(data, a => ({href:a.href, innerText:a.innerText}));
})();";
var result = await browser.EvaluateScriptAsync(script);
var data = (IEnumerable<dynamic>)result.Result;
MessageBox.Show(string.Join("\n", data.Select(x=>$"{x.href}").Distinct()));
}
}
Problem:
I am parsing pages generated by JS using HtmlUnit.
I have to wait until all JS are loaded and then parse page.
All these pages share same JS scripts.
There is a one problematic script that won't parse.
The problematic script does not affect html rendering.
What I want to do:
I want to detect name of the problematic script.
Put this name on blacklist.
And skip it for further parsing.
This is the code I use for JS loading...
private void waitForJs(WebClient client, HtmlPage page) throws Exception {
int maxDelay = 1000;
int attempts = 10;
int i = client.waitForBackgroundJavaScript(maxDelay);
while (i > 0 && attempts > 0) {
i = client.waitForBackgroundJavaScript(maxDelay);
if (i == 0) {
break;
}
synchronized (page) {
page.wait(500);
}
log("Waiting for JS (" + i + "), attempts: " + attempts, false);
attempts--;
}
}
I had to intoduce "attempts" variable in order to not stuck on loading of damaged script. Instead of this, I want to put all problematic script(s) - remaining in waitForJs - on blacklist and skip their loading in the futures. Is it possible?
The code above has an encoding issue - we have to use the correct charset when getting the bytes from the content string.
WebResponseData data = new WebResponseData(content.getBytes(response.getContentCharset()),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
You can modify the content of the JavaScript to be empty string, as hinted here:
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("my_url")) {
String content = response.getContentAsString();
// change content
content = "";
WebResponseData data = new WebResponseData(content.getBytes(),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
Basically that's the question, how is one supposed to construct a Document object from a string of HTML dynamically in javascript?
There are two methods defined in specifications, createDocument from DOM Core Level 2 and createHTMLDocument from HTML5. The former creates an XML document (including XHTML), the latter creates a HTML document. Both reside, as functions, on the DOMImplementation interface.
var impl = document.implementation,
xmlDoc = impl.createDocument(namespaceURI, qualifiedNameStr, documentType),
htmlDoc = impl.createHTMLDocument(title);
In reality, these methods are rather young and only implemented in recent browser releases. According to http://quirksmode.org and MDN, the following browsers support createHTMLDocument:
Chrome 4
Opera 10
Firefox 4
Internet Explorer 9
Safari 4
Interestingly enough, you can (kind of) create a HTML document in older versions of Internet Explorer, using ActiveXObject:
var htmlDoc = new ActiveXObject("htmlfile");
The resulting object will be a new document, which can be manipulated just like any other document.
Assuming you are trying to create a fully parsed Document object from a string of markup and a content-type you also happen to know (maybe because you got the html from an xmlhttprequest, and thus got the content-type in its Content-Type http header; probably usually text/html) – it should be this easy:
var doc = (new DOMParser).parseFromString(markup, mime_type);
in an ideal future world where browser DOMParser implementations are as strong and competent as their document rendering is – maybe that's a good pipe dream requirement for future HTML6 standards efforts. It turns out no current browsers do, though.
You probably have the easier (but still messy) problem of having a string of html you want to get a fully parsed Document object for. Here is another take on how to do that, which also ought to work in all browsers – first you make a HTML Document object:
var doc = document.implementation.createHTMLDocument('');
and then populate it with your html fragment:
doc.open();
doc.write(html);
doc.close();
Now you should have a fully parsed DOM in doc, which you can run alert(doc.title) on, slice with css selectors like doc.querySelectorAll('p') or ditto XPath using doc.evaluate.
This actually works in modern WebKit browsers like Chrome and Safari (I just tested in Chrome 22 and Safari 6 respectively) – here is an example that takes the current page's source code, recreates it in a new document variable src, reads out its title, overwrites it with a html quoted version of the same source code and shows the result in an iframe: http://codepen.io/johan/full/KLIeE
Sadly, I don't think any other contemporary browsers have quite as solid implementations yet.
Per the spec (doc), one may use the createHTMLDocument method of DOMImplementation, accessible via document.implementation as follows:
var doc = document.implementation.createHTMLDocument('My title');
var body = document.createElementNS('http://www.w3.org/1999/xhtml', 'body');
doc.documentElement.appendChild(body);
// and so on
jsFiddle: http://jsfiddle.net/9Fh7R/
MDN document for DOMImplementation: https://developer.mozilla.org/en/DOM/document.implementation
MDN document for DOMImplementation.createHTMLDocument: https://developer.mozilla.org/En/DOM/DOMImplementation.createHTMLDocument
The following works in most common browsers, but not some. This is how simple it should be (but isn't):
// Fails if UA doesn't support parseFromString for text/html (e.g. IE)
function htmlToDoc(markup) {
var parser = new DOMParser();
return parser.parseFromString(markup, "text/html");
}
var htmlString = "<title>foo bar</title><div>a div</div>";
alert(htmlToDoc(htmlString).title);
To account for user agent vagaries, the following may be better (please note attribution):
/*
* DOMParser HTML extension
* 2012-02-02
*
* By Eli Grey, http://eligrey.com
* Public domain.
* NO WARRANTY EXPRESSED OR IMPLIED. USE AT YOUR OWN RISK.
*
* Modified to work with IE 9 by RobG
* 2012-08-29
*
* Notes:
*
* 1. Supplied markup should be avalid HTML document with or without HTML tags and
* no DOCTYPE (DOCTYPE support can be added, I just didn't do it)
*
* 2. Host method used where host supports text/html
*/
/*! #source https://gist.github.com/1129031 */
/*! #source https://developer.mozilla.org/en-US/docs/DOM/DOMParser */
/*global document, DOMParser*/
(function(DOMParser) {
"use strict";
var DOMParser_proto;
var real_parseFromString;
var textHTML; // Flag for text/html support
var textXML; // Flag for text/xml support
var htmlElInnerHTML; // Flag for support for setting html element's innerHTML
// Stop here if DOMParser not defined
if (!DOMParser) return;
// Firefox, Opera and IE throw errors on unsupported types
try {
// WebKit returns null on unsupported types
textHTML = !!(new DOMParser).parseFromString('', 'text/html');
} catch (er) {
textHTML = false;
}
// If text/html supported, don't need to do anything.
if (textHTML) return;
// Next try setting innerHTML of a created document
// IE 9 and lower will throw an error (can't set innerHTML of its HTML element)
try {
var doc = document.implementation.createHTMLDocument('');
doc.documentElement.innerHTML = '<title></title><div></div>';
htmlElInnerHTML = true;
} catch (er) {
htmlElInnerHTML = false;
}
// If if that failed, try text/xml
if (!htmlElInnerHTML) {
try {
textXML = !!(new DOMParser).parseFromString('', 'text/xml');
} catch (er) {
textHTML = false;
}
}
// Mess with DOMParser.prototype (less than optimal...) if one of the above worked
// Assume can write to the prototype, if not, make this a stand alone function
if (DOMParser.prototype && (htmlElInnerHTML || textXML)) {
DOMParser_proto = DOMParser.prototype;
real_parseFromString = DOMParser_proto.parseFromString;
DOMParser_proto.parseFromString = function (markup, type) {
// Only do this if type is text/html
if (/^\s*text\/html\s*(?:;|$)/i.test(type)) {
var doc, doc_el, first_el;
// Use innerHTML if supported
if (htmlElInnerHTML) {
doc = document.implementation.createHTMLDocument("");
doc_el = doc.documentElement;
doc_el.innerHTML = markup;
first_el = doc_el.firstElementChild;
// Otherwise use XML method
} else if (textXML) {
// Make sure markup is wrapped in HTML tags
// Should probably allow for a DOCTYPE
if (!(/^<html.*html>$/i.test(markup))) {
markup = '<html>' + markup + '<\/html>';
}
doc = (new DOMParser).parseFromString(markup, 'text/xml');
doc_el = doc.documentElement;
first_el = doc_el.firstElementChild;
}
// RG: I don't understand the point of this, I'll leave it here though
// In IE, doc_el is the HTML element and first_el is the HEAD.
//
// Is this an entire document or a fragment?
if (doc_el.childElementCount == 1 && first_el.localName.toLowerCase() == 'html') {
doc.replaceChild(first_el, doc_el);
}
return doc;
// If not text/html, send as-is to host method
} else {
return real_parseFromString.apply(this, arguments);
}
};
}
}(DOMParser));
// Now some test code
var htmlString = '<html><head><title>foo bar</title></head><body><div>a div</div></body></html>';
var dp = new DOMParser();
var doc = dp.parseFromString(htmlString, 'text/html');
// Treat as an XML document and only use DOM Core methods
alert(doc.documentElement.getElementsByTagName('title')[0].childNodes[0].data);
Don't be put off by the amount of code, there are a lot of comments, it can be shortened quite a bit but becomes less readable.
Oh, and if the markup is valid XML, life is much simpler:
var stringToXMLDoc = (function(global) {
// W3C DOMParser support
if (global.DOMParser) {
return function (text) {
var parser = new global.DOMParser();
return parser.parseFromString(text,"application/xml");
}
// MS ActiveXObject support
} else {
return function (text) {
var xmlDoc;
// Can't assume support and can't test, so try..catch
try {
xmlDoc = new ActiveXObject("Microsoft.XMLDOM");
xmlDoc.async="false";
xmlDoc.loadXML(text);
} catch (e){}
return xmlDoc;
}
}
}(this));
var doc = stringToXMLDoc('<books><book title="foo"/><book title="bar"/><book title="baz"/></books>');
alert(
doc.getElementsByTagName('book')[2].getAttribute('title')
);
An updated answer for 2014, as the DOMparser has evolved. This works in all current browsers I can find, and should work too in earlier versions of IE, using ecManaut's document.implementation.createHTMLDocument('') approach above.
Essentially, IE, Opera, Firefox can all parse as "text/html". Safari parses as "text/xml".
Beware of intolerant XML parsing, though. The Safari parse will break down at non-breaking spaces and other HTML characters (French/German accents) designated with ampersands. Rather than handle each character separately, the code below replaces all ampersands with meaningless character string "j!J!". This string can subsequently be re-rendered as an ampersand when displaying the results in a browser (simpler, I have found, than trying to handle ampersands in "false" XML parsing).
function parseHTML(sText) {
try {
console.log("Domparser: " + typeof window.DOMParser);
if (typeof window.DOMParser !=null) {
// modern IE, Firefox, Opera parse text/html
var parser = new DOMParser();
var doc = parser.parseFromString(sText, "text/html");
if (doc != null) {
console.log("parsed as HTML");
return doc
}
else {
//replace ampersands with harmless character string to avoid XML parsing issues
sText = sText.replace(/&/gi, "j!J!");
//safari parses as text/xml
var doc = parser.parseFromString(sText, "text/xml");
console.log("parsed as XML");
return doc;
}
}
else {
// older IE
doc= document.implementation.createHTMLDocument('');
doc.write(sText);
doc.close;
return doc;
}
} catch (err) {
alert("Error parsing html:\n" + err.message);
}
}