Regex to match contents of HTML body

Regex to match contents of HTML body - javascript

EDIT: OOPS, sorry I wasn't clear. I have a string that I get from AJAX that is an xhtml document, I need to get the body tag of it, unless I can generate a dom tree from the string?
I need to get everything from a body tag in a string, including markup, with a javascript regex.
I know that this is a duplicate, but the regexes I found in other questions were for different flavours of regex, and gave me errors.
Thank in advance.

document.getElementsByTagName('body')[0].innerHTML will return a string of everything in the body tag. It's not a regex, but I'm not sure why you need one...?
POST QUESTION EDIT:
Your XHR object that you performed your AJAX with has responseText and responseXML properties. As long as the response is valid xml, which is probably should be, you can get any tag you want using getElementsByTagName on the xml object that I mentioned. But if you just want the inner parts of the body, I would do it this way:
var inner = myXHR.responseText.split(/(<body>|</body>)/ig)[2]);

Regex isn't the ideal tool for parsing the DOM as you will see mentioned throughout this site and others. The most ideal way, as suggested by George IV is to use the JavaScript tools that are more suited to this and that is getElementsByTagName and grab the innerHTML:
var bodyText = document.getElementsByTagName("body")[0].innerHTML;
Edit1: I've not checked it out yet, but Rudisimo suggested a tool that shows a lot of promise - the XRegExp Library which is an open sources and extensible library out of MIT. This could potentially be a viable option - I still think the DOM is the better way, but this looks far superior to the standard JavaScript implementation of regex.
Edit2: I recant my previous statements about the Regex engine [for reasons of accuracy] due to the example provided by Gumbo - however absurd the expression might be. I do, however, stand by my opinion that using regex in this instance is an inherently bad way to go and you should reference the DOM using the aforementioned example.

In general regular expressions are not suitable for parsing. But if you really want to use a regular expression, try this:
/^\s*(?:<(?:!(?:(?:--(?:[^-]+|-[^-])*--)+|\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*\]\]|[^<>]+)|(?!body[\s>])[a-z]+(?:\s*(?:[^<>"']+|"[^"]*"|'[^']*'))*|\/[a-z]+)\s*>|[^<]+)*\s*<body(?:\s*(?:[^<>"']+|"[^"]*"|'[^']*'))*\s*>([\s\S]+)<\/body\s*>/i
As you see, there is no easy way to do that. And I wouldn’t even claim that this is a correct regular expression. But it should take comment tags (<!-- … -->), CDATA tags (<![CDATA[ … ]]>) and normal HTML tags into account.
Good luck while trying to read it.

Everybody seems dead set on using regular expressions so I figured I'd go the other way and answer the second query you had.
It is theoretically possible to parse the result of your AJAX as an xmlDocument.
There are a few steps you'll likely want to take if you want this to work.
Use a library. I recommend jQuery
If you're using a library you must make sure that the mimetype of the response is an xml mimetype!
Make sure you test thoroughly in all your target browsers. You will get tripped up.
That being said, I created a quick example on jsbin.
It works in both IE and Firefox, unfortunately in order to get it to work I had to roll my own XMLHttpRequest object.
View the example source code here
(Seriously though, this code is ugly. It's worth using a library and setting the mime type properly...)
function getXHR() {
var xmlhttp;
//Build the request
if (window.XMLHttpRequest) {
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else if (window.ActiveXObject) {
// code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
} else {
alert("Your browser does not support XMLHTTP!");
}
//Override the mime type for firefox so that it returns the
//result as an XMLDocument.
if( xmlhttp.overrideMimeType ) {
xmlhttp.overrideMimeType('application/xhtml+xml; charset=x-user-defined');
}
return xmlhttp;
}
function runVanillaAjax(url,functor)
{
var xmlhttp = getXHR();
xmlhttp.onreadystatechange=function() { functor(xmlhttp); };
xmlhttp.open("GET",url,true);
xmlhttp.send(null);
}
function vanillaAjaxDone( response ) {
if(response.readyState==4) {
//Get the xml document element for IE or firefox
var xml;
if ($.browser.msie) {
xml = new ActiveXObject("Microsoft.XMLDOM");
xml.async = false;
xml.loadXML(response.responseText);
} else {
xml = response.responseXML.documentElement;
}
var textarea = document.getElementById('textarea');
var bodyTag = xml.getElementsByTagName('body')[0];
if( $.browser.msie ) {
textarea.value = bodyTag.text;
} else {
textarea.value = bodyTag.textContent;
}
}
}
function vanillaAjax() {
runVanillaAjax('http://jsbin.com/ulevu',vanillaAjaxDone);
}

There is an alternative fix to the dot matches newline limitation of the RegExp library in JavaScript. XRegExp is a powerful and open source library with an almost limitless license "MIT License" (for commercial projects), which is very compact (2.7KB gzipped) and powerful.
If you go to the New Flags section, you can see how there's a flag (s), in which dot matches all characters; including newlines.

Related

CSS - returning different values from different browsers

When I am using jQuery to grab CSS values for objects, each of the browsers (IE, Mozilla, Chrome, etc) returns different values.
For example, in Chrome, a background image (.css("background-image")) returns:
url(http://i41.tinypic.com/f01zsy.jpg)
Where in Mozilla, it returns:
url("http://i41.tinypic.com/f01zsy.jpg")
I am having the same problem on other aspects, such as background-size.
In chrome it returns:
50% 50%
But Mozilla returns:
50%+50%
My problem with this is, I have functions that split the CSS (background-size), for example based on a space .split(" "), but this could not work on Mozilla because it uses a + instead.
Is there any way that I can fix this problem and make the browsers to use one standard?
Is there any function that I could write which grabs and splits values, based on the type of browser the user is using?

My problem with this is, I have functions that split the CSS
(background-size), for example based on a space .split(" "), but this
could not work on Mozilla because it uses a + instead.
Try adding \+ to RegExp passed to .split
.split(/\s|\+/)
var res = ["50%+50%", "50% 50%"];
var re = /\s+|\+/;
console.log(res[0].split(re), res[1].split(re));

Different browsers use different CSS standards and you may have to write a full-blown parser to make them one standard.
Workaround is that you should split or use CSS values taking into account the different browsers standards. Like the CSS(background-size) problem can be solved using this:
space.split("\\s|\\+"); //split my string where it either has a space 'or' a plus sign
For CSS(background-image), the solution may be to replace the inverted commas before using it:
space.replace("\"", "");
Try to make the splits generallized for all browsers. Hope that helps.

This probably isn't the cleanest method, but you could run a string parser for the background image source and delete any quotation marks. This would be the most efficient method for parsing the background image URL. It should work without harming the data because URL's typically can't contain quotation marks, as they are encoded as %22
As for the background-size, you could parse the results for + signs and change those to spaces, as + signs typically aren't present as the values for any CSS properties, so you should be relatively safe in taking those out.
In addition, you could check the browser type to see if you'd even have to run these parsings in the first place. As a precaution, you should also see how Opera and Safari return results, and if those are any different, you could create branch statements for the parsers that handle the different types of CSS values returned by the different browsers.
Note: The parsing methods I have described attempt the goal of converting the Firefox results to the Chrome-style results.

Thanks for all the help.
I'll share the code I have ended up using!
cssCommas: function(text)
{
return text.replace(new RegExp("\"", "g"),"");
},
cssPlus: function(text)
{
return text.replace(new RegExp("\\+", "g"),"");
},
cssSplit: function(text,removePercent)
{
var removeParent = removeParent || false;
if(removePercent == true)
{
text = text.replace(new RegExp("%", "g"),"");
}
return text.split(new RegExp("\\s|\\+","g"));
},
css: function(text)
{
return this.cssCommas(this.cssPlus(text));
}
Works perfectly on all browsers now. Thanks a lot.

Microsoft.XMLHTTP documentElement is NULL

Can someone offer some troubleshooting tips here? My PostXML() return value is NULL for a very small subset of data results. This works for 99.9% of usage. I think the failed data may have special characters perhaps, but comparing to a similar dataset, its identical and passes OK? The oXMLDoc.xml in File.asp contains a valid XML string while debugging, but its null when it gets back to my JS call.
Is there any known issues with what looks like a valid XML element getting trashed in the Microsoft.XMLHTTP object?
function PostXML(sXML)
{
var oHTTPPost = new ActiveXObject("Microsoft.XMLHTTP");
oHTTPPost.Open("POST","File.asp", false);
oHTTPPost.send(sXML);
// documentElement is null???
return oHTTPPost.responseXML.documentElement;
}
File.asp
<%
' oXMLDoc.xml contains valid XML here, but is NULL in the calling JS
Response.ContentType = "text/xml"
Response.Write oXMLDoc.xml
%>

Check the response headers. The content type needs to be application/xml.
XMLHttpRequest is available in current IEs. I suggest using the ActiveX only as a fallback.
You can override the content mimetype on it:
xhr = new XMLHttpRequest();
xhr.overrideMimeType("application/xml");
...
This might by possible on the ActiveX object, too. But I am not sure.
Another possibility is using the DOMParser to convert a received string into an Document instance.

Found the issue.
Using IE Dev/Debugger, I found xEFxBFxBF in one of the string attributes. This product uses MS SQL Server and the query output did not reflect these characters even if copy/pasted into Notepad++. I'm assuming Ent Manager filters out unsupported characters... /=
Thanks for the help people!

Ouch man.
Is it possible to use jQuery instead?
The other thing I know is that if the return xml is a little off (poorly formatted xml, case sensitive, non-legal characters) the javascript will trash the return value.
With jQuery you have better debugging options to see the error.

Can't create XML node with cyrillic name in IE11

I need to create a xml document (with JavaScript) containing nodes, which is named in russian.
I get InvalidCharacterError in IE11 when trying run doc.createElement("Выборка")
doc is created with var doc = document.implementation.createDocument("", "", null)
In other browsers this code is working without any issues.
How can be solved? What is the root of an issue?
jsFiddle example: http://jsfiddle.net/e4tUH/1/
My post on connect.microsoft.com: https://connect.microsoft.com/IE/feedback/details/812130/cant-create-xml-node-with-cyrillic-name-in-ie11
Current workaround: Switch IE11 to IE10 with X-UA-Compatible meta-tag and use window.ActiveXObject(...) to create XML documents.

Maybe IE11 has an issue similar to what Firefox had in the past:
https://bugzilla.mozilla.org/show_bug.cgi?id=431701
That means that although your page is loading the correct encoding, IE11 is creating the new document with a default encoding which is not the expected one. There's no way to check that besides looking into IE11 source code, which we don't have.
Have you trying to add non-ASCII characters in other places besides element names? Like an attribute value or a text node?
I searched how to change the created document encoding and haven't found any solution for that.
To solve your problem I would suggest to use a DOMParser and generate a document from a XML string, like the following:
var parser=new DOMParser();
var xmlDoc=parser.parseFromString('<?xml version="1.0" encoding="UTF-8"?><Выборка>Выборка текста</Выборка>',"text/xml");
All browsers seems to support it for XML parsing. More about DOMParser on the following links, including how to provide backward compatibility with older IE versions:
http://www.w3schools.com/dom/dom_parser.asp
https://developer.mozilla.org/en-US/docs/Web/API/DOMParser
If you don't want to generate your XML just by concatenating strings, you can use some kind of XML builder like in this example: http://jsfiddle.net/UGYWx/6/
Then you can easily create your XML in a more safe manner:
var builder = new XMLBuilder("rootElement");
builder.text('Some text');
var element = builder.element("someElement", {'attr':'value'});
element.text("This is a text.");
builder.text('Some more Text');
builder.element("emptyElement");
builder.text('Even some more text');
builder.element("emptyWithAttributes", {'a1': 'val1', 'a2' : 'val2'});
$('div').text(builder.toString());

I have always been very reluctant to use non-ASCII characters inside source code. Try escaping the string; maybe it helps.
doc.createElement("\u0412\u044B\u0431\u043E\u0440\u043A\u0430")

Error in Firefox while replace string using regexp in JavaScript

try{
var hdnPassenger = $("#ctl00_ContentPlaceHolder1_hdnPassenger").val();
var newTr = $("#hdnCtl").html();
newTr = newTr.replace(/_ID/g, hdnPassenger);
}
catch(ex){
alert(ex);
}
Above code is working fine in the internet explorer, but displayed the following error in the mozilla firefox
InternalError: regular expression too complex

Having done some research into this problem, there are two possible reasons for this error:
The actual regex too complex (not in your case, as you only have /_ID/)
The length of the string you're trying to do the substitution on (I don't know what it is, but probably quite long). It seems that there's some hard-coded limit in some versions of firefox, but I can't vouch for that.
I suggest you do two this: add the values of your hdnPassenger and newTr variables - and at the same time google firefox regular expression too complex - there are plenty of hits.

Chrome extension read innerHTML of the current page?

Hi this may be a silly question, but I can't find the answer anywhere.
I'm writing a chrome extension, all I need is to read in the html of the current page so I can extract some data from it.
here's what I have so far:
<script>
window.addEventListener("load", windowLoaded, false);
function windowLoaded() {
alert(document.innerHTML)
});
}
</script>
Can anybody tell me what I'm doing wrong?
thanks,

function windowLoaded() {
alert('<html>' + document.documentElement.innerHTML + '</html>');
}
addEventListener("load", windowLoaded, false);
Notice how windowLoaded is created before it is used, not after, which won't work.
Also notice how I am getting the innerHTML of document.documentElement, which is the html tag, then adding the html source tags around it.

I'm writing a chrome extension, all I need is to read in the html of
the current page so I can extract some data from it.
I think an important answer here is not the correct code to use to alert the innerHTML but how to get the data you need from what's already been rendered.
As pimvdb pointed out, your code isn't working because of a typo and needing document.documentElement.innerHTML, something you can diagnose in the Chrome console (Ctrl+Shift+I). But that's secondary to why you'd want the inner HTML in the first place. Whether you're looking for a certain node, specific text, how many <div> elements exist, the value of an ID, etc., I'd heavily recommend the use of a library like jQuery (vanilla JS works, but it can be verbose and unwieldy). Instead of reading in all the HTML and parsing it with string functions or regex, you probably want to take advantage of all the DOM parsing functionality already available to you.
In other words, something like this:
$("#some_id").val(); // jQuery
document.getElementById("some_id").value; // vanilla JS
is probably way safer, easier and more readable than something eminently breakable like this (probably a bit off here, but just to make a point):
innerHTML.match(/<[^>]+id="some_id"[^>]+value="(.*?)"[^>]*?>/i)[1];

Use document.documentElement.outerHTML. (Note that this is not supported in Firefox; irrelevant in your case.) However, this is still not perfect as it doesn't return nodes outside the root element (!doctype and possibly some comments or processing instructions). The document.innerHTML property is, AFAIK, specified in HTML5 specification, but currently not supported in any browser.
Just FYI, navigating to view-source:www.example.com also displays the entire markup (Chrome & Firefox). But I don't know whether you can work with it somehow.

window.addEventListener("load", windowLoaded, false);
function windowLoaded() {
alert(document.documentElement.innerHTML);
}
You had a } with no purpose, and the }); should just be }. These are syntax errors.
Also, it's document.documentElement.innerHTML, since it's not a property of document.

We Keep Coding

JavaScript is the programming language of the Web.

Regex to match contents of HTML body - javascript

Related

CSS - returning different values from different browsers

Microsoft.XMLHTTP documentElement is NULL

Can't create XML node with cyrillic name in IE11

Error in Firefox while replace string using regexp in JavaScript

Chrome extension read innerHTML of the current page?

Categories

Resources