How to get the page's full content as string in javascript?

How to get the page's full content as string in javascript? - javascript

I'm writing a bookmarklet, i.e. a bookmark that contains javascript instead of a URL, and I have some trouble. In fact, I cannot remember how I can get the content of the page as a string, so I can apply a regular expression to find what I want. Can you please help me on this?
Before anyone suggests it, I cannot use getElementBy(Id/Name/Tag), because the data I'm looking for is HTML-commented and inside markups, so I don't think that would work.
Thanks.

You can access it through:
document.body.innerHTML

so I can apply a regular expression to find what I want
Do. Not. Use. Regex. To. Parse. HTML.
Especially when the browser has already parsed it for you! Come ON!
the data I'm looking for is HTML-commented
You can perfectly well grab comment content out of the DOM. eg.
<div id="mything"><!-- la la la I'm a big comment --></div>
alert(document.getElementById('mything').firstChild.data);
And if you need to search the DOM for comment elements:
// Get comment descendents
//
function dom_getComments(parent, recurse) {
var results= [];
for (var childi= 0; childi<parent.childNodes.length; childi++) {
var child= parent.childNodes[childi];
if (child.nodeType==8) // Node.COMMENT_NODE
results.push(child);
else if (recurse && child.nodeType==1) // Node.ELEMENT_NODE
results= results.concat(dom_getComments(child));
}
return results;
}

Related

Is there a more efficient way to replace multiple strings in a html?

first of all please excuse my bad title. Don't know how to name it.
This is a student project. My task is to explore/examine many, fairly large, texts in a web app.
But this is not important, that's anyway the way how it needs to be done.
The strings that needs to be replaced are stored in a json file. I'm iterating throw the file with
$.each(stringList, function(index, value) {
var isInContent = $('#myContent').text().indexOf(index) > -1;
if (isInContent) {... replaceString(index); ...}
}
And to replace the string I use
$('#myContent :not(script)').contents().filter(function() {
return this.nodeType === 3;
}).replaceWith(function() {
return this.nodeValue.replace(index, myNewString);
});
It works as it should. But it's a bit slow. I think because it's loading for every replace the $('#content') and writes the whole $('#content') after the replace back to the html.
Is there a more efficient way to do it?
Thank you

Regex for visible text, not HTML

If i had a string:
hey user, what are you doing?
How, with regex could I say: look for user, but not inside of < or > characters? So the match would grab the user between the <a></a> but not the one inside of the href
I'd like this to work for any tag, so it wont matter what tags.
== Update ==
Why i can't use .text() or innerText is because this is being used to highlight results much like the native cmd/ctrl+f functionality in browsers and I dont want to lose formatting. For example, if i search for strong here:
Some <strong>strong</strong> text.
If i use .text() itll return "Some strong text" and then I'll wrap strong with a <span> which has a class for styling, but now when I go back and try to insert this into the DOM it'll be missing the <strong> tags.

If you plan to replace the HTML using html() again then you will loose all event handlers that might be bound to inner elements and their data (as I said in my comment).
Whenever you set the content of an element as HTML string, you are creating new elements.
It might be better to recursively apply this function to every text node only. Something like:
$.fn.highlight = function(word) {
var pattern = new RegExp(word, 'g'),
repl = '<span class="high">' + word + '</span>';
this.each(function() {
$(this).contents().each(function() {
if(this.nodeType === 3 && pattern.test(this.nodeValue)) {
$(this).replaceWith(this.nodeValue.replace(pattern, repl));
}
else if(!$(this).hasClass('high')) {
$(this).highlight(word);
}
});
});
return this;
};
DEMO
It could very well be that this is not very efficient though.

To emulate Ctrl-F (which I assume is what you're doing), you can use window.find for Firefox, Chrome, and Safari and TextRange.findText for IE.
You should use a feature detect to choose which method you use:
function highlightText(str) {
if (window.find)
window.find(str);
else if (window.TextRange && window.TextRange.prototype.findText) {
var bodyRange = document.body.createTextRange();
bodyRange.findText(str);
bodyRange.select();
}
}
Then, after you the text is selected, you can style the selection with CSS using the ::selection selector.
Edit: To search within a certain DOM object, you could use a roundabout method: use window.find and see whether the selection is in a certain element. (Perhaps say s = window.getSelection().anchorNode and compare s.parentNode == obj, s.parentNode.parentNode == obj, etc.). If it's not in the correct element, repeat the process. IE is a lot easier: instead of document.body.createTextRange(), you can use obj.createTextRange().

$("body > *").each(function (index, element) {
var parts = $(element).text().split("needle");
if (parts.length > 1)
$(element).html(parts.join('<span class="highlight">needle</span>'));
});
jsbin demo
at this point it's evolving to be more and more like Felix's, so I think he's got the winner
original:
If you're doing this in javascript, you already have a handy parsed version of the web page in the DOM.
// gives "user"
alert(document.getElementById('user').innerHTML);
or with jQuery you can do lots of nice shortcuts:
alert($('#user').html()); // same as above
$("a").each(function (index, element) {
alert(element.innerHTML); // shows label text of every link in page
});

I like regexes, but because tags can be nested, you will have to use a parser. I recommend http://simplehtmldom.sourceforge.net/ it is really powerful and easy to use. If you have wellformed xhtml you can also use SimpleXML from php.
edit: Didn't see the javascript tag.

Try this:
/[(<.+>)(^<)]*user[(^>)(<.*>)]/
It means:
Before the keyword, you can have as many <...> or non-<.
Samewise after it.
EDIT:
The correct one would be:
/((<.+>)|(^<))*user((^>)|(<.*>))*/

Here is what works, I tried it on your JS Bin:
var s = 'hey user, what are you doing?';
s = s.replace(/(<[^>]*)user([^<]>)/g,'$1NEVER_WRITE_THAT_ANYWHERE_ELSE$2');
s = s.replace(/user/g,'Mr Smith');
s = s.replace(/NEVER_WRITE_THAT_ANYWHERE_ELSE/g,'user');
document.body.innerHTML = s;
It may be a tiny little bit complicated, but it works!
Explanation:
You replace "user" that is in the tag (which is easy to find) with a random string of your choice that you must never use again... ever. A good use would be to replace it with its hashcode (md5, sha-1, ...)
Replace every remaining occurence of "user" with the text you want.
Replace back your unique string with "user".

this code will strip all tags from sting
var s = 'hey user, what are you doing?';
s = s.replace(/<[^<>]+>/g,'');

Regex to search html return, but not actual html jQuery

I'm making a highlighting plugin for a client to find things in a page and I decided to test it with a help viewer im still building but I'm having an issue that'll (probably) require some regex.
I do not want to parse HTML, and im totally open on how to do this differently, this just seems like the the best/right way.
http://oscargodson.com/labs/help-viewer
http://oscargodson.com/labs/help-viewer/js/jquery.jhighlight.js
Type something in the search... ok, refresh the page, now type, like, class or class=" or type <a you'll notice it'll search the actual HTML (as expected). How can I only search the text?
If i do .text() it'll vaporize all the HTML and what i get back will just be a big blob of text, but i still want the HTML so I dont lose formatting, links, images, etc. I want this to work like CMD/CTRL+F.
You'd use this plugin like:
$('article').jhighlight({find:'class'});
To remove them:
.jhighlight('remove')
==UPDATE==
While Mike Samuel's idea below does in fact work, it's a tad heavy for this plugin. It's mainly for a client looking to erase bad words and/or MS Word characters during a "publishing" process of a form. I'm looking for a more lightweight fix, any ideas?

You really don't want to use eval, mess with innerHTML or parse the markup "manually". The best way, in my opinion, is to deal with text nodes directly and keep a cache of the original html to erase the highlights. Quick rewrite, with comments:
(function($){
$.fn.jhighlight = function(opt) {
var options = $.extend($.fn.jhighlight.defaults, opt)
, txtProp = this[0].textContent ? 'textContent' : 'innerText';
if ($.trim(options.find.length) < 1) return this;
return this.each(function(){
var self = $(this);
// use a cache to clear the highlights
if (!self.data('htmlCache'))
self.data('htmlCache', self.html());
if(opt === 'remove'){
return self.html( self.data('htmlCache') );
}
// create Tree Walker
// https://developer.mozilla.org/en/DOM/treeWalker
var walker = document.createTreeWalker(
this, // walk only on target element
NodeFilter.SHOW_TEXT,
null,
false
);
var node
, matches
, flags = 'g' + (!options.caseSensitive ? 'i' : '')
, exp = new RegExp('('+options.find+')', flags) // capturing
, expSplit = new RegExp(options.find, flags) // no capturing
, highlights = [];
// walk this wayy
// and save matched nodes for later
while(node = walker.nextNode()){
if (matches = node.nodeValue.match(exp)){
highlights.push([node, matches]);
}
}
// must replace stuff after the walker is finished
// otherwise replacing a node will halt the walker
for(var nn=0,hln=highlights.length; nn<hln; nn++){
var node = highlights[nn][0]
, matches = highlights[nn][1]
, parts = node.nodeValue.split(expSplit) // split on matches
, frag = document.createDocumentFragment(); // temporary holder
// add text + highlighted parts in between
// like a .join() but with elements :)
for(var i=0,ln=parts.length; i<ln; i++){
// non-highlighted text
if (parts[i].length)
frag.appendChild(document.createTextNode(parts[i]));
// highlighted text
// skip last iteration
if (i < ln-1){
var h = document.createElement('span');
h.className = options.className;
h[txtProp] = matches[i];
frag.appendChild(h);
}
}
// replace the original text node
node.parentNode.replaceChild(frag, node);
};
});
};
$.fn.jhighlight.defaults = {
find:'',
className:'jhighlight',
color:'#FFF77B',
caseSensitive:false,
wrappingTag:'span'
};
})(jQuery);
If you're doing any manipulation on the page, you might want to replace the caching with another clean-up mechanism, not trivial though.
You can see the code working here: http://jsbin.com/anace5/2/
You also need to add display:block to your new html elements, the layout is broken on a few browsers.

In the javascript code prettifier, I had this problem. I wanted to search the text but preserve tags.
What I did was start with HTML, and decompose that into two bits.
The text content
Pairs of (index into text content where a tag occurs, the tag content)
So given
Lorem <b>ipsum</b>
I end up with
text = 'Lorem ipsum'
tags = [6, '<b>', 10, '</b>']
which allows me to search on the text, and then based on the result start and end indices, produce HTML including only the tags (and only balanced tags) in that range.

Have a look here: getElementsByTagName() equivalent for textNodes.
You can probably adapt one of the proposed solutions to your needs (i.e. iterate over all text nodes, replacing the words as you go - this won't work in cases such as <tag>wo</tag>rd but it's better than nothing, I guess).

I believe you could just do:
$('#article :not(:has(*))').jhighlight({find : 'class'});
Since it grabs all leaf nodes in the article it would require valid xhtml, that is, it would only match link in the following example:
<p>This is some paragraph content with a link</p>
DOM traversal / selector application could slow things down a bit so it might be good to do:
article_nodes = article_nodes || $('#article :not(:has(*))');
article_nodes.jhighlight({find : 'class'});

May be something like that could be helpful
>+[^<]*?(s(<[\s\S]*?>)?e(<[\s\S]*?>)?e)[^>]*?<+
The first part >+[^<]*? finds > of the last preceding tag
The third part [^>]*?<+ finds < of the first subsequent tag
In the middle we have (<[\s\S]*?>)? between characters of our search phrase (in this case - "see").
After regular expression searching you could use the result of the middle part to highlight search phrase for user.

HTML web page string into array or JSON in JS

Greetings!
Is it possible to convert an HTML string to an array or JSON using Javascript?
Something like this:
var stringweb = '<html><head>hi</head><body>my body</body></html>';
And as result, I can have this:
var myarray = {[html,
[head,
[hi]
]
[etc...]
]}
Thanks in advance! :)

As you can tell from the comments above, this doesn't seem like the most robust idea... Anyhow, here is a solution that I think gets you what you asked for. It was fun to write, anyhow.
function htmlStringToArray(str) {
var temp = document.createElement('iframe');
temp.style.display = "none";
document.body.appendChild(temp);
var doc = temp.contentWindow.document;
doc.open();
doc.write(str);
doc.close();
var array = htmlNodeToArray(doc.documentElement);
temp.parentNode.removeChild(temp);
return array;
}
function htmlNodeToArray(node) {
if (node.nodeType == 1) {
var array = [node.tagName];
if (node.childNodes.length) {
for (var i=0, child; child = node.childNodes[i]; i++) {
if (child.nodeType == 1 || child.nodeType == 3) {
array.push(htmlNodeToArray(child));
}
}
} else if (node.innerText) {
array.push([node.innerText]);
}
return array;
} else if (node.nodeType == 3) {
return [node.nodeValue];
}
}
I tried it out in the latest chrome, firefox and IE. Here it is running on jsbin: http://jsbin.com/uqize3/7/edit
BTW your HTML string is invalid. Browsers will move "hi" from inside the <head> into the <body>. I assumed you intended to have a <title> in there.

You can do that in JavaScript, because JavaScript is a sufficiently expressive language as to allow just about anything. However, it's not going to be particularly easy: you're going to have to implement (or find) as complete an HTML parser as is necessary to recognize the particular HTML documents that you want to convert. HTML itself is pretty complicated, and that complexity is greatly magnified by the fact that most of the world's stock of existing HTML documents are badly erroneous. Thus, if you've got well-constrained HTML that you know to be valid, or at least consistently invalid, that might make the task a little easier.
edit — #Hemlock points out, quite wisely, that if you're doing this in a browser (that is, if this code is going to run from inside a web page served to browsers), then you've got it a lot easier. You can hand your HTML over to the browser, perhaps as the content document for an <iframe> element you add to the page. If it's not too awful for the browser to parse (and browsers can cope with surprisingly weird HTML), then once the DOM is ready in the <iframe> you can just walk the DOM and generate whatever sort of different representation you want.

How to get HTML data with javascript

I have an HTML web page full of divs and span tags identified with class that have lots of data I need in other format. I was wondering what would be the best way to do this with javascript.
Thank you for the help.

The fastest way? jQuery:
$(".myClass").each(function() {
// work with your data here
});

More lowlevel, but should be a lot faster (a lot less overhead):
var myelements = document.evaluate('//div[#class=myClass"]', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i = 0; i < myelements.snapshotLength; i++) {
var dataElement = myelements.snapshotItem(i);
// work with your data here
}
(ok, you'd have to do it twice (once for div and once for span), it's more code and doesn't look as nice, but it should still be faster)

If you are wanting to get at all of the documents with a specific class then you will need to test for the presence of that class on each object. You will want to use a
document.getElementByTagName("*") // This should select everything
and loop through them to detect the proper name.
if (regex test == true) {
// you found an element that matches
// do what you will with it.
}
If you find the elements you need do what you need with them. Now you have processed all elements on the page and found elements that match your criteria. Good luck.

We Keep Coding

JavaScript is the programming language of the Web.

How to get the page's full content as string in javascript? - javascript

You can access it through: document.body.innerHTML

Related

Is there a more efficient way to replace multiple strings in a html?

Regex for visible text, not HTML

Regex to search html return, but not actual html jQuery

HTML web page string into array or JSON in JS

How to get HTML data with javascript

Categories

Resources