Scraping data from HTML using JavaScript RegExp [duplicate]

Scraping data from HTML using JavaScript RegExp [duplicate] - javascript

I'm trying to figure out how to, in raw javascript (no jQuery, etc.), find an element with specific text and modify that text.
My first incarnation of the solution... is less than adequate. What I did was basically:
var x = document.body.innerHTML;
x.replace(/regular-expression/,"text");
document.body.innerHTML = x;
Naively I thought I succeeded with flying colors, especially since it was so simple. So then I added an image to my example and thought I could check every 5 seconds (because this string may enter the DOM dynamically)... and the image flickered every 5 seconds.
Oops.
So, there has to be a correct way to do this. A way that specifically singles out a specific DOM element and updates the text portion of that DOM element.
Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid. And even then, I'm skeptical about "changing the innerHTML to something different" being the correct way to update a DOM element.
So, what's the correct way to search through the DOM for a string? And what's the correct way to update a DOM element's text?

Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid.
I want to search for an element in an unordered random list. Now, there's a "go through all the elements till you find what you're looking for approach", which I want to avoid.
Old-timer magno tape, record, listen, meditate.
Btw, see: Find and replace text with JavaScript on James Padolsey's github
(also hig blog articles explaining it)

Edit: Changed querySelectorAll to getElementsByTagName from RobG's suggestion.
You can use the getElementsByTagName function to grab all of the tags on the page. From there, you can check their children and see if they have any Text Nodes as children. If they do, you'd then look at their text and see if it matches what you need. Here is an example that will print out the text of every Text Node in your document with the console object:
var elms = document.getElementsByTagName("*"),
len = elms.length;
for(var ii = 0; ii < len; ii++) {
var myChildred = elms[ii].childNodes;
len2 = myChildred.length;
for (var jj = 0; jj < len2; jj++) {
if(myChildred[jj].nodeType === 3) {
console.log(myChildred[jj].nodeValue);
// example on update a text node's value
myChildred[jj].nodeValue = myChildred[jj].nodeValue.replace(/test/,"123");
}
}
}
To update a DOM element's text, simple update the nodeValue property of the Text Node.

Don't use innerHTML with a regular expression, it will almost certainly fail for non-trivial content. Also, there are still differences in how browsers generate it from the live DOM. Replacing the innerHTML will also remove any event listeners added as element properties (i.e. like element.onclick = fn).
It is best if you can have the string enclosed in an element with an attribute or property you can search on (id, class, etc.) but failing that, a search of text nodes is the best approach.
Edit
Attempting a general purpose text selection function for an HTML document may result in a very complex algorithm since the string could be part of a complex structure, e.g.:
<h1>Some <span class="foo"><em>s</em>pecial</span> heading</h1>
Searching for the string "special heading" is tricky as it is split over 2 elements. Wrapping it another element (say for highlighting) is also not trivial since the resulting DOM structure must be valid. For example, the text matching "some special" in the above could be wrapped in a span but not a div.
Any such function must be accompanied by documentation stating its limitations and most appropriate use.

Forget regular expressions.
Iterate over each text node (and doing it recursively will be the most elegant) and modify the text nodes if the text is found. If just looking for a string, you can use indexOf().

x.replace(/regular-expression/,"text");
will return a value so
var y = x.replace(/regular-expression/,"text");
now you can assign new value.
document.body.innerHTML = y;
Bu you want to think about this, you dont't want to get the whole body just to change one small piece of code, why not get the content of a div or any element and so on
example:
<p id='paragraph'>
... some text here ...
</p>
now you can use javascript
var para = document.getElementById('paragraph').innerHTML;
var newPara = para.replace(/regex/,'new content');
para.innerHTML = newPara;
This should be the simplest way.

Related

Finding the DOM element with specific text and modify it

I'm trying to figure out how to, in raw javascript (no jQuery, etc.), find an element with specific text and modify that text.
My first incarnation of the solution... is less than adequate. What I did was basically:
var x = document.body.innerHTML;
x.replace(/regular-expression/,"text");
document.body.innerHTML = x;
Naively I thought I succeeded with flying colors, especially since it was so simple. So then I added an image to my example and thought I could check every 5 seconds (because this string may enter the DOM dynamically)... and the image flickered every 5 seconds.
Oops.
So, there has to be a correct way to do this. A way that specifically singles out a specific DOM element and updates the text portion of that DOM element.
Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid. And even then, I'm skeptical about "changing the innerHTML to something different" being the correct way to update a DOM element.
So, what's the correct way to search through the DOM for a string? And what's the correct way to update a DOM element's text?

Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid.
I want to search for an element in an unordered random list. Now, there's a "go through all the elements till you find what you're looking for approach", which I want to avoid.
Old-timer magno tape, record, listen, meditate.
Btw, see: Find and replace text with JavaScript on James Padolsey's github
(also hig blog articles explaining it)

Edit: Changed querySelectorAll to getElementsByTagName from RobG's suggestion.
You can use the getElementsByTagName function to grab all of the tags on the page. From there, you can check their children and see if they have any Text Nodes as children. If they do, you'd then look at their text and see if it matches what you need. Here is an example that will print out the text of every Text Node in your document with the console object:
var elms = document.getElementsByTagName("*"),
len = elms.length;
for(var ii = 0; ii < len; ii++) {
var myChildred = elms[ii].childNodes;
len2 = myChildred.length;
for (var jj = 0; jj < len2; jj++) {
if(myChildred[jj].nodeType === 3) {
console.log(myChildred[jj].nodeValue);
// example on update a text node's value
myChildred[jj].nodeValue = myChildred[jj].nodeValue.replace(/test/,"123");
}
}
}
To update a DOM element's text, simple update the nodeValue property of the Text Node.

Don't use innerHTML with a regular expression, it will almost certainly fail for non-trivial content. Also, there are still differences in how browsers generate it from the live DOM. Replacing the innerHTML will also remove any event listeners added as element properties (i.e. like element.onclick = fn).
It is best if you can have the string enclosed in an element with an attribute or property you can search on (id, class, etc.) but failing that, a search of text nodes is the best approach.
Edit
Attempting a general purpose text selection function for an HTML document may result in a very complex algorithm since the string could be part of a complex structure, e.g.:
<h1>Some <span class="foo"><em>s</em>pecial</span> heading</h1>
Searching for the string "special heading" is tricky as it is split over 2 elements. Wrapping it another element (say for highlighting) is also not trivial since the resulting DOM structure must be valid. For example, the text matching "some special" in the above could be wrapped in a span but not a div.
Any such function must be accompanied by documentation stating its limitations and most appropriate use.

Forget regular expressions.
Iterate over each text node (and doing it recursively will be the most elegant) and modify the text nodes if the text is found. If just looking for a string, you can use indexOf().

x.replace(/regular-expression/,"text");
will return a value so
var y = x.replace(/regular-expression/,"text");
now you can assign new value.
document.body.innerHTML = y;
Bu you want to think about this, you dont't want to get the whole body just to change one small piece of code, why not get the content of a div or any element and so on
example:
<p id='paragraph'>
... some text here ...
</p>
now you can use javascript
var para = document.getElementById('paragraph').innerHTML;
var newPara = para.replace(/regex/,'new content');
para.innerHTML = newPara;
This should be the simplest way.

Insert innerHTML with Prototypejs

Say I have a list like this:
<ul id='dom_a'>
<li>foo</li>
</ul>
I know how to insert elements in the ul tag with:
Element.insert('dom_a', {bottom:"<li>bar</li>"});
Since the string I receive contains the dom id, I need to insert the inner HTML instead of the whole element. I need a function to do this:
insert_content('dom_a', {bottom:"<ul id='dom_a'><li>bar</li></ul>"});
And obtain:
<ul id='dom_a'>
<li>foo</li>
<li>bar</li>
</ul>
How should I do this with Prototype ?
Here is the solution I have come up with, can anyone make this better ?
Zena.insert_inner = function(dom, position, content) {
dom = $(dom);
position = position.toLowerCase();
content = Object.toHTML(content);
var elem = new Element('div');
elem.innerHTML = content; // strip scripts ?
elem = elem.down();
var insertions = {};
$A(elem.childElements()).each(function(e) {
insertions[position] = e;
dom.insert(insertions);
});
}

I think you could parse the code block in your variable, then ask it for its innerHTML, and then use insert to stick that at the bottom of the actual node in the DOM.
That might look like this:
var rep_struct = "<ul id='dom_a'><li>bar</li></ul>";
var dummy_node = new Element('div'); // So we can easily access the structure
dummy_node.update(rep_struct);
$('dom_a').insert({bottom: dummy_node.childNodes[0].innerHTML});

I think you can slim down the code a bit by simply appending the innerHTML of the first child of temporary element:
Zena.insert_inner = function(dom, position, content) {
var d = document.createElement('div');
d.innerHTML = content;
var insertions = {};
insertions[position] = d.firstChild.innerHTML;
Element.insert(dom, insertions);
}
Not too much of an improvement though, example here.

I've been looking into the Prototype Documentation and I found this: update function.
By the way you described it, you could use the update function in order to find the current bottom content and then update it (just like innerHTML) by adding the desired code plus the previous stored code.

You could use regular expression to strip the outer element.
Element.Methods.insert_content = function(element, insertions) {
var regex = /^<(\w+)[^>]*>(.*)<\/\1>/;
for (key in insertions) {
insertions[key] = regex.exec(insertions[key])[2];
}
Element.insert(element, insertions);
};
Element.addMethods();
$('dom_a').insert_content({bottom:"<ul id='dom_a'><li>bar</li></ul>"});

If you are using PrototypeJS, you might also want to add script.aculo.us to your project. Builder in script.aculo.us provides a nice way to build complex DOM structures like so:
var myList = Builder.node("ul", {
id: "dom_a"
},[
Builder.node("li", "foo"),
Builder.node("li", "bar"),
]);
After this, you can insert this object which should be rendered as HTML anywhere in the DOM with any insert/update functions (of PrototypeJS) or even standard JavaScript appendChild.
$("my_div").insert({After: myList});
Note that in PrototypeJS insert comes in 4 different modes: After, Before, Top and Bottom. If you use insert without specifying a "mode" as above, the default will be Bottom. That is, the new DOM code will be appended below existing contents of the container element as innerHTML. Top will do the same thing but add it on top of the existing contents. Before and After are also cool ways to append to the DOM. If you use these, the content will be added in the DOM structure before and after the container element, not inside as innerHTML.
With Builder however, there is one thing to keep in mind, .. okay two things really:
i. You cannot enter raw HTML in the object as content... This will fail:
Builder.node("ul", "<li>foo</li>");
ii. When you specify node attributes, keep in mind that you must use className to signify HTML attribute class (and possibly also htmlFor for for attribute... although for attribute seems to be deprecated in HTML5(?), but who does not want to use it for labels)
Builder.node("ul", {
id: "dom_a",
className: "classy_list"
});
I know you are scratching your head because of point i. > What, no raw HTML, dang!
Not to worry. If you still need to add content which might contain HTML inside a Builder created DOM, just do it in the second stage using the insert({Before/After/Top/Bottom: string}). But why'd you want to do it in the first place? It would be really good practice if you wrote an once for all function that generates all kinds of DOM elements rather than stitching in all sorts of strings. The former approach would be neat and elegant. This is something like the inline style versus class type of question. Good design should after all separate content from meta content, or formatting markup / markdown.
One last thing to keep handy in your toolbox is Protype's DOM traversal in case you want to dynamically insert and delete content like a HTML Houdini. Check out the Element next, up, down, previous methods. Besides the $$ is also kinda fun to use, particularly if you know CSS3 selectors.

Regex to search html return, but not actual html jQuery

I'm making a highlighting plugin for a client to find things in a page and I decided to test it with a help viewer im still building but I'm having an issue that'll (probably) require some regex.
I do not want to parse HTML, and im totally open on how to do this differently, this just seems like the the best/right way.
http://oscargodson.com/labs/help-viewer
http://oscargodson.com/labs/help-viewer/js/jquery.jhighlight.js
Type something in the search... ok, refresh the page, now type, like, class or class=" or type <a you'll notice it'll search the actual HTML (as expected). How can I only search the text?
If i do .text() it'll vaporize all the HTML and what i get back will just be a big blob of text, but i still want the HTML so I dont lose formatting, links, images, etc. I want this to work like CMD/CTRL+F.
You'd use this plugin like:
$('article').jhighlight({find:'class'});
To remove them:
.jhighlight('remove')
==UPDATE==
While Mike Samuel's idea below does in fact work, it's a tad heavy for this plugin. It's mainly for a client looking to erase bad words and/or MS Word characters during a "publishing" process of a form. I'm looking for a more lightweight fix, any ideas?

You really don't want to use eval, mess with innerHTML or parse the markup "manually". The best way, in my opinion, is to deal with text nodes directly and keep a cache of the original html to erase the highlights. Quick rewrite, with comments:
(function($){
$.fn.jhighlight = function(opt) {
var options = $.extend($.fn.jhighlight.defaults, opt)
, txtProp = this[0].textContent ? 'textContent' : 'innerText';
if ($.trim(options.find.length) < 1) return this;
return this.each(function(){
var self = $(this);
// use a cache to clear the highlights
if (!self.data('htmlCache'))
self.data('htmlCache', self.html());
if(opt === 'remove'){
return self.html( self.data('htmlCache') );
}
// create Tree Walker
// https://developer.mozilla.org/en/DOM/treeWalker
var walker = document.createTreeWalker(
this, // walk only on target element
NodeFilter.SHOW_TEXT,
null,
false
);
var node
, matches
, flags = 'g' + (!options.caseSensitive ? 'i' : '')
, exp = new RegExp('('+options.find+')', flags) // capturing
, expSplit = new RegExp(options.find, flags) // no capturing
, highlights = [];
// walk this wayy
// and save matched nodes for later
while(node = walker.nextNode()){
if (matches = node.nodeValue.match(exp)){
highlights.push([node, matches]);
}
}
// must replace stuff after the walker is finished
// otherwise replacing a node will halt the walker
for(var nn=0,hln=highlights.length; nn<hln; nn++){
var node = highlights[nn][0]
, matches = highlights[nn][1]
, parts = node.nodeValue.split(expSplit) // split on matches
, frag = document.createDocumentFragment(); // temporary holder
// add text + highlighted parts in between
// like a .join() but with elements :)
for(var i=0,ln=parts.length; i<ln; i++){
// non-highlighted text
if (parts[i].length)
frag.appendChild(document.createTextNode(parts[i]));
// highlighted text
// skip last iteration
if (i < ln-1){
var h = document.createElement('span');
h.className = options.className;
h[txtProp] = matches[i];
frag.appendChild(h);
}
}
// replace the original text node
node.parentNode.replaceChild(frag, node);
};
});
};
$.fn.jhighlight.defaults = {
find:'',
className:'jhighlight',
color:'#FFF77B',
caseSensitive:false,
wrappingTag:'span'
};
})(jQuery);
If you're doing any manipulation on the page, you might want to replace the caching with another clean-up mechanism, not trivial though.
You can see the code working here: http://jsbin.com/anace5/2/
You also need to add display:block to your new html elements, the layout is broken on a few browsers.

In the javascript code prettifier, I had this problem. I wanted to search the text but preserve tags.
What I did was start with HTML, and decompose that into two bits.
The text content
Pairs of (index into text content where a tag occurs, the tag content)
So given
Lorem <b>ipsum</b>
I end up with
text = 'Lorem ipsum'
tags = [6, '<b>', 10, '</b>']
which allows me to search on the text, and then based on the result start and end indices, produce HTML including only the tags (and only balanced tags) in that range.

Have a look here: getElementsByTagName() equivalent for textNodes.
You can probably adapt one of the proposed solutions to your needs (i.e. iterate over all text nodes, replacing the words as you go - this won't work in cases such as <tag>wo</tag>rd but it's better than nothing, I guess).

I believe you could just do:
$('#article :not(:has(*))').jhighlight({find : 'class'});
Since it grabs all leaf nodes in the article it would require valid xhtml, that is, it would only match link in the following example:
<p>This is some paragraph content with a link</p>
DOM traversal / selector application could slow things down a bit so it might be good to do:
article_nodes = article_nodes || $('#article :not(:has(*))');
article_nodes.jhighlight({find : 'class'});

May be something like that could be helpful
>+[^<]*?(s(<[\s\S]*?>)?e(<[\s\S]*?>)?e)[^>]*?<+
The first part >+[^<]*? finds > of the last preceding tag
The third part [^>]*?<+ finds < of the first subsequent tag
In the middle we have (<[\s\S]*?>)? between characters of our search phrase (in this case - "see").
After regular expression searching you could use the result of the middle part to highlight search phrase for user.

Javascript: Changing color of every "r" in html document

EDIT [how can I] change the color of every R and r in my HTML document with javascript?

I'd use the highlight plugin for jQuery. Then do something like:
$('*').highlight('r'); // Not sure if it's case-insensitive or not
and in CSS:
.highlight { background-color: yellow; }

Doable, but not super easy. There's no CSS way to do it.
Basically, you'll need to use Javascript and iterate through the all nodes. If it's a text node, you can search it for "R" and then replace the R with a <span style="color:red">R</span>
I am obviously simplifying this a bit, it's probably better to just dynamically add a "highlight" class, rather than hard code a style, and have that defined in CSS. Similarly, I'm sure you'll wanna parameterize the search string. Also, this doesn't take into account what the text node is, for instance, I have special handling to skip comments, but you'll probably find there's other things (script nodes?) you also need to skip.
function updateNodes(node) {
if (node.nextSibling)
updateNodes(node.nextSibling);
if (node.nodeType ==8) return; //Don't update comments
if (node.firstChild)
updateNodes(node.firstChild);
if (node.nodeValue) { // update me
if (node.nodeValue.search(/[Rr]/) > -1){ // does the text node have an R
var span=document.createElement("span");
var remainingText = node.nodeValue;
var newValue='';
while (remainingText.search(/[Rr]/) > -1){ //Crawl through the node finding each R
var rPos = remainingText.search(/[Rr]/);
var bit = remainingText.substr(0,rPos);
var r = remainingText.substr(rPos,1);
remainingText=remainingText.substr(rPos+1);
newValue+=bit;
newValue+='<span style="color:red">';
newValue+=r;
newValue+='</span>';
}
newValue+=remainingText;
span.innerHTML=newValue;
node.parentNode.insertBefore(span,node);
node.parentNode.removeChild(node);
}
}
}
function replace(){ updateNodes(document.body);
}

Yes this is possible with a little Javascript, a smattering of CSS and some regex.
First, you need to define a style which provides the colour you require (in my example below I refer to a CSS class called "new-colour"), and then run some regex over your HTML content which does a search and replace. You are looking to change all 'r' and 'R' characters into something like this (as an example):
<span class="new-colour">r</span>
If you don't know regex, there are oodles of resources out there to get you started. You will be pleased to know that your requirement is very simple, so no worries there. Here are a couple of links:
regexlib.com
8 regular expressions you should know

You would need to use the DOM (or jQuery) to iterate through every text node in the document. Whenever you find the letter R, apply a transformation that wraps the letter in an appropriate element.
e.g. Transform the text node "art" into "a<span class="colored">r</span>t". This adds two new text nodes, "r" and "t", and the new span element.

The highlight plugin for jQuery is one option. Another option - especially since to-morrow - you might want to extend your highlighting into keywords or other terms is to use Google's Closure goog.dom.annotate Class. The beauty of this Class is that it will actually parse the dom tree properly and ONLY
annotate the relevant terms. It will also allow you to EXCLUDE elements or elements with certain classes.
A common problem with annotations is that you can mess your HTML, if you are not careful.
For example the 'simple solution posted above'
var body = document.getElementsByTagName("body")[0];
var html = body.innerHTML
.replace(/(^|>[^<rR]*)([rR])/g, "$1<em>$2</em>");
body.innerHTML = html;
will surely also capture terms in any style attributes. If you had this:
<p class="red">text......</p>
It will become
<p class="<span class="red">r</span>ed .....
that will break your html.
In general DOM parsing is 'slow', so try and avoid annotating the whole body of a webpage, ask yourself why you only need to highlight the R's? Actually I am curious why do you want to annotate the r's?:)

Plain JS solution without need of any 20kB JS library:
var body = document.getElementsByTagName("body")[0];
var html = body.innerHTML
.replace(/(^|>[^<rR]*)([rR])/g, "$1<em>$2</em>");
body.innerHTML = html; // note that you will lose all
// event handlers in this step...

Use Javascript E4X to selectively rename XML tags

I am using javascript to manipulate XML in a non-browser context (no DOM), and am looking for an E4X expression to rename a list of tags. For any given tag, I don't necessarily know ahead of time what it is called, and I only want to rename it if it contains a given substring.
As I very contrived example, I may have:
someXML = <Favourites>
<JillFaveColour>Blue</JillFaveColour>
<JillFaveCandy>Smarties</JillFaveCandy>
<JillFaveFlower>Rose</JillFaveFlower>
<Favourites>
and I want to turn the XML into:
<Favourites>
<GaryFaveColour>Blue</GaryFaveColour>
<GaryFaveCandy>Smarties</GaryFaveCandy>
<GaryFaveFlower>Rose</GaryFaveFlower>
<Favourites>
However, there may be more tags or fewer tags, and I won't know ahead of time what their full name is. I only rename them if they contain a given substring (in my example, the substring is "Jill").

For renaming elements, use setLocalName(newName). For your "I don't know all the tag names in advance" problem, just iterate over the elements and call their localName() methods (if node.length() === 1 && node.nodeKind() === "element") to get their tag names.

Something like:
var children= someXML.children();
for (var i= children.length; i-->0;)
if (children[i].nodeKind()==='element')
element.setLocalName(element.localName().split('Jill').join('Gary'));

Consider adding then deleting those nodes manually?
//msg is your xml object
msg['Favorites']['GaryFaveColor'] = msg['Favorites']['JillFaveColour'];
msg['Favorites']['GaryFaveCandy'] = msg['Favorites']['JillFaveCandy'];
msg['Favorites']['GaryFaveFlower'] = msg['Favorites']['JillFaveFlower'];
//now del Jill
delete msg['Favorites']['JillFaveColour'];
delete msg['Favorites']['JillFaveCandy'];
delete msg['Favorites']['JillFaveFlower'];

We Keep Coding

JavaScript is the programming language of the Web.

Scraping data from HTML using JavaScript RegExp [duplicate] - javascript

Forget regular expressions. Iterate over each text node (and doing it recursively will be the most elegant) and modify the text nodes if the text is found. If just looking for a string, you can use indexOf().

Related

Finding the DOM element with specific text and modify it

Insert innerHTML with Prototypejs

Regex to search html return, but not actual html jQuery

Javascript: Changing color of every "r" in html document

Use Javascript E4X to selectively rename XML tags

Categories

Resources