How to get HTML data with javascript

How to get HTML data with javascript - javascript

I have an HTML web page full of divs and span tags identified with class that have lots of data I need in other format. I was wondering what would be the best way to do this with javascript.
Thank you for the help.

The fastest way? jQuery:
$(".myClass").each(function() {
// work with your data here
});

More lowlevel, but should be a lot faster (a lot less overhead):
var myelements = document.evaluate('//div[#class=myClass"]', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i = 0; i < myelements.snapshotLength; i++) {
var dataElement = myelements.snapshotItem(i);
// work with your data here
}
(ok, you'd have to do it twice (once for div and once for span), it's more code and doesn't look as nice, but it should still be faster)

If you are wanting to get at all of the documents with a specific class then you will need to test for the presence of that class on each object. You will want to use a
document.getElementByTagName("*") // This should select everything
and loop through them to detect the proper name.
if (regex test == true) {
// you found an element that matches
// do what you will with it.
}
If you find the elements you need do what you need with them. Now you have processed all elements on the page and found elements that match your criteria. Good luck.

Related

Scraping data from HTML using JavaScript RegExp [duplicate]

I'm trying to figure out how to, in raw javascript (no jQuery, etc.), find an element with specific text and modify that text.
My first incarnation of the solution... is less than adequate. What I did was basically:
var x = document.body.innerHTML;
x.replace(/regular-expression/,"text");
document.body.innerHTML = x;
Naively I thought I succeeded with flying colors, especially since it was so simple. So then I added an image to my example and thought I could check every 5 seconds (because this string may enter the DOM dynamically)... and the image flickered every 5 seconds.
Oops.
So, there has to be a correct way to do this. A way that specifically singles out a specific DOM element and updates the text portion of that DOM element.
Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid. And even then, I'm skeptical about "changing the innerHTML to something different" being the correct way to update a DOM element.
So, what's the correct way to search through the DOM for a string? And what's the correct way to update a DOM element's text?

Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid.
I want to search for an element in an unordered random list. Now, there's a "go through all the elements till you find what you're looking for approach", which I want to avoid.
Old-timer magno tape, record, listen, meditate.
Btw, see: Find and replace text with JavaScript on James Padolsey's github
(also hig blog articles explaining it)

Edit: Changed querySelectorAll to getElementsByTagName from RobG's suggestion.
You can use the getElementsByTagName function to grab all of the tags on the page. From there, you can check their children and see if they have any Text Nodes as children. If they do, you'd then look at their text and see if it matches what you need. Here is an example that will print out the text of every Text Node in your document with the console object:
var elms = document.getElementsByTagName("*"),
len = elms.length;
for(var ii = 0; ii < len; ii++) {
var myChildred = elms[ii].childNodes;
len2 = myChildred.length;
for (var jj = 0; jj < len2; jj++) {
if(myChildred[jj].nodeType === 3) {
console.log(myChildred[jj].nodeValue);
// example on update a text node's value
myChildred[jj].nodeValue = myChildred[jj].nodeValue.replace(/test/,"123");
}
}
}
To update a DOM element's text, simple update the nodeValue property of the Text Node.

Don't use innerHTML with a regular expression, it will almost certainly fail for non-trivial content. Also, there are still differences in how browsers generate it from the live DOM. Replacing the innerHTML will also remove any event listeners added as element properties (i.e. like element.onclick = fn).
It is best if you can have the string enclosed in an element with an attribute or property you can search on (id, class, etc.) but failing that, a search of text nodes is the best approach.
Edit
Attempting a general purpose text selection function for an HTML document may result in a very complex algorithm since the string could be part of a complex structure, e.g.:
<h1>Some <span class="foo"><em>s</em>pecial</span> heading</h1>
Searching for the string "special heading" is tricky as it is split over 2 elements. Wrapping it another element (say for highlighting) is also not trivial since the resulting DOM structure must be valid. For example, the text matching "some special" in the above could be wrapped in a span but not a div.
Any such function must be accompanied by documentation stating its limitations and most appropriate use.

Forget regular expressions.
Iterate over each text node (and doing it recursively will be the most elegant) and modify the text nodes if the text is found. If just looking for a string, you can use indexOf().

x.replace(/regular-expression/,"text");
will return a value so
var y = x.replace(/regular-expression/,"text");
now you can assign new value.
document.body.innerHTML = y;
Bu you want to think about this, you dont't want to get the whole body just to change one small piece of code, why not get the content of a div or any element and so on
example:
<p id='paragraph'>
... some text here ...
</p>
now you can use javascript
var para = document.getElementById('paragraph').innerHTML;
var newPara = para.replace(/regex/,'new content');
para.innerHTML = newPara;
This should be the simplest way.

Appending Clones of HTML Form to Document body Respecting IDs

I have an html form which when signed I'd like to create a number of copies of to display directly after one another. My poor cloning function looks like this:
function formCloner(numCopies) {
let cloneContainer = document.createElement("div");
cloneContainer.id = "formCopies";
for(i = 0; i < numCopies) {
// grab the whole html form
let html = document.getElementsByTagName("html")[0];
let htmlClone = html.cloneNode(true);
/*
* do some other stuff to the clone
*/
cloneContainer.appendChild(htmlClone);
}
document.body.appendChild(cloneContainer);
}
One of the big problems with this approach is that in the end all of the form elements copied over share the same ID. Which is bad. I thought about running through each child node and changing the IDs manually but that seemed like overkill. There must be a simpler answer, can anyone recommend a painless way to achieve the end result of having copies of the form appended to the document body?

This is pretty easy... Simple target all elements with IDs, and add _# to them. Also do the same on the for attribute, since they are the only real reason why you'd want to play with IDs in the first place.
htmlClone.querySelectorAll("[id]").forEach(elem=>elem.setAttribute('id',elem.getAttribute('id')+"_"+i));
htmlClone.querySelectorAll("[for]").forEach(elem=>elem.setAttribute('for',elem.getAttribute('for')+"_"+i));

Identify all css ID's and classes used in a particular HTML page from a specific stylesheet?

I have been assigned the task of cleaning up over 5,000 lines of CSS. During the development of this project, we basically just appended classes to this massive CSS file.
I am tasked with organizing classes according to their pages, and then prepending the page ID to them to make them only hit that page.
However, what makes this tedious is manually finding out what classes and ID's are in a page, and then organizing them.
Is there a way to "dump" all the classes from a stylsheet, ONLY on that page, into a file or something?

To get an array of classes/id's in use, loop over every element in the document and populate an array. Obviously this isn't very efficient, but considering the task at hand, i doubt that is a concern.
var idArr = [];
var classArr = [];
[].forEach.call(document.querySelectorAll("*"), function(element){
if (element.id && idArr.indexOf(element.id) == -1) {
idArr.push(element.id);
}
if (element.className) {
var tempClassArr = element.className.split(" ");
for (var i = 0; i < tempClassArr.length; i++) {
if (classArr.indexOf(tempClassArr[i]) == -1) {
classArr.push(tempClassArr[i]);
}
}
}
});
console.log(idArr);
console.log(classArr);
http://jsfiddle.net/QLmNf/1/
This will give you an array of ID's and an array of classes on the page. you would then need to compare this to the style declarations to find ones that match. This of course won't cut out declarations that use these id's and classes but don't actually match any elements on the page, you might be able to solve that problem with the audits tab suggested by #jordanforeman.

Select tags that starts with "x-" in jQuery

How can I select nodes that begin with a "x-" tag name, here is an hierarchy DOM tree example:
<div>
<x-tab>
<div></div>
<div>
<x-map></x-map>
</div>
</x-tab>
</div>
<x-footer></x-footer>
jQuery does not allow me to query $('x-*'), is there any way that I could achieve this?

The below is just working fine. Though I am not sure about performance as I am using regex.
$('body *').filter(function(){
return /^x-/i.test(this.nodeName);
}).each(function(){
console.log(this.nodeName);
});
Working fiddle
PS: In above sample, I am considering body tag as parent element.
UPDATE :
After checking Mohamed Meligy's post, It seems regex is faster than string manipulation in this condition. and It could become more faster (or same) if we use find. Something like this:
$('body').find('*').filter(function(){
return /^x-/i.test(this.nodeName);
}).each(function(){
console.log(this.nodeName);
});
jsperf test
UPDATE 2:
If you want to search in document then you can do the below which is fastest:
$(Array.prototype.slice.call(document.all)).filter(function () {
return /^x-/i.test(this.nodeName);
}).each(function(){
console.log(this.nodeName);
});
jsperf test

There is no native way to do this, it has worst performance, so, just do it yourself.
Example:
var results = $("div").find("*").filter(function(){
return /^x\-/i.test(this.nodeName);
});
Full example:
http://jsfiddle.net/6b8YY/3/
Notes: (Updated, see comments)
If you are wondering why I use this way for checking tag name, see:
JavaScript: case-insensitive search
and see comments as well.
Also, if you are wondering about the find method instead of adding to selector, since selectors are matched from right not from left, it may be better to separate the selector. I could also do this:
$("*", $("div")). Preferably though instead of just div add an ID or something to it so that parent match is quick.
In the comments you'll find a proof that it's not faster. This applies to very simple documents though I believe, where the cost of creating a jQuery object is higher than the cost of searching all DOM elements. In realistic page sizes though this will not be the case.
Update:
I also really like Teifi's answer. You can do it in one place and then reuse it everywhere. For example, let me mix my way with his:
// In some shared libraries location:
$.extend($.expr[':'], {
x : function(e) {
return /^x\-/i.test(this.nodeName);
}
});
// Then you can use it like:
$(function(){
// One way
var results = $("div").find(":x");
// But even nicer, you can mix with other selectors
// Say you want to get <a> tags directly inside x-* tags inside <section>
var anchors = $("section :x > a");
// Another example to show the power, say using a class name with it:
var highlightedResults = $(":x.highlight");
// Note I made the CSS class right most to be matched first for speed
});
It's the same performance hit, but more convenient API.

It might not be efficient, but consider it as a last option if you do not get any answer.
Try adding a custom attribute to these tags. What i mean is when you add a tag for eg. <x-tag>, add a custom attribute with it and assign it the same value as the tag, so the html looks like <x-tag CustAttr="x-tag">.
Now to get tags starting with x-, you can use the following jQuery code:
$("[CustAttr^=x-]")
and you will get all the tags that start with x-

custom jquery selector
jQuery(function($) {
$.extend($.expr[':'], {
X : function(e) {
return /^x-/i.test(e.tagName);
}
});
});
than, use $(":X") or $("*:X") to select your nodes.

Although this does not answer the question directly it could provide a solution, by "defining" the tags in the selector you can get all of that type?
$('x-tab, x-map, x-footer')

Workaround: if you want this thing more than once, it might be a lot more efficient to add a class based on the tag - which you only do once at the beginning, and then you filter for the tag the trivial way.
What I mean is,
function addTagMarks() {
// call when the document is ready, or when you have new tags
var prefix = "tag--"; // choose a prefix that avoids collision
var newbies = $("*").not("[class^='"+prefix+"']"); // skip what's done already
newbies.each(function() {
var tagName = $(this).prop("tagName").toLowerCase();
$(this).addClass(prefix + tagName);
});
}
After this, you can do a $("[class^='tag--x-']") or the same thing with querySelectorAll and it will be reasonably fast.

See if this works!
function getXNodes() {
var regex = /x-/, i = 0, totalnodes = [];
while (i !== document.all.length) {
if (regex.test(document.all[i].nodeName)) {
totalnodes.push(document.all[i]);
}
i++;
}
return totalnodes;
}

Demo Fiddle
var i=0;
for(i=0; i< document.all.length; i++){
if(document.all[i].nodeName.toLowerCase().indexOf('x-') !== -1){
$(document.all[i].nodeName.toLowerCase()).addClass('test');
}
}

Try this
var test = $('[x-]');
if(test)
alert('eureka!');
Basically jQuery selector works like CSS selector.
Read jQuery selector API here.

How to get the page's full content as string in javascript?

I'm writing a bookmarklet, i.e. a bookmark that contains javascript instead of a URL, and I have some trouble. In fact, I cannot remember how I can get the content of the page as a string, so I can apply a regular expression to find what I want. Can you please help me on this?
Before anyone suggests it, I cannot use getElementBy(Id/Name/Tag), because the data I'm looking for is HTML-commented and inside markups, so I don't think that would work.
Thanks.

You can access it through:
document.body.innerHTML

so I can apply a regular expression to find what I want
Do. Not. Use. Regex. To. Parse. HTML.
Especially when the browser has already parsed it for you! Come ON!
the data I'm looking for is HTML-commented
You can perfectly well grab comment content out of the DOM. eg.
<div id="mything"><!-- la la la I'm a big comment --></div>
alert(document.getElementById('mything').firstChild.data);
And if you need to search the DOM for comment elements:
// Get comment descendents
//
function dom_getComments(parent, recurse) {
var results= [];
for (var childi= 0; childi<parent.childNodes.length; childi++) {
var child= parent.childNodes[childi];
if (child.nodeType==8) // Node.COMMENT_NODE
results.push(child);
else if (recurse && child.nodeType==1) // Node.ELEMENT_NODE
results= results.concat(dom_getComments(child));
}
return results;
}

We Keep Coding

JavaScript is the programming language of the Web.

How to get HTML data with javascript - javascript

I have an HTML web page full of divs and span tags identified with class that have lots of data I need in other format. I was wondering what would be the best way to do this with javascript. Thank you for the help.

The fastest way? jQuery: $(".myClass").each(function() { // work with your data here });

Related

Scraping data from HTML using JavaScript RegExp [duplicate]

Appending Clones of HTML Form to Document body Respecting IDs

Identify all css ID's and classes used in a particular HTML page from a specific stylesheet?

Select tags that starts with "x-" in jQuery

How to get the page's full content as string in javascript?

Categories

Resources