I currently use Jsoup to parse the DOM-tree of webpages. I have a separate application rendering the page and using JavaScript to extract the rendering position of every DOM-node. I use the JavaFX stage, webEngine, webView and executeScript functionality to execute the following JavaScript:
var all = document.getElementsByTagName("*");
var serialization = "";
var width = window.innerWidth;
var height = window.innerHeight;
for (var i = 0, max=all.length; i < max; i++) {
serialization += all.item(i).tagName+": "+all.item(i).offsetLeft+" "+all.item(i).offsetTop+" "+all.item(i).offsetWidth/width+" "+all.item(i).offsetHeight/height+"\n";
}
serialization
The problem I face now is to associate the output I get from the JavaScript with the information I collect from the Jsoup mechanics. Ie I want to add the rendering position of every node to the Jsoup data structure. Is there some unique ID for each DOM-node that I dont know about, or should I try a completely different approach?
I think you can use xPath as the unique identifier in both DOM and JSoup.
refer to this question How to calculate the XPath position of an element using Javascript? to get the xpath by javascript
follow the same idea to write a function in Java to get xpath by JSoup
Then you can compare each nodes by these 2 xpath expression. Hope this clarifies.
Related
I would like to read a query string in JavaScript, and then modify the link that will be rendered in HTML, however I am rendering the HTML as part of liquid loop. So am not sure how I would read the query string in JavaScript, store the value of query string in a variable, and show it in the html that's rendered as part of a liquid loop.
I am still new to Liquid so any help would be appreciated. I am using this as part of Dynamics 365 portals.
If I understand correctly you could just use javascript to make an html element then edit that element as you wish in javascript via
var x = document.getElementById("myVar");
//Use var x to edit this element here
//OR
var x = document.createElement("myVar");
//Use var x to edit this element here
document.getElementByID is only used if your element is already made in html, where document.createElement is used if you'd like to make a new element rather than using one thats already made.
I want to take a string that is legal HTML and extract some data from it, based on tags and their attributes. I know that this is possible with jQuery and that it has several built-in methods for this, but I'm trying out Angular and I want to avoid using jQuery unless I really, really need to. Does Angular provide its own set of functions for this?
You can do this with even plain Javascript. Here's a simple example. We could answer much more specifically if you showed us exactly what you're trying to extract from the HTML string. Here's a working snippet example that shows the basic concept:
var htmlStr = '<div><div class="item">Bear</div><div class="item">Wolf</div></div>';
var div = document.createElement("div");
div.innerHTML = htmlStr;
var items = div.querySelectorAll(".item");
for (var i = 0; i < items.length; i++) {
document.write(items[i].innerHTML + "<br>");
}
Angular contains a subset of jQuery called jqLite which is documented here. The .find() in jqLite is limited to only search for tag names so .querySelectorAll() which is built into all modern browsers these days would be much more capable.
I'm fairly new to JSOUP, and i've had no issues parsing using Element.select on tags or id values. The issue i'm having is how to screen scrape javascript code in the page. Here i load the document:
Document doc = Jsoup.connect(pageUrl)
.userAgent(Agent)
.timeout(5000)
.get();
The javascript field values i'm trying to extract are the following:
arrayGPSLocation["0"] = "-19473982376,6848295867";
arrayGPSLocation["1"] = "-19473982376,6848296245";
Since these array values are not in a standard code tag <> is JSOUP the appropriate way to do this? I like JSOUP's API. The only other method is hacking together a String routine...
ie:
int start = pageBuffer.indexOf("arrayGPSLocation[\" + counter + \"]");
int end = pageBuffer.indexOf(";");
String result = pageBuffer.subString(start,end);
This pseudo-code example would have a serious performance problem when parsing a large page. Does anyone know how to accomplish this with JSOUP or should i write my own scraper?
All you can do with Jsoup - is select Element that contains javascript code, get its value as String and work with this string. Right like you doing it in example.
I am actually making a Sidebar Gadget, (which is AJAX-based) and I am looking for a way to extract a single element from an AJAX Request.
The only way I found yet was to do something like that:
var temp = document.createElement("div");
temp.innerHTML = HttpRequest.innerText;
document.body.appendChild(temp);
temp.innerHTML = document.getElementByID("WantedElement").innerText;
But it is pretty ugly, I would like to extract WantedElement directly from the request without adding it to the actual document...
Thank you!
If you're in control of the data, the way you're doing it is probably the best method. Other answers here have their benefits but also they're all rather flawed. For instance, the querySelector() method is only available to Windows Desktop Gadgets running in IE8 mode on the host machine. Regular expressions are particularly unreliable for parsing HTML and should not be used.
If you're not in control of the data or if the data is not transferred over a secure protocol, you should be more concerned about security than code aesthetics -- you may be introducing potential security risks to the gadget and the host machine by inserting unsanitized HTML into the document. Since gadgets run with user or admin level privileges, the obvious security risk is untrusted source/MITM script injection, leaving a hole for malicious scripts to wreak havoc on the machine it's running on.
One potential solution is to use the htmlfile ActiveXObject:
function getElementFromResponse(divId)
{
var h = new ActiveXObject("htmlfile");
h.open();
// disable activex controls
h.parentWindow.ActiveXObject = function () {};
// write the html to the document
h.write(html);
h.close();
return h.getElementById("divID").innerText;
}
You could also make use of IE8's toStaticHTML() method, but your gadget would need to be running in IE8 mode.
One option would be to use regular expressions:
var str = response.match(/<div id="WantedElement">(.+)<\/div>/);
str[0]; // contents of div
However, if your server response is more complex, I'd suggest you to use a data format like JSON for the response. Then it would be much cleaner to parse at the client side.
You could append the response from XMLHttpRequest inside a hidden div, and then call getElementById to get the desired element. Later remove the div when done with it. Or maybe create a function that handles this for you.
function addNinjaNodeToDOM(html) {
var ninjaDiv = document.createElement("div");
ninjaDiv.innerHTML = html;
ninjaDiv.style.display = 'none';
return ninjaDiv;
}
var wrapper = addNinjaNodeToDOM(HttpRequest.innerText);
var requiredNode = wrapper.getElementById("WantedElement");
// do something with requiredNode
document.body.removeChild(wrapper); // remove when done
The only reason for appending it to the DOM was because getElementById will not work unless its part of the DOM tree. See MDC.
However, you can still run selector and XPath queries on detached DOM nodes. That would save you from having you to append elements to the DOM.
var superNinjaDiv = document.createElement('div');
superNinjaDiv.innerHTML = html;
var requiedNode = superNinjaDiv.querySelector("[id=someId]");
I think using getElementById to lookup the element in this case is not a good approach. This is because of extra steps you have to take to use it. You wrap the element in a DIV, inject in DOM, lookup your element using getElementById and then remove the injected DIV from DOM.
DOM manipulation is expensive and injection might cause unnecessary reflow as well. The problem is that you have a document.getElementById and not a element.getElementById which would allow you to query without injection in the document.
To solve this, using querySelector is an obvious solution which is far more easier. Else, I would suggest using getElementsByClassName if you can and if your element has a class defined.
getElementsByClassName is defined on ELEMENT and hence can be used without injecting the element in DOM.
Hope this helps.
It's somewhat unusual to pass HTML through an AJAX request; normally you pass a JSON string that the client can evaluate directly, and work with that
That being said, I don't think there's a way to parse HTML in javascript the way you want that's cross-browser, but here's a way to do it in Mozilla derivatives:
var r = document.createRange();
r.selectNode(document.body);
var domNode = r.createContextualFragment(HTTPRequest.innerText);
I'm getting an xml file and want to get data from it.
The source of the xml doesn't really matter but what I;ve got to get a certain field is:
tracks = xmlDoc.getElementsByTagName("track");
variable = tracks.item(i).childNodes.item(4).childNodes.item(0).nodeValue;
Now this works like a charm, EXCEPT when there is no value in the node. So if the structure is like this:
<xml>
<one>
<two>nodeValue</two>
</one>
<one>
<two></two>
</one>
</xml>
the widget will crash on the second 'one' node, because there is no value in the 'two' node. The console says:
TypeError: tracks.item(i).childNodes.item(4).childNodes.item(0) has no properties
Any ideas on how to get the widget to just see empty as an empty string (null, empty, or ""), instead of crashing? I'm guessing something along the lines of data, getValue(), text, or something else.
using
var track= xmlDoc.getElementsByTagName('track')[0];
var info= track.getElementsByTagName('artist')[0];
var value= info.firstChild? info.firstChild.data : '';
doesn't work and returns "TypeError: track has no properties". That's from the second line where artist is called.
Test that the ‘two’ node has a child node before accessing its data.
childNodes.item(i) (or the JavaScript simple form childNodes[i]) should generally be avoided, it's a bit fragile relying on whitespace text nodes being in the exact expected place.
I'd do something like:
var tracks= xmlDoc.getElementsByTagName('track')[0];
var track= tracks.getElementsByTagName('one')[0];
var info= track.getElementsByTagName('two')[0];
var value= info.firstChild? info.firstChild.data : '';
(If you don't know the tagnames of ‘one’ and ‘two’ in advance, you could always use ‘getElementsByTagName('*')’ to get all elements, as long as you don't need to support IE5, where this doesn't work.)
An alternative to the last line is to use a method to read all the text inside the node, including any of its child nodes. This doesn't matter if the node only ever contains at most one Text node, but can be useful if the tree can get denormalised or contain EntityReferences or nested elements. Historically one had to write a recurse method to get this information, but these days most browsers support the DOM Level 3 textContent property and/or IE's innerText extension:
var value= info.textContent!==undefined? info.textContent : info.innerText;
without a dtd that allows a one element to contain an empty two element, you will have to parse and fiddle the text of your xml to get a document out of it.
Empty elements are like null values in databases- put in something, a "Nothing" or "0" value, a non breaking space, anything at all- or don't include the two element.
Maybe it could be an attribute of one, instead of an element in its own right.
Attributes can have empty strings for values. Better than phantom elements .
Yahoo! Widgets does not implement all basic javascript functions needed to be able to use browser-code in a widget.
instead of using:
tracks = xmlDoc.getElementsByTagName("track");
variable = tracks.item(i).childNodes.item(4).childNodes.item(0).nodeValue;
to get values it's better to use Xpath with a direct conversion to string. When a string is empty in Yahoo! Widgets it doesn't give any faults, but returns the 'empty'. innerText and textContent (the basic javascript way in browsers, used alongside things like getElementsByTagName) are not fully (or not at all) implemented in the Yahoo! Widgets Engine and make it run slower and quite awfully react to xmlNodes and childNodes. an easy way however to traverse an xml Document structure is using 'evaluate' to get everything you need (including lists of nodes) from the xml.
After finding this out, my solution was to make everything a lot easier and less sensitive to faults and errors. Also I chose to put the objects in an array to make working with them easier.
var entries = xmlDoc.evaluate("lfm/recenttracks/track");
var length = entries.length;
for(var i = 0; i < length; i++) {
var entry = entries.item(i);
var obj = {
artist: entry.evaluate("string(artist)"),
name: entry.evaluate("string(name)"),
url: entry.evaluate("string(url)"),
image: entry.evaluate("string(image[#size='medium'])")
};
posts[i] = obj;
}