Get the raw HTML of selected content using javascript - javascript

How would I get the raw HTML of the selected content on a page using Javascript? For the sake of simplicity, I'm sticking with browsers supporting window.getSelection.
Here is an example; the content between both | represent my selection.
<p>
The <em>quick brown f|ox</em> jumps over the lazy <strong>d|og</strong>.
</p>
I can capture and alert the normalized HTML with the following Javascript.
var selectionRange = window.getSelection().getRangeAt(0);
selectionContents = selectionRange.cloneContents(),
fragmentContainer = document.createElement('div');
fragmentContainer.appendChild(selectionContents);
alert(fragmentContainer.innerHTML);
In the above example, the alerted contents would collapse the trailing elements and return the string <em>ox</em> jumps over the lazy <strong>d</strong>.
How might I return the string ox</em> jumps over the lazy <strong>d?

You would have to effectively write your own HTML serialiser.
Start at the selectionRange.startContainer/startOffset and walk the tree forwards from there until you get to endContainer/endOffset, outputting HTML markup from the nodes as you go, including open tags and attributes when you walk into an Element and close tags when you go up a parentNode.
Not much fun, especially if you are going to have to support the very different IE<9 Range model at some point...
(Note also that you won't be able to get the completely raw original HTML, because that information is gone. Only the current DOM tree is stored by the browser, and that means details like tag case, attribute order, whitespace, and omitted implicit tags will differ between the source and what you get out.)

Looking at the API's, I don't think you can extract the HTML without it being converted to a DocumentFragment, which by default will close any open tags to make it valid HTML.
See Converting Range or DocumentFragment to string for a similar Q.

Related

How to find a unique string within html and wrap it with a tag, but exclude links and urls

I'm looking for a way to look for a specific string within a page in the visible text and then wrap that string in <em> tags. I have tried used HTML Agility Pack and had some success with a Regex.Replace but if the string is included within a url it also gets replaced which I do not want, if it's within an image name, it gets replaced and this obviously breaks the link or image url.
An example attempt:
var markup = Encoding.UTF8.GetString(buffer);
var replaced = Regex.Replace(markup, "product-xs", " <em>product</em>-xs", RegexOptions.IgnoreCase);
var output = Encoding.UTF8.GetBytes(replaced);
_stream.Write(output, 0, output.Length);
This does not work as it would replace a <a href="product/product-xs"> with <a href="product/<em>product</em>-xs"> - which I don't want.
The string is coming from a text string value within a CMS so the user can't wrap the words there and ideally, I want to catch all instances of the word that are already published.
Ideally I would want to exclude <title> tags, <img> tags and <a> tags, everything else should get the wrapped tag.
Before I used the HTML Agility Pack, a fellow front end dev tried it with JavaScript but that had an unexpected impact on dropdown menus.
If you need any more info, just ask.
You can use HTML Agility Pack to select only the text nodes (i.e. the text that exists between any two tags) with a bit of XPath and modify them like this.
Looking only in body will exclude <title>, <meta> etc. The not excludes script tags, you can exclude others in the same way (or check the parent node in the loop).
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body//*[not(self::script)]/text()"))
{
var newNode = htmlDoc.CreateTextNode(node.InnerText.Replace("product-xs", "<em>product</em>-xs"));
node.ParentNode.ReplaceChild(newNode, node);
}
I've used a simple replace, regex will work fine too, prob best to check the performance of each approach and choose which works best for your use case.

Chrome extension breaks DOM

I'm making a Chrome extension that replaces certain text on a page with new text and a link. To do this I'm using document.body.innerHTML, which I've read breaks the DOM. When the extension is enabled it seems to break the loading of YouTube videos and pages at codepen.io. I've tried to fix this by excluding YouTube and codepen in the manifest, and by filtering them out in the code below, but it doesn't seem to be working.
Can anyone suggest an alternative to using document.body.innerHTML or see other problems in my code that may be breaking page loads? Thanks.
var texts=["some text","more text"];
if(!window.location.href.includes("www.google.com")||!window.location.href.includes("youtube.com")||!window.location.href.includes("codepen.io")){
for(var i=0;i<texts.length;i++){
if(document.documentElement.textContent || document.documentElement.innerText.includes(texts[i])){
var regex = new RegExp(texts[i],'g');
document.body.innerHTML = document.body.innerHTML.replace(regex,
"<a href='https://www.somesite.org'>replacement text</a>");
}
}
}
Using innerHTML to do this is like using a shotgun to do brain surgery. Not to mention that this can even result in invalid HTML. You will end up having to whitelist every single website that uses any JavaScript at this rate, which is obviously not feasible.
The correct way to do it is to not touch innerHTML at all. Recursively iterate through all the DOM nodes (using firstChild, nextSibling) on the page and look for matches in text nodes. When you find one, replace that single node (replaceChild) with your link (createElement), and new text nodes (createTextNode, appendChild, insertBefore) for any leftover bits.
Essentially you will want to look for a node like:
Text: this is some text that should be linked
And programmatically replace it with nodes like:
Text: this is
Element: a href="..."
Text: replacement text
Text: that should be linked
Additionally if you want to support websites that generate content with JavaScript you'll have to run this replacement process on dynamically inserted content as well. A MutationObserver would be one way to do that, but bear in mind this will probably slow down websites.

Dynamically inject HTML entity escaped for CSS via JavaScript

I'm attempting to dynamically create a list of HTMLElements with data-* attributes that correspond to different HTML Entities, to be then picked up by CSS and used as pseudo element content like so:
li:after {
content: attr(data-code);
}
The problem with this is that for attr() to properly render the actual entity, rather than the literal code is to prefix said code with &#x - your typical \ doesn't work.
So the desired output HTML is something like so: <li data-code="&#xENTITY"></li>. When added directly to HTML, this works exactly as expected in relation to my CSS rule. The escaped entity is placed on the page in an :after psuedo element and rendered as the entity icon.
Here's where things get curious...
As stated earlier, I'm trying to create and inject these lis dynamically through JavaScript (iterating through a list), and that's where the snag happens.
var entities = [{code: '&#x1F602'}, ...];
for (var i = 0; i < entitites.length; i++) {
var entity = entitites[i],
listItem = document.createElement('li');
listItem.setAttribute('data-code', entity.code);
list.appendChild(listItem);
}
The li is correctly added to the DOM with the properly formatted entity set so it gets picked up by my CSS rule. However, rather than rendering the entity icon, the code is shown!
Note in the image above, the first item rendered is the HTML explicitly on the page. The second item is injected via JS (using the exact same code), then given an :after element by CSS. Chrome's web inspector even renders it differently!
Even curiouser still is that I can edit the HTML via WebInspector and inject the escaped data-* attribute manually - Chrome STILL renders the correct icon!
I'm at a loss here, so any guidance would be greatly appreciated!
The HTML entity notation will only be parsed by an HTML parser. If your data-code attributes are "born" in JavaScript, then you need to use the JavaScript notation for getting the Unicode characters you want. Instead of ☺ for a smiley face, in JavaScript you use \u263A (a backslash, a lower-case "u", and four hex digits).
Whether your data-code attributes are coded directly into your HTML source (with HTML entity notation) or else created in JavaScript (with JavaScript notation), by the time the attribute value is part of the DOM, it's Unicode.
Now, things get more complicated when you have characters outside the 16-bit range, because JavaScript is kind-of terrible at dealing with that. You can look up your code point(s) at http://www.fileformat.info/info/unicode/ and that'll give you the "C/C++/Java" UTF-16 code pair you need. For example, your "tears of joy" face is the pair "\uD83D\uDE02".

JavaScript Library/Function to find Unclosed HTML Tags

I am currently looking for a solution to find and list out any unclosed HTML tags from an arbitrary slice of raw HTML. I don't feel like this should be an awful problem, but I cannot seem to find something that does it in JS. Unfortunately, this needs to be client-side since it is being used for rendering annotations to HTML pages. Obviously, annotations are somewhat nasty business, since they select or apply formatting that may apply to only part of an HTML element (i.e., a markup overlaid onto an existing HTML markup).
One simple use-case is where you might want to only render part of an HTML page, but then inject the rest later. For example, imagine a hypothetical segment:
<p>This is my text <StartDelayedInject/> with a comment I added. </p>
<p> But it doesn't exist until now. </p> <StopDelayedInject/>
I'll be doing some pre-processing to rebuild the HTML so that I wrap partial elements into span-type elements that apply the appropriate formatting. Initially this would be parsed in the form:
<p><span>This is my text</span></p>
After some user action, it would then be modified to a form such as:
<p><span>This is my text</span><span>with a comment I added.</span></p>
<p>But it doesn't exist until now.</p>
This is a very simplified example case (obviously things like ul elements and tables get hairier), but gives the general principle. However, to do this effectively, I need to be able to check a segment of HTML and figure out there are tags that have opened (but not closed). If I know that information, I can wrap the last unterminated text data into a span, close the unclosed tag, and know to return to that point to inject the remainder of the content when needed. However, I need to know the tags that were still open, so that when I inject or modify another segment of content, I can make sure to put it in the right place (e.g., get "with a comment I added." in the first paragraph).
From my understanding of context-free grammars, this should be a relatively trivial task. Each time you open/enter or close/exit a tag, you could just keep a stack of the tags opened but not yet closed. With that said, I'd much rather use a library that's a bit more of a mature solution than make naive parser for that purpose. I'd assume there's some JS HTML parser around that would do this, right? Plenty of them know how to close tags, so so clearly at some point they calculated this.
The problem is that JavaScript only has access to the html in two ways:
In a sense that each element is an object with properties and methods created by the browser on page load.
In a sense that it is a string of text.
Using the first method of interfacing with html, there is no way to detect unclosed tags as you only have access to the objects that the browser creates for you after it parses the html.
Using the second method, you would have to run the entire string of html through an html parser. Some people might assume you could do it simply with regexp, however, this is not feasible. I refer you to this fantastic stackoverflow question.
Even if you found a really robust html parser to use, you would still run into the problem created by the fact that, before your JavaScript even touches it, the browser will have attempted to parse the potentially broken html and there could be errors everywhere.
Edit:
If you like the parser idea, John Resig created this example one you might want to reference.
Not perfect but here's my quick method for checking for mismatch between open/close tags:
function find_unclosed_tags(str) {
str = str.toLowerCase();
var tags = ["a", "span", "div", "ul", "li", "h1", "h2", "h3", "h4", "h5", "h6", "p", "table", "tr", "td", "b", "i", "u"];
var mismatches = [];
tags.forEach(function(tag) {
var pattern_open = '<'+tag+'( |>)';
var pattern_close = '</'+tag+'>';
var diff_count = (str.match(new RegExp(pattern_open,'g')) || []).length - (str.match(new RegExp(pattern_close,'g')) || []).length;
if(diff_count != 0) {
mismatches.push("Open/close mismatch for tag " + tag + ".");
}
});
return mismatches;
}

XPath to find nodes with text + all their descendants & siblings that match certain criteria

Background:
I'm trying to improve on a Greasemonkey script I found.
The script marks prices in foreign currencies and can translate them into the currency of your choice.
The main problem:
How to make the script handle when prices are listed with tags, such as:
<b><i>9.</i></b><sup>95</sup>EUR
(Newegg.com does this, for example - they write their prices like so: <span>$</span>174<sup>.99</sup>).
Currently, the script only finds prices that are listed in the same text node since the XPath expression being used is:
document.evaluate("//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null)
Since the script needs to be fast, I'm trying to avoid stepping through the DOM too much...
Are there any XPath gurus who could help out with some smart solutions for this purpose?
More detailed description of the problem:
The code I now have for finding the text nodes:
var re_skip = /^(SCRIPT|IFRAME|TEXTAREA|STYLE|OPTION|TITLE|HEAD|NOSCRIPT)$/; // List of elements whose text node-children can be skipped
text = document.evaluate("//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
var i = text.snapshotLength;
while (i--) {
el = text.snapshotItem(i);
if (!el.parentNode || re_skip.test(el.parentNode.nodeName.toUpperCase()) || el.parentNode.className == 'autocurrency') {
continue;
}
// ...
// (RegEx logic to check if prices can be found in the text)
}
The check to discard text nodes whose parent elements are listed in "re_skip" could be done in the XPath expression as well (using the "not" notation), right? And this would give a speed-increase?
If an ordered XPath type is used instead, I guess I no longer will have to include a check to see if the parent of the text node being parsed is <span class="autocurrency"> (that is, the <span> that the script adds around matched prices).
If I've understood things correctly, normalize-space() (as suggested here), cannot be used in this case, since the script adds a <span class="autocurrency"> around the matched amount and we need to retain the correct index for where this <span> should be entered.
Is there a way for the XPath to allow only certain (inline) elements to be used in-between the currency values? Or perhaps it could do this: "when a node containing text is found, also include all of its children (and their children and so on) in the match - unless the child node is a block type element." (or perhaps it should read: "...unless the child node is a DIV, P, TABLE, or any of the elements in re_skip")
I can re-write the regex to handle text such as "<span>$</span>174<sup>.99</sup>" as long as I find these text strings - preferably using XPath, as I have understood this to be much faster than stepping through the DOM.
Thank you very much in advance for any help you can give me with this!
--------------------------------------------------------------
EDIT:
OK, I realize now that the question could do with some clarification and some examples, so here they come. A web page might look something like this:
<body>
<div>
<span>9.95 <span>EUR</span></span><br />
<span>8.<sup>95</sup></span>AU$<br />
<table>
<thead>
<tr>
<th>Bla</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>7</b>.95kr</td>
</tr>
</tbody>
</table>
<div>Bla bla</div>
6.95 <span>GBP</span>
</div>
<div><img src="" /><img src=""><span>Bla bla bla</span></div>
</body>
Now, in that example, the overhead isn't that great - I could just feed the whole source code, as a string, directly to the regex that finds prices. But normally, pages will have lots of non-text elements that would make the script very slow if I didn't use a fast XPath to parse out the texts. So, I'm looking for an XPath expression that would find the different texts in the example above, but not just the text content - since we also need tags that might surround parts of a price (a new <span> will later be created around the matched price, including any inline elements that might surround parts of the price).
I don't know exactly what the XPath could be made to return, but somehow I need to grab a hold of the following strings from the example page above:
"9.95 <span>EUR</span>" (or possibly: "<span>9.95 <span>EUR</span></span>")
"<span>8.<sup>95</sup></span>AU$"
"Bla" (or possibly: "<th>Bla</th>")
"<b>7</b>.95kr" (or possibly: "<td><b>7</b>.95kr</td>")
"Bla bla" (or possibly: "<div>Bla bla</div>")
"6.95 <span>GBP</span>"
"Bla bla bla" (or possibly: "<span>Bla bla bla</span>")
and then these strings can be parsed by the regex that finds prices.
Well you can certainly use a path like //*[not(self::script | self::textarea | self::style)]//text() to find only those text node descendants of element nodes that are not one of "script", "textarea", "style". So the regular expression test you have is not necessary, you could express that requirement with XPath. Whether that performs better I can't tell, you will have to check with the XPath implementations of the browser(s) you want to use the Greasemonkey script with.

Categories