Extract all links from a string - javascript

I have a javascript variable containing the HTML source code of a page (not the source of the current page), I need to extract all links from this variable.
Any clues as to what's the best way of doing this?
Is it possible to create a DOM for the HTML in the variable and then walk that?

I don't know if this is the recommended way, but it works: (JavaScript only)
var rawHTML = '<html><body>barzort</body></html>';
var doc = document.createElement("html");
doc.innerHTML = rawHTML;
var links = doc.getElementsByTagName("a")
var urls = [];
for (var i=0; i<links.length; i++) {
urls.push(links[i].getAttribute("href"));
}
alert(urls)

If you're using jQuery, you can really easily I believe:
var doc = $(rawHTML);
var links = $('a', doc);
http://docs.jquery.com/Core/jQuery#htmlownerDocument

This is useful esepcially if you need to replace links...
var linkReg = /(<[Aa]\s(.*)<\/[Aa]>)/g;
var linksInText = text.match(linkReg);

If you're running Firefox YES YOU CAN ! It's called DOMParser , check it out:
DOMParser is mainly useful for applications and extensions based on Mozilla platform. While it's available to web pages, it's not part of any standard and level of support in other browsers is unknown.

If you are running outside a browser context and don't want to pull a HTML parser dependency, here's a naive approach:
var html = `
<html><body>
Example
<p>text</p>
<a download href='./doc.pdf'>Download</a>
</body></html>`
var anchors = /<a\s[^>]*?href=(["']?)([^\s]+?)\1[^>]*?>/ig;
var links = [];
html.replace(anchors, function (_anchor, _quote, url) {
links.push(url);
});
console.log(links);

Related

How to open links in new tabs only if they contain a specific keyword?

I found a greasemonkey script on the net which opens all the links on a webpage in new tabs. What I want to do is edit it so that it only opens particular links that contain the word forum.
This is the script I am currently using:
javascript: (function () {
{
var links = document.getElementsByTagName('a');
for (i in links) {
window.open(links[i].href);
}
}
})()
How can I edit it to do what I want?
That script works on Greasemonkey perfectly, although there are a lot of brackets that can be removed. Here is the version I used:
(function(){
var links = document.getElementsByTagName('a');
for (i in links){
var href = links[i].href;
if(href.toLowerCase().indexOf('forum') > 0){
window.open(links[i].href);
}
}
})();
Are you sure it isn't working? It may be that your browser is just getting in the way and blocking popups.
Another approach would be to query for every anchor tag, place them in an array (as opposed to nodelist), and then conditionally open them in a new window using filter.
[].forEach.call(document.querySelectorAll('a'),function(el){ if(el.href.toLowerCase().indexOf('forum') > -1) window.open(el.href) })
Or in a more readable form
[].forEach.call(//access array's prototype to call forEach on
document.querySelectorAll('a')//the nodelist result of all anchor elements
,function(el){ //then use that result to iterate through
if( el.href.toLowerCase().indexOf('forum') > -1 )//check whether or not 'forum' exists
window.open(el.href) //and if it does open a new tab with the anchors href
})
Some references
slice :
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/slice
call :
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function/call
querySelectorAll :
https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelectorAll
filter :
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter
To test link to contain word forum you could use regular expression. So code is something that:
function () {
  var pattern = new RegExp('forum');
var anchors = document.getElementsByTagName('a');
for (var i = 0; i < anchors.length; i++) {
var a = anchors[i];
if (pattern.test(a.href)){
window.open(a.href);
}
}
}
I wrote the userscript which opens external links on any page in new tab.
It has settings to exclude parent, neighbor and child sites links.
Maybe this script or its code will be useful.
It can be installed from one of this repos:
or
by direct link External link newtaber

How to alter DOM with xmldom by XPath in node.js?

I am trying to alter a DOM structure in node.js. I can load the XML string and alter it with the native methods in xmldom (https://github.com/jindw/xmldom), but when I load XPath (https://github.com/goto100/xpath) and try to alter the DOM via that selector, it does not work.
Is there another way to do this out there? The requirements are:
Must work both in the browser and server side (pure js?)
Cannot use eval or other code execution stuff (for security)
Example code to show how I am trying today below, maybe I simply miss something basic?
var xpath = require('xpath'),
dom = require('xmldom').DOMParser;
var xml = '<!DOCTYPE html><html><head><title>blah</title></head><body id="test">blubb</body></html>';
var doc = new dom().parseFromString(xml);
var bodyByXpath = xpath.select('//*[#id = "test"]', doc);
var bodyById = doc.getElementById('test');
var h1 = doc.createElement('h1').appendChild(doc.createTextNode('title'));
// Works fine :)
bodyById.appendChild(h1);
// Does not work :(
bodyByXpath.appendChild(h1);
console.log(doc.toString());
bodyByXpath is not a single node. The fourth parameter to select, if true, will tell it to only return the first node; otherwise, it's a list.
As aredridel states, .select() will return an array by default when you are selecting nodes. So you would need to obtain your node from that array.
You can also use .select1() if you only want to select a single node:
var bodyByXpath = xpath.select1('//*[#id = "test"]', doc);

Disable a "link" tag without access to HTML

So I have a website I'm working on and in one part we have an embedded newsletter signup. The problem is that the embedded code uses all its own stylesheets which interferes with the design of the site. The embed is done through javascript so I cannot disable them until that section of the page loads.
Basically I need a script to disable an entire <link>. On top of that, the links don't have any classes or ids so they are hard to target.
<link rel=​"stylesheet" type=​"text/​css" href=​"http:​/​/​www.formstack.com/​forms/​css/​3/​default.css?20130404">​
This is one of the links I need to disable. I tried looking for something like a getElementByType or similar but I couldn't find anything.
Any help would be appreciated. As long as the code disables the link that's good enough for me. Maybe there is a way to search the document for the <link> string and surround it with comments?
Thanks guys
PS, I'm a javascript novice and have no idea what I'm doing with js
var test = "http:​/​/​www.formstack.com/​forms/​css/​3/​default.css";
for (var i = 0; i < document.styleSheets.length; i++) {
var sheet = document.styleSheets.item(i);
if (sheet.href.indexOf(test) !== -1) sheet.disabled = true;
}
this will work, however it is inefficient (still) as it continues to check additional CSSStyleSheets in the CSSStyleSheetList after it has found it's match.
if you can not care about browser support you can use Array.prototype.some to reduce the number of ops
[].some.call(document.styleSheets, function(sheet) {
return sheet.disabled = sheet.href.indexOf(test) !== -1;
});
see: Array some method on MDN
edit:
For a mix of performance AND legacy support the following solution would work:
var test = "http:​/​/​www.formstack.com/​forms/​css/​3/​default.css";
for (var i = 0; i < document.styleSheets.length; i++) {
var sheet = document.styleSheets.item(i);
if (sheet.href.indexOf(test) !== -1) {
sheet.disabled = true;
break;
}
}

TideSDK: Is there a way to pass data to a child window?

When I create a child window from my main window, I'd like to pass a JavaScript object ot it, but I'm not sure if there actually is a way to do it?
Two windows created with TideSDK each have their own JavaScript environement, just like two browser windows (and that's just what they are, If I understand it right), so you can't access a variable in one window from another one. On the other hand, you can access other windows from the one you are in (for example with Ti.UI.getOpenWindows). So... is there a way to do it?
There are some workarounds I believe are possible, but none of them is very straightforward, and each uses something other then just plain JavaScript:
using Ti.Database or Ti.Filesystem to store the data I want to pass, and then retrieve it from the child window
pass the data to the new window as GET variables,example: Ti.UI.createWindow("app://page.html?data1=test&data2=foobar");
you can assign the object to the child window object
var objToBePassed = {foo:'bar'};
var currentWindow = Ti.UI.currentWindow;
var newWindow = currentWindow.createWindow("app://page.html");
newWindow.obj = objToBePassed;
newWindow.open();
and in the javascript environment of the child window you can access the object by
var currentWindow = Ti.UI.currentWindow;
var obj = currentWindow.obj;
another way is to use Ti.API.set:
Ti.API.set('objKey', objToBePassed);
and you can access the object anywhere by
var obj = Ti.API.get('objKey');
I know this has been answered, but I came accross a way to mimic the PHP $_GET variable, which will do what the OP asked I reckon:
<script type="text/javascript">
(function(){
document.$_GET = [];
var urlHalves = String(document.location).split('?');
if(urlHalves[1]){
var urlVars = urlHalves[1].split('&');
for(var i=0; i<=(urlVars.length); i++){
if(urlVars[i]){
var urlVarPair = urlVars[i].split('=');
document.$_GET[urlVarPair[0]] = urlVarPair[1];
}
}
}
})();
</script>
And then use it:
<script type="text/javascript">
var conts = '<li><a title="back to list" href="/menu.html?module='+document.$_GET['module']+'">Unit Listing</a></li>';
document.write(conts);
</script>
Works fo me.

Cannot Execute Javascript XPath queries on created document

Problem
I'm creating a document with javascript and I'd like to execute XPath queries on this document.
I've tried this in safari/chrome
I've read up on createDocument / xpath searches and it really seems like this code should work
At this point it seems like it may be a webkit bug
My requirements:
I can use innerHTML() to setup the document
I can execute xpath searches w tagnames
The code:
If you copy/paste the following into the webkit inspector, you should be able to repro.
function search(query, root) {
var result = null;
result = document.evaluate(query, root, null, 7,null);
var nodes = [];
var node_count = result.snapshotLength;
for(var i = 0; i < node_count; i++) {
nodes.push(result.snapshotItem(i));
}
return nodes;
}
x = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', 'HTML');
body = x.createElement('body');
body.innerHTML = "<span class='mything'><a></a></span>";
xdoc = x.documentElement; //html tag
xdoc.appendChild(body);
console.log(search(".", xdoc)); // --> [<html>​…​</html>​]
console.log(search("/*", xdoc)); // --> [<html>​…​</html>​]
console.log(search("/html", xdoc)); // --> []
Best Guess
So I can definitely search using XPath, but I cannot search using tagnames. Is there something silly I'm missing about the namespace?
Have you tried:
console.log(search("//html", xdoc));
I'm not familiar with Safari specifically, but the problem might be that Safari is adding another node above HTML or something. If this was the case, the first two queries might be showing you that node plus it's children, which would make it look like they're working properly, while the third query would fail because there wouldn't be a root=>HTML node.
Just a thought.

Categories