javascript: read plain html-string and using DOMparser change links path - javascript

In my angular app using one of WYSIWYG i can insert links without protocol. And this is bad:
i need to parse string and change all link's (if thay didn't have protocol to http://...)
and i try to do so:
var content = '<p>7</p><p>77</p><p><br></p><p>http://example.com</p><p><br></p><p>example.com</p><p><br></p><p>ftp://localhost</p><p><br></p><p>localhost<br></p>';
var addProtocolToLinks = function(URL){
var protocols = ['http', 'https', 'ftp', 'sftp', 'ssh', 'smtp'];
var withProtocol = false;
if (URL.length > 0){
protocols.forEach(function(el) {
if (URL.slice(0,4).indexOf(el) > -1){
withProtocol = true;
}
});
var newURL = URL;
if (!withProtocol){
newURL = 'http://' + URL;
}
console.log(newURL + ' ' + URL);
return newURL;
}
};
var parser = new DOMParser();
var doc = parser.parseFromString(content, "text/html");
var links = doc.getElementsByTagName("a");
for(var i=0; i<links.length; i++) {
links[i].setAttribute('href', addProtocolToLinks(links[i].href));
console.log('result: ' + links[i].getAttribute('href'));
}
console.log('result html: ');
console.log(doc); // also i need to fetch only my var content part, without html, body etc
http://jsfiddle.net/r3dgeo23/
But for some reasons it's not working properly. What i do wrong?

you had almost everything right except that:
link[i].href
returns undefined if no protocol set. Therefore you gave you function addProtocolToLinks(undefined) and it did not work.
You can use:
getAttribute('href');
to make it work, see this fiddle:
http://jsfiddle.net/r3dgeo23/3/
/////EDIT
Here is a fiddle for only fetching the content part and not the whole html:
http://jsfiddle.net/r3dgeo23/5/
/////EDIT2
Create the container with unique id within your function:
var container = document.createElement('div');
container.setAttribute("id", "content");
container.innerHTML = content;
http://jsfiddle.net/r3dgeo23/6/

If I completely understood your question, this should work...
function jsF_addHTTP( url )
{
if (url !== "")
{
// Insert HTTP if it doesn't exist.
if ( !url.match("^(http|https|ftp|sftp|ssh|smtp)://") )
{
url = "http://" + url;
}
}
return url;
}

Try this..
It is WORKING
var addProtocolToLinks = function(URL){
protocols = ['http', 'https', 'ftp', 'sftp', 'ssh', 'smtp'];
protocols.forEach(function(item) {
if(url.indexOf(item) != -1) {
newUrl = "http://"+url.substr(url.indexOf("//")+2);
}
});
return newUrl;
}
sample demo is here http://jsfiddle.net/d9p9534h/
Let me know if it worked

How about this?
function ensureProtocol(href) {
var match = href.match(/^((\w+)\:)?(.*)/);
var protocol = match[1] || 'https:';
return protocol + match[3];
}
NOTE: Not every URI has an authority part. That's why the regular expression does not include //. See this article

function Protocol( url )
{
if (url !== "")
{
if ( !url.match("^(http|https|ftp|sftp|ssh|smtp)://") )
{
url = "http://" + url;
}
}
return url;
}

Related

How to match a url link and open it, otherwise google search it?

i want to mark a bookmarklet which can match my enter words.
if enter words are url like http://,
then open it,
otherwish use google to search it!
here is my code below, but it seems not working!!
javascript:var x=prompt('Enter url or text!','');
let url = x;
try {
var reg = /^http[s]?:\/\/(www\.)?(.*)?\/?(.)*/;
if (!reg.test(url)) {
url = 'https://www.google.com/search?q=' + encodeURIComponent(url);
}
else {
if (url.substring(4, 0).toLowerCase() == "http") {
url = encodeURIComponent(url);
}
else {
url = 'http://' + encodeURIComponent(url);
}
}
};
use bookmarklet way!
The code is fine, except it fails to match on URLs like www.stackoverflow.com (without the http[s]) and also it's encoding the forward slashes when http[s] is provided.
Change regex to also match www.stackoverflow.com
Change encodeURIComponent to encodeURI for URLs that already have http[s].
Wrap code in IIFE to avoid redeclaration of variables so it can be called more than once.
Note: This solution won't work for urls such as stackoverflow.com which are missing both the www. and http[s]. prefixes.
javascript:(function() {
var x=prompt('Enter url or text!','');
let url = x;
try {
var reg = /(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})/;
if (!reg.test(url)) {
url = 'https://www.google.com/search?q=' + encodeURIComponent(url);
} else {
if (url.substring(4, 0).toLowerCase() == "http") {
url = encodeURI(url);
}
else {
url = 'http://' + encodeURIComponent(url);
}
}
window.location.href = url;
} catch(err) {
console.error(err);
}
})();

How can I pass a value in a URL and insert value in a new URL to redirect with Javascript?

I am passing a value in a URL in the form of http://example.com/page?id=012345. The passed value then needs to be inserted in a new URL and redirect the page to the new URL. Here is what I have been working with
function Send() {
var efin = document.getElementById("id").value;
var url = "https://sub" + encodeURIComponent(efin) + ".example.com" ;
window.location.href = url;
};
Sounds like you're looking for the features of URLSearchParams - Specifically using .get() to fetch specific parameters from the URL
// Replacing the use of 'window.location.href', for this demo
let windowLocationHref = 'http://example.com/page?id=012345';
function Send() {
let url = new URL(windowLocationHref);
let param = url.searchParams.get('id');
let newUrl = "https://sub" + encodeURIComponent(param) + ".example.com" ;
console.log('Navigate to: ' + newUrl);
//window.location.href = newUrl;
};
Send();

How to write data in a node.js stream without duplicates?

This question is about a URL-crawler in node.js.
On the start_url URL he looks for links and "pushes" them to a .json-file (output.json).
How can I make sure that he does not "push" or "write" domains twice to output.json (so that I do not get duplicates)? I've been using the hash function but this has caused problems.
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var start_url = ["http://blog.codinghorror.com/"]
var wstream = fs.createWriteStream("output.json");
// Extract root domain name from string
function extractDomain(url) {
var domain;
if (url.indexOf("://") > -1) { //find & remove protocol (http(s), ftp, etc.) and get domain
domain = url.split('/')[2];
} else {
domain = url.split('/')[0];
}
domain = domain.split(':')[0]; //find & remove port number
return domain;
}
var req = function(url){
request(url, function(error, response, html){
if(!error){
var $ = cheerio.load(html);
$("a").each(function() {
var link = $(this).attr("href");
var makelinkplain = extractDomain(link);
start_url.push("http://" + makelinkplain);
wstream.write('"http://'+ makelinkplain + '",');
});
}
start_url.shift();
if(start_url.length > 0) {
return req(start_url[0]);
}
wstream.end();
});
}
req(start_url[0]);
You can just keep track of the previously seen domains in a Set object like this:
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var domainList = new Set();
var start_url = ["http://blog.codinghorror.com/"]
var wstream = fs.createWriteStream("output.json");
// Extract root domain name from string
function extractDomain(url) {
var domain;
if (url.indexOf("://") > -1) { //find & remove protocol (http(s), ftp, etc.) and get domain
domain = url.split('/')[2];
} else {
domain = url.split('/')[0];
}
domain = domain.split(':')[0]; //find & remove port number
// since domains are not case sensitive, canonicalize it by going to lowercase
return domain.toLowerCase();
}
var req = function(url){
request(url, function(error, response, html){
if(!error){
var $ = cheerio.load(html);
$("a").each(function() {
var link = $(this).attr("href");
if (link) {
var makelinkplain = extractDomain(link);
// see if we've already done this domain
if (!domainList.has(makelinkplain)) {
domainList.add(makelinkplain);
start_url.push("http://" + makelinkplain);
wstream.write('"http://'+ makelinkplain + '",');
}
}
});
}
start_url.shift();
if(start_url.length > 0) {
return req(start_url[0]);
}
wstream.end();
});
}
req(start_url[0]);
Note: I also added a .toLowerCase() to the extractDomain() function since domains are not case sensitive, but a Set object is. This will make sure that even domains that differ only in case are recognized as the same domain.

How can I exit PhantomJS when a specific request is not found?

The problem is, some sites contain the request to test.com/test.aspx and some don't.
If the request exists, it should print the JSON and exit.
If the request does not exist, it should exit too - at the moment, it stays open in this case.
Also, how could I make the code better? Maybe even faster if that's possible?
My JS code:
var Url = "http://www.test.de";
var params = new Array();
var webPage = require('webpage');
var page = webPage.create();
var targetJSON = {};
page.open(Url);
page.onResourceRequested = function(requestData, networkRequest) {
var match = requestData.url.match(/test.com\/test.aspx/g);
if (match != null) {
var targetString = decodeURI(JSON.stringify(requestData.url));
var klammerauf = targetString.indexOf("{");
var jsonobjekt = targetString.substr(klammerauf, (targetString.indexOf("}") - klammerauf) + 1);
targetJSON = (decodeURIComponent(jsonobjekt));
console.log(targetJSON);
phantom.exit();
}
};
I tried to add
} else {
phantom.exit();
}
and
} if (match == null) {
phantom.exit();
}
but nothing solves my problem.
If you want to check whether something doesn't exist, then you need to check all things to see if they are not it or as first-order logic: .
You first need to see all requests to see whether your intended request was there. For example like this:
var found = false;
page.onResourceRequested = function(requestData, networkRequest) {
var match = requestData.url.match(/test.com\/test.aspx/g);
if (match != null) {
var targetString = decodeURI(JSON.stringify(requestData.url));
var klammerauf = targetString.indexOf("{");
var jsonobjekt = targetString.substr(klammerauf, (targetString.indexOf("}") - klammerauf) + 1);
targetJSON = (decodeURIComponent(jsonobjekt));
console.log(targetJSON);
found = true;
phantom.exit();
}
};
page.open(Url, function(){
setTimeout(function(){
console.log("found: " + found); // will always print "false"
phantom.exit();
}, 1000);
});
I solved this with a global variable which denotes whether the request was found. If it wasn't, then you can exit PhantomJS. If wait until the page is loaded and an additional waiting time in case there are Ajax requests.

AJAX response text undefined

I am following below 2 posts:
Ajax responseText comes back as undefined
Can't return xmlhttp.responseText?
I have implemented the code in same fashion. But I am getting
undefined is not a function
wherever i am using callback() funtion in my code.
CODE:
function articleLinkClickAction(guid,callback){
var host = window.location.hostname;
var action = 'http://localhost:7070/assets/find';
var url = action + '?listOfGUID=' + guid.nodeValue;
console.log("URL "+url);
xmlhttp = getAjaxInstance();
xmlhttp.onreadystatechange = function()
{
if (xmlhttp.readyState == 4 && xmlhttp.status == 200) {
var response = JSON.parse(xmlhttp.responseText);
console.log(response);
console.log(xmlhttp.responseText);
callback(null, xmlhttp.responseText);// this is line causing error
}
else{
callback(xmlhttp.statusText);// this is line causing error
}
};
xmlhttp.open("GET", url, true);
xmlhttp.send(null);
}
And I am calling it from this code:
var anchors = document.getElementsByTagName("a");
var result = '';
for(var i = 0; i < anchors.length; i++) {
var anchor = anchors[i];
var guid = anchor.attributes.getNamedItem('GUID');
if(guid)
{
articleLinkClickAction(guid,function(err, response) { // pass an anonymous function
if (err) {
return "";
} else {
var res = response;
html = new EJS({url:'http://' + host + ':1010/OtherDomain/article-popup.ejs'}).render({price:res.content[i].price});
document.body.innerHTML += html;
}
});
}
}
You are using a single global variable for your xmlhttp and trying to run multiple ajax calls at the same time. As such each successive ajax call will overwrite the previous ajax object.
I'd suggest adding var in front of the xmlhttp declaration to make it a local variable in your function so each ajax request can have its own separate state.
function articleLinkClickAction(guid,callback){
var host = window.location.hostname;
var action = 'http://localhost:7070/assets/find';
var url = action + '?listOfGUID=' + guid.nodeValue;
console.log("URL "+url);
// add var in front of xmlhttp here to make it a local variable
var xmlhttp = getAjaxInstance();
xmlhttp.onreadystatechange = function()
{
if (xmlhttp.readyState == 4 && xmlhttp.status == 200) {
var response = JSON.parse(xmlhttp.responseText);
console.log(response);
console.log(xmlhttp.responseText);
callback(null, xmlhttp.responseText);// this is line causing error
}
else{
callback(xmlhttp.statusText);// this is line causing error
}
};
xmlhttp.open("GET", url, true);
xmlhttp.send(null);
}
In the future, you should consider using Javascript's strict mode because these "accidental" global variables are not allowed in strict mode and will report an error to make you explicitly declare all variables as local or global (whichever you intend).
I can't say if this is the only error stopping your code from working, but it is certainly a significant error that is in the way of proper operation.
Here's another significant issue. In your real code (seen in a private chat), you are using:
document.body.innerHTML += html
in the middle of the iteration of an HTMLCollection obtained like this:
var anchors = document.getElementsByTagName("a");
In this code, anchors will be a live HTMLCollection. That means it will change dynamically anytime an anchor element is added or removed from the document. But, each time you do document.body.innerHTML += html that recreates the entire body elements from scratch and thus completely changes the anchors HTMLCollection. Doing document.body.innerHTML += html in the first place is just a bad practice. Instead, you should append new elements to the existing DOM. I don't know exactly what's in that html, but you should probably just create a div, put the HTML in it and append the div like this:
var div = document.createElement("div");
div.innerHTML = html;
document.body.appendChild(div);
But, this isn't quite all yet because if this new HTML contains more <a> tags, then your live HTMLCollection in anchors will still change.
I'd suggestion changing this code block:
var anchors = document.getElementsByTagName("a");
var result = '';
for(var i = 0; i < anchors.length; i++) {
var anchor = anchors[i];
var guid = anchor.attributes.getNamedItem('GUID');
if(guid)
{
articleLinkClickAction(guid,function(err, response) { // pass an anonymous function
if (err) {
return "";
} else {
var res = response;
html = new EJS({url:'http://' + host + ':1010/OtherDomain/article-popup.ejs'}).render({price:res.content[i].price});
document.body.innerHTML += html;
}
});
}
}
to this:
(function() {
// get static copy of anchors that won't change as document is modified
var anchors = Array.prototype.slice.call(document.getElementsByTagName("a"));
var result = '';
for (var i = 0; i < anchors.length; i++) {
var anchor = anchors[i];
var guid = anchor.attributes.getNamedItem('GUID');
if (guid) {
articleLinkClickAction(guid, function (err, response) { // pass an anonymous function
if (err) {
//return "";
console.log('error : ' + err);
} else {
var res = response;
var html = new EJS({
url: 'http://' + host + ':1010/OtherDomain/article-popup.ejs'
}).render({
price: res.content[i].price
});
var div = document.createElement("div");
div.innerHTML = html;
document.body.appendChild(html);
}
});
}
}
})();
This makes the following changes:
Encloses the code in an IIFE (self executing function) so the variables declared in the code block are not global.
Changes from document.body.innerHTML += html to use document.body.appendChild() to avoid recreating all the DOM elements every time.
Declares var html so it's a local variable, not another accidental global.
Makes a copy of the result from document.getElementsByTagName("a") using Array.prototype.slice.call() so the array will not change as the document is modified, allowing us to accurately iterate it.

Categories