detect external broken links

detect external broken links - javascript

The problem is that I have on my website many of exetrnal download links, and some of the links get expired, so I want to detect automatically the expired links.
for me a valid link is a direct file download link pointing to one of my file servers.
a broken link lead to a simple html page with an error message.
my first idea was to get the html source code of the download link and see if it contains an error but it did not work.
I've tried also javascript but the problem is that js do not deal with external links.
any ideas??
thanks a lot

if you dont mind letting the client do the work, you could try doing it with javascript.
i have a greasemonkey script that automatically checks all links in the open page, and mark them according to the server response (not found, forbidden, etc).
see if you can get some ideas from it: http://userscripts.org/scripts/show/77701
i know that cross domain policies do not apply to GM_xmlhttprequest, and if want to use a javascript solution, might have to try a workaround, like:
Is it possible to use XMLHttpRequest across Domains
Making JavaScript call across domains
cross domain XMLHttprequest
if you want a server side solution, i believe the above answer can help you.

This isn't a task for your front-end, but for the back-end. As supernova said, check it from your server once a day. AJAX requests will not be your answer, since the browser security policy doesn't allow requests to different domains.
Solution:
Ok, based on your comment, check this solution:
<html>
<head>
<script src='http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js'></script>
<script>
$(document).ready(function(){
var linksDiv = $('#links');
$('#generateLinks').click(function(){
//I don't know your logic for this function, so I'll try to reproduce the same behavior
var someURLs = ['http://www.google.com','http://www.djfhdkjshjkfjhk.com', 'http://www.yahoo.com'];
linksDiv.html('');
for(var i = 0; i < someURLs.length; i++){
var link = $('<a/>').attr('href', someURLs[i]).append('link ' + i).css('display','block');
linksDiv.append(link);
}
});
$('#getLinksAndSend').click(function(){
var links = linksDiv.find('a');
var gatheredLinks = [];
$(links).each(function(){
gatheredLinks.push(this.href);
});
sendLinks(gatheredLinks);
});
var sendLinks = function(links){
$.ajax({
url: "your_url",
type: "POST",
data: {
links: links
}
}).done(function(resp) {
alert('Ok!')
});
}
});
</script>
</head>
<body>
<div id="links">
</div>
<button id="generateLinks">Generate all links</button>
<button id="getLinksAndSend">Get links and send to validator</button>
</body>
</html>

It may be overkill but there's a program in linux kde called klinkstatus that can find broken links in a website:
https://www.kde.org/applications/development/klinkstatus/

Related

Lazy-load a javascript script?

Is there a way I can wrap an external JS script embed with lazy-load behavior to only execute when the embed is in the viewport?
Context: I have an external javascript embed that when run, generates an iframe with a scheduling widget. Works pretty well, except that when the script executes, it steals focus and scrolls you down to the widget when it’s done executing. The vendor has been looking at a fix for a couple weeks, but it’s messing up my pages. I otherwise like the vendor.
Javascript embed call:
<a href=https://10to8.com/book/zgdmlguizqqyrsxvzo/ id="TTE-871dab0c-4011-4293-bee3-7aabab857cfd" target="_blank">See
Online Booking Page</a>
<script src=https://d3saea0ftg7bjt.cloudfront.net/embed/js/embed.min.js> </script> <script>
window.TTE.init({
targetDivId: "TTE-871dab0c-4011-4293-bee3-7aabab857cfd",
uuid: "871dab0c-4011-4293-bee3-7aabab857cfd",
service: 1158717
});
</script>
While I'm waiting for the vendor to fix their js, I wondered if lazy-loading the JS embed may practically eliminate the poor user experience. Warning: I'm a JS/webdev noob, so probably can't do anything complicated. A timer-based workaround is not ideal because users may still be looking at other parts of the page when the timer runs out. Here are the things I’ve tried and what happens:
I tried:
What happened:
Add async to one or both of the script declarations above
Either only shows the link or keeps stealing focus.
Adding type=”module” to one or both script declarations above
Only rendered the link.
Wrapping the above code in an iframe with the appropriate lazy-loading tags
When I tried, it rendered a blank space.
Also, I realize it's basically the same question as this, but it didn't get any workable answers.

I actually also speak french but I'll reply in english for everybody.
Your question was quite interesting because I also wanted to try out some lazy loading so I had a play on Codepen with your example (using your booking id).
I used the appear.js library because I didn't really want to spend time trying some other APIs (perhaps lighter so to take in consideration).
The main JS part I wrote is like this:
// The code to init the appear.js lib and add our logic for the booking links.
(function(){
// Perhaps these constants could be put in the generated HTML. I don't really know
// where they come from but they seem to be related to an account.
const VENDOR_LIB_SRC = "https://d3saea0ftg7bjt.cloudfront.net/embed/js/embed.min.js";
const UUID = "871dab0c-4011-4293-bee3-7aabab857cfd";
const SERVICE = 1158717;
let vendorLibLoaded = false; // Just to avoid loading several times the vendor's lib.
appear({
elements: function() {
return document.querySelectorAll('a.booking-link');
},
appear: function(bookingLink) {
console.log('booking link is visible', bookingLink);
/**
* A function which we'll be able to execute once the vendor's
* script has been loaded or later when we see other booking links
* in the page.
*/
function initBookingLink(bookingLink) {
window.TTE.init({
targetDivId: bookingLink.getAttribute('id'),
uuid: UUID,
service: SERVICE
});
}
if (!vendorLibLoaded) {
// Load the vendor's JS and once it's loaded then init the link.
let script = document.createElement('script');
script.onload = function() {
vendorLibLoaded = true;
initBookingLink(bookingLink);
};
script.src = VENDOR_LIB_SRC;
document.head.appendChild(script);
} else {
initBookingLink(bookingLink);
}
},
reappear: false
});
})();
I let you try my codepen here: https://codepen.io/patacra/pen/gOmaKev?editors=1111
Tell me when to delete it if it contains sensitive data!
Kind regards,
Patrick

This method will Lazy Load HTML Elements only when it is visible to User, If the Element is not scrolled into viewport it will not be loaded, it works like Lazy Loading an Image.
Add LazyHTML script to Head.
<script async src="https://cdn.jsdelivr.net/npm/lazyhtml#1.0.0/dist/lazyhtml.min.js" crossorigin="anonymous" debug></script>
Wrap Element in LazyHTML Wrapper.
<div class="lazyhtml" data-lazyhtml onvisible>
<script type="text/lazyhtml">
<!--
<a href=https://10to8.com/book/zgdmlguizqqyrsxvzo/ id="TTE-871dab0c-4011-4293-bee3-7aabab857cfd" target="_blank">See
Online Booking Page</a>
<script src=https://d3saea0ftg7bjt.cloudfront.net/embed/js/embed.min.js>
</script>
<script>
window.TTE.init({
targetDivId: "TTE-871dab0c-4011-4293-bee3-7aabab857cfd",
uuid: "871dab0c-4011-4293-bee3-7aabab857cfd",
service: 1158717
});
</script>
-->
</script>
</div>

Fetching a jquery DYNAMIC value from website

Title is pretty much self explanatory...
I'm getting the source code of the website but the value that I want to fetch is made dynamically via jQuery.
Let's assume that this is the source code of example.com
<div id="currentTime"></div>
<script>
var myVar = setInterval(function(){ myTimer() }, 1000);
function myTimer() {
var d = new Date();
var t = d.toLocaleTimeString();
document.getElementById("currentTime").innerHTML = t;
}
function myStopFunction() {
clearInterval(myVar);
}
</script>
Yes, to get the source code is very easy but how do I fetch what's going on in between :
<div id="currentTime">{This is what I'm looking for}</div>
So... Iow... Fetching the value that have been made via javescript (or jQuery) in between the #currentTime div.
Thanks!

You can do this:
$('#result').load('otherPageUrl.php div#currentTime');
See: Example
The only problem with this is cross domain restrictions or this will not be possible. The other way you can do this is by the use of iframe.

I am not sure how you are solving this problem (i.e. loading the contents of another website). But, sure enough you can not do it using AJAX because of the cross domain restrictions.
One possible thing you can do is to, use an iframe in your page to load the website dynamically and then refer to the element you are interested in. By this way, you don't have to worry about the content that is dynamically generated in the other web page.
Note: Make sure you are not making use of any data from an external website that you are not supposed to grab this way. If at all it is a website you are not associated with, be sure to double check what you are doing is something you are allowed to do!

How to use Javascript/jQuery to load the content from another domain?

I need to create a javascript application that can display the content from another domain (admittedly another big website). Further interpretation of the DOM tree is not needed at the moment. It will be used by only ten more people.
I can make it work via php's get_content function. But that is very slow since it runs on the server side. I looked into any origin but cannot get it to work. It is best to not touch any origin since we use it extensively and we don't have much cash to spend around. Can anyone help? By the way, iframe is not an option since the big website blocked it. The code is below. Admittedly I kind of took it from another stackoverflow answer. Thank you in advance!
Btw. another engineer told me if I use the extension .hta instead of html, the same-origin policy issue would be resolved. I tried it and it did not work. But I was wondering if I did it right.
<html>
<head>
<script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
<script>
function myCallbackFunction(myData) {
$(function() {
$("#test").contents().find('html').html(myData.contents);
});
}
</script>
<script src="http://anyorigin.com/get?url=http://http://www.amazon.com/dp/B001F7SGHQ/&callback=myCallbackFunction"></script>
</head>
<body>
</body>
<iframe id='test' style='width: 100%; height: 100%'>
</html>

Try something like the following.
var invocation = new XMLHttpRequest();
var url = 'http://http://www.amazon.com/dp/B001F7SGHQ/&callback=myCallbackFunction';
function callOtherDomain() {
if(invocation) {
invocation.open('GET', url, true);
invocation.withCredentials = true;
invocation.onreadystatechange = handler;
invocation.send();
}
}
Addition of [withCredentials = true] will enable the HTTP header "Access-Control-Allow-Origin:".

there's another good solution might be what you need via PHP ,
is to use class called PHP
Simple HTML DOM Parser
this class can copy all source of a websites and you can save it in your server with extension you want also you can modified what you need before you save and this class have a full documentation (You need to be good in PHP5 POO )
this a link for class
http://simplehtmldom.sourceforge.net/
and there a good advanced thing you can do it for make your website faster , is use a Cash System so you can download the source from website one time a Day or 1H or 12 Hours ,
and save it in your host .
i hope that will give you what you need .

Tracking outgoing links with Javascript and PHP

I have tried it using jQuery but it is not working.
<script>
$("a").click(function () {
$.post("http://www.example.com/trackol.php", {result: "click"
}, "html");
});
</script>
out

To get the best results you should change two things in your approach
Use onmousedown instead of click - this way you get a few extra milliseconds to complete the tracking request, otherwise the browser might not start the connection to your tracker at all as it is already navigating away from the original page. The downside is that you might get some false-positive counts, since the clicking user might not finish the click (eg. keeps the mousebutton down and moves the cursor away from the link) but overall it's a sacrifice you should be willing to make - considering the better quality of tracking.
Instead of an Ajax call ($.post('...')) use an image pre-fetcher (new Image().src='...'). The fact that the tracker is not an image is not relevant in this case because you don't want to use the resulting "image" anyway, you just want to make a request to the server. Ajax call is a two way connection so it takes a bit more time and might fail if the browser is already navigating away but the image pre-fetcher just sends the request to the server and it doesn't really matter if you get something back or not.
So the solution would be something like this:
<script>
$(document).ready(function() {
$("a").mousedown(function (){
new Image().src= "http://www.example.com/trackol.php?result=click";
});
});
</script>
out

Instead of using JavaScript to call a php tracking script, you could just link to your tracking script directly and have it in turn redirect the response to the ultimate destination, something like this:
out
and in the PHP script, after you do your tracking stuff:
...
header("Location: $dest");

As mentioned, the problem is you’re not running the script after the DOM has loaded. You can fix this by wrapping your jQuery script inside $(function() { }, like so:
This works:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Tracking outgoing links with JavaScript and PHP</title>
</head>
<body>
<p>Test link to Google</p>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4/jquery.min.js"></script>
<script>
$(function() {
$('a').click(function() {
$.post('http://www.example.com/trackol.php', { result: 'click' }, 'html');
});
});
</script>
</body>
</html>
See it in action here: http://jsbin.com/imomo3

Is there a light-weight client-side HTML include method?

I'm looking for a light weight method for client-side includes of HTML files. In particular, I want to enable client-side includes of publication pages of researchr.org, on third party web pages. For example, I'd like to export a page like
http://researchr.org/profile/eelcovisser/publications
(probably just the publications box of that page.)
Using an iframe it is possible to include HTML pages:
<iframe class="foo" style="height: 50em;" width="100%" frameborder="0"
src="http://researchr.org/profile/eelcovisser/publications">
</iframe>
However, iframes require specification of a fixed height, while the pages I'm exporting don't have a fixed height. The result has an ugly scrollbar:
http://swerl.tudelft.nl/bin/view/EelcoVisser/PublicationsResearchr
I found one reference to a method that appears to be appealing
http://www.webdeveloper.com/forum/archive/index.php/t-26436.html
It uses an iframe to import the html, and then a javascript call from the included document to a function defined in the including document, which places the contents of the body of the included file in a div of the including file. This does not work in my scenario, probably due to the same origin policy for javascript, i.e. the including and included page are not from the same domain (which is the whole point).
Any ideas for solving this? Which could be either:
a CSS trick to make the height of the iframe flexible
a javascript technique to lift the contents of the iframe to a div in the including page
some other approach I've overlooked
Requirement: the code to include on should be minimal.

No. The same-origin policy prevents you from doing any of that stuff (and rightly). You will have to go server-side, have a script on your server access that page and copy its contents into your own page (prefeably at build-time/in the background; you could do it at access-time or via AJAX but that would involve a lot of scraping traffic between your server and theirs, which may not be appreciated.
Or just put up with the scrollbar or make the iframe very tall.

As far as I know there is no CSS trick, the only way is to query the iFrame's document.documentElement.offsetHeight or scrollHeight, depending on which is higher, take that value and apply it on the iframe's css height ( add the + 'px' ).

try this ajax with cross domain capability

Why don't you use AJAX?
Try this:
<div id="content"></div>
<script type="text/javascript">
function AJAXObj () {
var obj = null;
if (window.XMLHttpRequest) {
obj = new XMLHttpRequest();
} else if (window.ActiveXObject) {
obj = new ActiveXObject("Microsoft.XMLHTTP");
}
return obj;
}
var retriever = new AJAXObj();
function getContent(url)
{
if (retriever != null) {
retriever.open('GET', url, true);
retriever.onreadystatechange = function() {
if (retriever.readyState == 4) {
document.getElementsById('content').innerHTML(retriever.responseText);
}
}
retriever.send(null);
}
}
getContent('http://researchr.org/profile/eelcovisser/publications');
</script>
And then, you can parse the received page content with JS with regular expressions, extracting whatever content you want from that page.
Edit:
Sorry, I guess I missed the fact that it's a different domain. But as ceejayoz said, you could use a proxy for that.

If you're using jQuery, you can use the load method to retrieve a page via AJAX, optionally scrape content from it, and inject it into an existing element. The only problem is that it requires JavaScript.

We Keep Coding

JavaScript is the programming language of the Web.