I am trying to write a browser extension that will detect advertisements. I don't want an AdBlock, I just wish to detect how many ads are encountered. I don't know where to begin searching for ads in the HTML, though. Any help for a good first start?
Most adblockers catch the ads via some form of a regex match.
I would recommend you to start with the adblockpluscore repository, since it's open source and you can quickly run through the source code.
Start with the test directory, particularly peeking into the patterns.ini file and see the common patterns, when determining different sources of ads.
Search for these sections in patterns.ini:
General tracking systems
Third-party tracking domain
You can expect, that your initial solutions won't be too effective, since ads come in different forms of data, but you'll find common patterns between many of them.
Ads vary, but I think that Google Ads uses an ins element. I can't seem to put the tag into the post, it comes out as nothing: , but you can see the code on google's page: here.
So you can search the page for an ins element and add it to a counter, sort of like the following extremely simple/barebones code:
$.fn.count = function(selector) {
return this.filter(selector).length;
};
If this doesn't work, look at this SO question.
Remember that this is just to get started and will not work 100% of the time. AS wOxxOm pointed out, ads are complicated.
Related
Is it possible to get list of all scripts injected by browser? Or at least detect them somehow. I mean sometimes on Windows there are various viruses which inject scripts in fly modifying eg. click actions to display ads. I'm writing kind of advanced website so I'd like to warn user about other scripts which most likely:
will crash as my webapp is modifying basic native browser APIs like document.getElement* or even listeners
may make webapp unstable and in wors case make it crash.
could be performance overkill
I'm talking also about scripts modifying site content like eg. Ponify or XKCD numbers.
I know about navigator.plugins but it doesn't seem to be what am I looking for.
Not really. I mean you could do a:
document.getElementsByTagName('script');
And fetch all script tags, and check their src attribute, but it's not that simple. It's possible to make an ajax request for a javascript file, and then eval(ajaxResult) to execute that code. Your browser has no way of knowing where that code came from as it's just a string.
There are a lot of ways to execute javascript, and cleanup any trace, there is no way to cover them all.
EDIT: I missed the key phrase "scripts injected by browser" :)
At least in Chrome, some extensions do seem to inject script tags. Though they don't seem to be marked up in any special way. Filtering them may be tricky. Perhaps if you add a class to the script tags you know should be there, and you find a script tag that does not have that class then you know it could be a extension script.
I'm also not sure that an extension must insert a script tag to do stuff on the page. I think it has ways of interacting with the page directly from the extension code as well. Not sure though. More research required. And this is probably different for different browsers.
Personally, I think defensively protecting your site from people's own browser extensions is a fools errand. If someone wants to hamstring their own browsing experience in bizarre ways, it's not your responsibility to hold their hand. And you will have a very hard time detecting all the ways an extension can blow up your everything.
AFAIK, the rel="nofollow" attribute on links instruct search engines not to follow through the link when it crawls your site, therefore severing all assumption of relationship between your site and the linked site, and therefore, not sharing any of your SEO goodness. For the most part, that's a Good Thing™ on a comment system.
Now, after integrating an IntenseDebate system on my site, I noticed that the commenter names link through their respective websites without nofollow. This kind of raised an alarm in my head --- that is, until I realized that these were generated dynamically via AJAX. Which means that these links aren't there when a search spider crawls through my site.
Problem averted. Good. A good sigh of relief.
But then, there are these sites that suggest implementing a script-based solution to add nofollow.
Now that just doesn't jive well with my current understanding of nofollow, for two reasons:
As mentioned, the links aren't there when a spider crawls your page. So it doesn't make sense to nofollow it, because as far as the spider is concerned, there isn't anything to follow after all.
Regarding static links, a spider wouldn't be able to run the script to add nofollow on your markup, so links that a spider can follow will be unmodified, and therefore, are follow links.
Am I missing something here? Is it actually useful to dynamically add nofollow to links using Javascript?
From an interview with Matt Cutts from Google (emphasis mine):
For a while, we were scanning within JavaScript, and we were looking
for links. Google has gotten smarter about JavaScript and can execute
some JavaScript. I wouldn't say that we execute all JavaScript, so
there are some conditions in which we don't execute JavaScript.
Certainly there are some common, well-known JavaScript things like
Google Analytics, which you wouldn't even want to execute because you
wouldn't want to try to generate phantom visits from Googlebot into
your Google Analytics.
We do have the ability to execute a large fraction of JavaScript when
we need or want to. One thing to bear in mind if you are advertising
via JavaScript is that you can use NoFollow on JavaScript links
Additional debate on the topic: https://webmasters.stackexchange.com/questions/5653/does-the-google-spider-render-javascript.
I have a page which displays a different website (say www.cnn.com) in an iframe.
All I want, is to make links inside the iframe open in the parent window, and not inside the frame.
I know that this is normally impossible for security reasons, which makes good sense to me. However, the page I'm working on is not going to be public, but only on my private computer, and if I have to switch off certain security features to make it work, it's OK.
Is there any way at all to do this?
I have been combing through the web all day for a solution. If I missed a post here or elsewhere, please point me to it.
I read that in Firefox (which I'm using), it's possible to get extended permissions in javascript if the script is "signed" (or a particular config entry is changed). However, I don't know how to exploit these extended permissions for my purpose...any hints?
I'd also consider different approaches, e.g. not using iframes at all. Whatever the method, I want to be able to embed several websites, which I have no control over, within one page. Links clicked in any of the embedded websites should open in the parent window. It's just supposed to be a handy tool for myself. I should say that I have basically no knowledge of javascript and am just learning by doing. If you can confidently say that what I want is not possible with any client-side methods, that would help as well. I guess it would be rather straighforward to do it e.g. with php but I don't want to setup a webserver if it's not necessary. Thanks for any tips!
This is a bit different solution than you asked for, but might be a better way to attack the problem as it might give you the ability you seek without compromising any normal web security.
I wonder if Greasemonkey (add-on for Firefox and other browsers) might be a useful solution for you as it allows you to run local javascript against other pages to modify them locally, somewhat regardless of normal security restrictions. So, you could run through all the links in a CNN page and modify them if that's what you needed to do.
To use it, you would install the greasemonkey add-on into Firefox, write a script that modifies CNN.com the way you want to, install that script into Greasemonkey, then target the script at just the web page CNN.com. I think it should work on that site whether it's in an iframe or not, but your script could likely detect whether it was in an iframe if you needed to.
It would appear the HTML5 seamless attribute would be what you are looking for. But it doesn't appear that anything supports it yet...
http://www.w3schools.com/html5/att_iframe_seamless.asp
I want to keep bots from following my external links through rel=nofollow.
I have 2 questions about it:
1) Does this really help my page ranking (I heard a SEO guy saying this, as it the page ranking should go up as the probability is lower that the user leaves the page)
2) Does it work when the rel=nofollow is set through javascript in the $(document).ready() function?
EDIT: thanks for the suggestions so far - to go more into detail to 1:
how can the robot know(...)?
The robot knows this because he knows the page ranking of the page that you link to, and if it is high the probability is high that you follow this link and so by leave my page. That's why it is supposed to be good if you have more incoming than outgoing links, where of course incoming links from high-ranked pages count more than incoming links from low-ranked websites. on the other hand outgoing links to high-ranked pages are supposed to increase the probability that the user leaves... but I am no expert in this that's just what this SEO guy was telling
EDIT 2
Question is if it improves my Google pageranking if I put rel="nofollow" to external links, and - in case it improves my page ranking - if this still works through setting it with javascript.
Thanks in advance
1.
It's possible. Your pages will flow pagerank internally, so having more outbound links will decrease the pagerank you flow to your own pages.
2.
Google is capable of reading javascript, and will honor a nofollow on dynamically created links, however, I am not sure if it works when dynamically adding nofollow on 'static' links.
Of course, there's much speculation when it comes to SEO.
I doubt
No, it doesn't work. Bots generally don't execute JavaScript code.
What?
the page ranking should go up as the probability is lower that the user leaves the page
How should a robot know this?
Robots don't process JavaScript, rel="nofollow" has to be present in the source markup as it is sent to the client.
And to add: rel="nofollow" does not guarantee that a link is not followed or added as link to the other page to build up page rank (the real process is much more complex); that depends on the robot/search engine.
Adding a rel="nofollow" will not stop the bot following the link. but it will stop the bot giving any of your page rank to that link.
Oh and as said before mostly bots do not execute JavaScript. I belive google have been playing around with one that dose, but this is the exception not the norm.
1) The more pages that you link out to, the more it affects your authority ratio, you essentially want more linking in that you link out. CTR is tracked by google analytics and this is factored into their essentially blackbox search ranking magic.
2) Whilst it's commonly thought that robots don't process JavaScript, this is wrong, google's current generation of robots are ajax aware.
I came here looking for an answer to this question myself. (Thanks Andre!)
I can attest to Google following links with href="javascript:..." URLs, and going to the correct pages, so that is no defense against unwanted link-crawling. I have also seen search result snippets include text inserted by javascript, so there is ample evidence of Google processing javascript.
If the links are internal, proper use of robots.txt would be the preferred, easier, and more bandwidth-efficient answer, of course, if you have access to that. (We don't on the server in question, thus my own search for answers.)
I shall be adding nofollow via javascript.
According to this page it would seem like they don't, in the sense that they don't actually run it, but that page is 2 years old (judging from the copyright info).
The reason I'm asking this question is because we use Javascript to replace text on our site with other more typographically sound content. We're worried that this may affect the crawlability/seo of our sites, since generally what we're replacing is headers; ie. <h1>, <h2>, etc.
Will search engine bots see our original code, or will they run the Javascript and see the replaced text?
Google now officially processes JavaScript.
In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.
Sometimes things don't go perfectly during rendering, which may negatively impact search results for your site. Here are a few
potential issues, and – where possible, – how you can help prevent
them from occurring:
If resources like JavaScript or CSS in separate files are blocked (say, with robots.txt) so that Googlebot can’t retrieve them, our
indexing systems won’t be able to see your site like an average user.
We recommend allowing Googlebot to retrieve JavaScript and CSS so that
your content can be indexed better. This is especially important for
mobile websites, where external resources like CSS and JavaScript help
our algorithms understand that the pages are optimized for mobile. If
your web server is unable to handle the volume of crawl requests for
resources, it may have a negative impact on our capability to render
your pages. If you’d like to ensure that your pages can be rendered by
Google, make sure your servers are able to handle crawl requests for
resources.
It's always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn't have
compatible JavaScript implementations. It will also help visitors with
JavaScript disabled or off, as well as search engines that can't
execute JavaScript yet.
Sometimes the JavaScript may be too complex or arcane for us to execute, in which case we can’t render the page fully and accurately.
Some JavaScript removes content from the page rather than adding, which prevents us from indexing the content.
Search engines don't process JavaScript as such.
There is some evidence that Google may have started processing inline script content in some cases, in order to catch content that is entered into the page parse queue using document.write. However certainly DOM methods such as you might use for font-replacement are not affected and no onload code is invoked.
Generally no. Google has mentioned that they are working on a system of indexing ajax content, but I don't think any of the major search engines index dynamic content as a rule. See this page for Google's take on it: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=81766
The bots will certainly not run the Javascript code, but they might recognise some commonly used scripts.
You shouldn't count on it though. Clear markup, proper content and real links is still what counts.
Also, if the bots happen to recognise your script, it might not be in your favor. If the code is recognised as something that is commonly used to try to fool bots, it could even hurt your page ranking.
I'd use metadata to ensure bots pick up the content on your pages.
I know the general consensus is that google does not process javascript or index anything with a <script> tag, however, the general consensus appears incorrect.
Try searching for the following, with the surrounding quotes (or click here):
"Samsung Public Interest Statement by Thomas Fusco, Fish & Richardson P.C., for Samsung."
You should only get one result. Now click on that result (or just click here) and view the source.
Do a CTRL-F for the text you searched for in Google. Notice that the text is in a javascript variable, and not html. Google must be processing some javascript to pull those words into its index.