how do web crawlers handle javascript

how do web crawlers handle javascript - javascript

Today a lot of content on Internet is generated using JavaScript (specifically by background AJAX calls). I was wondering how web crawlers like Google handle them. Are they aware of JavaScript? Do they have a built-in JavaScript engine? Or do they simple ignore all JavaScript generated content in the page (I guess quite unlikely). Do people use specific techniques for getting their content indexed which would otherwise be available through background AJAX requests to a normal Internet user?

JavaScript is handled by both Bing and Google crawlers. Yahoo uses the Bing crawler data, so it should be handled as well. I didn't look into other search engines, so if you care about them, you should look them up.
Bing published guidance in March 2014 as to how to create JavaScript-based websites that work with their crawler (mostly related to pushState) that are good practices in general:
Avoid creating broken links with pushState
Avoid creating two different links that link to the same content with pushState
Avoid cloaking. (Here's an article Bing published about their cloaking detection in 2007)
Support browsers (and crawlers) that can't handle pushState.
Google later published guidance in May 2014 as to how to create JavaScript-based websites that work with their crawler, and their recommendations are also recommended:
Don't block the JavaScript (and CSS) in the robots.txt file.
Make sure you can handle the load of the crawlers.
It's a good idea to support browsers and crawlers that can't handle (or users and organizations that won't allow) JavaScript
Tricky JavaScript that relies on arcane or specific features of the language might not work with the crawlers.
If your JavaScript removes content from the page, it might not get indexed.
around.

Most of them don't handle Javascript in any way. (At least, all the major search engines' crawlers don't.)
This is why it's still important to have your site gracefully handle navigation without Javascript.

I have tested this by putting pages on my site only reachable by Javascript and then observing their presence in search indexes.
Pages on my site which were reachable only by Javascript were subsequently indexed by Google.
The content was reached through Javascript with a 'classic' technique or constructing a URL and setting the window.location accordingly.

Precisely what Ben S said. And anyone accessing your site with Lynx won't execute JavaScript either. If your site is intended for general public use, it should generally be usable without JavaScript.
Also, related: if there are pages that you would want a search engine to find, and which would normally arise only from JavaScript, you might consider generating static versions of them, reachable by a crawlable site map, where these static pages use JavaScript to load the current version when hit by a JavaScript-enabled browser (in case a human with a browser follows your site map). The search engine will see the static form of the page, and can index it.

Crawlers doesn't parse Javascript to find out what it does.
They may be built to recognise some classic snippets like onchange="window.location.href=this.options[this.selectedIndex].value;" or onclick="window.location.href='blah.html';", but they don't bother with things like content fetched using AJAX. At least not yet, and content fetched like that will always be secondary anyway.
So, Javascript should be used only for additional functionality. The main content taht you want the crawlers to find should still be plain text in the page and regular links that the crawlers easily can follow.

crawlers can handle javascript or ajax calls if they are using some kind of frameworks like 'htmlunit' or 'selenium'

Related

Calling tracking JavaScript from AMP pages

We are using in house tracking mechanism for our website. We use our tracking.js file on our all pages.
Every page sent some info in an js object to this script file which later send this information to our tracking application using spring controller.
Now as to move page faster we use some pages in AMP templates.
But this does not allow us to use tracking.js
We tried iframe tag but it does not allow to use http call (it only allow https calls)
Could you please suggest a way to do it as it very critical and we can not move to https right now for other limitation.
Thanks
Virendra Agarwal

You can't use tracking.js with AMP as it is considered as an external library. It's written on their How It Works page that it won't allow author-written/3rd party JS:
"One thing we realized early on is that many performance issues are
caused by the integration of multiple JavaScript libraries, tools,
embeds, etc. into a page. This isn’t saying that JavaScript
immediately leads to bad performance, but once arbitrary JavaScript is
in play, most bets are off because anything could happen at any time
and it is hard to make any type of performance guarantee. With this in
mind we made the tough decision that AMP HTML documents would not
include any author-written JavaScript, nor any third-party scripts."
Only the components on this AMP example can be used.

As we worked with Google. We got it sorted.
You can add your API to AMP pages after validation by Google.
This API must be behind https and all calls should be validated by Google.
Google then will white list on AMP page and you can use that code in production.

Javascript redirect impact on SEO

Not sure if this is the best place for this question but it's something I've been really curious about. I'd like to use data only available on the client side for loading resources/assets for a website, such as device-pixel-ratio, touch support, etc.
The content on the page will not be changing, just resources like JS files, CSS files, and image files.
There are a few scripts already out there that work like this that run client-side tests and then store the data in a cookie, and then reload the page, loading resources based on data stored in the cookie.
The process works as follows:
User comes to the site
JS sets cookie with device features
JS reloads current page
Server can now access the cookie with all the feature data
Can conditionally load resources and assets based on this data
Is this a bad practice to immediately reload the page as the user comes to it. Are there any SEO drawbacks to this method. It seems like a great technique for conditionally loading resources based on device capabilities. I'm just not sure if there are any reasons not to do this?

Many web crawlers do not use full Javascript or cookie functionality. For instance, GoogleBot does interpret all Javascript by default. Thus, all the content you are dynamically loading as a part of your cookie may not be detected by the crawler and will not be indexed as a result. This kills the SEO.
As a quote from Matt Cutts (Google's webspam guy):
"For a while, we were scanning within JavaScript, and we were looking
for links. Google has gotten smarter about JavaScript and can execute
some JavaScript. I wouldn't say that we execute all JavaScript, so
there are some conditions in which we don't execute JavaScript.
Certainly there are some common, well-known JavaScript things like
Google Analytics, which you wouldn't even want to execute because you
wouldn't want to try to generate phantom visits from Googlebot into
your Google Analytics".
Reference: http://www.searchnewz.com/topstory/news/sn-2-20100315SEOInterviewwithMattCutts.html

Well, search engines usually neither support cookies nor JavaScript, so they will get the default version only.
And some search engines may test for that and might see this as a "doorway page" (and thus punish the site). I wonder if one of the reasons they started their own web browser was as a side product of developing a robot that checks for such things. Obviously, the robots needs to be fast in JavaScript...

I certainly would not like the page to reload when I've just started using it.
You should probably be using media queries (for the CSS) and feature detection (for JS).
#media all and (min-width:420px) {
/*styles...*/
}
And:
if( typeof window.localStorage !== "undefined") {
// you can now do stuff with localStorage.
}

Is it recommended to use javascript to build layouts?

I'm creating a blog, but I need box-shadows for my boxes, so I'm asking the following.
Is it good to add shadows via a)images/css or b)javascript?
I've heard that lot of people don't have javascript enabled while browsing, so is there this a problem? It would be easier and simpler to create these shadows with javascript than adding a million divs and positioning them.
EDIT: I found this page: http://www.w3schools.com/browsers/browsers_stats.asp and it says that almoset every user has js enabled.

You could use JavaScript for your layout, but the general principal that you should keep in mind is that your HTML should be semantic: the elements on the page should have a meaning; it should project a structure that goes beyond the design of the page (although that structure can certainly be used as an indcator for the design aspects as well).
When this principal is applied, using JavaScript can help with providing the style you wish to project given the semantic meaning of the page.
Also, you should check your server logs (your hosting provider should have some sort of analytics tool/report available) which should tell you what browsers and versions are being used to visit your site. With that information, you can get a good feel for the people that you are currently reaching.
If you are using some sort of analytics package (e.g. Google Analytics) then you can possibly see the delta between two periods of time for the new visitors to your site as well, and try to gauge the capability of the browsers that new users will be using when they visit your site.
A few things to consider when using JavaScript to manipulate the DOM on the front end:
If you are using JavaScript to manipulate a good deal of the content, it's going to be a client-side process, and that can slow down the rendering of your page. You might want to consider a theme/template for your blog/cms which gives you the styling that you want and is rendered through CSS on the server-side.
Search engines do not execute your JavaScript. Because of this, you want to avoid manipulating the indexable content at all costs. You want your content to be embedded in the HTML as it is sent from the server. Using AJAX or other JavaScript to manipulate certain things is fine, but when it comes to your content, unless you are stylizing it, do not use JavaScript to manipulate it

Use CSS box-shadow for nice, up-to-date browsers: http://css-tricks.com/snippets/css/css-box-shadow/ (requires no extra markup)
And for most everyone else, serve up your js solution.

You should do it the easiest way for you and allow the page to degrade gracefully for those without JS (If you think you need to consider them, as today, I don't see any point in building none JS sites or building sites for no-js users).

Do search engines process Javascript?

According to this page it would seem like they don't, in the sense that they don't actually run it, but that page is 2 years old (judging from the copyright info).
The reason I'm asking this question is because we use Javascript to replace text on our site with other more typographically sound content. We're worried that this may affect the crawlability/seo of our sites, since generally what we're replacing is headers; ie. <h1>, <h2>, etc.
Will search engine bots see our original code, or will they run the Javascript and see the replaced text?

Google now officially processes JavaScript.
In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.
Sometimes things don't go perfectly during rendering, which may negatively impact search results for your site. Here are a few
potential issues, and – where possible, – how you can help prevent
them from occurring:
If resources like JavaScript or CSS in separate files are blocked (say, with robots.txt) so that Googlebot can’t retrieve them, our
indexing systems won’t be able to see your site like an average user.
We recommend allowing Googlebot to retrieve JavaScript and CSS so that
your content can be indexed better. This is especially important for
mobile websites, where external resources like CSS and JavaScript help
our algorithms understand that the pages are optimized for mobile. If
your web server is unable to handle the volume of crawl requests for
resources, it may have a negative impact on our capability to render
your pages. If you’d like to ensure that your pages can be rendered by
Google, make sure your servers are able to handle crawl requests for
resources.
It's always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn't have
compatible JavaScript implementations. It will also help visitors with
JavaScript disabled or off, as well as search engines that can't
execute JavaScript yet.
Sometimes the JavaScript may be too complex or arcane for us to execute, in which case we can’t render the page fully and accurately.
Some JavaScript removes content from the page rather than adding, which prevents us from indexing the content.

Search engines don't process JavaScript as such.
There is some evidence that Google may have started processing inline script content in some cases, in order to catch content that is entered into the page parse queue using document.write. However certainly DOM methods such as you might use for font-replacement are not affected and no onload code is invoked.

Generally no. Google has mentioned that they are working on a system of indexing ajax content, but I don't think any of the major search engines index dynamic content as a rule. See this page for Google's take on it: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=81766

The bots will certainly not run the Javascript code, but they might recognise some commonly used scripts.
You shouldn't count on it though. Clear markup, proper content and real links is still what counts.
Also, if the bots happen to recognise your script, it might not be in your favor. If the code is recognised as something that is commonly used to try to fool bots, it could even hurt your page ranking.

I'd use metadata to ensure bots pick up the content on your pages.

I know the general consensus is that google does not process javascript or index anything with a <script> tag, however, the general consensus appears incorrect.
Try searching for the following, with the surrounding quotes (or click here):
"Samsung Public Interest Statement by Thomas Fusco, Fish & Richardson P.C., for Samsung."
You should only get one result. Now click on that result (or just click here) and view the source.
Do a CTRL-F for the text you searched for in Google. Notice that the text is in a javascript variable, and not html. Google must be processing some javascript to pull those words into its index.

What are advantages of using google.load('jQuery', ...) vs direct inclusion of hosted script URL?

Google hosts some popular JavaScript libraries at:
http://code.google.com/apis/ajaxlibs/
According to google:
The most powerful way to load the libraries is by using google.load() ...
What are the real advantages of using
google.load("jquery", "1.2.6")
vs.
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js"></script>
?

Aside from the benefit of Google being able to bundle multiple files together on the request, there is no perk to using google.load. In fact, if you know all libraries that you want to use (say just jQuery 1.2.6), you're possibly making the user's browser perform one unneeded HTTP connection. Since the whole point of using Google's hosting is to reduce bandwidth consumption and response time, the best decision - if you're just using 1 library - is to call that library directly.
Also, if your site will be using any SSL certificates, you want to plan for this by calling the script via Google's HTTPS connection. There's no downside to calling a https script from an http page, but calling an http script from an https page will causing more obscure debugging problems than you would want to think about.

It allows you to dynamically load the libraries in your code, wherever you want.
Because it lets you switch directly to a new version of the library in the javascript, without forcing you to rebuild/change templates all across your site.

It lets Google change the URL (but they can't since the URL method is already established)
In theory, if you do several google.load()s, Google can bundle then into one file, but I don't think that is implemented.

I find it's very useful for testing different libraries and different methods, particularly if you're not used to them and want to see their differences side by side, without having to download them. It appears that one of the primary reason to do it, would be that it is asynchronous versus the synchronous script call. You also get some neat stuff that is directly included in the google loader, like client location. You can get their latitude and longitude from it. Not necessarily useful, but it may be helpful if you're planning to have targeted advertising or something of the like.
Not to mention that dynamic loading is always useful. Particularly to smooth out the initial site load. Keeping the initial "site load time" down to as little as possible is something every web designer is fighting an uphill battle on.

You might want to load a library only under special conditions.
Additionally the google.load method would speed up the initial page display. Otherwise the page rendering will freeze until the file has been loaded if you include script tags in your html code.

Personally, I'm interested in whether there's a caching benefit for browsers that will already have loaded that library as well. Seems like if someone browses to google and loads the right jQuery lib and then browses to my site and loads the right jQuery lib... ...both might well use the same cached jQuery. That's just a speculative possibility, though.
Edit: Yep, at very least when using the direct script tags to the location, the javascript library will be cached if someone has already called for the library from google (e.g. if it were included by another site somewhere).

If you were to write a boatload of JavaScript that only used the library when a particular event happens, you could wait until the event happens to download the library, which avoids unnecessary HTTP requests for those who don't actually end up triggering the event. However, in the case of libraries like Prototype + Scriptaculous, which downloads over 300kb of JavaScript code, this isn't practical.

We Keep Coding

JavaScript is the programming language of the Web.