I'm experimenting with building sites dynamically on the client side, through JavaScript + a JSON content server, the js retrieves the content, and builds the page client-side.
Now, the content won't be indexed by Google this way. Is there a workaround for this? Like having a crawler version and a user version? Or having some sort of static archives? Has anyone done this already?
You should always make sure that your site works without javascript. Make links that link to static versions of the content. Then add javascript click handlers to those links that block the default action from hapening and make the AJAX request. I.e. using jQuery:
HTML:
<a href='static_content.html' id='static_content'>Go to page!</a>
Javascript:
$('#static_content').click(function(e) {
e.preventDefault(); //stop browser from following link
//make AJAX request
});
That way the site is usable for crawlers and users without javascript. And has fancy AJAX for people with javascript.
If the site is meant to be indexed by google then the "information" you want searchable and public should be available without javascript. You can always add the dynamic stuff later after the page loads with javascript. This will not only make the page indexable but will also make the page faster loading.
On the other hand if the site is more of an application 'ala gmail' then you probably don't want google indexing it anyway.
You could utilize a server rendered version, and then replace it onload with the ajax version.
But if you are going to do that, why not build the entire site that way and just use ajax for interaction where the client supports it ala non-intrusive javascript.
You can use phantomjs to build a crawler version, see my solution here:
https://github.com/liuwenchao/ajax-seo
Related
I need to create a HTML code snippet that I will distribute to third party websites. This code snippet talks to a php file on my server and contains a logic to update the content(image) after specified time intervals. The reason I cannot use JavaScript is that it is not search engine friendly.
The way I have it now is using an HTML+ Javascript code which includes an XMLhttp request and uses Ajax to call a PHP file which in turn reads a csv file and updates the banner image on the third party site. But it is not crawlable by search engines.
Any other way of getting this to work using HTML? Probably using forms?
HTML is not active. If you want to do something, you need some sort of scripting language. You can do this without using Ajax (XMLhttp). Before Ajax, it was a common practice to relay information to the server using dynamic image loading. Of course, the dynamic image loading required a script. It can be rather simple:
<img id='myimg' src='temp.jpg'
onload="document.getElementById('myimg').src='myscript.php?width='+window.innerWidth;"
>
Your script replaces the image with whatever you like, but you have information delivered from the web page to your server through the get string. Originally, I saw this used extensively to deliver rotating ads. With this, you can record which ads are shown along with information that would otherwise only be known by the web browser.
I've got this setup:
Single page app that generates HTML content using Javascript. There is no visible HTML for non-JS users.
History.js (pushState) for handling URLS without hashbangs. So, the app on "domain.com" can load dynamic content of "page-id" and updates the URL to "domain.com/page-id". Also, direct URLS work nicely via Javascript this way.
The problem is that Google cannot execute Javascript this way. So essentially, as far as Google knows, there is no content whatsoever.
I was thinking of serving cached content to search bots only. So, when a search bot hits "domain.com/page-id", it loads cached content, but if a user loads the same page, it sees normal (Javascript injected) content.
A proposed solution for this is using hashbangs, so Google can automatically convert those URLs to alternative URLs with an "escaped_fragment" string. On the server side, I could then redirect those alternative URLs to cached content. As I won't use hashbangs, this doesn't work.
Theoretically I have everything in place. I can generate a sitemap.xml and I can generate cached HTML content, but one piece of the puzzle is missing.
My question, I guess, is this: how can I filter out search bot access, so I can serve those bots the cached pages, while serving my users the normal JS enabled app?
One idea was parsing the "HTTP_USER_AGENT" string in .htaccess for any bots, but is this even possible and not considered cloaking? Are there other, smarter ways?
updates the URL to "domain.com/page-id". Also, direct URLS work nicely via Javascript this way.
That's your problem. The direct URLs aren't supposed to work via JavaScript. The server is supposed to generate the content.
Once whatever page the client has requested is loaded, JavaScript can take over. If JavaScript isn't available (e.g. because it is a search engine bot) then you should have regular links / forms that will continue to work (if JS is available, then you would bind to click/submit events and override the default behaviour).
A proposed solution for this is using hashbangs
Hashbangs are an awful solution. pushState is fix for hashbangs, and you are using that already - you just need to use it properly.
how can I filter out search bot access
You don't need to. Use progressive enhancement / unobtrusive JavaScript instead.
I'm sorry if this is a newbie question but I don't really know what to search for either. How do you keep content from a previous page when navigating through a web site? For example, the right side Activity/Chat bar on facebook. It doesn't appear to refresh when going to different profiles; it's not an iframe and doesn't appear to be ajax (I could be wrong).
Thanks,
I believe what you're seeing in Facebook is not actual "page loads", but clever use of AJAX or AHAH.
So ... imagine you've got a web page. It contains links. Each of those links has a "hook" -- a chunk of JavaScript that gets executed when the link gets clicked.
If your browser doesn't support JavaScript, the link works as it normally would on an old-fashioned page, and loads another page.
But if JavaScript is turned on, then instead of navigating to an HREF, the code run by the hook causes a request to be placed to a different URL that spits out just the HTML that should be used to replace a DIV that's already showing somewhere on the page.
There's still a real link in the HTML just in case JS doesn't work, so the HTML you're seeing looks as it should. Try disabling JavaScript in your browser and see how Facebook works.
Live updates like this are all over the place in Web 2.0 applications, from Facebook to Google Docs to Workflowy to Basecamp, etc. The "better" tools provide the underlying HTML links where possible so that users without JavaScript can still get full use of the applications. (This is called Progressive Enhancement or Graceful degradation, depending on your perspective.) Of course, nobody would expect Google Docs to work without JavaScript.
In the case of a chat like Facebook, you must save the entire conversation on the server side (for example in a database). Then, when the user changes the page, you can restore the state of the conversation on the server side (with PHP) or by querying your server like you do for the chat (Javascript + AJAX).
This isn't done in Javascript. It needs to be done using your back-end scripting language.
In PHP, for example, you use Sessions. The variables set by server-side scripts can be maintained on the server and tied together (between multiple requests/hits) using a cookie.
One really helpful trick is to run HTTPFox in Firefox so you can actually monitor what's happening as you browse from one page to the next. You can check out the POST/Cookies/Response tabs and watch for which web methods are being called by the AJAX-like behaviors on the page. In doing this you can generally deduce how data is flowing to and from the pages, even though you don't have access to the server side code per se.
As for the answer to your specific question, there are too many approaches to list (cookies, server side persistence such as session or database writes, a simple form POST, VIEWSTATE in .net, etc..)
You can open your last closed web-page by pressing ctrl+shift+T . Now you can save content as you like. Example: if i closed a web-page related by document sharing and now i am on travel web page. Then i press ctrl+shift+T. Now automatic my last web-page will open. This function works on Mozilla, e explorer, opera and more. Hope this answer is helpful to you.
I am quite new to web application development and I need to know how would I make other sites use it.
My webapp basically gets a username and returns some data from my DB. This should be visible from other websites.
My options are:
iframe. The websites owners embed an iframe and they pass the userid in the querystring. I render a webpage with the data and is shown inside the iframe.
pros: easy to do, working already.
cons: the websites wont know the data returned, and they may like to know it.
javascript & div. They paste a div and some javascript code in their websites and the div content is updated with the data retrieved by the small javascript.
pros: the webside would be able to get the data.
cons: I could mess up with their website and I don't know wow would I run the javascript code appart from being triggered by a document ready, but I wouldn't like to add jquery libraries to their sites.
There must be better ways to integrate web applications than what I'm thinking. Could someone give me some advice?
Thanks
Iframes cannot communicate with pages that are on a different domain. If you want to inject content into someone else's page and still be able to interact with that page you need to include (or append) a JavaScript tag (that points to your code) to the hosting page, then use JavaScript to write your content into the hosting page.
Context Framework contains embedded mode support, where page components can be injected to other pages via Javascript. It does depend on jQuery but it can always be used in noConflict-mode. At current release the embedded pages must be on same domain so that same-origin-policy is not violated.
In the next release, embedded mode can be extended to use JSONP which enables embedding pages everywhere.
If what you really want is to expose the data, but not the visual content, then I'd consider exposing your data via JSONP. There are caveats to this approach, but it could work for you. There was an answer here a couple of days ago about using a Web Service, but this won't work directly from the client because of the browser's Same Origin policy. It's a shame that the poster of that answer deleted it rather than leave it here as he inadvertently highlighted some of the misconceptions about how browsers access remote content.
I want to create a web crawler/spider to iteratively fetch all the links in the webpage including javascript-based links (ajax), catalog all of the Objects on the page, build and maintain a site hierarchy. My question is:
Which language/technology should be better (to fetch javascript-based links)?
Is there any open source tools there?
Thanks
Brajesh
You can automate the browser. For example, have a look at http://watir.com/
Fetching ajax links is something that even the search-giants haven't accomplished yet. It is because, the ajax links are dynamic and the command and response both vary greatly as per the user's actions. That's probably why, SEF-AJAX (Search Engine Friendly AJAX) is now being developed. It is a technique that makes a website completely indexable to search engines that when visited by a web browser, acts as a web application. For reference, you may check this link: http://nixova.com
No offence but I dont see any way of tracking ajax links. That's where my knowledge ends. :)
you can do it with php, simple_html_dom and java. let the php crawler copy the pages on your local machine or webserver, open it with an java application (jpane or something) mark all text as focused and grab it. send it to your database or where you want to store it. track all a tags or tags with an onclick or mouseover attribute. check what happens when you call it again. if the source html (the document returned from server) size or md5 hash is different you know its an effective link and can grab it. i hope you can understand my bad english :D