Javascript App and SEO - javascript

I've got this setup:
Single page app that generates HTML content using Javascript. There is no visible HTML for non-JS users.
History.js (pushState) for handling URLS without hashbangs. So, the app on "domain.com" can load dynamic content of "page-id" and updates the URL to "domain.com/page-id". Also, direct URLS work nicely via Javascript this way.
The problem is that Google cannot execute Javascript this way. So essentially, as far as Google knows, there is no content whatsoever.
I was thinking of serving cached content to search bots only. So, when a search bot hits "domain.com/page-id", it loads cached content, but if a user loads the same page, it sees normal (Javascript injected) content.
A proposed solution for this is using hashbangs, so Google can automatically convert those URLs to alternative URLs with an "escaped_fragment" string. On the server side, I could then redirect those alternative URLs to cached content. As I won't use hashbangs, this doesn't work.
Theoretically I have everything in place. I can generate a sitemap.xml and I can generate cached HTML content, but one piece of the puzzle is missing.
My question, I guess, is this: how can I filter out search bot access, so I can serve those bots the cached pages, while serving my users the normal JS enabled app?
One idea was parsing the "HTTP_USER_AGENT" string in .htaccess for any bots, but is this even possible and not considered cloaking? Are there other, smarter ways?

updates the URL to "domain.com/page-id". Also, direct URLS work nicely via Javascript this way.
That's your problem. The direct URLs aren't supposed to work via JavaScript. The server is supposed to generate the content.
Once whatever page the client has requested is loaded, JavaScript can take over. If JavaScript isn't available (e.g. because it is a search engine bot) then you should have regular links / forms that will continue to work (if JS is available, then you would bind to click/submit events and override the default behaviour).
A proposed solution for this is using hashbangs
Hashbangs are an awful solution. pushState is fix for hashbangs, and you are using that already - you just need to use it properly.
how can I filter out search bot access
You don't need to. Use progressive enhancement / unobtrusive JavaScript instead.

Related

How can I make all relative URLs on a HTML page be relative to a different domain then it's on (in HTML elements as well as JS HTTP Requests)?

I have a page that has "domain relative URLs" such as ../file/foo or /file/foo, in a href attributes, img src attributes, video, audio, webcomponents, js ajax calls, etc.
Now here's the thing, when the page gets such URLs I want them to be relative to a single specific and different domain then it's actually on, so say the page is on http://localhost:8081/page/ the browser will translate them to:
http://localhost:8081/page/../file/foo
http://localhost:8081/file/foo
But what I really want is for all relative URLs to have no relationship to the domain that their on, but rather be in a relationship with another domain, e.x. http://localhost:5462/, and therefor translate to:
http://localhost:5462/page/../file/foo
http://localhost:5462/file/foo
In other words: I want a page where the URLs it contains, in a-href, ajax calls, etc, never change no matter where you are viewing the page from (so that the page always fetches the same content from the same source), which works out of the box for complete URLs but not for relative URLs. A way to define and enforce a domain that URLs are relative to so that no matter where the page goes its content and behavior stays the same.
So, what options do I have to solve this and which is the best one? Some ideas:
Frontend Solutions
The page is generated from markdown, so I can make a plugin for the markdown to html converter that detects domain relative links and convert them to complete urls, but I've yet to experiment to see if it's really that simple, maybe it will miss links on complex markdown components, or a href and img src defined not using markdown but html, or various types of links inside webcomponent elements, etc (also, this isn't a very universal solution and doesn't work with js)
Load the page, then load a js that goes over all the links on the page and converts domain relative links to complete urls with correct domain. This might be better but it doesn't change the relationship at a "fundamental level" so it will miss domain relative links in javascript, such as ajax calls, or URLs added later in the lifetime of the page.
Instead of manipulating the dom I could maybe intercept all http calls, clicks, and requests by using js somehow, check if they are domain relative, and alter the call if necessary (this would brake user facing behavior though, like the preview of where the user will go when hovering a link)
Maybe there exist some kind of a tag like meta with options that I can put in the header of the html that changes the relationship between domain relative links/calls and the domain? I recently discovered the base tag but it seems to only work with HTML elements and not js scripts.
Backend Solutions
Route localhost:5462 behind localhost:8081 so that all the data from localhost:5462 is accessible at localhost:8081, this is very clean and plainly just works, even for js http calls, and I will and already use this in allot of cases with zero problems, but sometimes I don't want to use routing and then I'm back to square one. So this is the result that I want, but I need it without routing as well (that the http requests from the frontend don't go through the backend but to the same URLs as on the original or "correct" domain).
Maybe it's possible to inject some kind of domain info in the response object/header/? when requesting the page from the backend which tricks all http requests on the page to use the "correct" domain for domain relative URLs?
Other Solutions
I could ban the use of domain relative links but I need to support file:/// as well, which needs relative links, and how would I go about banning it in the first place and do I really want to ban a form of links?
What more solutions are there?
Some things to consider
This isn't limited to localhost, it could be domain1.com and domain2.com or any combination of domains, or inside an Electron app, etc.
Ideally when viewing the source code of the page, "live inspecting" it, etc, the links would be unmodified / still look relative, but when you hover a link the indicator of where it goes would show the desired translation of the relative link rather then the default behavior of domain relative links, this "ideal" is definitely not attained with some of my suggested solutions.
Another ideal would be that all user facing behavior would "just work" but when viewing source code or any developer facing behavior could be funky (the only user facing behavior I can think of is the preview of where you'll go when hovering a link).
CORS is not an issue.
It would be great if it worked with any backend setup (a completely frontend solution), but it would also be acceptable if it only works with my particular backend due to technical impossibility or frontend solution being too much of a hack and error prone, or it could require a universally doable modification to any backend.
My backend is written in Node/Koa.
What seems like should be the best solution is something like the base tag but which also works for any arbitrary unmodified javascript, instead of just HTML elements.
But if that doesn't exist maybe it's possible to use the base class just for HTML elements, and in conjunction with that use some very neat and non-hacky javascript that successfully intercepts all possible javascript and webassembly http (or any other protocol) requests and directs them to the correct domain if they are using domain relative URLs? And if that really is the best solution and technically feasible, how would I do it?
After experimenting a bit I've come to the conclusion that HTTP redirects are the best solution, essentially making the backend serve as nothing more but a way to bounce off of to the right URLs. There are some problems with POST requests though which might be possible to fix with frontend javascript modifications, but if it's really a problem it's always possible to switch redirects with "forwards" instead.

Pentest pure JavaScript (qooxdoo) Website

I'm wondering how I could Pentest a website made completely in JavaScript, for example using the qooxdoo Framework.
Those websites do not contain any requests to the server which respond with HTML content. Only one Javascript file gets transmitted when loading the page (which is an almost empty html page with just the link to the javascript file) and the page is then beeing set up by the loaded JS file, without any line of HTML written by the developer.
Typically, there would be some Spidering/Crawling in most Web App Scanners (like Nexpose), which check a website for links and forms and crawl any link they find which directs to the same domain and test any parameter found on these links. I assume those scanners would not have any effect on a pure JS page.
Then there's the other possibility: A proxy server (like Burp Suite) which captures any traffic beeing sent to a server and is able to check any found parameters on this requests. This would probably work to test the API-Server which is located behind the Website (for example to find SQL injections).
But: Is there any way to test the client, for example for XSS (self or stored)?
Or more in general: What types of attacks would you typically need to check in such a pure JS web application? What tools could help with that?

Deep linking javascript powered websites

I have a website which has two versions, an all singing all dancing javascript powered application which is served when you request the root url
/
As you navigate around the lovely website the content updates, as does the url, thanks to html5 push state or good old correctly formatted #! urls. However if you don't have javascript enabled you can still use all functionality of the site as each piece of content also exists under it's own url. This is great for 3 reasons
non javascript users can still use the site
SEO - web crawlers can index the site easily
everything is shareable on social networks
The third reason is very important to me as every piece of content must be individually shareable on the site. And because each piece of content has it's own url it is easy to deep link to that url, and each piece of content can have it's own specific open graph data.
However the issue I hit is the following. You are a normal person and have javascript enabled and you are browsing and image gallery on the site and decide to share the picture of a lovely cat you have found. Using javascript the url has been updated to
/gallery/lovely-cat
You share this url and your friend clicks on it. When they click on the link the server sends you the non javascript / web crawler version of the site, and the experience is no where near as nice as the javascript version you would have been served if you directly went to the root of the site and navigated there.
Do anyone have a nice solution / alternative setup to solve this problems? I have several hacks which work, however I am not that happy with them. They include :
javascript redirect to the root of the site on every page and store a cookie / add a #! to the url so on page render the javascript router will show the correct content. ( does google punish automatic javascript redirects? )
render the no javascript page, and add some javascript which redirects the user to the root, similar to above, whenever the user clicks on a link
I don't particularly like either of these solutions, but can't think of a better solution. Rendering the entire javascript app for each page doesn't appear to be a solution to me, as you would end up with bad looking urls such as /gallery/lovely-cat/gallery/another-lovely-cat as you start navigating through the site.
My solution must support old browsers which do not implement push state
Make the "non javascript / web crawler version of the site" the same as the JavaScript version. Just build HTML on the server instead of DOM on the client.
Rendering the entire javascript app for each page doesn't appear to be a solution to me,
That is the robust approach
as you would end up with bad looking urls such as /gallery/lovely-cat/gallery/another-lovely-cat
Only if you linked (and pushStateed) to gallery/another-lovely-cat instead of /gallery/another-lovely-cat. (Note the / at the front).
Try out this plugin it might solve your 3rd reason, along with two reasons.
http://www.asual.com/jquery/address/

Doing links like Twitter, Hash-Bang #! URL's [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What’s the shebang/hashbang (#!) in Facebook and new Twitter URLs for?
I was wondering how Twitter works its links.
If you look in the source code, you use the links are done like /#!/i/connect or /#!/i/discover, but they don't have a JavaScript function attached to them like load('connect') or something, and that it doesn't require a page reload. It just changes out the page content.
I saw this page, but then all of those files would have to exist, and you couldn't just go straight to one of them. I imagine that on Twitter each of those files don't exist, and that it is handled in some other method. Please correct me if I'm wrong, though.
Is there a way I could replicate this effect? If so, is there a tutorial on how to go about doing this?
"Hash-Bang" navigation, as it's sometimes called, ...
http://example.com/path/to/#!/some-ajax-state
...is a temporary solution for a temporary problem that is quickly becoming a non-issue thanks to modern browser standards. In all likelihood, Twitter will phase it out, as Facebook is already doing.
It is the combination of several concepts...
In the past, a link served two purposes: It loaded a new document and/or scrolled down to an embedded anchor as indicated with the hash (#).
http://example.com/script.php#fourth-paragraph
Anything in a URL after the hash was not requested from the server, but was searched for in the page by the browser. This all still works just fine.
With the adoption of AJAX, new content could be loaded into the current (already loaded) page. With this dynamic loading, several problems arose: 1) there was no unique URL for bookmarking or linking to this new content, 2) search would never see it.
Some smart people solved the first problem by using the hash as a sort of "state" reference to be included in links & bookmarks. After the document loads, the browser reads the hash and runs the AJAX requests, displaying the page plus its dynamic AJAX changes.
http://example.com/script.php#some-ajax-state
This solved the AJAX problem, but the search engine problem still existed. Search engines don't load pages and execute Javascript like a browser.
Google to the rescue. Google proposed a scheme where any URL with a hash-bang (#!) in lieu of just a hash (#) would suggest to the search bot that there was an alternate URL for indexing, which involved an "_escaped_fragment_" variable, among other things. Read about it here: Ajax Crawling: Getting Started.
Today, with the adoption of Javascript's pushstate in most major browsers, all of this is becoming obsolete. With pushstate, as content is dynamically loaded or changed, the current page URL can be altered without causing a page load. When desired, this provides a real working URL for bookmarks & history. Links can then be made as they always were, without hashes & hash-bangs.
As of today, if you load Facebook in an older browser, you'll see the hash-bangs, but a current browser will demonstrate the use of pushstate.
You might wanna check out more on Unique URLs.
It's loading the page via AJAX, and parsing the "hash" (the values that come after the "#") to determine which page it's going to load. Also, this method is used due to the nature that AJAX requests don't count to the browser's history thus the "back button breaks". But the browser does however store into history the hash changes.
Using hashes plus the fact that you can use hashes to determine pages, you can say that you can keep AJAX requested pages "in history". Added to that, hashed URLs are just URLs, and they are bookmarkable including the hash, so you can also bookmark AJAX requested pages.

web crawler/spider to fetch ajax based link

I want to create a web crawler/spider to iteratively fetch all the links in the webpage including javascript-based links (ajax), catalog all of the Objects on the page, build and maintain a site hierarchy. My question is:
Which language/technology should be better (to fetch javascript-based links)?
Is there any open source tools there?
Thanks
Brajesh
You can automate the browser. For example, have a look at http://watir.com/
Fetching ajax links is something that even the search-giants haven't accomplished yet. It is because, the ajax links are dynamic and the command and response both vary greatly as per the user's actions. That's probably why, SEF-AJAX (Search Engine Friendly AJAX) is now being developed. It is a technique that makes a website completely indexable to search engines that when visited by a web browser, acts as a web application. For reference, you may check this link: http://nixova.com
No offence but I dont see any way of tracking ajax links. That's where my knowledge ends. :)
you can do it with php, simple_html_dom and java. let the php crawler copy the pages on your local machine or webserver, open it with an java application (jpane or something) mark all text as focused and grab it. send it to your database or where you want to store it. track all a tags or tags with an onclick or mouseover attribute. check what happens when you call it again. if the source html (the document returned from server) size or md5 hash is different you know its an effective link and can grab it. i hope you can understand my bad english :D

Categories