Is window.location.href = 'some_page.html' followed by search engines? - javascript

Currently our website uses links to allow the user to change their locale. The problem with this is that you get a lot of random outlinks from each page on the site to... the same page, in other languages. When a search engine traverses this, it gets an excessively complex view of the site.
We were going to change it to a form post to avoid this. However, it seems to me that we should just be able to change it to an onclick="window.location.href='change_my_language.php'" rather than an href="change_my_language.php". Am I right? Or do the major search engines scan for and follow this sort of thing nowadays?

To solve the larger problem of duplicate content, you can use the canonical link tag to specify on the pages in other languages the URL of the preferred document.
<!-- on http://www.example.com/article.php?id=123&language=something-else -->
<link rel="canonical" href="http://www.example.com/article.php?id=123" />
To save search engines the trouble of landing on the other pages, it wouldn't hurt to add rel="nofollow" to the links, to ensure that robots don't waste their time checking them out. However, the canonical link tag is still vital, in case someone links to your other-language content, to ensure that your preferred page gets the ranking credit.

I'm fairly certain search engines don't parse JavaScript, so any code that changes the value of any property of the location object wouldn't follow the URL.
EDIT
An interesting article on the topic: http://www.seroundtable.com/archives/019026.html
and bobince's answer on this potential duplicate question suggests the same: window.location and SEO

POST method must be used when the request changes server's state, i.e. request has side-effects:
http://www.cs.tut.fi/~jkorpela/forms/methods.html
No, search engines won't be able to follow javascript links, but POST is more elegant solution.

When a search engine traverses this, it gets an excessively complex view of the site.
This should not be a problem. As long as you are correctly marking up each page with lang="...", the search engine should know what to do with it. What is the actual problem you are facing that leads you to believe search engines are confused by a ‘complex’ link map?
You can give them a sitemap if you really want to be explicit.
However, it seems to me that we should just be able to change it to an onclick="window.location.href='change_my_language.php'" rather than an href="change_my_language.php"
That would degrade the site's usability and accessibility a little as well as (deliberately) sabotaging the search engine.
In any case, whatever you do you should definitely leave each language version on its own URL (eg. /en/category/title) rather than totally relying on a language-setting cookie, or you really do run the risk of confusing search engines. Normally you do want search engines to index every language version you have, to catch searches from users of other languages.

Google is getting increasingly better at parsing JavaScript. I don't think that the search engines will follow this link now...but to be more certain they don't follow you can change your anchor tags to spans instead and use the onclick="document.location='url'" method.
Though what you may want to do is add rel="nofollow" to these links instead. You can also add a canonical link to the main page.

Related

How to Call the Same Item or Post with an URL when the title slug changes [duplicate]

Why is it a bad idea to have a ID in the URL in terms of SEO? How does this URL
http://example.com/user/1234 hurt SEO?
Can someone give me a practical example where search engine rankings are worse?
The reason people are saying that {ID} in the URL is bad is due to the way search engine algorithms work. When a search term is located in the actual URL, it is weighted much more heavily than the content of the page, etc.
For example:
<!-- http://example.com/blog/57 -->
<html><head><title>An article on search engine optimization</title>...
vs
<!-- http://example.com/blog/an-article-on-search-engine-optimization -->
<html><head><title>An article on search engine optimization</title>...
If you do a search in Google for "Search Engine Optimization" the second page, the one with the slug in the url will weight as a better result than the one with only the id.
You can deal with this in the same way that stack overflow deals with this issue:
http://stackoverflow.com/questions/{id}/{slug}
http://stackoverflow.com/questions/910683/why-is-id-in-the-url-a-bad-idea
The combined id and slug format really helps you achieve the best of both worlds. You get the ease of programming by retrieving records by {id}, but you also retain the optimized search URL because of the {slug}.
It does have an effect of the click-through rate.
The url is presented in green below the search result - so if it contains relevant words the user might click your site and not another site.
Which would you rather click:
www.test.com/page.php?u=85583
OR
www.test.com/Solution-to-your-problem.php
As commented this effect may be achieved even with urls including an id.
In the olden days it search engines treated words in url with much respect and gave those pages extra credit and higher ranking. This effect has almost vanished. We are left with two other effects of readable urls:
Clickthrough
Linkbuilding: Easier for a human to copy such a url and after the link is copied it is often referred to with some of the slug words. The url with "Solution-to-your-problem" may have Solution to your problem inside the a tag also when people link to your page. This will influence your ranking.
A solution with id + slug might be the best solution and it fixes the problem of keeping track of slug changes.
test.com/85583/solution-to-your-problem
But there are some rules to follow, you should do a 301 redirect if the slug is incorrect to prevent a lot of duplicate content pages. Spam/duplicate content detecting kicks in if you got a lot of similiar pages:
test.com/85583/solution-to-your-problem
test.com/85583/solution-to-yar-problem
test.com/85583/evil-competitor-spamming-you-haha
Including the id also requires your ids to be as short as possible, an url with a full guid might be confusing to the eye and prevent a good clickthrough:
test.com/0CD03822-4A35-11DE-BF38-3F9356D89593/solution-to-yar-problem
Remember that Google News even demanded that you had an id in your url to be included.
Well, my name is Sudhir Jonathan, so if I want people to find me on your site, example.com/user/sudhir-jonathan is much much better than example.com/user/1234. Simply because the object of your page - "Sudhir Jonathan" - is now present in the url itself. This is a big win.
Similarly, example.com/articles/how-to-bake-a-cake is ranked much higher than example.com/articles/2379797 for the search term "bake a cake".
See this Do SEO-friendly URLs really affect a page’s ranking? question. Based on the answers, no-one can find any proof that IDs in the URLs has any effect on SEO.
It simple, search engines care words rather than number. That is to say, it will be better to see keywords in url than just ID sine ID/number is useless for search engines to determine whether your site is relevant or not!
It's always a bad idea to provide unusual information. Try the user name instead!
For SEO there is no real advantage/disadvantage between static ID urls and username urls.
1) you miss out on keywords in the url
2) it's harder for a human to read and understand what the link will be about
3) sql injection is a lot easier with IDs

Prevent user-entered scripts from running in webpage

In my application, there is a comment box. If someone enters a comment like
<script>alert("hello")</script>
then an alert appears when I load that page.
Is there anyway to prevent this?
There are several ways to address this, but since you haven't mentioned which back-end technology you are using, it is hard to give anything but rough answers.
Also, you haven't mentioned if you want to allow, or deny, the ability to enter regular HTML in the box.
Method 1:
Sanitize inputs on the way in. When you accept something at the server, look for the script tags and remove them.
This is actually far more difficult to get right then might be expected.
Method 2:
Escape the data on the way back down to the server. In PHP there is a function called
htmlentities which will turn all HTML into which renders as literally what was typed.
The words <script>alert("hello")</script> would appear on your page.
Method 3
White-list
This is far beyond the answer of a single post and really required knowing your back-end system, but it is possible to allow some HTML characters with disallowing others.
This is insanely difficult to get right and you really are best using a library package that has been very well tested.
You should treat user input as plain text rather than HTML. By correctly escaping HTML entities, you can render what looks like valid HTML text without having the browser try to execute it. This is good practice in general, for your client-side code as well as any user provided values passed to your back-end. Issues arising from this are broadly referred to as script injection or cross-site scripting.
Practically on the client-side this is pretty easy since you're using jQuery. When updating the DOM based on user input, rely on the text method in place of the html method. You can see a simple example of the difference in this jsFiddle.
The best way is replace <script> with other string.For example in C#use:
str.replace("<script>","O_o");
Other options has a lot of disadvantage.
1.Block javascript: It cause some validation disabled too.those validation that done in frontend.Also after retrive from database it works again.I mean attacker can inject script as input in forms and it saved in database.after you return records from database in another page it render as script!!!!
2.render as text. In some technologies it needs third-party packages that it is risk in itself.Maybe these packages has backdoor!!!
convert value into string ,it solved in my case
example
var anything

Equivalent of SPContet.Current.ListItem in Client Object Model (ECMAScript)

I'm integrating an external application to SharePoint 2010 by developing custom ribbon tabs, groups, controls and commands that are made available to editors of a SharePoint 2010 site. The ribbon commands use the dialog framework to open dialogs with custom application pages.
In order to pass a number of query string parameters to the custom applications pages, I'm therefore looking for the equivalent of SPContext.Current.ListItem in the Client Object Model (ECMAScript).
Regarding available tokens (i.e. {ListItemId} or {SelectedItemId}) that can be used in the declarative XML, I already emitting all tokens, but unfortunately the desired tokens are not either not parsed or simply null, while in the context of a Publishing Page (i.e. http://domain/pages/page.aspx). Thus, none of the tokes that do render, are of use to establishing the context of the calling SPListItem in the application page.
Looking at the SP.ClientContext.get_current() provides a lot of information about the current SPSite, SPWeb etc. but nothing about the current SPListItem I'm currently positioned at (again, having the page rendered in the context of a Publishing Page).
What I've come up with so far is the idea of passing in the url of the current page (i.e. document.location.href) and parse that in the application page - however, it feels like I'm going in the wrong direction, and SharePoint surely should be able to provide this information.
I'm not sure this is a great answer, or even fully on-topic, but is basically something I originally intended to blog about - anyway:
It is indeed a pain that the Client OM does not seem to provide a method/property with details of the current SPListItem. However, I'd venture to say that this is a simple concept, but actually has quite wide-ranging implications in SharePoint which aren't apparent until you stop to think about it.
Consider:
Although a redirect exists, a discussion post can be surfaced on 2 or 3 different URLs (e.g. Threaded.aspx/Flat.aspx)
Similarly, a blog post can exist on a couple (Post.aspx/EditPost.aspx, maybe one other)
A list item obviously has DispForm.aspx/EditForm.aspx and (sort of) NewForm.aspx
Also for even for items with an associated SPFile (e.g. document, publishing page), consider that these URLs represent the same item:
http://mydomain/sites/someSite/someLib/Forms/DispForm.aspx?ID=x, http://mydomain/sites/someSite/someLib/Filename.aspx
Also, there could be other content types outside of this set which have a similar deal
In our case, we wanted to 'hang' data off internal and external items (e.g. likes, comments). We thought "well everything in SharePoint has a URL, so that could be a sensible way to identify an item". Big mistake, and I'm still kicking myself for falling into it. It's almost like we need some kind of 'normalizeUrl' method in the API if we wanted to use URLs in this way.
Did you ever notice the PageUrlNormalization class in Microsoft.SharePoint.Utilities? Sounds promising doesn't it? Unfortunately that appears to do something which isn't what I describe above - it doesn't work across the variations of content types etc (but does deal with extended web apps, HTTP/HTTPS etc).
To cut a long story short, we decided the best approach was to make the server emit details which allowed us to identify the current SPListItem when passed back to the server (e.g. in an AJAX request). We hide the 'canonical' list item ID in a JavaScript variable or hidden input field (whatever really), and these are evaluated when back at the server to re-obtain the list item. Not as efficient as obtaining everything from context, but for us it's OK because we only need to resolve when the user clicks something, not on every page load. By canonical, I mean:
SiteID|WebID|ListID|ListItemID
IIRC, one of the key objects has a CanonicalId property (or maybe it's internal), which may help you build such a string.
So in terms of using the window.location.href, I'd avoid that if you're in vaguely the same situation as us. Suggest considering an approach similar to the one we used, but do remember that there are some locations (e.g. certain forms) where even on the server SPContext.Current.ListItem is null, despite the fact that SPContext.Current.Web (and possibly SPContext.Current.List) are populated.
In summary - IDs are your friend, URLs are not.

Run external Javascript on given URL?

I want to be able to write javascript that executes on a page that I visit every day. It is for my employers blog-like "What's New" page where there are announcements such as new hires, terminations, etc. But among these are also article that announcing grants we received and other important things.
I want to filter out those things that I don't care about but keep those that I do. I could do this with CSS if there were selectors that I could match on. But there are none that seperate those posts from the ones I don't care about.
So, is there a way in Firefox to specify a js file that would act on a certain URL? Like there is a user defined userChrome.css that matches on all CSS in a page?
Thanks
Eric
You want Greasemonkey. Despite the name that makes you go ewwwww, it is a very powerful and useful tool.

What precautions should I take to prevent XSS on user submitted HTML?

I'm planning on making a web app that will allow users to post entire web pages on my website. I'm thinking of using HTML Purifier but I'm not sure because HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted. So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.
I saw a Google video a while ago that had a solution for this. Their solution was to use another website to post javascript in so the original website cannot be accessed by it. But I don't wanna purchase a new domain just for this.
be careful with homebrew regexes for this kind of thing
A regex like
s/(<.*?)onClick=['"].*?['"](.*?>)/$1 $3/
looks like it might get rid of onclick events, but you can circumvent it with
<a onClick<a onClick="malicious()">="malicious()">
running the regex on that will get you something like
<a onClick ="malicious()">
You can fix it by repeatedly running the regex on that string until it doesn't match, but that's just one example of how easy it is to get around simple regex sanitizers.
The most critical error people make when doing this is validating things on input.
Instead, you should validate on display.
The context matters when determing what is XSS and what isn't. Therefore, you can happily accept any input, as long as you pass it through appropriate cleaning functions when displaying it.
Consider that something that constitutes 'XSS' will be different when the input is placed in a '<a href="HERE"> as opposed to <a>here!</a>.
Thus, all you need to do, is make sure that any time you write user data, you consider, very carefully, where you are displaying it, and make sure that it can't escape the context you are writing it to.
If you can find any other way of letting users post content, that does not involve HTML, do that. There are plenty of user-side light markup systems you can use to generate HTML.
So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.
Forget it. You cannot process HTML with regex in any useful way. Let alone when security is involved and attackers might be deliberately throwing malformed markup at you.
If you can convince your users to input XHTML, that's much easier to parse. You still can't do it with regex, but you can throw it into a simple XML parser, and walk over the resulting node tree to check that every element and attribute is known-safe, and delete any that aren't, then re-serialise.
HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted.
Why?
If it's so they can edit it in their original form, then the answer is simply to purify it on the way out to be displayed in the browser, not on the way in at submit-time.
If you must let users input their own free-form HTML — and in general I'd advise against it — then HTML Purifier, with a whitelist approach (ban all elements/attributes that aren't known-safe) is about as good as it gets. It's very very complicated and you may have to keep it up to date when hacks are found, but it's streets ahead of anything you're going to hack up yourself with regexes.
But I don't wanna purchase a new domain just for this.
You can use a subdomain, as long as any authentication tokens (in particular, cookies) can't cross between subdomains. (Which for cookies they can't by default as the domain parameter is set to only the current hostname.)
Do you trust your users with scripting capability? If not don't let them have it, or you'll get attack scripts and iframes to Russian exploit/malware sites all over the place...
Make sure that user content doesn't contain anything that could cause Javascript to be ran on your page.
You can do this by using an HTML stripping function that gets rid of all HTML tags (like strip_tags from PHP), or by using another similar tool. There are actually many reasons besides XSS to do this. If you have user submitted content, you want to make sure that it doesn't break the site layout.
I belive you can simply use a sub-domain of your current domain to host Javascript, and you will get the same security benefits for AJAX. Not cookies however.
In your specific case, filtering out the <script> tag and Javascript actions is probably going to be your best bet.
1) Use clean simple directory based URIs to serve user feed data.
Make sure when you dynamically create URIs to address the user's uploaded data, service account, or anything else off your domain make sure you don't post information as parameters to the URI. That is an extremely easy point of manipulation that could be used to expose flaws in your server security and even possibly inject code onto your server.
2) Patch your server.
Ensure you keep your server up to date on all the latest security patches for all the services running on that server.
3) Take all possible server-side protections against SQL injection.
If somebody can inject code to your SQL database that can execute from services on your box that person will own your box. At that point they can then install malware onto your webserver to be feed back to your users or simple record data from the server and send it out to a malicious party.
4) Force all new uploads into a protected sandboxed area to test for script execution.
No matter how you try to remove script tags from submitted code there will be a way to circumvent your safeguards to execute script. Browsers are sloppy and do all kinds of stupid crap they are not supposed to do. Test your submissions in a safe area before you publish them for public consumption.
5) Check for beacons in submitted code.
This step requires the previous step and can be very complicated, because it can occur in script code that requires a browser plugin to execute, such as Action Script, but is just as much a vulnerability as allowing JavaScript to execute from user submitted code. If a user can submit code that can beacon out to a third party then your users, and possibly your server, is completely exposed to data loss to a malicious third party.
You should filter ALL HTML and whitelist only the tags and attributes that are safe and semantically useful. WordPress is great at this and I assume that you will find the regular expressions used by WordPress if you search their source code.

Categories