Skip javascript snippet of code for crawlers

Skip javascript snippet of code for crawlers - javascript

I have a website in php, that pass certain php variables to javascript variables, google crawled me, which generates errors and duplicate content. Is there any way to make the google crawler to ignore the declaration of these variables in javascript?
echo '<script language="javascript">var '.$item['Nombre'].'="'.$descripcion.'";</script>';
Sorry for my english,

Google crawling javascript code and considering it duplicate? I have never heard of this problem before. Some of my pages have inlined javascript (if the content is small), that means the same <script>...</script> on every page.
There are also cases where I output javascript variables more-or-less the same way you do. Google never marked it as "duplicate content".
Description from here:
Duplicate content generally refers to substantive blocks of content
within or across domains that either completely match other content or
are appreciably similar. Mostly, this is not deceptive in origin.
Examples of non-malicious duplicate content could include:
Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
Store items shown or linked via multiple distinct URLs
Printer-only versions of web pages
You can get this kind of errors if you have the same content on more than one of your pages, but google does not parse javascript as content. (Although you can never know for sure what google does or does not). The same way that google will not mark your <head> tag as duplicate, or there is no penalty for having the same layout (menu, footer, etc) on every page.
You can put that <script> tag in an <aside> tag just to be sure.
The HTML Element represents a section of a page that consists
of content that is tangentially related to the content around it,
which could be considered separate from that content. Such sections
are often represented as sidebars or as inserts. They often contain
side explanations, like a glossary definition; more loosely related
stuff, like advertisements; the biography of the author; or in
web-applications, profile information or related blog links.
This means that the content will be more or less ignored by google when indexing the page. It will not mark it as a duplicate since it could be a commertial.
Also drop the language="javascript" attribute from your script tags. I doubt that it would confuse google in any way, since that attribute is deprecated (use type instead) and nothing takes it into account nowadays. But if google bot does, the correct value would be text/javascript instead of simply javascript. It is possible that google does not recognise the value javascript and parses it as unknown type of text content.
The default type of the script is text/javascript, so it is safe to omit.
Above all I suspect that the problem is not the existence of JS variables, but some other thing like GET parameters in your URL. GET parameters can be dealt with by configuring URL Parameters correctly in Webmaster Tools.

Important: This is bad practice in most of the cases. If google notices that you serve different content to it's bot and considers it relevant, than your site can get penalties beyond measure.
I do recommend this php solution:
in PHP use this code:
if (!strpos($_SERVER[‘HTTP_USER_AGENT’],"Googlebot")) {
//echo the script
}
else{ //dont echo, does nothing. }
But if this don't work you can try adding this javascript code into your script tag:
if (!navigator.userAgent.contains('Googlebot')) {
//do the script
} else {
//does nothing
}
Ps: Here is a list of User-Agents http://www.useragentstring.com/pages/Crawlerlist/

Another (untested, speculative) approach that requires that you can write your own robots.txt file:
Move all your javascript code generation to another URL and include this as a javascript script in your page: <script type="text/javascript" src="/path/to/my/php/that/generates/js/variables.php"></script>
Add that URL to your robots.txt file (see Google answer)
User-Agent: Googlebot
Disallow: /path/to/my/php/that/generates/js/variables.php

You can Use following PHP code:
$crawlers = array(
'Google'=>'Google',
'MSN' => 'msnbot',
'Rambler'=>'Rambler',
'Yahoo'=> 'Yahoo',
'AbachoBOT'=> 'AbachoBOT',
'accoona'=> 'Accoona',
'AcoiRobot'=> 'AcoiRobot',
'ASPSeek'=> 'ASPSeek',
'CrocCrawler'=> 'CrocCrawler',
'Dumbot'=> 'Dumbot',
'FAST-WebCrawler'=> 'FAST-WebCrawler',
'GeonaBot'=> 'GeonaBot',
'Gigabot'=> 'Gigabot',
'Lycos spider'=> 'Lycos',
'MSRBOT'=> 'MSRBOT',
'Altavista robot'=> 'Scooter',
'AltaVista robot'=> 'Altavista',
'ID-Search Bot'=> 'IDBot',
'eStyle Bot'=> 'eStyle',
'Scrubby robot'=> 'Scrubby',
);
function crawlerDetect($USER_AGENT)
{
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
// $crawlers_agents = implode('|',$crawlers);
$crawlers_agents = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
if ( strpos($crawlers_agents , $USER_AGENT) === false )
return false;
// crawler detected
// you can use it to return its name
/*
else {
1,1 Top
return array_search($USER_AGENT, $crawlers);
}
*/
}
Using above method you can check Request is coming from crawler or not.

Related

Changing Javascript source URL based on interface translation

I've run into a particularly troubling issue when trying to attach external libraries which URL changes based on the translation of the site.
To sketch the situation:
There's an element on a website I'm working on which loads in an external Javascript file to display certain contents.
This element is only shown on a specific page rendered by a module.
The languages are noted by subdomain, for example: uk.example.com, de.example.com
The script should be loaded based on this subdomain, so: uk.example.com/script.js, de.example.com/script.js , The path will always be the same.
The problem I'm running into:
While attaching the Javascript using a HOOK_library_info_alter() the Javascript source URL gets cached, this means that the uk version of the script gets loaded in on de de version of the site. It's not possible to change this system, these scripts need to be loaded using different URLs for reasons I wont go in.
I've tried adding the script using a HOOK_page_attachments to put the script in the header with the correct subdomain, except it is impossible to determine if the script only gets loaded on that specific page, with that specific element (Using library_info_alter I'm able to check if the $extension is correct)
Is there any possible solution to this problem?
I'm sorry if it's worded problematic, my english isn't exactly amazing.

Here is the outline of a possible solution:
function yourmodule_page_attachments(array &$page) {
$current_path = \Drupal::service('path.current')->getPath();
$language = \Drupal::languageManager()->getCurrentLanguage()->getId();
if($current_path == '/node/123') {
if($language == 'en') {
$page['#attached']['library'][] = 'yourmodule/extlib-en';
}
}
}
You will need to change "yourmodule", "/node/123" and "en" to fit, obviously, and define "extlib-en" and other language-specific libraries in yourmodule.liraries.yml, as defined in Drupal 8 documentation.

How can i prevent theft of javscript code [duplicate]

I know it's impossible to hide source code but, for example, if I have to link a JavaScript file from my CDN to a web page and I don't want the people to know the location and/or content of this script, is this possible?
For example, to link a script from a website, we use:
<script type="text/javascript" src="http://somedomain.example/scriptxyz.js">
</script>
Now, is possible to hide from the user where the script comes from, or hide the script content and still use it on a web page?
For example, by saving it in my private CDN that needs password to access files, would that work? If not, what would work to get what I want?

Good question with a simple answer: you can't!
JavaScript is a client-side programming language, therefore it works on the client's machine, so you can't actually hide anything from the client.
Obfuscating your code is a good solution, but it's not enough, because, although it is hard, someone could decipher your code and "steal" your script.
There are a few ways of making your code hard to be stolen, but as I said nothing is bullet-proof.
Off the top of my head, one idea is to restrict access to your external js files from outside the page you embed your code in. In that case, if you have
<script type="text/javascript" src="myJs.js"></script>
and someone tries to access the myJs.js file in browser, he shouldn't be granted any access to the script source.
For example, if your page is written in PHP, you can include the script via the include function and let the script decide if it's safe" to return it's source.
In this example, you'll need the external "js" (written in PHP) file myJs.php:
<?php
$URL = $_SERVER['SERVER_NAME'].$_SERVER['REQUEST_URI'];
if ($URL != "my-domain.example/my-page.php")
die("/\*sry, no acces rights\*/");
?>
// your obfuscated script goes here
that would be included in your main page my-page.php:
<script type="text/javascript">
<?php include "myJs.php"; ?>;
</script>
This way, only the browser could see the js file contents.
Another interesting idea is that at the end of your script, you delete the contents of your dom script element, so that after the browser evaluates your code, the code disappears:
<script id="erasable" type="text/javascript">
//your code goes here
document.getElementById('erasable').innerHTML = "";
</script>
These are all just simple hacks that cannot, and I can't stress this enough: cannot, fully protect your js code, but they can sure piss off someone who is trying to "steal" your code.
Update:
I recently came across a very interesting article written by Patrick Weid on how to hide your js code, and he reveals a different approach: you can encode your source code into an image! Sure, that's not bullet proof either, but it's another fence that you could build around your code.
The idea behind this approach is that most browsers can use the canvas element to do pixel manipulation on images. And since the canvas pixel is represented by 4 values (rgba), each pixel can have a value in the range of 0-255. That means that you can store a character (actual it's ascii code) in every pixel. The rest of the encoding/decoding is trivial.

The only thing you can do is obfuscate your code to make it more difficult to read. No matter what you do, if you want the javascript to execute in their browser they'll have to have the code.

Just off the top of my head, you could do something like this (if you can create server-side scripts, which it sounds like you can):
Instead of loading the script like normal, send an AJAX request to a PHP page (it could be anything; I just use it myself). Have the PHP locate the file (maybe on a non-public part of the server), open it with file_get_contents, and return (read: echo) the contents as a string.
When this string returns to the JavaScript, have it create a new script tag, populate its innerHTML with the code you just received, and attach the tag to the page. (You might have trouble with this; innerHTML may not be what you need, but you can experiment.)
If you do this a lot, you might even want to set up a PHP page that accepts a GET variable with the script's name, so that you can dynamically grab different scripts using the same PHP. (Maybe you could use POST instead, to make it just a little harder for other people to see what you're doing. I don't know.)
EDIT: I thought you were only trying to hide the location of the script. This obviously wouldn't help much if you're trying to hide the script itself.

Google Closure Compiler, YUI compressor, Minify, /Packer/... etc, are options for compressing/obfuscating your JS codes. But none of them can help you from hiding your code from the users.
Anyone with decent knowledge can easily decode/de-obfuscate your code using tools like JS Beautifier. You name it.
So the answer is, you can always make your code harder to read/decode, but for sure there is no way to hide.

Forget it, this is not doable.
No matter what you try it will not work. All a user needs to do to discover your code and it's location is to look in the net tab in firebug or use fiddler to see what requests are being made.

From my knowledge, this is not possible.
Your browser has to have access to JS files to be able to execute them. If the browser has access, then browser's user also has access.
If you password protect your JS files, then the browser won't be able to access them, defeating the purpose of having JS in the first place.

I think the only way is to put required data on the server and allow only logged-in user to access the data as required (you can also make some calculations server side). This wont protect your javascript code but make it unoperatable without the server side code

I agree with everyone else here: With JS on the client, the cat is out of the bag and there is nothing completely foolproof that can be done.
Having said that; in some cases I do this to put some hurdles in the way of those who want to take a look at the code. This is how the algorithm works (roughly)
The server creates 3 hashed and salted values. One for the current timestamp, and the other two for each of the next 2 seconds. These values are sent over to the client via Ajax to the client as a comma delimited string; from my PHP module. In some cases, I think you can hard-bake these values into a script section of HTML when the page is formed, and delete that script tag once the use of the hashes is over The server is CORS protected and does all the usual SERVER_NAME etc check (which is not much of a protection but at least provides some modicum of resistance to script kiddies).
Also it would be nice, if the the server checks if there was indeed an authenticated user's client doing this
The client then sends the same 3 hashed values back to the server thru an ajax call to fetch the actual JS that I need. The server checks the hashes against the current time stamp there... The three values ensure that the data is being sent within the 3 second window to account for latency between the browser and the server
The server needs to be convinced that one of the hashes is
matched correctly; and if so it would send over the crucial JS back
to the client. This is a simple, crude "One time use Password"
without the need for any database at the back end.
This means, that any hacker has only the 3 second window period since the generation of the first set of hashes to get to the actual JS code.
The entire client code can be inside an IIFE function so some of the variables inside the client are even more harder to read from the Inspector console
This is not any deep solution: A determined hacker can register, get an account and then ask the server to generate the first three hashes; by doing tricks to go around Ajax and CORS; and then make the client perform the second call to get to the actual code -- but it is a reasonable amount of work.
Moreover, if the Salt used by the server is based on the login credentials; the server may be able to detect who is that user who tried to retreive the sensitive JS (The server needs to do some more additional work regarding the behaviour of the user AFTER the sensitive JS was retreived, and block the person if the person, say for example, did not do some other activity which was expected)
An old, crude version of this was done for a hackathon here: http://planwithin.com/demo/tadr.html That wil not work in case the server detects too much latency, and it goes beyond the 3 second window period

As I said in the comment I left on gion_13 answer before (please read), you really can't. Not with javascript.
If you don't want the code to be available client-side (= stealable without great efforts),
my suggestion would be to make use of PHP (ASP,Python,Perl,Ruby,JSP + Java-Servlets) that is processed server-side and only the results of the computation/code execution are served to the user. Or, if you prefer, even Flash or a Java-Applet that let client-side computation/code execution but are compiled and thus harder to reverse-engine (not impossible thus).
Just my 2 cents.

You can also set up a mime type for application/JavaScript to run as PHP, .NET, Java, or whatever language you're using. I've done this for dynamic CSS files in the past.

I know that this is the wrong time to be answering this question but i just thought of something
i know it might be stressful but atleast it might still work
Now the trick is to create a lot of server side encoding scripts, they have to be decodable(for example a script that replaces all vowels with numbers and add the letter 'a' to every consonant so that the word 'bat' becomes ba1ta) then create a script that will randomize between the encoding scripts and create a cookie with the name of the encoding script being used (quick tip: try not to use the actual name of the encoding script for the cookie for example if our cookie is name 'encoding_script_being_used' and the randomizing script chooses an encoding script named MD10 try not to use MD10 as the value of the cookie but 'encoding_script4567656' just to prevent guessing) then after the cookie has been created another script will check for the cookie named 'encoding_script_being_used' and get the value, then it will determine what encoding script is being used.
Now the reason for randomizing between the encoding scripts was that the server side language will randomize which script to use to decode your javascript.js and then create a session or cookie to know which encoding scripts was used
then the server side language will also encode your javascript .js and put it as a cookie
so now let me summarize with an example
PHP randomizes between a list of encoding scripts and encrypts javascript.js then it create a cookie telling the client side language which encoding script was used then client side language decodes the javascript.js cookie(which is obviously encoded)
so people can't steal your code
but i would not advise this because
it is a long process
It is too stressful

use nwjs i think helpful it can compile to bin then you can use it to make win,mac and linux application

This method partially works if you do not want to expose the most sensible part of your algorithm.
Create WebAssembly modules (.wasm), import them, and expose only your JS, etc... workflow. In this way the algorithm is protected since it is extremely difficult to revert assembly code into a more human readable format.
After having produced the wasm module and imported correclty, you can use your code as you normallt do:
<body id="wasm-example">
<script type="module">
import init from "./pkg/glue_code.js";
init().then(() => {
console.log("WASM Loaded");
});
</script>
</body>

Secure database entry against XSS

I'm creating an app that retrieves the text within a tweet, store it in the database and then display it on the browser.
The problem is that I'm thinking if the text has PHP tags or HTML tags it might be a security breach there.
I looked into strip_tags() but saw some bad reviews. I also saw suggestions to HTML Purifier but it was last updated years ago.
So my question is how can I be 100% secure that if the tweet text is "<script> something_bad() </script>" it won't matter?
To state the obvious the tweets are sent to the database from users so I don't want to check all individually before displaying them.

You are NEVER 100% secure, however you should take a look at this. If you use ENT_QUOTES parameter too, currently there are no ways to inject ANY XSS on your website if you're using valid charset (and your users don't use outdated browsers). However, if you want to allow people to only post SOME html tags into their "Tweet" (for example <b> for bold text), you will need to take a deep look at EACH whitelisted tag.

You've passed the first stage which is to recognise that there is a potential issue and skipped straight to trying to find a solution, without stopping to think about how you want to deal the scenario of the content. This is a critical pre-cusrsor to solving the problem.
The general rule is that you validate input and escape output
validate input
- decide whether to accept or reject it it in its entirety)
if (htmlentities($input) != $input) {
die "yuck! that tastes bad";
}
escape output
- transform the data appropriately according to where its going.
If you simply....
print "<script> something_bad() </script>";
That would be bad, but....
print JSONencode(htmlentities("<script> something_bad() </script>"));
...then you'd would have done something very strange at the front end to make the client susceptivble to a stored XSS attack.

If you're outputting to HTML (and I recommend you always do), simply HTML encode on output to the page.
As client script code is only dangerous when interpreted by the browser, it only needs to be encoded on output. After all, to the database <script> is just text. To the browser <script> tells the browser to interpret the following text as executable code, which is why you should encode it to <script>.
The OWASP XSS Prevention Cheat Sheet shows how you should do this properly depending on output context. Things get complicated when outputting to JavaScript (you may need to hex encode and HTML encode in the right order), so it is often much easier to always output to a HTML tag and then read that tag using JavaScript in the DOM rather than inserting dynamic data in scripts directly.
At the very minimum you should be encoding the < & characters and specifying the charset in metatag/HTTP header to avoid UTF7 XSS.

You need to convert the HTML characters <, > (mainly) into their HTML equivalents <, >.
This will make a < and > be displayed in the browser, but not executed - ie: if you look at the source an example may be <script>alert('xss')</script>.
Before you input your data into your database - or on output - use htmlentities().
Further reading: https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet

What is the best practice for the multiple use of the same link?

I'm trying to rationalize a website, and I have many links on it to the same document, so I want to create a JavaScript that return the URL of this document. This way, i could update the document and only have to change the URL in the function, not in all the occurrences of the link (it's a professional and internal website, with many links to official documents, that get updated often, out of my control, and each time i get to update links, i realize a while after that i forgot some, even by searching in all html files. the site is messy, was poorly written by many people, and that's why i'm trying to simplify)
My first idea was to use link, but everyone says it's a bad practice. i still don't see why, and I don't like to use the onclick as it doesn't work with middle click, and i want to let users decide how they open the doc.
Also, I want to use link to redirect to a specific page.
on top of this, what i tried so far is not working like I intend, so i would need some help, whether to come up with a better solution, or to make this work!
here is my js, with different versions:
function F_link_PDF() {
// i was pretty sure this would work
return "http://www.example.com/presentation.pdf" ;
}
function F_link_PDF_2() {
document.write("http://www.example.com/presentation.pdf");
}
function F_link_PDF_3() {
// i don't like this solution, as it doesn't open as user intended to
location.href = "http://www.example.com/presentation.pdf" ;
}
this example is for a pdf document, but i could also need this for html, doc, ppt...
and finally, i started with js because i'm used to, but I could also use other languages, php, asp, if someone says it's a better option
thanks in advance!

The hack way: Go about using JavaScript, however you run into potential issues with browsers not running it.
The better way: Use mod_rewrite / .htaccess to redirect previous (expired) requests to the new location of the resource. You could also use FallbackResource and provide a .php file that could provide the new resource based on criteria (you now have the power of PHP to decide where the Location header should go).
The best way1: Place those document references in a database table somewhere and reference them in the page using the table's current value. This creates a single place of "truth" and allows you to update the site from a global perspective. You could also, at a later date, provide search, tag, display a list, etc.
1 Not implying it's the abosolute best, but it is certainly a better way than updating hard-coded references.

A server side programming language like php is a better option.
Here's example code that helps:
<?php
$link="http://www.example.com/files/document.pdf";
if ($_GET['PAGE'] == "downloads")
{
?>
This is a download page where you can download our flyer.
<?php
echo "Download PDF";
}
if ($_GET['PAGE'] == "specials")
{
?>
This is our store specials page. check them out. a link to the flyer is below.
<?php
echo "Download PDF";
}
?>
The code isn't 100% perfect since some text needs adjusting but what it does is it takes a parameter PAGE and sees that it is "downloads" or "specials" and if it is, it loads the appropriate page and adds the link to the download file. If you try both pages, the link to the download is exactly the same.
If the above php script is saved as index.php, then you can call each page with:
index.php?PAGE=specials for the specials page
index.php?PAGE=downloads for the download page
Once that works, then you can add another "if" section for another page to create but the most important line in each section is the last line of...
echo "Download PDF";
...because it's taking a variable thats usable in every case in the script.
An advantage with using server side method is that people can view the site even with javascript disabled.

How to get a Script Tag's innerHTML

Alright... I've been searching for an hour now... How does one get the innerHTML of a script tag? Here is what I've been working on...
<script type="text/javascript" src="http://www.google.com" id="externalScript"></script>
<script type="text/javascript">
function getSource()
{document.getElementById('externalScript').innerHTML;
}
</script>
I've been trying to work on a way to call another domain's page source with the script tag. I've seen a working example, but cannot find it for the life of me...

You can't do that. There is no innerHTML....all you can do is pull down the file view XMLHttpRequest to get to its contents....but of course, that is limited by same-origin policy, but script tags are not. Sorry.

actually, there is a way to get the content, but it depends on the remote server letting you get the file without valid headers and still fails a lot of the time just because of those settings. using jQuery since it's the end of my day and I'm out the door....
$.get($('#externalScript').attr('src'), function(data) {
alert(data);
});

I'm guessing you want one of two things:
To make a JavaScript file global (so that other pages can call it)
To get the script that is currently in the file
Both of those can be solved by moving your script to a .js file, and then using the tag
<script src="[path-to-file]"></script>

You can't do this. It would be a massive security problem if you could.
Script content can include any number of things. Consider this: a script loaded from a URL on your bank's website might contain all sorts of things, like your account number, your balance, and other personal information. That script would be loaded by your bank's normal pages to do what they want to do.
Now, I'm an evil hacker, and I suspect you may be a customer of Biggo Bank. So on one of my own pages, I include a <script> tag for that Biggo Bank script. The script may only load if there's a valid Biggo Bank session cookie, but what if there is? What if you visit my hacker site while you're logged in to Biggo Bank in another browser tab? Now my own JavaScript code can read the contents of that script, and your money is now mine :)

You can Use Html Parsers:
jsoup » jsoup: Java HTML Parser
jsoup: Java HTML Parser
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
refer this:
http://jsoup.org/

We Keep Coding

JavaScript is the programming language of the Web.