I use HttpClient 4.3.4.
I make POST request - site in turn makes a few redirects (code page 302), which handled by HttpClient automatically (use LaxRedirectStrategy) by HttpClient. At the end I get HTML page (code 200) with the title Redirection .... In its content there is the Javascript code with redirection on some URL. This redirection is not (of course) handled by HttpClient.
I tried to parse this page to get the URL make appropriate GET request (similar to browser) but site return HTML page with error (although I do not understand why this happens).
Is there is some way to handle redirects in Javascript?
HttpClient is a library that handles the HTTP protocol for you. It's not supposed to handle the content transfered by the protocol. The content, HTML and javascript, needs to be processed by a real browser or some simplified version of a browser.
You can either try to parse and execute the javascript yourself using an embedded javascript engine, or start a real browser. For the later option I recommend Selenium, a web browser automation tool.
Your approach is brittle since it depends on the specific rediction logic used by the content at the time you wrote the parsing code. As for why it fails, there could be complications in the javascript yet to be discovered.
Related
I am facing an issue in simulating below scenario in JMeter script. Appreciate if anyone of you can help with a solution.
I am trying to create JMeter script for a form submission flow which is a .NET application. One of the HTTP Request Samplers is getting redirected to a different HTTP request. JMeter script replay is able to redirect to correct HTTP request; however, it doesn’t provide required HTTP response.
It fails with the message – “Please enable JavaScript to view the page content. Your support ID is: 7865380748200702010”
While recording the script, it gives proper response with .net variables such as View State, View State Generator, Event Validation etc.
Please help me if you have got this earlier.
Most probably you're not sending the right requests because your script is missing or doesn't have properly implemented correlation of the dynamic parameters
In the vast majority of cases you won't be able to replay the recorded test scenario, in your case due to incorrect hard-coded recorded values of these View State, View State Generator, Event Validation, etc.
While browser is sending these variables automatically for JMeter you need to extract them from the previous response using a suitable PostProcessor (I would recommend CSS Selector Extractor), convert them into JMeter Variables and replace hard-coded values with the variables. You can see ASP.NET Login Testing with JMeter article for example correlation of these .NET web applications dynamic parameters.
With regards to JavaScript in general, as per Apache JMeter project main page
JMeter is not a browser, it works at protocol level. As far as web-services and remote services are concerned, JMeter looks like a browser (or rather, multiple browsers); however JMeter does not perform all the actions supported by browsers. In particular, JMeter does not execute the Javascript found in HTML pages. Nor does it render the HTML pages as a browser does (it's possible to view the response as HTML etc., but the timings are not included in any samples, and only one sample in one thread is ever displayed at a time).
so if a part of your page is being loaded by JavaScript (i.e. using AJAX technology) JMeter again won't execute this request automatically, you will need to properly simulate it
I'm trying to scrap the data from the website using file_get_contents but instead of the webpage source I'm getting following code:
<body onload="challenge();">
<script>eval(function(p,a,c,k,e,r){e=function(c){return c.toString(a)};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('1 6(){2.3=\'4=5; 0-7=8; 9=/\';a.b.c()}',13,13,'tax|function|document|cookie|ddosdefend|1d4607e3ac67b865e6c7263260c34e888cae7c56|challenge|age|0|path|window|location|reload'.split('|'),0,{}))
Engine is wordpress. Is there any chance to get real source?
file_get_contents seems to work fine. However, it seems you are not served the desired content but some JavaScript code which needs to be evaluated before redirecting to the content.
This might be because the website you want to scrape uses a DDOS protection (e.g. something like CouldFlare) which detects your simple scraping attempt.
Usually, a DDOS protection service is a proxy between the original webserver and your scraper. It inspects your request behavior, user agent etc. and based on that serves you either the original webserver's content or presents you a challenge (e.g. captcha, or simply requires you to evaluate javascript etc.).
If you can get the IP address of the original webserver, you might be able to directly access it. The DNS resolution for the webserver's name will direct you to the proxy, so you have to look elsewhere. Alternatively, use a web scraping library that emulates real browser behavior in PHP.
I want some content of my website to be dynamically loaded after login. A $.post(...) interacts with a servlet which validates the user's credentials, and then a $.load(url) loads the content from a separate page into a <div>. I noticed that, as long as I know where to fetch the content from, I can force this behavior from the chrome javascript console, bypassing validation.
How can I prevent a user from doing this?
You can't.
Once a document has been delivered to the user's browser it is completely under the control of the user. They can run any JS they like.
The URLs you present on your webserver are the public interface to it. Anyone can request them. You can use authentication/authorization to limit who gets a response, but you can't make that response conditional on the user running specific JavaScript that you supply.
The server needs to authorize the user each time it delivers restricted data. You can't do it once and then trust the browser to enforce it.
You can add a secret parameter to the url you load. By defining a random variable in the users session (server side) or in the database, and then return this variable once the validation is successful so your javascript code can use the variable in the next load call. In the load url you can check at the server side if the secret parameter had the correct value or not.
Hope its clear.
The simple answer is: You Can't.
JavaScript runs within the browser and therefore a user or application can run their own code whenever the feel like. This could be as simple as adding new CSS or running their own JS codes.
The main thing you can do to disable this is to ensure all of the requests are validated on your server side before being run as well as allowing only entry for certain types of information (like only allowing integers as numbers to stop strings coming through).
Something close to this sort of problem is XSS or Cross-Site Scripting. A 3rd party will try to inject some malicious code to a trusted website, usually some form of POST, to affect different users. Here is some more information on the matter
Cross-Site Scripting - Wikipedia
CSS - OWASP
I'm looking for a method to scrape a website from server side (which uses javascript) and save the output after analyzing data into a mysql database. I need to navigate from page to page by clicking links and submitting data from the database,without session expiring . Is this possible using phpquery web browser plugin? . I've started doing this using casperjs. I would like to know the pros and cons of both methods. I'm a beginner in the coding space. Please help.
I would recommend that you use PhantomJS or CasperJS and parse the DOM with JavaScript selectors to get the parts of the pages you want back. Don't use phpQuery as it's based on PHP and would require a separate step in your processing versus using just JavaScript DOM parsing. Also, you won't be able to perform click events using PHP. Anything client side would need to be run in PhantomJS or CasperJS.
It might even be possible to write a full scraping engine using just PHP if that's your server side language of choice. You would need to reverse engineer the login process and maintain a cookie jar with your cURL requests to keep your login valid with each request. Once you've established a session with the the website, you can then setup your navigation path with an array of links that you would like to crawl. The idea behind web crawling is that you load a page from some link and process the page and then move to the next link. You continue this process until all pages have been processed and then your crawl is complete.
I would check out Google's guide Making AJAX Applications Crawlable the website you're trying to scrap might have adopted the scheme (making their site's content crawlable).
You want to look for #! in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme.
To put it simply, when you come across a URL like this.
www.example.com/ajax.html#!key=value you would modify it to www.example.com/ajax.html?_escaped_fragment_=key=value. The server should respond with a HTML snapshot of that page.
Here is the Full Specification
There is this 3rd party webservice. One of the public webmethods available is a GetDocument() method. This method returns a Document object. The Document object has properties for File(byte[]), ContentType(string) ect.
My Question : Can I subscribe to this service using javascript(mootools) + ajax + JSON, return the document object, in this case an excel document, and force the file download?
It is true that typically you cannot initiate a download from JavaScript, but there is a flash component, Downloadify that does enable client side file generation.
So you can serve files for download from HTML/JavaScript.
With that problem solved, you still have the problem of how to get the data that you wish to serve from the source web service.
3rd party implies XSS (cross site scripting) which is a no-no using XmlHttpRequest (Ajax).
A possible solution to this problem could be to use a common hidden IFrame technique to get the data.
Simply have an appropriate (hidden?) form that correctly posts to the web service and point it's action to an hidden IFrame element upon which you are trapping the Load event and parse the data returned.
But current browsers have different levels of security measures that limit your ability to access IFrames with an external source so you are actually stuck here. Sorry to get your hopes up.
The only practical robust way to accomplish what you would like to do is to have a local server side script that can act as a proxy between your HTML/JavaScript and the external web service.
Using such a proxy, you can simply go back to using Ajax to get your data to serve up with Downloadify.
But then, since you are using a server script to get the data, why not just serve the data from the script for download?
These are just my observations on the problem domain you present.