How to read the source of another webpage - javascript [duplicate] - javascript

I know very, very little of javascript, but I'm interested in writing a script which needs information from another webpage. It there a javascript equivalent of something like urllib2? It doesn't need to be very robust, just enough to process a simple GET request, no need to store cookies or anything and store the results.

There is the XMLHttpRequest, but that would be limited to the same domain of your web site, because of the Same Origin Policy.
However, you may be interested in checking out the following Stack Overflow post for a few solutions around the Same Origin Policy:
Ways to circumvent the same-origin policy
UPDATE:
Here's a very basic (non cross-browser) example:
var xhr = new XMLHttpRequest();
xhr.open('GET', '/questions/3315235', true);
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
console.log(xhr.responseText);
}
};
xhr.send(null);
If you run the above in Firebug, with Stack Overflow open, you'd get the HTML of this question printed in your JavaScript console:
JavaScript access another webpage http://img217.imageshack.us/img217/5545/fbugxml.png

You could issue an AJAX request and process it.

Write your own server, which runs the script to load the data from websites. Then from your web page, ask your server to fetch the data from websites and send them back to you.
see http://www.storminthecastle.com/2013/08/25/use-node-js-to-extract-data-from-the-web-for-fun-and-profit/

Related

Making REST calls to AWS S3 using Pure Javascript (No SDK)

I am trying to make a GET request to AWS S3 using pure Javascript. This is because I am unfortunately no longer able to use the SDK for all of my requests. I have been attempting to follow the documentation provided by Amazon, however I have made very little progress. So far, I have only been able to generate my signature key. I would be enthused if someone could post an example of pure Javascript that makes a simple call to retrieve an object or even lists all of the objects with a specific prefix. I am, to be perfectly honest, completely lost reading their documentation. It seems like it is only useful for people who are intimately familiar with making these calls. #1 and #2 on this image here are what I'm struggling with. I think I sort of understand what they are wanting but I don't know how to fully translate it into an actual request. Unfortunately the code examples on their docs are very few and far between - and a lot of them are just pseudocode/small fractions of the whole thing
edit: Hello is anyone even reading this
edit2: Here's some stuff that isn't working that I'm trying to figure out how to do
var signingKey = getSigningKey(dateStamp, secretKey, regionName, serviceName);
var time = new Date();
//fullURL is something like https://s3.amazon.aws.com/{bucketName}/{imageName}
time = time.toISOString();
time = time.replace(/:/g, '').replace(/-/g,'');
time = time.substring(0,time.indexOf('.'))+"Z";
var request = new XMLHttpRequest();
var canonString = "GET\n"+
encodeURI(fullURL)+"\n"+
encodeURI("Key=asd.jpeg")+"\n"+
"host:s3.amazonaws.com\n"+
"x-amz-content-sha256:"+CryptoJS.SHA256("").toString()+"\n"+
"host;x-amz-content-sha256\n"+
CryptoJS.SHA256("").toString();
var stringToSign = "AWS4-HMAC-SHA256\n"+
time+"\n"+
"20181002/us-east-1/s3/aws4_request\n"+
CryptoJS.SHA256(canonString).toString();
var authString = CryptoJS.HmacSHA256(signingKey, stringToSign).toString();
var queryString = "GET https://s3.amazonaws.com/?Action=GetObject&Version=2010-05-08 HTTP/1.1\n"+
"Authorization: AWS4-HMAC-SHA256 Credential="+accessKey+"/20181002/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-date, Signature="+authString+"\n"+
"host: s3.amazonaws.com\n"+
"x-amz-date: "+time+"\n";
request.open("GET", "https://s3.amazonaws.com/?Action=GetObject&Version=2010-05-08", false);
request.setRequestHeader("Authorization", "AWS4-HMAC-SHA256 Credential="+accessKey+"/20181002/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-date, Signature="+authString);
request.setRequestHeader("host", "s3.amazonaws.com");
request.setRequestHeader("x-amz-date", time);
request.send();
edit3: Here are a bunch of errors I get, presumably because I have no idea what I'm doing.
index.js:61 Refused to set unsafe header "host"
index.js:63 OPTIONS https://s3.amazonaws.com/?Action=GetObject&Version=2010-05-08 403 (Forbidden)
index.js:63 Failed to load https://s3.amazonaws.com/?Action=GetObject&Version=2010-05-08: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'null' is therefore not allowed access.
index.js:63 Uncaught DOMException: Failed to execute 'send' on 'XMLHttpRequest': Failed to load 'https://s3.amazonaws.com/?Action=GetObject&Version=2010-05-08'.
You might want to use the SDK, combined with the browser debugger to figure out how the SDK formats the request. In the Chrome debugger Network tab, you can copy the request as a javascript fetch. This will show all the headers you need to set. You can then use this as a basis for your non-SDK code.

Remote calls via javascript

I run a service where there is a javascript file that is called and self executed on a user's site.
This then calls an external server every 10 or so seconds with a bunch of variables.
I used to do this by using a createElement('script') and then setting the path to a file on the external server and passing the required variables across by means of GET variables. (works well for small URI's)
This worked really well and seemed to work cross browser as well with no undesired effects.
The problem I then ran into was when I needed to extend the amount or size of the variables that were being sent across. So obviously I decided to change from GET method to POST, but by doing that I could no longer use the createElement('script') trick and had to opt for the XMLHttpRequest() (ala Ajax - without jQuery) method which worked really well, except for the minor problem of having to also cater for Internet Explorer and Opera which didn't really play ball too well (big shock). So I used the following:
function createCORSRequest(method, url){
var xhr = new XMLHttpRequest();
if ("withCredentials" in xhr){
xhr.open(method, url, true);
} else if (typeof XDomainRequest != "undefined"){
xhr = new XDomainRequest();
xhr.open(method, url);
} else {
xhr = null;
}
return xhr;
}
var request = createCORSRequest("post", "http://xx.xxxx.com/");
if (request){
request.onload = function(){
//do something with request.responseText
};
request.send(myPostObjectDataVariableGoeshere);
}
..which I found over at this page
This is basically just a fallback to using the XDomainRequest() method which InternetExplorer wants you to use instead..
Fantastic, BUT -> Looking in the Console of Developer Tools in IE it says:
SEC7118: XMLHttpRequest for http://xx.xxxx.com/ required Cross Origin Resource Sharing (CORS).
SEC7120: Origin null not found in Access-Control-Allow-Origin header.
SCRIPT7002: XMLHttpRequest: Network Error 0x80070005, Access is denied.
But what's really odd about this is that I've already got the following as the first line in my backend PHP file that is being called (which works for other browsers...)
header('Access-Control-Allow-Origin: *');
Someone please tell me what's wrong here.. Also if there is a better way to be doing this instead of fighting the browser wars..
Note: I cannot use jQuery for this task!
You should try jQuery for this task. Its much easier and don't have that problem with IE.
http://api.jquery.com/jQuery.ajax/
IE unfortunately block Cross Origin requests, i believe there is no simple way to get around it by script only, but you can try tuning the options or via my proxy script.
Tuning the options
Internet Explorer ignores Access-Control-Allow headers and by default prohibits cross-origin access for Internet Zone. To enable CORS go to Tools->Internet Options->Security tab, click on “Custom Level” button. Find the Miscellaneous -> Access data sources across domains setting and select “Enable” option.
Proxy Script on local server as a Bridge
Previous post:
Remote POST request with jQuery and Ajax
This is for you to place a PHP script on a local server and do a local AJAX request and proxy to the remote server for good.

Can JavaScript do a POST HTTP request to any domain?

I know it's possible to load any kind of document from any domain from JavaScript (without necessarily being able to peek at its content), but it usually concerns regular GET requests. What about POST?
Is it possible to make an HTTP POST request from JavaScript to any domain name? (I'm specifically interested in form submissions.)
If so, how?
As per some answers on a nearby question, «HTTP GET request in JavaScript?», you might use XMLHttpRequest, since, according to the docs, the POST method is supported, too.
http://www.w3.org/TR/XMLHttpRequest/
https://developer.mozilla.org/en-US/docs/DOM/XMLHttpRequest
A sample code from the above w3.org document:
function log(message) {
var client = new XMLHttpRequest();
client.open("POST", "/log");
client.setRequestHeader("Content-Type", "text/plain;charset=UTF-8");
client.send(message);
}
However, it would seem like in order for it to work with POST requests to domains unrelated to yours (where instead of "/log", a complete http or https URL is specified), the Cross-Origin Resource Sharing may have to be supported and enabled on the target server, as per https://developer.mozilla.org/en-US/docs/HTTP/Access_control_CORS#Simple_requests.
So, it seems like, at least through XMLHttpRequest, you cannot make form submissions through POST requests (in fact, looks like even GET requests won't fly, either).

AJAX and Cross-Site Scripting to Read the Header

Help me understand AJAX and cross-site scripting a little better. Writing AJAX is fairly straight forward. If I want to asynchronously read HTTP header of a website, I'd do something like this:
var req = new XMLHttpRequest();
req.open('HEAD', 'http://www.stackoverflow.com/', true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4) {
if(req.status == 200)
alert(req.responseText);
else
alert("Error loading page");
}
};
req.send(null);
However, when I copy and paste this into a simple HTML page using notepad and try to run it locally, the request status doesn't seem to return 200. I am assuming this is due to cross-site scripting. How would I get around this?
You are right in that making requests across domains is not allowed unless you are using Cross-Origin Resource Sharing (CORS, http://www.w3.org/TR/cors/). CORS has a client-side and server side component. On the client side, the request looks mostly like a regular XmlHttpRequest, except you have a few other properties and handlers you can configure. On the server, the response will need to emit some special http headers. This article gives a good breakdown of how CORS works on the client and server: http://www.nczonline.net/blog/2010/05/25/cross-domain-ajax-with-cross-origin-resource-sharing/
My first guess would be to try and make a local PHP file which acts like a gateway:
<?php
echo get_headers($_GET['url']);
?>
Then, perform a GET request with the url of your target site as the parameter, and parse the .responseText of that request to determine the response header of your original.
I don't think it's possible with pure JS, so you'll have to use some serverside code.
There are two types of "locally":
Using a local server (http://localhost/)
Accessing HTML file directly (file:///C:\a\b\c.html)
AJAX won't work, ever, in the second case.
You can't make an ajax request to http://stackoverflow.com if your page is being served on http://localhost/...
http://en.wikipedia.org/wiki/XMLHttpRequest#Cross-domain_requests

Open webpage and parse it using JavaScript

I know JavaScript can open a link in a new window but is it possible to open a webpage without opening it in a window or displaying it to the user? What I want to do is parse that webpage for some text and use it as variables.
Is this possible without any help from server side languages? If so, please send me in a direction I can achieve this.
Thanks all
You can use an XMLHttpRequest object to do this. Here's a simple example
var req = new XMLHttpRequest();
req.open('GET', 'http://www.mydomain.com/', false);
req.send(null);
if(req.status == 200)
dump(req.responseText);
Once loaded, you can perform your parsing/scraping by using javascript regular expressions on the req.responseText member.
More detail...
In practice you need to do a little more to get the XMLHttpRequest object in a cross platform manner, e.g.:
var ua = navigator.userAgent.toLowerCase();
if (!window.ActiveXObject)
req = new XMLHttpRequest();
else if (ua.indexOf('msie 5') == -1)
req = new ActiveXObject("Msxml2.XMLHTTP");
else
req = new ActiveXObject("Microsoft.XMLHTTP");
Or use a library...
Alternatively, you can save yourself all the bother and just use a library like jQuery or Prototype to take care of this for you.
Same-origin policy may bite you though...
Note that due to the same-origin policy, the page you request must be from the same domain as the page making the request. If you want to request a remote page, you will have to proxy that via a server side script.
Another possible workaround is to use Flash to make the request, which does allow cross-domain requests if the target site grants permission with a suitably configured crossdomain.xml file.
Here's a nice article on the subject of the same-origin policy:
Same-Origin Policy Part 1: Why we’re stuck with things like XSS and XSRF/CSRF
Whatever Origin is an open source library that allows you to use purely Javascript to do scraping. It also solves the "same-domain-origin" problem.
http://www.whateverorigin.org/
$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('http://google.com') + '&callback=?', function(data){
alert(data.contents);
});
You can try using fetch and it's callback
fetch('https://api.codetabs.com/v1/proxy?quest=google.com').then((response) => response.text()).then((text) => console.log(text));
You could open the new window in an iframe:
http://www.w3schools.com/TAGS/tag_iframe.asp
Although note that Javascript access is limited if the site you open is from a different URL. This is to prevent cross-site scripting attacks:
http://en.wikipedia.org/wiki/Cross-site_scripting
You would use AJAX. This would make a Get request to the URL in question and return the response HTML. Jquery makes this very easy e.g.
$.get("test.php");
http://docs.jquery.com/Ajax
Andrew

Categories