Open webpage and parse it using JavaScript - javascript

I know JavaScript can open a link in a new window but is it possible to open a webpage without opening it in a window or displaying it to the user? What I want to do is parse that webpage for some text and use it as variables.
Is this possible without any help from server side languages? If so, please send me in a direction I can achieve this.
Thanks all

You can use an XMLHttpRequest object to do this. Here's a simple example
var req = new XMLHttpRequest();
req.open('GET', 'http://www.mydomain.com/', false);
req.send(null);
if(req.status == 200)
dump(req.responseText);
Once loaded, you can perform your parsing/scraping by using javascript regular expressions on the req.responseText member.
More detail...
In practice you need to do a little more to get the XMLHttpRequest object in a cross platform manner, e.g.:
var ua = navigator.userAgent.toLowerCase();
if (!window.ActiveXObject)
req = new XMLHttpRequest();
else if (ua.indexOf('msie 5') == -1)
req = new ActiveXObject("Msxml2.XMLHTTP");
else
req = new ActiveXObject("Microsoft.XMLHTTP");
Or use a library...
Alternatively, you can save yourself all the bother and just use a library like jQuery or Prototype to take care of this for you.
Same-origin policy may bite you though...
Note that due to the same-origin policy, the page you request must be from the same domain as the page making the request. If you want to request a remote page, you will have to proxy that via a server side script.
Another possible workaround is to use Flash to make the request, which does allow cross-domain requests if the target site grants permission with a suitably configured crossdomain.xml file.
Here's a nice article on the subject of the same-origin policy:
Same-Origin Policy Part 1: Why we’re stuck with things like XSS and XSRF/CSRF

Whatever Origin is an open source library that allows you to use purely Javascript to do scraping. It also solves the "same-domain-origin" problem.
http://www.whateverorigin.org/
$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('http://google.com') + '&callback=?', function(data){
alert(data.contents);
});

You can try using fetch and it's callback
fetch('https://api.codetabs.com/v1/proxy?quest=google.com').then((response) => response.text()).then((text) => console.log(text));

You could open the new window in an iframe:
http://www.w3schools.com/TAGS/tag_iframe.asp
Although note that Javascript access is limited if the site you open is from a different URL. This is to prevent cross-site scripting attacks:
http://en.wikipedia.org/wiki/Cross-site_scripting

You would use AJAX. This would make a Get request to the URL in question and return the response HTML. Jquery makes this very easy e.g.
$.get("test.php");
http://docs.jquery.com/Ajax
Andrew

Related

How to read the source of another webpage - javascript [duplicate]

I know very, very little of javascript, but I'm interested in writing a script which needs information from another webpage. It there a javascript equivalent of something like urllib2? It doesn't need to be very robust, just enough to process a simple GET request, no need to store cookies or anything and store the results.
There is the XMLHttpRequest, but that would be limited to the same domain of your web site, because of the Same Origin Policy.
However, you may be interested in checking out the following Stack Overflow post for a few solutions around the Same Origin Policy:
Ways to circumvent the same-origin policy
UPDATE:
Here's a very basic (non cross-browser) example:
var xhr = new XMLHttpRequest();
xhr.open('GET', '/questions/3315235', true);
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
console.log(xhr.responseText);
}
};
xhr.send(null);
If you run the above in Firebug, with Stack Overflow open, you'd get the HTML of this question printed in your JavaScript console:
JavaScript access another webpage http://img217.imageshack.us/img217/5545/fbugxml.png
You could issue an AJAX request and process it.
Write your own server, which runs the script to load the data from websites. Then from your web page, ask your server to fetch the data from websites and send them back to you.
see http://www.storminthecastle.com/2013/08/25/use-node-js-to-extract-data-from-the-web-for-fun-and-profit/

Remote calls via javascript

I run a service where there is a javascript file that is called and self executed on a user's site.
This then calls an external server every 10 or so seconds with a bunch of variables.
I used to do this by using a createElement('script') and then setting the path to a file on the external server and passing the required variables across by means of GET variables. (works well for small URI's)
This worked really well and seemed to work cross browser as well with no undesired effects.
The problem I then ran into was when I needed to extend the amount or size of the variables that were being sent across. So obviously I decided to change from GET method to POST, but by doing that I could no longer use the createElement('script') trick and had to opt for the XMLHttpRequest() (ala Ajax - without jQuery) method which worked really well, except for the minor problem of having to also cater for Internet Explorer and Opera which didn't really play ball too well (big shock). So I used the following:
function createCORSRequest(method, url){
var xhr = new XMLHttpRequest();
if ("withCredentials" in xhr){
xhr.open(method, url, true);
} else if (typeof XDomainRequest != "undefined"){
xhr = new XDomainRequest();
xhr.open(method, url);
} else {
xhr = null;
}
return xhr;
}
var request = createCORSRequest("post", "http://xx.xxxx.com/");
if (request){
request.onload = function(){
//do something with request.responseText
};
request.send(myPostObjectDataVariableGoeshere);
}
..which I found over at this page
This is basically just a fallback to using the XDomainRequest() method which InternetExplorer wants you to use instead..
Fantastic, BUT -> Looking in the Console of Developer Tools in IE it says:
SEC7118: XMLHttpRequest for http://xx.xxxx.com/ required Cross Origin Resource Sharing (CORS).
SEC7120: Origin null not found in Access-Control-Allow-Origin header.
SCRIPT7002: XMLHttpRequest: Network Error 0x80070005, Access is denied.
But what's really odd about this is that I've already got the following as the first line in my backend PHP file that is being called (which works for other browsers...)
header('Access-Control-Allow-Origin: *');
Someone please tell me what's wrong here.. Also if there is a better way to be doing this instead of fighting the browser wars..
Note: I cannot use jQuery for this task!
You should try jQuery for this task. Its much easier and don't have that problem with IE.
http://api.jquery.com/jQuery.ajax/
IE unfortunately block Cross Origin requests, i believe there is no simple way to get around it by script only, but you can try tuning the options or via my proxy script.
Tuning the options
Internet Explorer ignores Access-Control-Allow headers and by default prohibits cross-origin access for Internet Zone. To enable CORS go to Tools->Internet Options->Security tab, click on “Custom Level” button. Find the Miscellaneous -> Access data sources across domains setting and select “Enable” option.
Proxy Script on local server as a Bridge
Previous post:
Remote POST request with jQuery and Ajax
This is for you to place a PHP script on a local server and do a local AJAX request and proxy to the remote server for good.

Error: "Access to restricted URI denied"

Access to restricted URI denied" code: "1012 [Break On This Error]
xhttp.send(null);
function getXML(xml_file) {
if (window.XMLHttpRequest) {
var xhttp = new XMLHttpRequest(); // Cretes a instantce of XMLHttpRequest object
}
else {
var xhttp = new ActiveXObject("Microsoft.XMLHTTP"); // for IE 5/6
}
xhttp.open("GET",xml_file,false);
xhttp.send(null);
var xmlDoc = xhttp.responseXML;
return (xmlDoc);
}
I'm trying to get data from a XML file using JavaScript. Im using Firebug to test and debug on Firefox.
The above error is what I'm getting. It works in other places i used the same before, why is acting weird here?
Can someone help me why it's occuring?
Update:
http://jquery-howto.blogspot.com/2008/12/access-to-restricted-uri-denied-code.html
I found this link explaining the cause of the problem. But I didn't get what the solution given means can someone elaborate?
Another possible cause of this is when you are working with a .html file directly on the file system. For example, if you're accessing it using this url in your browser: C:/Users/Someguy/Desktop/MyProject/index.html
If that then has to make an ajax request, the ajax request will fail because ajax requests to the filesystem are restricted. To fix this, setup a webserver that points localhost to C:/Users/Someguy/Desktop/MyProject and access it from http://localhost/index.html
Sounds like you are breaking the same origin policy.
Sub domains, different ports, different protocols are considered different domains.
Try adding Access-Control-Allow-Origin:* header to the server side script that feeds you the XML. If you don't do it in PHP (where you can use header()) and try to read a raw XML file, you probably have to set the header in a .htaccess file by adding Header set Access-Control-Allow-Origin "*". In addition you might need to add Access-Control-Allow-Headers:*.
Also I'd recommend to replace the * in production mode to disallow everybody from reading your data and instead add your own url there.
Without code impossible to say, but you could be running foul of the cross-site ajax limitation: you cannot make ajax requests to other domains.

AJAX and Cross-Site Scripting to Read the Header

Help me understand AJAX and cross-site scripting a little better. Writing AJAX is fairly straight forward. If I want to asynchronously read HTTP header of a website, I'd do something like this:
var req = new XMLHttpRequest();
req.open('HEAD', 'http://www.stackoverflow.com/', true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4) {
if(req.status == 200)
alert(req.responseText);
else
alert("Error loading page");
}
};
req.send(null);
However, when I copy and paste this into a simple HTML page using notepad and try to run it locally, the request status doesn't seem to return 200. I am assuming this is due to cross-site scripting. How would I get around this?
You are right in that making requests across domains is not allowed unless you are using Cross-Origin Resource Sharing (CORS, http://www.w3.org/TR/cors/). CORS has a client-side and server side component. On the client side, the request looks mostly like a regular XmlHttpRequest, except you have a few other properties and handlers you can configure. On the server, the response will need to emit some special http headers. This article gives a good breakdown of how CORS works on the client and server: http://www.nczonline.net/blog/2010/05/25/cross-domain-ajax-with-cross-origin-resource-sharing/
My first guess would be to try and make a local PHP file which acts like a gateway:
<?php
echo get_headers($_GET['url']);
?>
Then, perform a GET request with the url of your target site as the parameter, and parse the .responseText of that request to determine the response header of your original.
I don't think it's possible with pure JS, so you'll have to use some serverside code.
There are two types of "locally":
Using a local server (http://localhost/)
Accessing HTML file directly (file:///C:\a\b\c.html)
AJAX won't work, ever, in the second case.
You can't make an ajax request to http://stackoverflow.com if your page is being served on http://localhost/...
http://en.wikipedia.org/wiki/XMLHttpRequest#Cross-domain_requests

Cross-site AJAX requests

I need to make an AJAX request from a website to a REST web service hosted in another domain.
Although this is works just fine in Internet Explorer, other browsers such as Mozilla and Google Chrome impose far stricter security restrictions, which prohibit cross-site AJAX requests.
The problem is that I have no control over the domain nor the web server where the site is hosted. This means that my REST web service must run somewhere else, and I can't put in place any redirection mechanism.
Here is the JavaScript code that makes the asynchronous call:
var serviceUrl = "http://myservicedomain";
var payload = "<myRequest><content>Some content</content></myRequest>";
var request = new XMLHttpRequest();
request.open("POST", serviceUrl, true); // <-- This fails in Mozilla Firefox amongst other browsers
request.setRequestHeader("Content-type", "text/xml");
request.send(payload);
How can I have this work in other browsers beside Internet Explorer?
maybe JSONP can help.
NB youll have to change your messages to use json instead of xml
Edit
Major sites such as flickr and twitter support jsonp with callbacks etc
The post marked as the answer is erroneous: the iframes document is NOT able to access the parent. The same origin policy works both ways.
The fact is that it is not possible in any way to consume a rest based webservice using xmlhttprequest. The only way to load data from a different domain (without any framework) is to use JSONP. Any other solutions demand a serverside proxy located on your own domain, or a client side proxy located on the remote domain and som sort of cross-site communication (like easyXDM) to communicate between the documents.
The fact that this works in IE is a security issue with IE, not a feature.
Unfortunately cross-site scripting is prohibited, and the accepted work around is to proxy the requests through your own domain: do you really have no ability to add or modify server side code?
Furthermore, the secondary workaround - involving the aquisition of data through script tags - is only going to support GET requests, which you might be able to hack with a SOAP service, but not so much with the POST request to a RESTful service you describe.
I'm really not sure an AJAX solution exists, you might be back to a <form> solution.
The not very clear workaround (but works) is using iframe as container for requests to another sites. The problem is, the parent can not access iframe's content, can only navigate iframe's "src" attribut. But the iframe content can access parent's content.
So, if the iframe's content know, they can call some javascript content in parent page or directly access parent's DOM.
EDIT:
Sample:
function ajaxWorkaroung() {
var frm = gewtElementById("myIFrame")
frm.src = "http://some_other_domain"
}
function ajaxCallback(parameter){
// this function will be called from myIFrame's content
}
Make your service domain accept cross origin resource sharing (CORS).
Typical scenario: Most CORS compliant browsers will first send an OPTIONS header, to which, the server should return information about which headers are accepted. If the headers satisfy the service's requirements for the request provided (Allowed Methods being GET and POST, Allowed-Origin *, etc), the browser will then resend the request with the appropriate method (GET, POST, etc.).
Everything this point forward is the same as when you are using IE, or more simply, if you were posting to the same domain.
Caviots: Some service development SDK's (WCF in particular) will attempt to process the request, in which case you need to preprocess the OPTIONS Method to respond to the request and avoid the method being called twice on the server.
In short, the problem lies server-side.
Edit There is one issue with IE 9 and below with CORS, in that it is not fully implemented. Luckily, you can solve this problem by making your calls from server-side code to the service and have it come back through your server (e.g. mypage.aspx?service=blah&method=blahblah&p0=firstParam=something). From here, your server side code should implement a request/response stream model.
Just use a server side proxy on your origin domain. Here is an example: http://jquery-howto.blogspot.com/2009/04/cross-domain-ajax-querying-with-jquery.html
This can also be done using a webserver setup localy that calls curl with the correct arguments and returns the curl output.
app.rb
require 'sinatra'
require 'curb'
set :views,lambda {"views/"+self.name.to_s.downcase.sub("controller","")}
set :haml, :layout => :'../layout', :format => :html5, :escape_html=>true
disable :raise_errors
get '/data/:brand' do
data_link = "https://externalsite.com/#{params[:brand]}"
c = Curl::Easy.perform(data_link)
c.body_str
end
Sending an ajax request to localhost:4567/data/something will return the result from externalsite.com/something.
Another option would be to setup a CNAME record on your own domain to "Mask" the remote domain hostname.

Categories