How to screen scrape across origins in an IFRAME? - javascript

I have a business web app that needs to pull in information from various other web sites. For most sites, the user just instructs the server to pull the data (either using .NET's HttpRequest, or Selenium).
But for some unfriendly, Javascript-heavy sites, our users have to visit the site manually, navigate to the right spot, and copy and paste into our application.
Other than bookmarklets, is there any way for our page to show an IFRAME with the source web site loaded, allow the user to navigate within the frame, and then capture the IFRAME's body?
Since the site in the IFRAME isn't in the same domain (not even close), I can't seem to work around browser cross-site scripting limitations. I've tried using HTML5's "sandbox" feature, but it appears to only allow communication (via "allow-same-origin") the other way, from the IFRAME to the host site, which isn't useful to me. Also, it doesn't work if the site in question attempts to load its frames to the top context.
What I'm ideally looking for is a solution that would allow the browser to be configured to trust my web site implicitly (it's an intranet app) and allow it to access any frame's contents. That would at least get me in the ballpark. Bonus points if I can get the iframe to redefine the "top" context as its own frame, so the hosted site functions properly within the frame.

The best approach I've found through many many screen scraping projects (scraping JS heavy pages) is to create a user-script or Greasemonkey script, setup a few virtual machines in their own IP space (for protection) and feed them a list of sites to visit from a remote program:
Check the queue at a set interval
Request page with Greasemonkey, etc.
Capture contents and send to remote program for processing
You can't use an iframe method and you are going to bang your head up against a wall trying to go that route, the method I've described has worked for numerous large-scale scraping projects.

Related

Communicate w/ Javascript running in an iFrame

I'm currently working on an application that uses the Phonegap/Cordova framework to display an online and an offline version of a website. If you're not familiar w/ this framework, it offers a simple way of creating multi-platform applications by displaying local files in a full-screen webview.
When launching the application, the Javascript integrated in the local files of the application detects if Internet access if available, and redirects the user to either another local webpage containing a full-screen iFrame of the live website, or a reduced offline version of the website (contained in the local files of the app) if no Internet connection is detected.
I would like to detect when the user logs in using the various forms on the website (being displayed inside the iFrame), but I have no way of knowing which page the user is on, or interact w/ the website content at all because of the same-origin policy.
Would it be possible though to make the Javascript from the local page (which contains the iFrame) interact w/ the Javascript from the remote page (which is being displayed in the iFrame)? This way, I would be able to obtain the login information, and save it for later use (obviously not w/o using a token system), but also it would help for another planned feature (trigger the guidance system).
Thank you.
Look into HTML5 communication, it's pretty simple and sounds like it fits your needs
http://stevehanov.ca/blog/index.php?id=109
https://developer.mozilla.org/en-US/docs/Web/API/Window/postMessage

Use another website from an HTML file running locally

I am trying to load another website from a webpage I am running locally. While it does load, I can not seem to reference anything inside. I keep getting
Blocked a frame with origin "null" from accessing a frame with origin "http://theWebsiteImAccessingWithTheIFrame.com". The frame requesting access has a protocol of "file", the frame being accessed has a protocol of "http". Protocols must match.
I get that this is a security feature, but there must be a way to reference the stuff inside if it is loading it anyway, no?
Any help is greatly appreciated!
Edit:
I have created a map of the office I work in, using SVGs, with everyone's information(office location, their photo, extension, etc). We also just got a bunch of IP Phones that are managed and hosted by LightPath. On the webpage they have, it lets us manage our phones and even make them call others(using javascript but I have no idea how since their code is insanely complex).
My plan was that if a user clicked on someone's office, they could then have a button that would ask them to enter their number and pin to log in(how it works on the lightpath website), it would connect their 2 phones. I intended to use their number and pin to log in for them, and have the call connect that way, by just controlling the forms on lightpath's site, while it was in an iframe. This way, they wouldn't see Lightpath's site's clutter(because I could hide the iframe), and it would just get done. Essentially, it would happen as if they had gone to the website themselves and done it that way, except in a much more approachable format, and with less distractions.
LightPath does offer a "call me" feature which creates a dedicated button for calling a specific person, but it creates a token for them, and only that person has the ability to create it, not to mention I would have to enter each persons' unique token into the site, and run the risk of it failing should their extension get changed, or they leave the company. So I was hoping for something a little more dynamic.
This is a security feature from the browsers.
You can't access iFrames which are not from the same origin.
So the file has to be local.
They have to be on the same server. In chrome you aren't allowed to access any other file in the file:// protocol.
So to access remote files you may want to look into other ways of accessing it.
But either way, you need — at least in Google Chrome — to be on the http:// protocol.

URL tracking functionality

I want my webpage to have two parts. The top part has a textbox. When the user types a URL into the textbox, the bottom part browses to the content of that URL. When the user clicks a link within the bottom part, the bottom part navigates to the new URL, and the textbox in the top part changes to the new URL. How can I do it?
NOTE: This behavior is the same as in Google Translate (e.g. here), but without any translation.
first problem..
Same origin issue
The only way to achieve what you are asking is exactly the way google translate does what it does - which is to use a server-side powered script as a proxy request:
http://translate.google.com/translate_un?depth=1&hl=en&ie=UTF8&prev=_t&rurl=translate.google.com&sl=auto&tl=en&twu=1&u=http://de.wikipedia.org/wiki/USA&lang=de&usg=ALkJrhgoLkbUGvOPUCHoNZIkVcMQpXhxZg
The above is the URL taken from the iframe that Google translate uses to display the translated page. The main thing to note is that the domain part of the URL is the same as the parent page's URL http://translate.google.com -- if both your frame and your parent window do not share the same domain, then your parent window's JavaScript wont be able to access anything within the iframe. It will be blocked by your browser's in-built security.
Obviously the above wont be a problem if in your project you are only ever going to be navigating your own pages (on the same domain), but considering you are proffering Google Translate as an example I'm assuming not.
What would Google do?
What the above URL does is to ask the server-side to fetch the wikipedia page and return it so that the iframe can display it - but to the iframe this page appears to be hosted on translate.google.com rather than wikipedia. This means that the iframe stays within the same origin as the parent window, and means that JavaScript can be used to edit or modify the page within the iframe.
next problem....
Rewrite the proxied content
Basically what I'm saying is that this can't be achieved with just HTML and client-side JavaScript - you need to have something to help from the server-side i.e. PHP, Python, Ruby, Lisp, Node.. and so on. This script will be responsible for making sure the proxied page appears/renders correctly e.g. you will have to make sure relative links to content/images/css on the original server are not broken (you can use the base tag or physically rewrite relative links). There are also many sites that would see this as an illegal use of their site, as per their site's terms of use and so should be black listed from your service.
final problem..?
Prevent the user from breaking away from your proxy
Once you have your proxy script, you can then use an iframe (please avoid using old framesets), and a bit of JavaScript magic that onload or ondomready of the iframe rewrites all of the links, forms and buttons in the page. This is so that when clicked or submitted, they post to your proxy script rather than the original destination. This rewrite code would also have to send the original destination to your proxy script some how - like u in the Google translate URL. Once you've sorted this, it will mean your iframe will reload with the new destination content, but - all importantly - your iframe will stay on the same domain.
too many problems!
If it were me, personally, I'd rethink your strategy
Overall this is not a simple task, and it isn't 100% fullproof either because there are many things that will cause problems:
Certain sites are designed to break out of frames.
There are ways a user can navigate from a page that can not be easily rewritten i.e. any navigation powered by JavaScript.
Certain pages are designed to break when served up from the wrong host.
Sites that do this kind of 'proxying' of other websites can get into hot water with regards to copyright and usage.
The reason why Google can do it is because they have a lot of time, money and resources... oh and a great deal of what Google translate does is actually handled on the server-side - not in JavaScript.
suggestions
If you are looking for tracking users navigating through your own site:
Use Google Analytics.
Or implement a simple server-side tracking system using cookies.
If you are looking to track users coming to your site and then travelling on to the rest of the world wide web:
Give up, web technologies are designed to prevent things like this.
Or join an online marketing company, they do their best to get around the prevention of things like this.
add a javascript function to your second frame -
<frame id="dataframe" src="frame_a.htm" onload="load()">
let the text box have an id - say "test"
function load()
{
document.getElementById('test').value=document.getElementById('dataframe').src
}

Why is top window forbidden from accessing frames inside it? (unless content from same server)

I understand the reason for forbidding iframes from accessing the top window, but the other way around it seems a bit unnecessary and restricting of innovative applications.
It's actually more dangerous to be able to access content in a child window, because the top window is "in control" (i.e., the top window chooses which page to display in the iframe). Technically the threat is the same either way, but it makes it a lot easier for a malicious web site if it can host it's own iframes, rather than hope it gets embedded in a target site.
By preventing access to the contents when they're cross-domain, it prevents a whole host of XSRF and XSS attacks. For example, if I was running a malicious web site, I could simply place hidden iframes on my page to dozens of popular sites, whether they be social networking, e-mail, financial, etc. If you were already authenticated against any of them, your browser would send your session cookies along, even within the iframe, and the iframe would serve an authenticated page with secure content.
This is obviously really bad if the parent window can scrape the child window or inject new JavaScript into the child window to be executed.
Because this would allow you to relatively invisibly put a site like paypal.com in an iframe and then change that site, thus deceiving the user (and perhaps capturing the credentials or bank account information entered).
One web site is not allowed to modify the behavior of another site, purely from the web. Modifying the behavior of a site can be done with browser plug-ins or with add-on frameworks like greasemonkey, but the user has to choose to install those capabilities and there's an assumption that they only install capabilities they trust (not always true, but that's what it relies on).
It's potentially even more dangerous for the top level frame to be able to access the embedded frames because the top level frame gets to decide which sites to put in the embedded frames and thus attack/mess with.
It's the same issue as child to parent. You don't want the chance of malicious sites messing with the content of valid sites they just happen to be in the same browser window with.

How to offer a webapp to other sites. (div with javascript, iframe or..?)

I am quite new to web application development and I need to know how would I make other sites use it.
My webapp basically gets a username and returns some data from my DB. This should be visible from other websites.
My options are:
iframe. The websites owners embed an iframe and they pass the userid in the querystring. I render a webpage with the data and is shown inside the iframe.
pros: easy to do, working already.
cons: the websites wont know the data returned, and they may like to know it.
javascript & div. They paste a div and some javascript code in their websites and the div content is updated with the data retrieved by the small javascript.
pros: the webside would be able to get the data.
cons: I could mess up with their website and I don't know wow would I run the javascript code appart from being triggered by a document ready, but I wouldn't like to add jquery libraries to their sites.
There must be better ways to integrate web applications than what I'm thinking. Could someone give me some advice?
Thanks
Iframes cannot communicate with pages that are on a different domain. If you want to inject content into someone else's page and still be able to interact with that page you need to include (or append) a JavaScript tag (that points to your code) to the hosting page, then use JavaScript to write your content into the hosting page.
Context Framework contains embedded mode support, where page components can be injected to other pages via Javascript. It does depend on jQuery but it can always be used in noConflict-mode. At current release the embedded pages must be on same domain so that same-origin-policy is not violated.
In the next release, embedded mode can be extended to use JSONP which enables embedding pages everywhere.
If what you really want is to expose the data, but not the visual content, then I'd consider exposing your data via JSONP. There are caveats to this approach, but it could work for you. There was an answer here a couple of days ago about using a Web Service, but this won't work directly from the client because of the browser's Same Origin policy. It's a shame that the poster of that answer deleted it rather than leave it here as he inadvertently highlighted some of the misconceptions about how browsers access remote content.

Categories