I have an html page that is being accessed via a link that places an external page in the url - e.g.
http://www.mydomain.com/mypage?external-page=encodedURL
It is the responsibility of my page to scrape some data from the URL it is handed.
How can I access the passed-in page using javascript/jquery? I need to be able to pull out the content for certain classes and ids.
Is this a violation of same origin policy? If so, is there some other way to process an external page like this? Seems strange to me that I can hit the web page in a browser or a terminal command and receive the content, but not in a js file.
You can use a browser extension to scrape the external page, then send the data to your site, OR display it within the page, so that it can then be accessed by your page's javascript via the DOM.
You can use a proxy on your domain which fetches the external page and hands it to your javascript whose origin is on your domain, too.
You can use an API for the external page which is accessible.
You can ask,command, change the code of the external page (if you have access to it) to serve pages with Access-Control-Allow-Origin=*
I think this is all you can do.
EDIT: The "seems strange" is until you realize the intended difference between a user, and a process. The user is not thought to be malicious, but a process could be. A process could for example, grab data from a user's logged in gmail session if it had access to the external page, and transmit that data to a server. Since the user on the terminal is probably (but not always !) the one who logged in to that session, the user is not thought to be malicious. But a script whose origin is some website that user navigates to, should not be able to act with the same permissions as that user. Since that script is an agent as well, and can make actions, but it is not created or directed by the user. That's the strongest reason for the isolation of origin's and the same origin policy.
Example
Execution Context of Bookmarklets, and IFrames
If you are injecting JS into every page via a bookmarklet, then that injected code will behave as if it has the same origin as the rest of the page, or at least the "top frame" of that page. It will execute in the same context as the top frame. If there are nested iframes in the page then you will get an "unsafe attempt to access page x from " error if your bookmarklet tries to inject into there. This is because the bookmarklet has it's origin in the top page, and the top page can never access nested iframes on different domains anyway.
So if some part of the site you wish to scrape is in an iframe below the top frame, your bookmarklet will fail to get it.
Transmitting Data using a bookmarklet
If you want to take a url on one page, on your domain, then grab data from that url, on another domain, then display that data back on the same page, you need a way to get the data across. You could use a bookmarklet but the flow would still involve some "user help". It would go something like this:
Load your domain's page, D. User puts a url into an input box. Clicks submit.
Javascript on D opens a new tab/window pointing to the user provided url.
User clicks your scraping bookmarklet on that external page, which collects the desired data, X.
Desired data, X, is sent via Ajax to a "server", S, with session identifier I.
Page D, polls the server S, until it gets notified that some data with session identifier I has been grabbed, then it gets that data and displays it on D.
There is the need for a server. You can't use local storage to transmit the information since this is specific to a domain. There is an alterative that does not require a server. It requires making a browser extension.
Transmitting data using a browser extension The "background page" of the extension is basically the same as a local server for all the browser tabs, it permits transmitting of information across tabs targeted to different domains. The "clients" in this set up are the "content scripts", which are loaded to every page (just like a bookmarklet, except without the requirement for a user to actually click the bookmarklet to load it. It happens automatically). The flow would go like this:
Page D again. User inputs url in input box. Clicks submit -> which triggers some code in the extension.
The extension background page instructs a tab to open and targets it to the url.
A content script loads automatically into that tab, checks with the background what data it should get. It gets that data, and sends it, via a message (a json string) to the background page.
The background page pushes that notification and the data on to the original contents script on page D. Which displays the information.
Optionally, the background page also transmits the information to your server for saving into that user's datastore.
The language I use for the browser extension "background page" and "content script" is pretty much focussed on Google Chrome. The same concepts are available in Safari, Firefox as well. If you want to support IE you're going to have to work out something else. IE10 does not plan to even support extensions.
If the external page and your page is on the same domain, then you should be able to access that external page using JavaScript. Otherwise, the JavaScript won't be allowed to access the external site, browsers will prevent Cross-site scripting.
Related
I have a Chrome extension that used to run a background script that would call an API for a website using the users session or cookie.
It'd simply perform a get request and then pull various image URLs from the page using Cheerio.
The owners of the site though have now changed how the pages work. On load, they call a JSON API, and the source of the page uses JavaScript to render the page.
The issues I have now is that when I call a get request, it simply gets the page source, rather than the rendered HTML.
Does anyone know how I can get the rendered HTML? I'd rather not open a tab with the page in chrome, grab the data with a content script and then close it (automatically of course) as there are hundreds of pages to go through, and that's quite intensive on CPU resources
I fear that the answer will be "Impossible due to browser security policy" but I really need to accomplish the following:
The problem I have is that the content to be embedded in my web page includes some menu items that I need to remove/hide because they trigger operations that I need to prevent. I cannot find a way to address these DOM nodes to hide them.
I have a web page and need to embed a URL from another domain into my web page. I have tried this with and also by using Ajax to fetch the URL contents and insert them into the DOM of my web page. These two methods have different results.
If I use to embed the page from the "foreign" domain I can see the "foreign" domain's content and I can address the node but all attempts to access the nodes underneath return null. There is no error message (in Firefox) but I suspect that I am getting null because the browser is enforcing the same-domain policy.
On the other hand, if I use Ajax to insert the page content into my web page I don't even see the content and in this case there is a CORS error in the Firefox debugger console.
Since I don't control the "foreign" domain I can't modify it to use the window.postMessage(); technique.
Can anyone suggest a way for me to be able to hide menu items that are in content fetched from a "foreign" domain? (Gotta be a way, gotta be a way, ...)
Thank you.
I have a site where I save URLs and I want to process and save the entire DOM (in case the site goes down -- I'll still have access to the content).
The current version of my javascript bookmarklet (which only saves the URL and Page title) has been submitting a series of GET variables to a PHP page. However this will not work for the entire DOM because there are URL limit constrictions (usually ~15,000 characters it seems).
I think that using POST would allow me to send more information but I believe that the browser will stop it because of XSS (cross site scripting) concerns.
Is there a way to send a large amount of data (15,000char+) from a javascript bookmarklet?
I'm happy to clarify!
create a form(in an iframe) -> set its values -> submit -> remove the iframe.
the reason for the iframe is so the page doesnt navigate away when you submit the form.
there wont be any permission issues.
I have a requirement where I have to submit same/similar data to 10-15 forms at a time. What I want to do is create a single page where all those forms are loaded, and fill in all known values automatically... The end user simply has to fill in the captchas shown for those 15 forms... Now I want each form's submission response to be loaded into an iframe within the same web page.
After this, I want a simple js to be loaded into each iframe, which reads some data from the parent document, as well as entire content of the response web page, and sends this using XMLHttpRequest to my web application. (The web application will parse through the content of form submission response, and see if the submission is successful or not).
The script that should be loaded into each iframe (within the main window) should read the iframe ID, some divs from the main window, and entire content of that iframe, and send it as a POST request to my web app.
Can such a scenario be implemented using Greasemonkey? Note that initially when the page with iframes is loaded, at that stage the iframes are blank- at this stage no data from iframes should be sent to my web app. Only after user submits all 10 forms, and the iframes are all loaded with respective form submission responses, now the js should send the data within each iframe to my web app.
One more question- currently I plan to use Google Chrome with appropriate runtime parameters to disable the same origin policy...But if the above scenario can be implemented using Greasemonkey script, then will I need to disable Same Origin Policy in Firefox also? Also, there is an extension in Firefox to add CORS enabler to a web page, can I combine that script with the code for above scenario, so that even if an iframe has different domain compared to main window, even then the data of each iframe is submitted?
1- Greasemonkey script load on every page and iframe that matches with your site filter.
you can stop it from running for main window with this command:
if(window == window.top) return;
// else do the rest
2- You can access parent window and its content with window.parent. and access the iframe from Parent with .contentWindow property of your iframe. (if they have save domains)
I have a task that i do not know where to start, i hope Stack Overflowers can give me some ideas.
I want to read the html source code of the previously opened and still opening tab in my web page.
My approach was to grab the url of the targeted page, send that url to server and do something, then use it in my web page. But i am facing the "same domain policy" on the server side, i know that JSONP can be used, but i must use POST in this case (other reasons). So i think if the tab (page) has been opened and is still open, there must be some ways that i can read the HTML when my web page is opened.
The flow will be if there is Page1 opening, user opens mywebpage.html on the same Window, mywebpage.html finds there is Page1 opening, then grab the HTML source page and use it.
Thanks!
Edit:
This is the full story.
What I am planning to do is a FireFox plugin. And there is a Button (myPluginButton) on the tool bar.
If user click myPluginButton, the HTML code of the current page will be sent to the server, then server parse the HTML code and generate a report, a new tab then is opened to display this report.
My current approach is to read the HTML of current page using newTabBrowser.contentDocument and send it to server, then do the parsing on server side. But this approach creates extra traffic. The efficient way would be only the url of the current page is sent to the server, and we can read HTML and parse it on the server side. However, the same domain policy does not allow me to do this easily.
So, my question is if it is possible to do when user click myPluginButton to open a new tab, this new tab loop all the opening tabs on the browser and reads the HTML contents of them then generate the report, since these tabs are still opening and the HTML contents must be saved on somewhere ( or i am wrong).
Thanks.
The browsers have a built in protection called same origin policy that prevent a page to read the content of other origin(domain, subdomain, port,...)
If you want to gain access to the current page you can use a bookmarklet.
You ask your users to add it in their bookmarks bar, and each time they want to use it, they don't open a tab but click on the bookmark.
This will load your script in the page, with all access to read the page content.
And oddly enough you can POST from this page to your domain, by posting a FORM to an IFRAME hosted on your domain. But you won't be able to read the response of the POST. You can use a setInterval with a JSONP call to your domain to know if the POST was successful.