Getting rendered HTML from a page loaded with clientside JavaScript? - javascript

I have a Chrome extension that used to run a background script that would call an API for a website using the users session or cookie.
It'd simply perform a get request and then pull various image URLs from the page using Cheerio.
The owners of the site though have now changed how the pages work. On load, they call a JSON API, and the source of the page uses JavaScript to render the page.
The issues I have now is that when I call a get request, it simply gets the page source, rather than the rendered HTML.
Does anyone know how I can get the rendered HTML? I'd rather not open a tab with the page in chrome, grab the data with a content script and then close it (automatically of course) as there are hundreds of pages to go through, and that's quite intensive on CPU resources

Related

Access elements on an external page

I have an html page that is being accessed via a link that places an external page in the url - e.g.
http://www.mydomain.com/mypage?external-page=encodedURL
It is the responsibility of my page to scrape some data from the URL it is handed.
How can I access the passed-in page using javascript/jquery? I need to be able to pull out the content for certain classes and ids.
Is this a violation of same origin policy? If so, is there some other way to process an external page like this? Seems strange to me that I can hit the web page in a browser or a terminal command and receive the content, but not in a js file.
You can use a browser extension to scrape the external page, then send the data to your site, OR display it within the page, so that it can then be accessed by your page's javascript via the DOM.
You can use a proxy on your domain which fetches the external page and hands it to your javascript whose origin is on your domain, too.
You can use an API for the external page which is accessible.
You can ask,command, change the code of the external page (if you have access to it) to serve pages with Access-Control-Allow-Origin=*
I think this is all you can do.
EDIT: The "seems strange" is until you realize the intended difference between a user, and a process. The user is not thought to be malicious, but a process could be. A process could for example, grab data from a user's logged in gmail session if it had access to the external page, and transmit that data to a server. Since the user on the terminal is probably (but not always !) the one who logged in to that session, the user is not thought to be malicious. But a script whose origin is some website that user navigates to, should not be able to act with the same permissions as that user. Since that script is an agent as well, and can make actions, but it is not created or directed by the user. That's the strongest reason for the isolation of origin's and the same origin policy.
Example
Execution Context of Bookmarklets, and IFrames
If you are injecting JS into every page via a bookmarklet, then that injected code will behave as if it has the same origin as the rest of the page, or at least the "top frame" of that page. It will execute in the same context as the top frame. If there are nested iframes in the page then you will get an "unsafe attempt to access page x from " error if your bookmarklet tries to inject into there. This is because the bookmarklet has it's origin in the top page, and the top page can never access nested iframes on different domains anyway.
So if some part of the site you wish to scrape is in an iframe below the top frame, your bookmarklet will fail to get it.
Transmitting Data using a bookmarklet
If you want to take a url on one page, on your domain, then grab data from that url, on another domain, then display that data back on the same page, you need a way to get the data across. You could use a bookmarklet but the flow would still involve some "user help". It would go something like this:
Load your domain's page, D. User puts a url into an input box. Clicks submit.
Javascript on D opens a new tab/window pointing to the user provided url.
User clicks your scraping bookmarklet on that external page, which collects the desired data, X.
Desired data, X, is sent via Ajax to a "server", S, with session identifier I.
Page D, polls the server S, until it gets notified that some data with session identifier I has been grabbed, then it gets that data and displays it on D.
There is the need for a server. You can't use local storage to transmit the information since this is specific to a domain. There is an alterative that does not require a server. It requires making a browser extension.
Transmitting data using a browser extension The "background page" of the extension is basically the same as a local server for all the browser tabs, it permits transmitting of information across tabs targeted to different domains. The "clients" in this set up are the "content scripts", which are loaded to every page (just like a bookmarklet, except without the requirement for a user to actually click the bookmarklet to load it. It happens automatically). The flow would go like this:
Page D again. User inputs url in input box. Clicks submit -> which triggers some code in the extension.
The extension background page instructs a tab to open and targets it to the url.
A content script loads automatically into that tab, checks with the background what data it should get. It gets that data, and sends it, via a message (a json string) to the background page.
The background page pushes that notification and the data on to the original contents script on page D. Which displays the information.
Optionally, the background page also transmits the information to your server for saving into that user's datastore.
The language I use for the browser extension "background page" and "content script" is pretty much focussed on Google Chrome. The same concepts are available in Safari, Firefox as well. If you want to support IE you're going to have to work out something else. IE10 does not plan to even support extensions.
If the external page and your page is on the same domain, then you should be able to access that external page using JavaScript. Otherwise, the JavaScript won't be allowed to access the external site, browsers will prevent Cross-site scripting.

AJAX App JavaScript Loading Issue

I am creating a complete ajax application where there is one base page and any pages the user navigates to within the application are loaded via ajax into a content div on the page. On the base page I include the various scripts that are needed for every page within the application (jQuery, jQuery-UI, other custom javascript files). Then on the various pages with the application I include a script or two for each page that contains the logic needed for just that page. Each of those script files have something that executes on the page ready event. The problem is that every time the user navigates to page1, the page1.js file is loaded. So, if they visit that page 10 times, that script is then loaded ten times into their browser. Looking at the Chrome script developer tools after running around the site I see tons of duplicated scripts.
I read somewhere about checking to see if the script has already been loaded using a boolean value or storing the loaded scripts in an array. But, the problem with that is that if I see the script is already loaded and I don't load it, the page ready function doesn't get fired for the page's javascript file and everything fails.
Is there an issue having the javascript file loaded over and over when the user visit the same page multiple times?
I did notice looking at the network traffic that every time we visit the page, the script is requested with a random number parameter (/Scripts/Page1.js?_=298384892398) which causes the forced request for the script file every time. I set the cache: true settings on the jQuery ajaxSetup method and that removed the parameter from the request and thus the cached version of the javascript file was loaded instead of actually making a separate HTTP request for it. But, the problem is that I don't want all the ajax requests made to be cached as content changes all the time. Is there a way to force just javascript files to be cachced but allow all other ajax requests to be not cached.
Even when I forced caching on all requests, the javascript file still showed up multiple times in the developer tools. Maybe that isn't a big deal but it doesn't seem quite right.
Any advice on how to handle this situation?
About your first question:
Every time you load a JavaScript file, the entire content gets evaluated by the browser. It solely depends on the content if you can load and execute it multiple times in a row. I'd not consider it a best practice to do so. ;)
Still i'd recommend that you find a way to check if it was already loaded and fire the "page loaded" event manually within the already present code.
For the second question: I'd assume that the script is intended to show up multiple times when including it multiple times. To give an advice on how to not cache the loaded JS i'd need to know how you loaded the code, how you do AJAX and the general jQuery setup.
After doing some more research it looks like it is actually just a Chrome issue. When you load a script via AJAX you can include the following in your code to get it to show up in the the Chrome developer tools
//# sourceURL=some-script-name
The problem is that when you navigate away from the page, the developer tools keeps the script around, but it is actually not longer referenced by the page.

Append javascript/html to page when navigating from a different page?

Alright, first off this is not a malicious question I'm asking. I have no intentions of using any info for ill gains.
I have an application that contains an embedded browser. This browser runs within the application's process, so I can't access it via Selenium WebDriver or anything like that. I know that it's possible to dynamically append scripts and html to loaded web pages via WebDriver, because I've done it.
In the embedded browser, I don't have access to the pages that get loaded. Instead, I can create my own html/javascript pages and execute them, to manipulate the application that houses the browser. I'm having trouble manipulating the existing pages within the browser.
Is there a way to dynamically add javascript to a page when you navigate to it and have it execute right after the page loads?
Something like
page1.navigateToUrl(executeThisScriptOnLoad)
page2 then executes the passed script.
I guess it is not possible to do it without knowledge of destination site. Although you can send data to the site and then use eval() function to evaluate sent data on destination page.

Displaying a web Page's Content on my Webpage which has my custom JavaScript

I want to Display the content of a webpage ( say wikipedia ) on my web page which has my custom JavaScript how shall i do that ?
I tried to use the iFrame for this but the JavaScript that i have on my page doesnt work on the Iframe but it does work on the rest of the body
How should i use the content of a different webpage on my webpage so that i can use my JavaScript on that page.
I want a page like google translator which has on top my Header and on the bottom the content of a webpage.
is it done through an iFrame or a content placeholder or ... what ?
You'll have to fetch the content from your server, build up a page around it (possibly using an <iframe>; that'd certainly be the simplest thing) and then serve it up. There might be all sorts of problems as the page tries to fetch its auxiliary files (CSS, scripts, images) because it may use relative URLs. Depending on what you know about the remote page, you'd have to do some surgery on the fetched content before sending it out to the client.
You cannot mess with content fetched from a different domain. That's why it doesn't work when you just include a frame that directly fetches the other content from the client. When you fetch the content from your server, however, the browser will be happy.
Oh, and also, note that forms or AJAX code in the fetched content may also have problems when running inside your site, because again it may use relative URLs. Even if it isn't, you may have security problems, because there's no way for a user to really log in (unless you proxy that too from your server).

Does preloading content from a page skew my google analytics stats?

I'd like to write myself a simple script that uses AJAX to load the content from each page on my main navbar into a hidden div on the current page.
This is just so that I can preload as much of my important content as possible and get it cached on the user's computer (hopefully) before they've finished with the current page and want to move on.
I'm concerned that doing a request for every page on the site, every time someone visits, will really ruin the validity of my google analytics stats.
How does AJAX interact with google analytics? Does it count as a "page visit"?
If you retrieve each page without running the embedded script, then the Google Analytics code would not be run and it should not count as a page view. I suggest not doing anything with the code after retrieving each page (i.e. not inserting the content into a hidden div).
If you want to ajaxify your site by removing pages and replacing them with ajax requests, then all you need to do on the GA side of things is call _trackPageView whenever a page view should be tracked.

Categories