How to access page sources from chrome devtools API

How to access page sources from chrome devtools API - javascript

What is the easy way to access with the chrome devtools api all the content of the sources tab in the devtools?
I am writing a small program using nightmarejs to scrap some webpages. And I need to do some analysis, both on the rendered html and on the original one.
Nightmarejs doesn't provide an api call to get the source of the page. I am thinking about using the devtools api. But this is not clear to me how to do so. As I can see many files in the Sources tab of the chrome devtools, I thought I could get this content easily.
For now, I have a few leads:
The chrome.devtools.network API.
There is snippet on the documentation:
chrome.devtools.network.onRequestFinished.addListener(
function(request) {
if (request.response.bodySize > 40*1024) {
chrome.devtools.inspectedWindow.eval(
'console.log("Large image: " + unescape("' +
escape(request.request.url) + '"))');
}
});
I thin I could use a listener like this. And get the body if it's available in the response. But I don't find the documentation for this response content. Also, I don't want to store the result of all the requests.
But my main problem is that I don't see the content of the request here. I tried to do a request.getContent(), which returned me null.
chrome.debug
I didn't have time to play with it yet.

Related

Manifest V3 web extension overwrite response body

How would I overwrite the response body for an image with a dynamic value in a Manifest V3 Chrome extension?
This overwrite would happen in the background, as per the Firefox example, (see below) meaning no attaching debuggers or requiring users to press a button every time the page loads to modify the response.
I'm creating a web extension that would store an image in the extension's IndexedDB storage and then override the response body with that image on requests to a certain image. A redirect to a dataurl: I have it working in a Manifest V2 extension in Firefox via the browser.webRequest.onBeforeRequest api with the following code, but browser.webRequest and MV2 are depreciated in Chrome. In MV3, browser.webRequest was replaced with browser.declarativeNetRequest, but it doesn't have the same level of access, as you can only redirect and modify headers, not the body.
Firefox-compatible example:
browser.webRequest.onBeforeRequest.addListener(
(details) => {
const request = browser.webRequest.filterResponseData(details.requestId);
request.onstart = async () => {
request.write(racetrack);
request.disconnect();
};
},
{
urls: ['https://www.example.com/image.png'],
},
['requestBody', 'blocking']
);
The Firefox solution is the only one that worked for me, albeit being exlusive to Firefox. I attempted to write a POC userscript with xhook to modify the content of a DOM image element, but it didn't seem to return the modified image as expected. Previously, I tried using a redirect to a data URI and an external image, but while the redirect worked fine, the website threw an error that it couldn't load the required resources.
I'm guessing I'm going to have to write a content script that injects a Service Worker (unexplored territory for me) into the page and create a page rule that redirects, say /extension-injected-sw.js to either a web-available script, but I'm not too sure about how to pull that off, or if I'd still be able to have the service worker communicate with the extension, or if that would even work at all. Or is there a better way to do this that I'm overlooking?
Thank you for your time!

Getting initiator of XXX-xsrfstatemanager.js file using Chrome Developer Tools

In order to triage a problem with a web browser I am trying to determine the initiator of the XXX-xsrfstatemanager.js file (the XXX part seems to be something dynamic like a nonce) that occurs as part of a Google Authentication flow (using OAuth).
When I use Chrome developer tools, it says the below URL is the initiator:
https://accounts.google.com/o/oauth2/v2/auth?approval_state=%21Ch[REDACTED]Q%E2%88%99AJ[REDACTED]xq&as=-aBk[REDACTED]
Looking at the result of the above page see a lot of Javascript, but the string "xsrfstatemanager" is nowhere to be found, nor do I see any other javascript pages being included. Unless there is some really cryptic code that is somehow building this URL, the call is actually coming from some other page.
Does anyone know how I can get the 'real' initiator? Or if the above URL might be correct, if I can get more information like what exact line number of the file initiated the call?
By the way, while I edited the above URL for security reasons, if you go to (for example) www.quora.com and quick "continue with google" it is easy to see the flow in question.

The flow includes a redirection, which is why you cannot see the source code that initiates/references that script.
If you view the source of the original URL that is opened when you click on "Continue with Google", you will see the <script src> that references it. This works in Chrome and probably Safari -
view-source:https://accounts.google.com/o/oauth2/auth?redirect_uri=storagerelay%3A%2F%2Fhttps%2Fwww.quora.com%3Fid%3Dauth488109&response_type=code%20permission%20id_token&scope=email%20profile%20openid&openid.realm=&client_id=917071888555.apps.googleusercontent.com&ss_domain=https%3A%2F%2Fwww.quora.com&access_type=offline&include_granted_scopes=true&prompt=select_account&origin=https%3A%2F%2Fwww.quora.com&gsiwebsdk=2
From the source code -
<script src='https://ssl.gstatic.com/accounts/o/532969778-xsrfstatemanager.js' nonce="IgiKmQiLZIHDwGvce7/q6Q"></script>
You can also use tools like Fiddler to see the source code of the redirect, or check "Preserve log" in the Network panel of the Developer Tools feature of Chrome, or by going to the original URL with JavaScript disabled.

Is there an alternative to preprocessorScript for Chrome DevTools extensions?

I want to create a custom profiler for Javascript as a Chrome DevTools Extension. To do so, I'd have to instrument all Javascript code of a website (parse to AST, inject hooks, generate new source). This should've been easily possible using chrome.devtools.inspectedWindow.reload() and its parameter preprocessorScript described here: https://developer.chrome.com/extensions/devtools_inspectedWindow.
Unfortunately, this feature has been removed (https://bugs.chromium.org/p/chromium/issues/detail?id=438626) because nobody was using it.
Do you know of any other way I could achieve the same thing with a Chrome Extension? Is there any other way I can replace an incoming Javascript source with a changed version? This question is very specific to Chrome Extensions (and maybe extensions to other browsers), I'm asking this as a last resort before going a different route (e.g. dedicated app).

Use the Chrome Debugging Protocol.
First, use DOMDebugger.setInstrumentationBreakpoint with eventName: "scriptFirstStatement" as a parameter to add a break-point to the first statement of each script.
Second, in the Debugger Domain, there is an event called scriptParsed. Listen to it and if called, use Debugger.setScriptSource to change the source.
Finally, call Debugger.resume each time after you edited a source file with setScriptSource.
Example in semi-pseudo-code:
// Prevent code being executed
cdp.sendCommand("DOMDebugger.setInstrumentationBreakpoint", {
eventName: "scriptFirstStatement"
});
// Enable Debugger domain to receive its events
cdp.sendCommand("Debugger.enable");
cdp.addListener("message", (event, method, params) => {
// Script is ready to be edited
if (method === "Debugger.scriptParsed") {
cdp.sendCommand("Debugger.setScriptSource", {
scriptId: params.scriptId,
scriptSource: `console.log("edited script ${params.url}");`
}, (err, msg) => {
// After editing, resume code execution.
cdg.sendCommand("Debugger.resume");
});
}
});
The implementation above is not ideal. It should probably listen to the breakpoint event, get to the script using the associated event data, edit the script and then resume. Listening to scriptParsed and then resuming the debugger are two things that shouldn't be together, it could create problems. It makes for a simpler example, though.

On HTTP you can use the chrome.webRequest API to redirect requests for JS code to data URLs containing the processed JavaScript code.
However, this won't work for inline script tags. It also won't work on HTTPS, since the data URLs are considered unsafe. And data URLs are can't be longer than 2MB in Chrome, so you won't be able to redirect to large JS files.
If the exact order of execution of each script isn't important you could cancel the script requests and then later send a message with the script content to the page. This would make it work on HTTPS.
To address both issues you could redirect the HTML page itself to a data URL, in order to gain more control. That has a few negative consequences though:
Can't reload page because URL is fixed to data URL
Need to add or update <base> tag to make sure stylesheet/image URLs go to the correct URL
Breaks ajax requests that require cookies/authentication (not sure if this can be fixed)
No support for localStorage on data URLs
Not sure if this works: in order to fix #1 and #4 you could consider setting up an HTML page within your Chrome extension and then using that as the base page instead of a data URL.
Another idea that may or may not work: Use chrome.debugger to modify the source code.

Can I get the address bar url from the javascript console when the page has failed to load?

Just say I typed in a bad hostname in the address bar.
For example, say I wasn't running a local web server, and I load:
http://localhost/callback_url
In Chrome, this will give me a "This webpage is not available" message.
Is there anyway I can find out what the url is in the address bar from the Javascript console, even though the page failed to load?
I know I can normally use window.location.href to get this, but that returns "data:text/html,chromewebdata" in this instance.
So in this example, I'd like to know if there's some javascript that returns http://localohost/callback_url
EDIT: The main reason I'd like to do this is so I know if server side redirect failed when using ChromeDriver with Selenium. So I'd prefer to avoid using extensions if possible, and am open to Chrome and ChromeDriver specific solutions if applicable! The callback_url may have extra info in it, added by the server, and I'd like to see what this info is. I'd like to avoid running another server to get this data if possible.

The loadTimeData object included in the ERR_CONNECTION_REFUSED page has the failed URL:
> loadTimeData.data_.summary.failedUrl
"http://localhost/foo?request_url=bar"

You can get it from the title of the page.
By typing document.title and doing some regex you can get the URL.
Another way I found is by using the following
var data = loadTimeData.createJsEvalContext();
console.log(data.a.$top.summary.failedUrl);
If you open the developer tools and search for a part of the URL in source code, you will see that Chrome creates the loadTimeData in the "not available page".

Is there a way to mitigate downloading of resources (images/css and js files) with Javascript?

I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.

The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.

Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.

There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.

Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.

We Keep Coding

JavaScript is the programming language of the Web.