Is it possible to use JavaScript to scrape all the changes to a webpage that is being updated live with AJAX? The site I wish to scrape updates data using AJAX every second and I want to grab all the changes. This is a auction website and several objects can change whenever a user places a bid. When a bid is placed the the following change:
The current Bid Price
The current high bidder
The auction timer has time added back to it
I wish to grab this data using a Chrome extension built on JavaScript. Is there a AJAX listener for JavaScript that can accomplish this? A tool kit? I need some direction. Can JavaScript accomplish this??
I'm going to show two ways of solving the problem. Whichever method you pick, don't forget to read the bottom of my answer!
First, I present a simple method which only works if the page uses jQuery. The second method looks slightly more complex, but will also work on pages without jQuery.
The following examples shows how you can implement filters based on method (eg POST/GET), URL, and read (POST) data and response bodies.
Use a global ajax event in jQuery
More information about the jQuery method can be found in the documentation of .ajaxSuccess.
Usage:
jQuery.ajaxSuccess(function(event, xhr, ajaxOptions) {
/* Method */ ajaxOptions.type
/* URL */ ajaxOptions.url
/* Response body */ xhr.responseText
/* Request body */ ajaxOptions.data
});
Pure JavaScript way
When the website does not use jQuery for its AJAX requests, you have to modify the built-in XMLHttpRequest method. This requires more code...:
(function() {
var XHR = XMLHttpRequest.prototype;
// Remember references to original methods
var open = XHR.open;
var send = XHR.send;
// Overwrite native methods
// Collect data:
XHR.open = function(method, url) {
this._method = method;
this._url = url;
return open.apply(this, arguments);
};
// Implement "ajaxSuccess" functionality
XHR.send = function(postData) {
this.addEventListener('load', function() {
/* Method */ this._method
/* URL */ this._url
/* Response body */ this.responseText
/* Request body */ postData
});
return send.apply(this, arguments);
};
})();
Getting it to work in a Chrome extension
The previously shown code has to be run in the context of the page (in your case, an auction page). For this reason, a content script has to be used which injects (!) the script. Using this is not difficult, I refer to this answer for a detailled explanation plus examples of usage: Building a Chrome Extension - Inject code in a page using a Content script.
A general method
You can read the request body, request headers and response headers with the chrome.webRequest API. The headers can also be modified. It's however not (yet) possible to read, let alone modify the response body of a request. If you want this feature, star https://code.google.com/p/chromium/issues/detail?id=104058.
Related
I'm trying to write a web extension that stops the requests from a url list provided locally, fetches the URL's response, analyzes it in a certain way and based on the analysis results, blocks or doesn't block the request.
Is that even possible?
The browser doesn't matter.
If it's possible, could you provide some examples?
I tried doing it with Chrome extensions, but it seems like it's not possible.
I heard it's possible on mozilla though
I think that this is only possible using the old webRequestBlocking API which Chrome is removing as a part of Manifest v3. Fortunately, Firefox is planning to continue supporting blocking web requests even as they transition to manifest v3 (read more here).
In terms of implementation, I would highly recommend referring to the MDN documentation for webRequest, in particular their section on modifying responses and their documentation for the filterResponseData method.
Mozilla have also provided a great example project that demonstrates how to achieve something very close to what I think you want to do.
Below I've modified their background.js code slightly so it is a little closer to what you want to do:
function listener(details) {
if (mySpecialUrls.indexOf(details.url) === -1) {
// Ignore this url, it's not on our list.
return {};
}
let filter = browser.webRequest.filterResponseData(details.requestId);
let decoder = new TextDecoder("utf-8");
let encoder = new TextEncoder();
filter.ondata = event => {
let str = decoder.decode(event.data, {stream: true});
// Just change any instance of Example in the HTTP response
// to WebExtension Example.
str = str.replace(/Example/g, 'WebExtension Example');
filter.write(encoder.encode(str));
filter.disconnect();
}
// This is a BlockingResponse object, you can set parameters here to e.g. cancel the request if you want to.
// See: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/BlockingResponse#type
return {};
}
browser.webRequest.onBeforeRequest.addListener(
listener,
// 'main_frame' means this will only affect requests for the main frame of the browser (e.g. the HTML for a page rather than the images, CSS, etc. that are loaded afterwards). You might want to look into whether you want to expand this.
{urls: ["*://*/*"], types: ["main_frame"]},
["blocking"]
);
Correction:
The above example only works properly if the response data fits in one chunk. If it is larger (and you still want to inspect the entirety of the response data), you would need to put all of the data into a buffer, and then work on it once all data has been received. See the document here for more information: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/StreamFilter/ondata#webextension_examples (the code section titled "This example combines all buffers into a single buffer" would be of most interest to you I think).
In terms of using this API to block responses, data is only returned from this URL if you call filter.write(), so if you don't like the response, you can simply not call it (just call filter.close()) and an empty response will be returned. You can also only return part of the full response body by filter.write()ing only the bits that you want to return.
Disclaimer
Firstly, a disclaimer: I am working within specific boundaries, so whilst it may seem I'm going about something the long way round, I am limited as to what I can do. I know I should be doing this entirely differently, but I cannot. If it's not possible to do what I'm trying to do here, then that's fine, I just need to know.
Background
Basically, this boils down to a cross-domain javascript call. However, I need to wait for the response before returning the method.
Say I have a page - example1.com/host.html. This contains a javascript method of 'ProvideValue()' which returns an int. Edit: This method must be executed where it is found, since it may need to access other resources within that domain, and access global variables set for the current session.
https://example1.com/host.html
function ProvideValue(){
return 8; // In reality, this will be a process that returns a value
}
This host.html page contains an iframe pointing to example2.com/content.html (note the different domain). This content.html page contains a method that needs to display the value from host.html in an alert.
https://example2.com/content.html
function DisplayValue(){
var hostValue = //[get value from ProvideValue() in host.html]
alert(hostValue);
}
That's it.
Limitations
I can run any javascript I like on the host.html, but nothing server-side. On content.html I can run javascript and anything server-side. I have no control over the example1.com domain, but full control over example2.com.
Question
How can I retrieve the value from ProvideValue() on example1.com/host.html within the DisplayValue() method on example2.com/content.html?
Previous Attempts
Now, I've tried many of the cross-domain techniques, but all of them (that I've found) use an asynchronous callback. That won't work in this case, because I need to make the request to the host.html, and receive the value back, all within the scope of a single method on the content.html.
The only solution I got working involved relying on asynchronous cross-domain scripting (using easyXDM), and a server-side list of requests/responses in example2.com. The DisplayValue() method made the request to host.html, then immediately made a synchronous post to the server. The server would then wait until it got notified of the response from the cross-domain callback. Whilst waiting, the callback would make another call to the server to store the response. It worked fine in FireFox and IE, but Chrome wouldn't execute the callback until DisplayValue() completed. If there is no way to address my initial question, and this option has promise, then I will pose this as a new question, but I don't want to clutter this question with multiple topics.
Use XMLHttpRequest with CORS to make synchronous cross-domain requests.
If the server doesn't support cors, use a proxy which adds the appropriate CORS headers, e.g. https://cors-anywhere.herokuapp.com/ (source code at https://github.com/Rob--W/cors-anywhere).
Example 1: Using synchronous XHR with CORS
function getProvidedValue() {
var url = 'http://example.com/';
var xhr = new XMLHttpRequest();
// third param = false = synchronous request
xhr.open('GET', 'https://cors-anywhere.herokuapp.com/' + url, false);
xhr.send();
var result = xhr.responseText;
// do something with response (text manipulation, *whatever*)
return result;
}
Example 2: Use postMessage
If it's important to calculate the values on the fly with session data, use postMessage to continuously update the state:
Top-level document (host.html):
<script src="host.js"></script>
<iframe name="content" src="https://other.example.com/content.html"></iframe>
host.js
(function() {
var cache = {
providedValue: null,
otherValue: ''
};
function sendUpdate() {
if (frames.content) { // "content" is the name of the iframe
frames.content.postMessage(cache, 'https://other.example.com');
}
}
function recalc() {
// Update values
cache.providedValue = provideValue();
cache.otherValue = getOtherValue();
// Send (updated) values to frame
sendUpdate();
}
// Listen for changes using events, pollers, WHATEVER
yourAPI.on('change', recalc);
window.addEventListener('message', function(event) {
if (event.origin !== 'https://other.example.com') return;
if (event.data === 'requestUpdate') sendUpdate();
});
})();
A script in content.html: content.js
var data = {}; // Global
var parentOrigin = 'https://host.example.com';
window.addEventListener('message', function(event) {
if (event.origin !== parentOrigin) return;
data = event.data;
});
parent.postMessage('requestUpdate', parentOrigin);
// To get the value:
function displayValue() {
var hostName = data.providedValue;
}
This snippet is merely a demonstration of the concept. If you want to apply the method, you probably want to split the login in the recalc function, such that the value is only recalculated on the update of that particular value (instead of recalculating everything on every update).
I am puzzling my way through my first 'putting it all together' Chrome extension, I'll describe what I am trying to do and then how I have been going about it with some script excerpts:
I have an options.html page and an options.js script that lets the user set a url in a textfield -- this gets stored using localStorage.
function load_options() {
var repl_adurl = localStorage["repl_adurl"];
default_img.src = repl_adurl;
tf_default_ad.value = repl_adurl;
}
function save_options() {
var tf_ad = document.getElementById("tf_default_ad");
localStorage["repl_adurl"] = tf_ad.value;
}
document.addEventListener('DOMContentLoaded', function () {
document.querySelector('button').addEventListener('click', save_options);
});
document.addEventListener('DOMContentLoaded', load_options );
My contentscript injects a script 'myscript' into the page ( so it can have access to the img elements from the page's html )
var s = document.createElement('script');
s.src = chrome.extension.getURL("myscript.js");
console.log( s.src );
(document.head||document.documentElement).appendChild(s);
s.parentNode.removeChild(s);
myscript.js is supposed to somehow grab the local storage data and that determines how the image elements are manipulated.
I don't have any trouble grabbing the images from the html source, but I cannot seem to access the localStorage data. I realize it must have to do with the two scripts having different environments but I am unsure of how to overcome this issue -- as far as I know I need to have myscript.js injected from contentscript.js because contentscript.js doesn't have access to the html source.
Hopefully somebody here can suggest something I am missing.
Thank you, I appreciate any help you can offer!
-Andy
First of all: You do not need an injected script to access the page's DOM (<img> elements). The DOM is already available to the content script.
Content scripts cannot directly access the localStorage of the extension's process, you need to implement a communication channel between the background page and the content script in order to achieve this. Fortunately, Chrome offers a simple message passing API for this purpose.
I suggest to use the chrome.storage API instead of localStorage. The advantage of chrome.storage is that it's available to content scripts, which allows you to read/set values without a background page. Currently, your code looks quite manageable, so switching from the synchronous localStorage to the asynchronous chrome.storage API is doable.
Regardless of your choice, the content script's code has to read/write the preferences asynchronously:
// Example of preference name, used in the following two content script examples
var key = 'adurl';
// Example using message passing:
chrome.extension.sendMessage({type:'getPref',key:key}, function(result) {
// Do something with result
});
// Example using chrome.storage:
chrome.storage.local.get(key, function(items) {
var result = items[key];
// Do something with result
});
As you can see, there's hardly any difference between the two. However, to get the first to work, you also have to add more logic to the background page:
// Background page
chrome.extension.onMessage.addListener(function(message, sender, sendResponse) {
if (message.type === 'getPref') {
var result = localStorage.getItem(message.key);
sendResponse(result);
}
});
On the other hand, if you want to switch to chrome.storage, the logic in your options page has to be slightly rewritten, because the current code (using localStorage) is synchronous, while chrome.storage is asynchronous:
// Options page
function load_options() {
chrome.storage.local.get('repl_adurl', function(items) {
var repl_adurl = items.repl_adurl;
default_img.src = repl_adurl;
tf_default_ad.value = repl_adurl;
});
}
function save_options() {
var tf_ad = document.getElementById('tf_default_ad');
chrome.storage.local.set({
repl_adurl: tf_ad.value
});
}
Documentation
chrome.storage (method get, method set)
Message passing (note: this page uses chrome.runtime instead chrome.extension. For backwards-compatibility with Chrome 25-, use chrome.extension (example using both))
A simple and practical explanation of synchronous vs asynchronous ft. Chrome extensions
I am writing an application for users, in which they input valid HTML into a text field.
I have a button in jQuery which tries to load the text field area into the W3C validator:
$('#inspecthtml').on('click', function() {
var storyhtml = $('#story').text();
validatorurl= "http://validator.w3.org/#validate_by_input";
var newWin = open(validatorurl,'Validator','height=600,width=600');
newWin.onload = function() {
newWin.document.getElementById("fragment").value=storyhtml;
}
});
I get an error message in the console (using Chrome):
Unsafe JavaScript attempt to access frame with URL
http://api.flattr.com/button/view/?url=http%3A%2F%2Fvalidator.w3.org%2F&title=View%20W3C-Validator%20on%20flattr.com&
from frame with URL http://validator.w3.org/#validate_by_input. The
frame being accessed set 'document.domain' to 'flattr.com', but the
frame requesting access did not. Both must set 'document.domain' to
the same value to allow access.
I attribute this to the cross domain security (see Unsafe JavaScript attempt to access frame with URL)
My question: Is there a way to send the data to the validator, so my users can check their own mark-up?
I think the code snippet below will you can get the same effect and user experience you’re after.
It’s written using jQuery’s $.ajax(…) with some DOMParser and document.write(…) to put the styled results and UI of the W3C HTML Checker into a new window the way it seems you want.
var validator_baseurl= "https://validator.w3.org/nu/";
var validator_requesturl = validator_baseurl
+ "?showsource=yes&showoutline=yes";
$.ajax({
url: validator_requesturl,
type: "POST",
crossDomain: true,
data: storyhtml,
contentType: "text/html;charset=utf-8",
dataType: "html",
success: function (response) {
var results = (new DOMParser()).parseFromString(response, "text/html");
results.querySelector("link[rel=stylesheet]").href
= validator_baseurl + "style.css";
results.querySelector("script").src
= validator_baseurl + "script.js";
results.querySelector("form").action
= validator_requesturl;
var newWin = window.open("about:blank",
"Checker results", "height=825,width=700");
newWin.document.open();
newWin.document.write(results.documentElement.outerHTML);
newWin.document.close();
newWin.location.hash = "#textarea";
setTimeout(function() {
newWin.document.querySelector("textarea").rows = "5";
}, 1000)
}
});
Explanation
causes a POST request to be sent to the W3C HTML Checker
makes the storyhtml text the POST body
makes text/html;charset=utf-8 the POST body’s media type (what the checker expects)
causes the checker to actually check the storyhtml contents automatically
shows the checker results in a new window right when it’s first opened, in one step (so your users don’t need to do a second step to manually submit it for checking themselves)
replaces relative URLs for the checker’s frontend CSS+JS with absolute URLs (otherwise in this “standalone window” context, the CSS wouldn’t get applied, and the script wouldn’t run)
newWin.location.hash = "#textarea" is needed to make the checker show the textarea
Notes
intentionally uses the current W3C HTML Checker (not the legacy W3C markup validator)
intentionally sends the content to be checked as a POST body, not multipart/form-data); the checker supports multipart/form-data but making it a POST body is easier and better
the setTimeout textarea bit isn’t required; I just put it to make the results visible without scrolling (bottom part of new window below textarea); you can of course remove it if you want
sets the new window’s height and width a bit larger than the 600x600 in the question’s original code; again, I just did that to make things easier to see; change them however you want
uses standard DOM ops that may have better jQuery methods/idioms (I don’t normally use jQuery, so I can imagine there are ways to streamline the code in it further around JQuery)
could of course also be done without using jQuery at all—using standard Fetch or XHR instead (and I’d be happy to also add examples here that use Fetch and XHR if desired)
tested & works as expected in Edge, Firefox, Chrome & Safari; but as with any code that uses document.open, Safari users need to unset Preferences > Security > Block pop-up windows
Let's say I have a web page (/index.html) that contains the following
<li>
<div>item1</div>
details
</li>
and I would like to have some javascript on /index.html to load that
/details/item1.html page and extract some information from that page.
The page /details/item1.html might contain things like
<div id="some_id">
picture
map
</div>
My task is to write a greasemonkey script, so changing anything serverside is not an option.
To summarize, javascript is running on /index.html and I would
like to have the javascript code to add some information on /index.html
extracted from both /index.html and /details/item1.html.
My question is how to fetch information from /details/item1.html.
I currently have written code to extract the link (e.g. /details/item1.html)
and pass this on to a method that should extract the wanted information (at first
just .innerHTML from the some_id div is ok, I can process futher later).
The following is my current attempt, but it does not work. Any suggestions?
function get_information(link)
{
var obj = document.createElement('object');
obj.data = link;
document.getElementsByTagName('body')[0].appendChild(obj)
var some_id = document.getElementById('some_id');
if (! some_id) {
alert("some_id == NULL");
return "";
}
return some_id.innerHTML;
}
First:
function get_information(link, callback) {
var xhr = new XMLHttpRequest();
xhr.open("GET", link, true);
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
callback(xhr.responseText);
}
};
xhr.send(null);
}
then
get_information("/details/item1.html", function(text) {
var div = document.createElement("div");
div.innerHTML = text;
// Do something with the div here, like inserting it into the page
});
I have not tested any of this - off the top of my head. YMMV
As only one page exists in the client (browser) at a time and all other (virtual/possible) pages are on the server, how will you get information from another page using JavaScript as you will have to interact with the server at some point to retrieve the second page?
If you can, integrate some AJAX-request to load the second page (and parse it), but if that's not an option, I'd say you'll have to load all pages that you want to extract information from at the same time, hide the bits you don't want to show (in hidden DIVs?) and then get your index (or whoever controls the view) to retrieve the needed information from there ... even though that sounds pretty creepy ;)
You can load the page in a hidden iframe and use normal DOM manipulation to extract the results, or get the text of the page via AJAX, grab the part between <body...>...</body>¨ and temporarily inject it into a div. (The second might fail for some exotic elements like ins.) I would expect Greasemonkey to have more powerful functions than normal Javascript for stuff like that, though - it might be worth to thumb through the documentation.