I am new to both JavaScript and Chrome development, and am trying to create an extension that injects content/CSS in certain web pages. Simple enough, but the catch is that to do so it requires looking through a significant amount of data in local storage. From what I've read so far, the correct way to do this would be either:
Reading the required data (JSON serialized) from storage directly from the content script every time the page is visited, or
Maintaining the state in the extension background page and transferring the required data (also JSON serialized) to the content script environment using message passing.
Either of these approaches, however, would be extremely inefficient due to large amounts of data being needlessly serialized and deserialized on every page load.
So, I want to know:
Is it in any way possible to maintain a shared memory cache in Chrome that content scripts injected in all tabs can access?
If not, is an alternate approach possible where the background page listens for the chrome.tabs.onUpdated event and somehow modifies the target DOM itself?
1) I don't think this is possible. You seem to have exhausted the possibilities.
2) Content scripts are the only way to access/modify a normal tab's DOM.
1- Reading the required data (JSON serialized) from storage directly from the content script every time the page is visited.
But you have to do that every time your page is loaded which you want to avoid (I guess)
2- Maintaining the state in the extension background page and transferring the required data (also JSON serialized) to the content script environment using message passing.
The only way to make Content Scripts and Background scripts interact is via Message Passing. You are not actually looking to an alternative solution but you want to improve the process and avoid message passing each time a page is loaded.
For this, you can develop a spec. The spec states for which URLs or which Domains or based on some condition you want to get the data from Background. If your current URL/Tab agrees with spec only then pass a message to background.
Optionally, Background can also do the same and send message only if your spec is followed. Moreover, when your extension is loaded, you can also cache the storage into local variable.
Use Chrome Storage API to listen for changes in storage and update your local data copy accordingly.
You can also look at this code written by me using the same approach.
Related
I want to create a tampermonkey script that is registered on one page (call it A). From this page (it is an overview page), it extracts a series of links (say [B, C, D]). This is working so far.
Now, I want to do the following:
Navigate to location B.
Wait for the DOM to become ready, so I can extract further information
Parse some information from the page and store them in some object/array.
Repeat steps 1 through 3 with the URLs C and D
Go back to address A
Copy the content of out to the clipboard
The tasks 1 I can achieve by window.open or window.location. But I am failing at steps 2 and 3 currently.
Is this even possible? I am unsure if waiting for another page will terminate and unload the current script.
Can you point me into the correct direction to get that issue solved?
If you have any better idea, I am willing to hear them. The reason I am using the browser with tampermonkey is that the pages use some sort of CSRF protection means that will not allow me to use e.g. curl to extract the relevant data.
I have seen this answer. As far as I understand it, this will start a new script on each invocation and I had to pass all information using URL parameters manually. It might be doable (unless the server is messing with the params) but seems to be some effort. Is there a simpler solution?
To transfer information, there are a few options.
URL parameters, as you mentioned - but that could get messy
Save the values and a flag in Tampermonkey's shared storage using GM_setValue
If you open the windows to scrape using window.open, you can have the child windows call .postMessage while the parent window listens for messages (including for those from other domains). (BroadcastChannel is a nice flexible option, but it's probably overkill here)
It sounds like your userscript needs to be able to run on arbitrary pages, so you'll probably need // #match *://*/*, as well as a way to indicate to the script that the page that was automatically navigated to is one to scrape.
When you want to start scraping, open the target page with window.open. (An iframe would be more user-friendly, but that will sometimes fail due to the target site's security restrictions.) When the page opens, your userscript can have the target page check if window.opener exists, or if there's a URL parameter (like scrape=true), to indicate that it's a page to be scraped. Scrape the information, then send it back to the parent using .postMessage. Then the parent can repeat the process for the other links. (You could even process all links in parallel, if they're on different domains and it won't overload your browser.)
Waiting for the DOM to be ready should be trivial. If the page is fully populated at the end of HTML parsing, then all your script needs is to not have #run-at document-start, and it'll run once the HTML is loaded. If the page isn't fully populated at the end of HTML parsing, and you need to wait for something else, just have a timeout loop until the element you need exists.
protection means that will not allow me to use e.g. curl to extract the relevant data.
Rather than a userscript, running this on your own server would be more reliable and somewhat easier to manage, if it's possible. Consider checking if something more sophisticated can curl could work - for example, puppeteer, which can emulate a full browser.
Between the following setups, which one would be performing the fastest page load for a front end user. I am only interested in the speed performance for frontend users and not the maintenance requirement for backend developers.
A website that only uses static .html files, no JavaScript, no PHP, no server side programming language to render the html. Basically the origins of the internet, where each click on an internal link loads a static .html file. Each page is a pre-created physical .html file on the server.
A website with a physical pre-created .html file, however the main content (article) on each page is fetched via Javascript from a noSQL server (Google Cloud Firestore or Fauna DB). Each click on an internal link only replaces the main content of the page via database call. The rest of the website (menu, logo, sidebar, footer) is all static and never needs to reload.
A website with a physical pre-created .html file, but the main content (article) on each page itself is fetched via JavaScript from a local JSON file, no database, just a regular .json file in the same directory as the .html file on the same server. Each click on an internal link only replaces the main content of the page using JavaScript (probably vanilla JavaScript using fetch, unless react is somehow faster, which I doubt). The rest of the website (menu, logo, sidebar, footer) is all static and never needs to reload.
Of course server performance and user location does always play a role in speed tests, but for argument sake let’s assume it’s the same user visiting the same web server. Additionally in regards to noSQL, let's say it’s a fast and reliable performing 3rd party server such as Google Cloud Firestore.
Which one of these setups would be the fastest? Has anyone tested this? I heard some people argue that basic static .html files are always fastest, while others argue that a static html file where the content is loaded via JavaScript is faster when navigating internal links once the initial page load is done. Both arguments make sense.
Any major pros or cons for one of the mentioned setups, or past benchmarks?
The speed of the webpage has two big components:
A. How fast the server responds/the size of the response
B. How fast the browser can render whatever it fetched
So, static files without JS will be the fastest, there is no delay on the server side, and the browser is very efficient in rendering static assets
The third option is still fast, but slightly slower than the first one as there is some work for the browser exists (transforming the JSON to HTML via JS)
The second option will be the slowest, as it is the only option where the server is not responding instantly with a file, but needs to connect to a DB, fetch the results, transform them, and only then send back.
All of it is relevant only in case we are talking about exactly the same content, but in different forms.
The question is slightly flawed, but to answer
Static content is fastest, the browser will render the content and cache it.
Getting content from a database adds overhead to the call and retrieval, the main page will be downloaded once and cached on the client side, the calls for content can not be cached as the browser needs to make the call to see what the content is. The upside is that the call will only return the content that needs to be displayed and DB searches are pretty quick from the big cloud service providers
This option is probably slower than 2, because the whole JSON file will need to be downloaded for the JavaScript to pick out the content for one article from all the content.
I would suggest option 2 is best from a maintainability vs speed point of view as it will only send the required data across the network and the rest is cached.
If you like option 3, have a look at using the browser cache https://web.dev/cache-api-quick-guide/ to cache your JSON file, this way the user will only need to download an updated version when you change the content
I'm about to release a web application with a few pages. Each page is a Vue.js bundle. So on each page, there is a single javascript bundle & a single CSS file included, and a single div with a unique ID in the page where the app elements get mounted.
I need to be able to make updates to the static CSS/JS files without major service disruption. I'm using a Google Firebase backend for the application data, so if the client code doesn't update when an update is deployed, it could try to write to the database in the wrong format in an incorrect way. So, caching of the script files has been a problem.
I was initially under the impression that caches are invalidated when the hash of the file contents changes, but apparently that is not true. So, the core question is: How can I invalidate the browser cache of these files every time the content is updated?
What makes things complicated is that the web application may be embedded on clients' websites, by adding a small snippet to the page. And, I don't want to modify these snippets for every update - so I can't to change the filename with each version. E.g. in someoneswebsite.com.au/app/index.html:
<div id="my-app-mount"></div>
<script src="https://mywebapp.com.au/app/homepage.js"></script>
What won't work for me
Adding a query string or changing the filename with every update (Or other server-side tricks in PHP): I can't use any preprocessor as the snippet needs to be embeddable on other sites in HTML only.
Just setting a short TTL in the cache for these items. I need updates to work overnight, so I'd have to go down to just an hour or two. And this leads into the next point;
Disabling caching completely for these items: I don't want to flog my server with the extra traffic.
Just telling my users to do a hard-reload if they have any issues - this is a client-facing product.
My ideas for a solution so far
Change the filename with each upgrade app-1.0.js, app-1.1.js and so on, and add a 'bootstrap' script that gets the latest version based on a version string read from Firebase. However, this adds extra latency to every single page load as we need to hear from the database before loading the main JS payload.
In each javascript bundle, add a check to compare the app version with a version number retrieved from Firebase. If the script is out of date, we can programatically invalidate the cache and refresh the page (but how to do this?)
Some combination of HTTP cache headers, to always invalidate the cached copy if the hashed contents don't match the server.
I know how to get the localStorage from any open wep page by using content scripts. So I'm basically able to open a new tab with my own web page and read the storage data with a content script and message it to the background page.
But now I'd like to do this without loading an external page every time. Is there a way to access the localStorage of a page directly from within the extension? Maybe some query to chrome directly.
I don't see any API for that.
Your options are:
Make a native messaging host application that would read database files directly from Local Storage directory in the browser user profile. An example: What's the best way to read Sqlite3 directly in Browser using Javascript?
Put the other page into an iframe: Is it possible to use HTML5 local storage to share data between pages from different sites?
P.S. "Ironic side note" quoted from Cross-domain localStorage article by Nicholas C. Zakas.
Who knew cross-domain client-side data storage would be useful? Actually, the WHAT-WG did. In the first draft of the Web Storage specification (at that time, part of HTML5), there was an object called globalStorage that allowed you to specify which domains could access certain data. [...]
The globalStorage interface was implemented in Firefox 2 prematurely as the specification was still evolving. Due to security concerns, globalStorage was removed from the spec and replaced with the origin-specific localStorage.
Alright, first off this is not a malicious question I'm asking. I have no intentions of using any info for ill gains.
I have an application that contains an embedded browser. This browser runs within the application's process, so I can't access it via Selenium WebDriver or anything like that. I know that it's possible to dynamically append scripts and html to loaded web pages via WebDriver, because I've done it.
In the embedded browser, I don't have access to the pages that get loaded. Instead, I can create my own html/javascript pages and execute them, to manipulate the application that houses the browser. I'm having trouble manipulating the existing pages within the browser.
Is there a way to dynamically add javascript to a page when you navigate to it and have it execute right after the page loads?
Something like
page1.navigateToUrl(executeThisScriptOnLoad)
page2 then executes the passed script.
I guess it is not possible to do it without knowledge of destination site. Although you can send data to the site and then use eval() function to evaluate sent data on destination page.