Extract the whole text contained in webpage using Chrome extension

Extract the whole text contained in webpage using Chrome extension - javascript

I'm developing a Chrome extension for text parsing of Google search results. I want the user to insert a certain text in the omnibox, and then be direct to a Google search page.
function navigate(url) {
chrome.tabs.query({active: true, currentWindow: true}, function(tabs) {
chrome.tabs.update(tabs[0].id, {url: url});
});
}
chrome.omnibox.onInputEntered.addListener(function(text) {
navigate("https://www.google.com.br/search?hl=pt-BR&lr=lang_pt&q=" + text + "%20%2B%20cnpj");
});
alert('Here is where the text will be extracted');
After directing the current tab to the search page, I want to get the plain text form of the page, to parse it afterwards. What is the most straightforward way to accomplish this?

Well, parsing the webpage is probably going to be easier to do as a DOM instead of plain text. However, that is not what your question asked.
Your code has issues with how you are navigating to the page and dealing with the asynchronous nature of web navigation. This is also not what your question asked, but impacts how what you did ask about, getting text from a webpage, is implemented.
As such, to answer your question of how to extract the plain text from a webpage, I implemented doing so upon the user clicking a browser_action button. This separates answering how this can be done from the other issues in your code.
As wOxxOm mentioned in a comment, to have access to the DOM of a webpage, you have to use a content script. As he did, I suggest you read the Overview of Chrome extensions. You can inject a content script using chrome.tabs.executeScript. Normally, you would inject a script contained in a separate file using the file property of the details parameter. For code that is just the simple task of sending back the text of the webpage (without parsing, etc), it is reasonable to just insert the single line of code that is required for the most basic way of doing so. To insert a short segment of code, you can do so using the code property of the details parameter. In this case, given that you have said nothing about your requirements for the text, document.body.innerText is the text returned.
To send the text back to the background script, chrome.runtime.sendMessage() is used.
To receive the text in the background script, a listener, receiveText, is added to chrome.runtime.onMessage.
background.js:
chrome.browserAction.onClicked.addListener(function(tab) {
console.log('Injecting content script(s)');
//On Firefox document.body.textContent is probably more appropriate
chrome.tabs.executeScript(tab.id,{
code: 'document.body.innerText;'
//If you had something somewhat more complex you can use an IIFE:
//code: '(function (){return document.body.innerText})();'
//If your code was complex, you should store it in a
// separate .js file, which you inject with the file: property.
},receiveText);
});
//tabs.executeScript() returns the results of the executed script
// in an array of results, one entry per frame in which the script
// was injected.
function receiveText(resultsArray){
console.log(resultsArray[0]);
}
manifest.json:
{
"description": "Gets the text of a webpage and logs it to the console",
"manifest_version": 2,
"name": "Get Webpage Text",
"version": "0.1",
"permissions": [
"activeTab"
],
"background": {
"scripts": [
"background.js"
]
},
"browser_action": {
"default_icon": {
"32": "myIcon.png"
},
"default_title": "Get Webpage Text",
"browser_style": true
}
}

Related

Injecting a JavaScript into a webpage automatically?

I am a very novice when it comes to writing code, this is my first attempt. I have done a fair amount of research and learning, but there is so much to learn. So i'm looking for some help or advice. At work we have to click a button to get us a work order. I am trying to automate the process, I have some code written that clicks the button for me. Unfortunately when it returns with "No Matches" it automatically reloads the page and my code is gone in the DOM. Is there a way to automatically inject my code every time the webpage reloads?
My Current Code:
var button = document.getElementById('GetMeAWorkOrder')
setInterval
button.click()
,1500

You can look into creating your own Chrome extension that injects the code through a content script. To make it easier, I've included the starter files for your specific project. Simply entering your specific URL into the marked lines should be enough for the code you posted.
Manifest.json
{
"manifest_version": 2,
"name": "My Extension",
"description": "Injects custom code to [website]",
"version": "1",
"permissions": [
"*://URL HERE*"
],
"content_scripts": [
{
"matches": [
"*://URL HERE*"
],
"js": ["content.js"]
}
]
}
content.js
//code you want to inject into the website
var button = document.getElementById('GetMeAWorkOrder');
setInterval( function() { button.click() },1500 );
Remember to replace the URLS with the website you want to inject the code into; here's a reference on match patterns.
Good luck and have fun coding! :)

Injecting into navigation error page gets: Error: No window matching {"matchesHost":["<all_urls>"]}

I am trying to execute a script that shows a green border on the specified tab (by ID). The script should execute when the response for the requested URL is an error. The problem is that, when I load the extension from about:debugging, I get the following error (in the browser console in FF 53):
Error: No window matching {“matchesHost”:[“<all_urls>”]}
I searched for hours and hours and looked at several posts for similar problems but none of them have helped me. For example, this post suggests adding "<all_urls>" permission and it did not help in my case. Another post says that it is not possible to execute script in certain type of hosts such as about:[anything] pages and mozilla pages. I do not see my URL belongs to any of them.
Here is my example:
The manifest.json
{
"manifest_version": 2,
"name": "test",
"version": "1.0",
"background": {
"scripts": ["test.js"]
},
"permissions": [
"<all_urls>",
"activeTab",
"tabs",
"storage",
"webRequest"
]
}
The background script is test.js:
console.log("-- inside js file --");
var target = "<all_urls>";
function logError(responseDetails) {
errorTab=responseDetails.tabId;
console.log("response tab: "+errorTab);
var makeItGreen = 'document.body.style.border = "5px solid green"';
var executing = browser.tabs.executeScript(errorTab,{
code: makeItGreen
});
}//end function
browser.webRequest.onErrorOccurred.addListener(
logError,
{urls: [target],
types: ["main_frame"]}
);

The error you are seeing:
Error: No window matching {"matchesHost":["<all_urls>"]}
is generated when you attempt to inject a script using tabs.executeScript() (or CSS with tabs.insertCSS()) in a tab that is currently displaying a URL which you do not have permission to inject into. In this case, you have specified in your manifest.json the host permission "<all_urls>". The fact that "matchesHost":["<all_urls>"] is displayed indicates that Firefox is aware of your "<all_urls>" permission. That you have still gotten the error means that you have attempted to inject into a URL which does not match <all_urls>.
As you have mentioned, Firefox does not permit injecting into about:* pages. In addition, injecting into pages at the domain addons.mozilla.org is not permitted. None of those pages will match <all_urls>. All such URLs will generate the above error if you attempt to inject into tabs showing them.
But, I'm injecting into some normal URL that had an error
All easily obtainable information to the contrary — including the URL provided in the tabs.Tab data obtained from tabs.get() —, the page you are attempting to inject into is, in fact, an about:* page, not the page (that doesn't exist) at the URL where you got the error. While the URL reported in the tabs.tab structure for the tab in which you received the error will show the URL on which the error occurred, the actual URL for the page being displayed is something like:
about:neterror?e=dnsNotFound&u=[URL you were attempting to get to, but encoded as a query string]
I know this because the last webNavigation.onDOMContentLoaded event when I tested attempting to load the URL: http://www.exampleahdsafhd.com/ was:
webNavigation.onDOMContentLoaded - > arg[0] = Object {
url: "about:neterror?e=dnsNotFound&u=http%3A//www.exampleahdsafhd.com/&c=UTF-8&f=regular&d=Firefox%20can%E2%80%99t%20find%20the%20server%20at%20www.exampleahdsafhd.com.",
timeStamp: 1497389662844,
frameId: 0,
parentFrameId: -1,
tabId: 2,
windowId: 3
}
The fact that the error page is an about:* page, means that you will not be able to inject scripts, or CSS, into it. This means that you will need to find some other way to accomplish what you desire and/or adapt what you desire to do to what is possible. One possibility would be to navigate to a page within your extension which describes the error.

Chrome Extension - Getting "tab was closed" error on injecting a script

I am writing a chrome extension which detects the type of file being opened and based on that injects a script on the page which does many other things. Here is the part of my code for the background.js which is injecting the script:
chrome.webRequest.onHeadersReceived.addListener(function(details){
console.log("Here: " + details.url + " Tab ID: " + details.tabId);
if(toInject(details))
{
console.log("PDF Detected: " + details.url);
if(some-condition)
{
//some code
}
else
{
chrome.tabs.executeScript(details.tabId, { file: "contentscript.js", runAt: "document_start"}, function(result){
if(chrome.runtime.lastError)
{
console.log(chrome.runtime.lastError.message + " Tab ID: " + details.tabId);
}
});
}
return {
responseHeaders: [{
name: 'X-Content-Type-Options',
value: 'nosniff'
},
{
name: 'X-Frame-Options',
/*
Deny rendering of the obtained data.
Cant use {cancel:true} as we still need the frame to be accessible.
*/
value: 'deny'
}]
};
}
}, {
urls: ['*://*/*'],
types: ['main_frame', 'sub_frame']
}, ['blocking', 'responseHeaders']);
Here is the manifest file:
{
"manifest_version": 2,
"name": "ABCD",
"description": "ABCD",
"version": "1.2",
"icons": {
"16" : "images/16.png",
"32" : "images/32.png",
"48" : "images/48.png",
"128" : "images/128.png"
},
"background": {
"scripts": ["chrome.tabs.executeScriptInFrame.js", "background.js"],
"persistent": true
},
"permissions": [
"webRequest",
"<all_urls>",
"webRequestBlocking",
"tabs",
"nativeMessaging"
],
"web_accessible_resources": [ "getFrameId", "aux.html", "chrome-extension:/*", "images/*.png", "images/*.gif", "style.css"]
}
The problem is that when injecting script the last error part runs and it shows the tab was closed and the script is not injected. If I press enter on the omnibox a several times the script is injected and things work fine. Here is a sample run of events:
Sorry for my naive photo editing :P
There are a few more things we can deduce from this image:
The first thing being loaded in the tab with tab id 86 is something related to my google account. I have logged out and also turned off the prerender feature of chrome.
On pressing enter several times the tab was closed error goes but the script which maintains a chrome.runtime connection with the background.js gets disconnected.
And then finally things work fine.
I have been banging my head around this for days. No other question on SO addresses this problem. Nor anywhere else on the internet as well.
EDIT:
One more thing to note: The sample run shown in the image above is one such. There are many different behaviors. Sometimes 3 enters wouldn't make it work. Sometimes just one will. Is there something wrong because of the custom headers i am sending?
UPDATE #1
One must notice the headers I am returning in OnHeadersReceived. It's being done to stop chrome from rendering the document. But on doing that all the data of the file is dumped on the screen and I don't want that to appear. So i think I need document_start so that I can hide the dumped data before my content script does other things like putting a custom UI on the page.
UPDATE #2
Noticed one more thing. If I open a new tab, and then paste a url there and then press enter the following is the output of the background page on the console.
So I guess, the location of the window is updated at a later time by chrome. Am I right? Any workarounds?

"The tab was closed" error message is a bit misleading, because the tab obviously is not closed. In chrome sources the variable with the string is called kRendererDestroyed. So the error is because the corresponding renderer is being destroyed for some reason.
I was getting the error if the the page opened in tab redirected (thus one renderer destroyed, another one created for the same tab, but different url this time), in this case extension will got tab updates with statuses like:
loading url: 'example.com', here tab is already returned to callbacks etc, but will get the error, if tried to inject script
loading url: 'example.com/other_url'
title: 'some title'
complete
I managed to get around by injecting script only after receiving status: 'complete' (but probably injecting on title should also do)
Did not try with pdfs, but chrome probably will replace renderer for those too like with a redirect. So look more into page statuses and redirects/renderer replaces. Hope this helps anyone stumbling upon this question.

A simple setTimeout call to wait for the page to load worked for me.

Tasks automation with chrome extensions

I'm trying to automate the task of taking customers data from an ebay page and inserting it into a form in another page. I used Imacros but i don't quite like it.
Are chrome extensions good for this kind of work?
And if yes, where the dom logic should go, on the background page or in the content script?
Can anyone provide me a simple example of code?

NOTE: since January 2021, use Manifest V3 with chrome.scripting.executeScript() and the scripting permission and move <all_urls> to host_permissions instead of using the deprecated chrome.tabs.executeScript() with the tabs permission.
Task
What you need here is a Chrome extension with the ability to read DOM content of the customer page inside a tab with a content script, and then store the information and send it to another tab.
Basically, you'll need to:
Inject a content script in the customer page
Retrieve the data and send it to the background
Elaborate the data and send it to another content script, that will:
Insert the data in the form on another page
Implementation:
So, first of all, your manifest.json will need the permission to access the tabs and the URLs you need, plus the declaration for your background script, something like this:
{
"manifest_version": 2,
"name": "Extension name",
"description": "Your description...",
"version": "1",
"permissions": [
"<all_urls>",
"tabs"
],
"background": { "scripts": ["background.js"] }
}
Now, following the steps:
Add a listener to chrome.tabs.onUpdated to find the customer page and inject the first content script, plus find the tab with the form and inject the second script, all from background.js:
chrome.tabs.onUpdated.addListener(function(tabID, info, tab) {
if (~tab.url.indexOf("someWord")) {
chrome.tabs.executeScript(tabID, {file: "get_data.js"});
// first script to get data
}
if (~tab.url.indexOf("someOtherWord")) {
chrome.tabs.executeScript(tabID, {file: "use_data.js"});
// second script to use the data in the form
}
});
Ok, now the above code will inject your get_data.js script in the page if its URL contains "someWord" and use_data.js if its URL contains "someOtherWord" (you must obviously replace "someWord" and "someOtherWord" with the right words to identify the correct page URLs).
Now, in your get_data.js you'll need to retrieve data and send it to the background.js script, using chrome.runtime.sendMessage, something like this:
var myData = document.getElementById("some-id").textContent;
// just an example
chrome.runtime.sendMessage({messgae: "here is the data", data: myData});
Now you've sent the data, so inside background.js you'll need a listener to catch and elaborate it:
chrome.runtime.onMessage.addListener(function(request, sender, sendResponse) {
if (request.message = "here is the data") {
// elaborate it
chrome.tabs.query({url: "*://some/page/to/match/*"}, function(tabs) {
chrome.tabs.sendMessage(tab[0].id, {message: "here is the data", data: request.data});
});
}
});
Ok, you are almost finished, now you'll need to listen to that message in the second content script, which is use_data.js, and use it in the form:
chrome.runtime.onMessage.addListener(function(request, sender, sendResponse) {
if (request.message == "here is the data") {
// use the data to do something in the page:
var myData = request.data;
// for example:
document.getElementById("input-id").textContent = data;
}
});
And you are done!
Documentation
This wast just an example, and I strongly recommend you to check out the documentation to fully understand the functions and methods to use, here are some useful links:
chrome.tabs
.query
.onUpdated
.sendMessage
.executeScript
chrome.runtime
.onMessage
.sendMessage
Chrome extension message passing

How Do I Create a Google Chrome Extension that autofills username/password when I Click the icon?

I am having problems figuring out what Is wrong with my code so far because when I click the icon, it says that I run both background.js as well as autofill.js. But does not autofill the gmail site. This is my first chrome extension as well as working with javascript. my ultimate goal is that it can autofill all sites (not just gmail) and be able to store/read all the passwords on a .txt file. Another thing is that when i try to run this code is says that something is wrong with my autofill.js file and gives me the error "Uncaught TypeError: Cannot set property 'value' of null. This is for autofill.js right under the comment //fills in your username and password.
Thanks for taking your time to help me out and anything input would help me because I am stuck and hit a wall
manifest.json:
{
"name": "Test",
"manifest_version": 2,
"version": "1.0",
"description": "This is a Chrome extension that will autofill passwords",
"browser_action": {
"default_icon": "icon.png",
"default_popup":"popup.html",
"default_title": "PasswordFill"
},
//*********************************************************************
//declaring the permissions that will be used in this extension
"permissions": ["*://*.google.com/", "tabs", "activeTab", "*://*.yahoo.com/"],
"background": {
"scripts": ["background.js", "autofill.js"]
},
//*********************************************************************
/* Content scripts are JavaScript files that run in the context of web pages. By using the standard Document Object Model (DOM), they can read details of the web pages the browser visits, or make changes to them */
"content_scripts": [
{
//Specifies which pages this content script will be injected into
"matches": ["https://accounts.google.com/ServiceLoginAuth"],
//The list of JavaScript files to be injected into matching pages
"js": ["autofill.js"], //was background.js
//Controls when the files at "js" are being injected
"run_at": "document_end",
"all_frames": true
}
]
}
background.js:
console.log("Background.js Started .. "); //for debug purpose so i can see it in the console log
chrome.browserAction.onClicked.addListener(function (tab) { //Starts when User Clicks on ICON
chrome.tabs.executeScript(tab.id, {file: 'autofill.js'});
console.log("Script Executed .. "); // Notification on Completion
});
autofill.js:
console.log("Autofill.js Started .. "); //for debug purpose so i can see it in the console log
//define username and password
var myUsername = 'McAfee.sdp';
var myPassword = 'baskin310';
//finds the fields in your login form
var loginField = document.getElementById('Email');
var passwordField = document.getElementById('Passwd');
//fills in your username and password
loginField.value = myUsername;
passwordField.value = myPassword;
//automatically submits the login form
var loginForm = document.getElementById ('signIn');
loginForm.submit();

You should take some time and read the development guides first. Make sure you understand how debugging extensions works.
Also, as a general rule, if your script crashes at some line of code, execution will stop and the extension will most likely fail at whatever you wanted it to do (depends on where the crash happens) - just like in any other application.
"Uncaught TypeError: Cannot set property 'value' of null.
That error tells you that you're trying to "access" (set) the property of a "missing" (null) object. loginField.value = myUsername; tries to access value of loginField so you can easily deduce that loginField is null which, in turn, means that var loginField = document.getElementById('Email'); didn't really work. But don't take my word for it, learn to debug it yourself.
Why it fails is a different story: extensions are "sandboxed" and can't run around changing page content whenever they feel like - you have "content scripts" for that. Go back to the docs and read the overview and content scripts sections.
In your specific case:
the only background script file should be background.js; remove autofill.js
make use of event pages instead of background ones whenever possible
autofill.js is a content script and you have it added to the manifest. no need to use programatic injection using chrome.tabs.executeScript
learn how to communicate between backgroud/toolbar and content scripts - you'll need it
your extension needs permission to access `chrome.tabs.* so add "tabs" to the list of permissions in your extension manifest

We Keep Coding

JavaScript is the programming language of the Web.