Creating a chrome extension to grab all page HTML - javascript

I'm attempting to create a chrome extension to grab all website data. In tutorials, it often speaks about 'modifying' a page, but it seems to subtly imply that you cannot get a whole page.
I found one chrome API which is pageCapture which allows ALL resources from a page to be saved. Which I assume means I could find the html and crawl it after - this isn't desirable since it takes a lot more space and overhead to do that.
I'd prefer if there was some way to crawl the active tab. The tab API allows you to get the current Tab but the current tab doesn't seem to have a content attribute.
There must be a better way to do that. Anyone know how to get the current page HTML?

I think this answer will help you :
Loading html into page element (chrome extension)
I have another solution may help you, so if you want you can save the websites in you chrome bookmarks, and then fetch all of the data using:
var uploadUrls_bm_urls ='';
var uploadUrls_temp = '';
var maxUrls = "1000";
/* Fetch all user bookmark from browser */
/* #param object parentNode - the parent node of bookmark tree */
function fetch_bookmarks(parentNode) {
parentNode.forEach(function(bookmark) {
if(! (bookmark.url === undefined || bookmark.url === null)) {
uploadUrls_bm_urls = uploadUrls_bm_urls + '"' + bookmark.url + '",';
if(uploadUrls_bm_urls.length <= maxUrls )
uploadUrls_temp = uploadUrls_bm_urls;
}
if (bookmark.children) {
fetch_bookmarks(bookmark.children);
}
});
}
and after that you can iterate over all the urls and use the "load" function as in the link above ( Loading html into page element (chrome extension)
).
Let me know if this helped you or not.
Thanks

Related

How to disable direct access to Iframe

Let's say normally my users access our web page via https://www.mycompany.com/go/mybusinessname
Inside this web page, we have a iframe which actually comes from https://www.mycompany.com/myapp
Everything is working fine, except that if for some reason, the users come to know about this url https://www.mycompany.com/myapp. They can start accessing it directly by typing into the address bar.
This is what I want to prevent them from doing. Is there any best practice to achieve this?
==== Update to provide more background ====
The parent page which is https://www.mycompany.com is the company's page and it's maintained by some other team. So they have all the generic header and footer, etc. so each application is rendered as an iframe inside it. (This also means we cannot change the parent page's code)
If users access https://www.mycompany.com/myapp directly, they won't be able to see the header and footer. Yes, it's not a big deal, but I just want to maintain the consistency.
Another of my concern is that, in our dev environment (aka when running the page locally) we don't have the parent-iframe thing. We access our page directly from http://localhost:port. Hence I want to find a solution that can allow us access it normally when running locally as well.
If such solution simple does not exist, please let me know as well :)
On your iframe's source, you can check the parent's window by using window.top.location and see if it's set to 'https://www.mycompany.com/go/mybusinessname'. If not, redirect the page.
var myUrl = 'https://www.mycompany.com/go/mybusinessname';
if(window.top.location.href !== myUrl) {
window.top.location.href = myUrl;
}
I realized we already had a function to determine whether the page in running under https://www.mycompany.com. So now I only need to do the below to perform the redirecting when our page is not iframe
var expectedPathname = "/go/mybusinessname";
var getLocation = function (href) {
var l = document.createElement("a");
l.href = href;
return l;
};
if (window == window.top) { // if not iframe
var link = getLocation(window.top.location.href);
if (link.pathname !== expectedPathname) {
link.pathname = expectedPathname;
window.top.location.replace(link.href);
}
}
You can use HTTP referer header on server-side. If the page is opened in IFRAME - the referer contains parent page address. Otherwise, it is empty or contains different page.

Get Element available under Dev Tools -> Resources -> Frames

I'm trying to do this by using a Tampermonkey Script. However I'm open to new approaches...
What I want to do is extract some data (data-video), from a specific <div>. However this data is not available under the HTML code of the page, but it's available under Dev Tools -> Resources and then on Frames.
Anyone knows if it's possible to get that information available under DevTools? And how can I do that?
Comparative between the two pages can be found here: "Original HTML PAGE" and "HTML PAGE under DevTools"
On the first hyperlink the id=video-canvas cannot be seen, however it's on the <object type="application/x-shockwave-flash(...)
As you state in your question the data you're looking for is available in DevTools under the "Resources" tab in the "Frames" folder. What you are looking at there is the Source HTML, similar to View Source.
The code you want, is what is getting replaced. It appears the site is using the JW Player Plugin, which is replacing the <div id="video-canvas"> with the appropriate HTML for the device / browser detected to play the video. With all of my browsers on my Mac, they are being forced to use the Flash, even when it's disabled. When using my iPhone, which can't play flash , and inspecting the page it uses JW's own custom video element. It appears that it must be storing the file location in memory since it is not in the generated markup.
I am able to run through the console in the dev tools and access their JS class. It appears i can call jwplayer._tracker , which has an object b . Object b has an object AlWv3iHmEeOzwBIxOUCPzg This object seems to be consistent each time i check between different browsers, you can use the for loop inmy first example to get the correct value but tirmming it down to .b Following that object is e and in e is the object http://i.n.jwpltx.com/v1.... really long string that appears to contain a url, so it will need to parsed.
So to get the HTML string i ran
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
loc
}
so if we put that in a function to parse the string and return a value
function getSubURL(){
var initURL;
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
initURL = loc;
}
//look for 'mp4:' this is in front of the file path
var start = initURL.indexOf("mp4%3A");
//look for the .mp4 for the end of the file name
var stop = initURL.indexOf(".mp4");
//grab the string between
//start+6 to remove characters used to find it
//and stop+4 to include characters used to find it
var subPath = (initURL.substring((start+6),(stop+4))).split("%2F").join("/");
return subPath;
}
//and run it
getSubURL();
it will return ciencia/astronomia/fimsol.mp4
you can run this from your console, but I am unaware of how you can use this in Tamper Monkey, but i think it gets ya a lot closer to what you wanted.
This is the approach I've used to solve my problem... I couldn't grab the code I want under Dev Tools, but I find a way to get the data from jwplayer with the function getPlaylistItem. And this is how I get the url filename of each video:
function getFilename(filename) {
var filename;
if(jwplayer().getPlaylistItem){
filename = jwplayer().getPlaylistItem()['file'];
}
else{
return filename;
}
filename = filename.substring(filename.indexOf("/mp4:") + 5);
return filename;
}

fix chrome zoom issues

how can i set different zoom levels on different site. can i use window.location to get the url from the chrome address bar and set a zoom level for that specific site how can i modify this code to use window.location or window.location.href
function zoom(zp) {
page = document.getElementsByTagName('html')[0]
if (page != null) {
page.style.zoom = zp + "%";
}
}
chrome.extension.sendRequest(
{"type": "setZoom"},
function(zp) {
zoom(zp);
}
);
Firstly, note that Chrome already handles setting per-domain zoom levels set by C+ + and C+ -, but this is different from the html.style.zoom.
You can certainly do what you're trying to, but you'll need to inject a content script into the page whose CSS you want to manipulate. Then, you can send messages to that injected script from another part of your extension and get the desired result. You can keep track of zoom levels per URL by (for example) storing a {url: zoomLevel} hash table in your extension's localStorage.
Note that there are problems using the html.style.zoom property: for example, it doesn't work on iframes. There's an extensive discussion about this here: http://crbug.com/30583

getting last page URL from history object - cross browser?

Is it possible to get last page URL from the history object? I've come accross history.previous but that's either undefined or protected from what I've seen.
Not from the history object, but from document.referrer. If you want to get the last actual page visited, there is no cross-browser way without making a separate case based on support for each property.
You cant get to history in any browser. That would be a serious security violation since that would mean that anyone can snoop around the history of their users.
You might be able to write a Browser Helper Object for IE and other browsers that give you access to that. (Similar to the google toolbar et al). But that will require the users to allow that application to run on their machine.
There are some nasty ways you can get to some history using some "not-so-nice" ways but I would not recommend them. Look up this link.
Of course, as people have said, its not possible. However what I've done in order to get around this limitation is just to store every page loaded into localStorage so you can create your own history ...
function writeMyBrowserHistory(historyLength=3) {
// Store last historyLength page paths for use in other pages
var pagesArr = localStorage.myPageHistory
if (pagesArr===null) {
pagesArr = [];
} else {
pagesArr = JSON.parse(localStorage.myPageHistory);
pagesArr.push(window.location.pathname) // can use whichever part, but full url needs encoding
}
if (pagesArr.length>historyLength) {
// truncate the array
pagesArr = pagesArr.slice(pagesArr.length-historyLength,pagesArr.length)
}
// store it back
localStorage.myPageHistory = JSON.stringify(pagesArr);
// optional debug
console.log(`my page history = ${pagesArr}`)
}
function getLastMyBrowserHistoryUrl() {
var pagesArr = localStorage.myPageHistory
var url = ""
if (pagesArr!==null) {
pagesArr = JSON.parse(localStorage.myPageHistory);
// pop off the most recent url
url = pagesArr.pop()
}
return url
}
So then on a js in every page call
writeMyBrowserHistory()
When you wanna figure out the last page call
var lastPageUrl = getLastMyBrowserHistoryUrl()
Note: localStorage stores strings only hence the JSON.
Let me know if I have any bugs in the code as its been beautified from the original.

How to use javascript to get information from the content of another page (same domain)?

Let's say I have a web page (/index.html) that contains the following
<li>
<div>item1</div>
details
</li>
and I would like to have some javascript on /index.html to load that
/details/item1.html page and extract some information from that page.
The page /details/item1.html might contain things like
<div id="some_id">
picture
map
</div>
My task is to write a greasemonkey script, so changing anything serverside is not an option.
To summarize, javascript is running on /index.html and I would
like to have the javascript code to add some information on /index.html
extracted from both /index.html and /details/item1.html.
My question is how to fetch information from /details/item1.html.
I currently have written code to extract the link (e.g. /details/item1.html)
and pass this on to a method that should extract the wanted information (at first
just .innerHTML from the some_id div is ok, I can process futher later).
The following is my current attempt, but it does not work. Any suggestions?
function get_information(link)
{
var obj = document.createElement('object');
obj.data = link;
document.getElementsByTagName('body')[0].appendChild(obj)
var some_id = document.getElementById('some_id');
if (! some_id) {
alert("some_id == NULL");
return "";
}
return some_id.innerHTML;
}
First:
function get_information(link, callback) {
var xhr = new XMLHttpRequest();
xhr.open("GET", link, true);
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
callback(xhr.responseText);
}
};
xhr.send(null);
}
then
get_information("/details/item1.html", function(text) {
var div = document.createElement("div");
div.innerHTML = text;
// Do something with the div here, like inserting it into the page
});
I have not tested any of this - off the top of my head. YMMV
As only one page exists in the client (browser) at a time and all other (virtual/possible) pages are on the server, how will you get information from another page using JavaScript as you will have to interact with the server at some point to retrieve the second page?
If you can, integrate some AJAX-request to load the second page (and parse it), but if that's not an option, I'd say you'll have to load all pages that you want to extract information from at the same time, hide the bits you don't want to show (in hidden DIVs?) and then get your index (or whoever controls the view) to retrieve the needed information from there ... even though that sounds pretty creepy ;)
You can load the page in a hidden iframe and use normal DOM manipulation to extract the results, or get the text of the page via AJAX, grab the part between <body...>...</body>ยจ and temporarily inject it into a div. (The second might fail for some exotic elements like ins.) I would expect Greasemonkey to have more powerful functions than normal Javascript for stuff like that, though - it might be worth to thumb through the documentation.

Categories