I'm trying to do this by using a Tampermonkey Script. However I'm open to new approaches...
What I want to do is extract some data (data-video), from a specific <div>. However this data is not available under the HTML code of the page, but it's available under Dev Tools -> Resources and then on Frames.
Anyone knows if it's possible to get that information available under DevTools? And how can I do that?
Comparative between the two pages can be found here: "Original HTML PAGE" and "HTML PAGE under DevTools"
On the first hyperlink the id=video-canvas cannot be seen, however it's on the <object type="application/x-shockwave-flash(...)
As you state in your question the data you're looking for is available in DevTools under the "Resources" tab in the "Frames" folder. What you are looking at there is the Source HTML, similar to View Source.
The code you want, is what is getting replaced. It appears the site is using the JW Player Plugin, which is replacing the <div id="video-canvas"> with the appropriate HTML for the device / browser detected to play the video. With all of my browsers on my Mac, they are being forced to use the Flash, even when it's disabled. When using my iPhone, which can't play flash , and inspecting the page it uses JW's own custom video element. It appears that it must be storing the file location in memory since it is not in the generated markup.
I am able to run through the console in the dev tools and access their JS class. It appears i can call jwplayer._tracker , which has an object b . Object b has an object AlWv3iHmEeOzwBIxOUCPzg This object seems to be consistent each time i check between different browsers, you can use the for loop inmy first example to get the correct value but tirmming it down to .b Following that object is e and in e is the object http://i.n.jwpltx.com/v1.... really long string that appears to contain a url, so it will need to parsed.
So to get the HTML string i ran
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
loc
}
so if we put that in a function to parse the string and return a value
function getSubURL(){
var initURL;
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
initURL = loc;
}
//look for 'mp4:' this is in front of the file path
var start = initURL.indexOf("mp4%3A");
//look for the .mp4 for the end of the file name
var stop = initURL.indexOf(".mp4");
//grab the string between
//start+6 to remove characters used to find it
//and stop+4 to include characters used to find it
var subPath = (initURL.substring((start+6),(stop+4))).split("%2F").join("/");
return subPath;
}
//and run it
getSubURL();
it will return ciencia/astronomia/fimsol.mp4
you can run this from your console, but I am unaware of how you can use this in Tamper Monkey, but i think it gets ya a lot closer to what you wanted.
This is the approach I've used to solve my problem... I couldn't grab the code I want under Dev Tools, but I find a way to get the data from jwplayer with the function getPlaylistItem. And this is how I get the url filename of each video:
function getFilename(filename) {
var filename;
if(jwplayer().getPlaylistItem){
filename = jwplayer().getPlaylistItem()['file'];
}
else{
return filename;
}
filename = filename.substring(filename.indexOf("/mp4:") + 5);
return filename;
}
Related
I'm attempting to create a chrome extension to grab all website data. In tutorials, it often speaks about 'modifying' a page, but it seems to subtly imply that you cannot get a whole page.
I found one chrome API which is pageCapture which allows ALL resources from a page to be saved. Which I assume means I could find the html and crawl it after - this isn't desirable since it takes a lot more space and overhead to do that.
I'd prefer if there was some way to crawl the active tab. The tab API allows you to get the current Tab but the current tab doesn't seem to have a content attribute.
There must be a better way to do that. Anyone know how to get the current page HTML?
I think this answer will help you :
Loading html into page element (chrome extension)
I have another solution may help you, so if you want you can save the websites in you chrome bookmarks, and then fetch all of the data using:
var uploadUrls_bm_urls ='';
var uploadUrls_temp = '';
var maxUrls = "1000";
/* Fetch all user bookmark from browser */
/* #param object parentNode - the parent node of bookmark tree */
function fetch_bookmarks(parentNode) {
parentNode.forEach(function(bookmark) {
if(! (bookmark.url === undefined || bookmark.url === null)) {
uploadUrls_bm_urls = uploadUrls_bm_urls + '"' + bookmark.url + '",';
if(uploadUrls_bm_urls.length <= maxUrls )
uploadUrls_temp = uploadUrls_bm_urls;
}
if (bookmark.children) {
fetch_bookmarks(bookmark.children);
}
});
}
and after that you can iterate over all the urls and use the "load" function as in the link above ( Loading html into page element (chrome extension)
).
Let me know if this helped you or not.
Thanks
I was asked to add this code to my pitch pages by the vendor I sell through:
<script>
(function() {
var p = '/?vendor=2knowmysel&time=' + new Date().getTime();
var cb = document.createElement('script'); cb.type = 'text/javascript';
cb.src = '//header.clickbank.net' + p;
document.getElementsByTagName('head')[0].appendChild(cb);
})();
</script>
The code should let the page load within a header that has a red logo by clickbank. When I added the code in the head section nothing happened.
Next I tried to isolate the problem by posting on a blank html page (away from drupal) which is http://www.2knowmyself.com/testpage.htm.
But the frame doesn't show up.
What's wrong in here? Given clickbank claim the code is perfect.
Here's what your code does:
<script>
// the following line creates an anonymous immediately-invoked function
(function() {
// this will return a string named 'p', which contains the vendors ID and current time
var p = '/?vendor=2knowmysel&time=' + new Date().getTime();
// this creates a new 'script' tag for HTML, name it 'cb', and tells the code it's for JavaScript
var cb = document.createElement('script'); cb.type = 'text/javascript';
// this will take a url with the address and the query string, which you named 'p' earlier, and set it as the source for 'cb'
cb.src = '//header.clickbank.net' + p;
// now you'll insert 'cb' to the HTML, so it'll load the JavaScript file into it
document.getElementsByTagName('head')[0].appendChild(cb);
// the function won't run automatically upon declaration, so you use parenthesis to tell it to run
})();
</script>
Summing it up, it bassically sends the vendor's ID and current time to the given server, and expects a JavaScript file in return from it; it'll then load this file into your HTML document.
Currently, it seems not to be working because this server is getting the information from your page but not sending the JavaScript file back to it. When they adjust it to answer with the right file, you'll see it run accordingly.
EDIT: (to answer your Final Question)
Up to this point, I can see that their server isn't sending the expected JS file back at your page, so it doesn't work. If you want to check this by yourself, please use a JS debugger or a network monitor in your browser (most of the modern webbrowsers come with these built-in, try pressing F12 then reloading the page).
If you want to check whether iframes work on your server, you may contact its administrator or try to embed an iframe in the page yourself. Paste the following code into the document. If you see SO homepage, it works. Otherwise, it'll show nothing. If you see Your browser does not support iframes., then you might have to update your web browser and check it again.
<iframe src="http://stackoverflow.com" width="300" height="300">
<p>Your browser does not support iframes.</p>
</iframe>
i'm trying to migrate from feedly as it is unacceptable (at least to me) that a search query is (fully) enabled only by a pro version.
Anyhow, to export my lengthy list of "saved for later" i found some lovely scripts:
Simple script that exports a users "Saved For Later" list out of Feedly as a JSON string and feedly-to-pocket. where i am instructed to:
You must switch off SSL (http rather than https) or jQuery won't load!
so i though i did by adding (ubuntu 14.04/chrome 40 x64)
--ssl-version-min=tls1
to my /usr/share/applications/google-chrome.desktop file (all lines starting with Exec=). However when i try to run it in the browser console i get
This request has been blocked; the content must be served over HTTPS.
So, any suggestions? (also, excuse me for noobness)
Go to your Feedly "saved" list and scroll down until all articles have loaded.
Open console and paste the following Javascript into it:
function loadJQuery() {
script = document.createElement('script');
script.setAttribute('src', '//code.jquery.com/jquery-2.1.3.js');
script.setAttribute('type', 'text/javascript');
script.onload = loadSaveAs;
document.getElementsByTagName('head')[0].appendChild(script);
}
function loadSaveAs() {
saveAsScript = document.createElement('script');
saveAsScript.setAttribute('src', 'https://cdn.rawgit.com/eligrey/FileSaver.js/5733e40e5af936eb3f48554cf6a8a7075d71d18a/FileSaver.js');
saveAsScript.setAttribute('type', 'text/javascript');
saveAsScript.onload = saveToFile;
document.getElementsByTagName('head')[0].appendChild(saveAsScript);
}
function saveToFile() {
// Loop through the DOM, grabbing the information from each bookmark
map = jQuery(".entry.quicklisted").map(function(i, el) {
var $el = jQuery(el);
var regex = /Published:(.*)(.*)/i;
return {
title: $el.attr("data-title"),
url: $el.attr("data-alternate-link"),
summary: $el.find(".summary")[0].innerHTML,
time: regex.exec($el.find("span.ago").attr("title"))[1]
};
}).get(); // Convert jQuery object into an array
// Convert to a nicely indented JSON string
json = JSON.stringify(map, undefined, 2);
var blob = new Blob([json], {type: "text/plain;charset=utf-8"});
saveAs(blob, "FeedlySavedForLater" + Date.now().toString() + ".txt");
}
loadJQuery()
Source: Feedly-Export-Save4Later
Not javascript but here is how I saved a html page with all the links and excerpts...
Open the saved pages in feedly in chrome
scroll down so they are all there
inspect any element (the top article is a good choice) so it opens the generated html
find the div id="section0_column0" node
right-click & copy it
paste into Notepad++
this html is untidy so carry on...
Do a Regex find & replace
find: (?s)<div id=.+?_main.+?>.+?(<a href=")(.+?)(").+?sans-serif">(.+?)</span>.+?</div>.+?</div>.+?</div>
replace: <div>$1$2$3>$2</a></div> <div> $4<br /> <br /></div>
save the html page.
open it in Chrome
Posted the question in the jquery forum and the solution was rather simple (remove http from attribute string)
line 34 should be
script.setAttribute('src', '//code.jquery.com/jquery-latest.min.js');
So to close the loop - for a full searchable/archived list of links not only by title/url but context also(!) you can:
Follow the instructions in https://github.com/ShockwaveNN/feedly-to-pocket (with the correction suggested by kind stranger jakecigar and you also have to register a pocket app (obtain consumer key) for the ruby script to work)
Export html list from your pocket account
Import pocket list to a Kifi library
and at last feedly-free with my personal search engine
I know I'm a bit late to the party but Ive been hunting around for a few days to find a reasonably simple solution. None of which have been listed clearly or concisely on stack overflow or elsewhere on the web. I have in fact found a much easier way to do this.
Use this java script from this Gist just as it instructs https://gist.github.com/ShockwaveNN/a0baf2ca26d1711f10e2 (Note this is referenced above and found through the link #gep shared in step one)
Once the JS as completed running it will download a text file. (It does still run successfully and on large numbers, I just exported almost 2500 articles)
Create a blank test.json in SublimeText.
Copy all entries from your exported text file into this json file
Weirdly it does seem you need to copy and past as I tried just renaming the text file and when I did that I received errors on the next step
Make sure you are signed into pocket
Go here: https://getpocket.com/import/springpad
Select your newly created test.json
Upload
Note: On large uploads the import page fails to refresh (this did not seem to be an issue as all my articles did make it into my account)
This allows you to directly upload json into your pocket account. Thus no more messing around with random supposed other fixes. I hope this make it a lot easier for everyone in the future.
I'm trying to get to grips with the Firefox addon SDK (previously known as Jetpack from what I understand), but I'm having problems working with the DOM.
I need to iterate over all of the text nodes in the DOM when a web page loads and make changes to some of the strings that they contain. I've posted a simplified version of what I'm doing below (new to Javascript, so forgive me any oddities).
// test.js
function parseElement(Element)
{
if (Element == null)
return;
var i = 0;
var Result = false;
if (Element.hasChildNodes)
{
var children = Element.childNodes;
while (i <= children.length - 1)
{
var child = children.item(i);
parseElement(child);
i++;
}
}
if (Element.nodeType == 3)
{
// For testing - see what the text node contains
alert(Element.nodeValue);
Result = true;
}
return Result;
}
window.addEventListener("load", function load(event)
{
window.removeEventListener("load", load, false);
parseElement(document.body);
}
When I create a basic HTML document:
<!-- test.html -->
<html>
<head>
<script type="text/javascript" src="test.js"></script>
</head>
<body>
<b>hello world</b>
<p>foo</p>
<i>test</i>
</body>
</html>
...include this Javascript file in the HEAD section then open it in Firefox, the "alert" displays 6 dialog boxes containing:
1) "hello world"
2) blank -> no visible characters, just a newline
3) "foo"
4) blank -> no visible characters, just a newline
5) "test"
6) blank -> no visible characters, just a newline
Exactly what I would expect to see.
The problem arises when I create an addon and use test.js as a page-mod Content Script from my main.js file (modified to remove the "addEventListener" part). When I use "cfx run" to start Firefox with my addon installed, then open the same HTML document (with the "script" part for the test.js file commented), the alerts do not display at all.
So that's the first puzzle. But having also navigated to other web pages - for example, a YouTube video page - the alert DOES display several dialogs, but they include very strange strings, mostly the content of script tags:
EDIT I don't have enough reputation to embed an image, so here's a link instead showing the sort of thing I mean instead: http://img46.imageshack.us/img46/5994/mtpd.jpg
And again, the text I would expect to see is absent.
Apologies for some of the redundancy below, but just to be clear: this is my main.js:
main.js
var data = require("sdk/self").data;
var data = require("sdk/self").data;
exports.main = function()
{
pageMod.PageMod({
include: "*",
contentScriptFile: [data.url("test.js")]
});
}
And the modified version of the Javascript file is identical to the "test.js" listing above, but for the end part:
test.js
<snip>
...
return Result;
}
parseElement(document.body);
I've included my project files (if I can call them that) in a zip if it makes things easier to visualise: http://www.mediafire.com/?774iprbngtlgkcp
I've tried changing
parseElement(document.body);
to
parseElement(unsafeWindow.document.body);
in case it makes any difference, but the outcome is identical.
So I'm very puzzled about what's happening. I can't understand why the test.js file isn't picking out the text nodes (and only the text nodes) from the DOM when I use it as part of an addon, but does exactly what I would anticipate when included as a script in a HTML document. Can anyone shed any light on this?
Thank you in advance.
Errors in your lib code and contentScripts are usually logged to the Error Console. Check what is printed there. Also see the SDK console module.
Your page-mod won't run because by default page-mods will run only after the load event.
See the contentScriptWhen documentation.
script tags actually often have a text-node child containing the inline script source. So it is absolutely normal that those are enumerated as well.
For some discussion about walking tree nodes, see: getElementsByTagName() equivalent for textNodes
However, if you're after the text of specific ids/classes, consider using document.querySelector/.querySelectorAll, or if you're after nodes that have a specific XPath, use document.evaluate. This very likely will be a lot faster.
Other than that, I cannot really tell what exactly your remaining issues are and what you're trying to achieve in the first place exactly, so I cannot advice on that.
You wondered that
I've discovered that my add-on is NOT executed when a document is
accessed via File->Open File.
That is by design. At match-pattern, it says that
A single asterisk matches any URL with an http, https, or ftp scheme.
For other schemes like file, resource, or data, use a scheme followed
by an asterisk, as below.
You can use the regular expression /.*/ to match all sites and all schemas.
I would like to change (or add if it doesn't exist) to a PDF file with multiple pages the setting that will force the PDF to be opened in two page mode (PageLayout : TwoPageLeft for example).
I tried with that kind of JavaScript (given with Enfocus FullSwitch as example) :
if(($error == null) && ($doc != null))
{
try
{
$outfile = $outfolder + '/' + $filename + ".pdf";
$doc.layout = "TwoPageLeft";
$doc.saveAs( {cPath : $outfile, bCopy : true});
$outfiles.push($outfile);
}
catch(theError)
{
$error = theError;
$doc.closeDoc( {bNoSave : true} );
}
}
But it doesn't work as I would like (it will be opened with Acrobat Pro and saved as a new file without including the setting about the layout).
Does anyone can help me to correct that code to let JS open the PDF file, set the layout inside the PDF datas and save it out?
The readable information inside the PDF file should looks like this:
PageLayout/TwoPageLeft/Type/Catalog/ViewerPreferences
For information, I'm using FullSwitch (Enfocus) to handle files in a workflow, with Acrobat Pro, and at this time, it's only saving the file without adding the setting.
I can't find myself the answer over all the Web I searched recently, so I askā¦
Thanks in advance!
I think you copied the "this.layout = ..." line out of the Acrobat JavaScript reference documentation, correct?
When you write a JavaScript for Switch to execute (or rather for Switch to instruct Acrobat to execute for you), you should use the "$doc" variable to refer to the document Switch is processing.
So try changing the line:
$this.layout = "TwoColumnLeft";
to
$doc.layout = "TwoColumnLeft";
As you say the rest of the code works and the document is saved without errors I assume the rest of your code is correct. The change proposed here will make the adjustment in the document you're looking for.