I'm trying to get to grips with the Firefox addon SDK (previously known as Jetpack from what I understand), but I'm having problems working with the DOM.
I need to iterate over all of the text nodes in the DOM when a web page loads and make changes to some of the strings that they contain. I've posted a simplified version of what I'm doing below (new to Javascript, so forgive me any oddities).
// test.js
function parseElement(Element)
{
if (Element == null)
return;
var i = 0;
var Result = false;
if (Element.hasChildNodes)
{
var children = Element.childNodes;
while (i <= children.length - 1)
{
var child = children.item(i);
parseElement(child);
i++;
}
}
if (Element.nodeType == 3)
{
// For testing - see what the text node contains
alert(Element.nodeValue);
Result = true;
}
return Result;
}
window.addEventListener("load", function load(event)
{
window.removeEventListener("load", load, false);
parseElement(document.body);
}
When I create a basic HTML document:
<!-- test.html -->
<html>
<head>
<script type="text/javascript" src="test.js"></script>
</head>
<body>
<b>hello world</b>
<p>foo</p>
<i>test</i>
</body>
</html>
...include this Javascript file in the HEAD section then open it in Firefox, the "alert" displays 6 dialog boxes containing:
1) "hello world"
2) blank -> no visible characters, just a newline
3) "foo"
4) blank -> no visible characters, just a newline
5) "test"
6) blank -> no visible characters, just a newline
Exactly what I would expect to see.
The problem arises when I create an addon and use test.js as a page-mod Content Script from my main.js file (modified to remove the "addEventListener" part). When I use "cfx run" to start Firefox with my addon installed, then open the same HTML document (with the "script" part for the test.js file commented), the alerts do not display at all.
So that's the first puzzle. But having also navigated to other web pages - for example, a YouTube video page - the alert DOES display several dialogs, but they include very strange strings, mostly the content of script tags:
EDIT I don't have enough reputation to embed an image, so here's a link instead showing the sort of thing I mean instead: http://img46.imageshack.us/img46/5994/mtpd.jpg
And again, the text I would expect to see is absent.
Apologies for some of the redundancy below, but just to be clear: this is my main.js:
main.js
var data = require("sdk/self").data;
var data = require("sdk/self").data;
exports.main = function()
{
pageMod.PageMod({
include: "*",
contentScriptFile: [data.url("test.js")]
});
}
And the modified version of the Javascript file is identical to the "test.js" listing above, but for the end part:
test.js
<snip>
...
return Result;
}
parseElement(document.body);
I've included my project files (if I can call them that) in a zip if it makes things easier to visualise: http://www.mediafire.com/?774iprbngtlgkcp
I've tried changing
parseElement(document.body);
to
parseElement(unsafeWindow.document.body);
in case it makes any difference, but the outcome is identical.
So I'm very puzzled about what's happening. I can't understand why the test.js file isn't picking out the text nodes (and only the text nodes) from the DOM when I use it as part of an addon, but does exactly what I would anticipate when included as a script in a HTML document. Can anyone shed any light on this?
Thank you in advance.
Errors in your lib code and contentScripts are usually logged to the Error Console. Check what is printed there. Also see the SDK console module.
Your page-mod won't run because by default page-mods will run only after the load event.
See the contentScriptWhen documentation.
script tags actually often have a text-node child containing the inline script source. So it is absolutely normal that those are enumerated as well.
For some discussion about walking tree nodes, see: getElementsByTagName() equivalent for textNodes
However, if you're after the text of specific ids/classes, consider using document.querySelector/.querySelectorAll, or if you're after nodes that have a specific XPath, use document.evaluate. This very likely will be a lot faster.
Other than that, I cannot really tell what exactly your remaining issues are and what you're trying to achieve in the first place exactly, so I cannot advice on that.
You wondered that
I've discovered that my add-on is NOT executed when a document is
accessed via File->Open File.
That is by design. At match-pattern, it says that
A single asterisk matches any URL with an http, https, or ftp scheme.
For other schemes like file, resource, or data, use a scheme followed
by an asterisk, as below.
You can use the regular expression /.*/ to match all sites and all schemas.
Related
i'm trying to migrate from feedly as it is unacceptable (at least to me) that a search query is (fully) enabled only by a pro version.
Anyhow, to export my lengthy list of "saved for later" i found some lovely scripts:
Simple script that exports a users "Saved For Later" list out of Feedly as a JSON string and feedly-to-pocket. where i am instructed to:
You must switch off SSL (http rather than https) or jQuery won't load!
so i though i did by adding (ubuntu 14.04/chrome 40 x64)
--ssl-version-min=tls1
to my /usr/share/applications/google-chrome.desktop file (all lines starting with Exec=). However when i try to run it in the browser console i get
This request has been blocked; the content must be served over HTTPS.
So, any suggestions? (also, excuse me for noobness)
Go to your Feedly "saved" list and scroll down until all articles have loaded.
Open console and paste the following Javascript into it:
function loadJQuery() {
script = document.createElement('script');
script.setAttribute('src', '//code.jquery.com/jquery-2.1.3.js');
script.setAttribute('type', 'text/javascript');
script.onload = loadSaveAs;
document.getElementsByTagName('head')[0].appendChild(script);
}
function loadSaveAs() {
saveAsScript = document.createElement('script');
saveAsScript.setAttribute('src', 'https://cdn.rawgit.com/eligrey/FileSaver.js/5733e40e5af936eb3f48554cf6a8a7075d71d18a/FileSaver.js');
saveAsScript.setAttribute('type', 'text/javascript');
saveAsScript.onload = saveToFile;
document.getElementsByTagName('head')[0].appendChild(saveAsScript);
}
function saveToFile() {
// Loop through the DOM, grabbing the information from each bookmark
map = jQuery(".entry.quicklisted").map(function(i, el) {
var $el = jQuery(el);
var regex = /Published:(.*)(.*)/i;
return {
title: $el.attr("data-title"),
url: $el.attr("data-alternate-link"),
summary: $el.find(".summary")[0].innerHTML,
time: regex.exec($el.find("span.ago").attr("title"))[1]
};
}).get(); // Convert jQuery object into an array
// Convert to a nicely indented JSON string
json = JSON.stringify(map, undefined, 2);
var blob = new Blob([json], {type: "text/plain;charset=utf-8"});
saveAs(blob, "FeedlySavedForLater" + Date.now().toString() + ".txt");
}
loadJQuery()
Source: Feedly-Export-Save4Later
Not javascript but here is how I saved a html page with all the links and excerpts...
Open the saved pages in feedly in chrome
scroll down so they are all there
inspect any element (the top article is a good choice) so it opens the generated html
find the div id="section0_column0" node
right-click & copy it
paste into Notepad++
this html is untidy so carry on...
Do a Regex find & replace
find: (?s)<div id=.+?_main.+?>.+?(<a href=")(.+?)(").+?sans-serif">(.+?)</span>.+?</div>.+?</div>.+?</div>
replace: <div>$1$2$3>$2</a></div> <div> $4<br /> <br /></div>
save the html page.
open it in Chrome
Posted the question in the jquery forum and the solution was rather simple (remove http from attribute string)
line 34 should be
script.setAttribute('src', '//code.jquery.com/jquery-latest.min.js');
So to close the loop - for a full searchable/archived list of links not only by title/url but context also(!) you can:
Follow the instructions in https://github.com/ShockwaveNN/feedly-to-pocket (with the correction suggested by kind stranger jakecigar and you also have to register a pocket app (obtain consumer key) for the ruby script to work)
Export html list from your pocket account
Import pocket list to a Kifi library
and at last feedly-free with my personal search engine
I know I'm a bit late to the party but Ive been hunting around for a few days to find a reasonably simple solution. None of which have been listed clearly or concisely on stack overflow or elsewhere on the web. I have in fact found a much easier way to do this.
Use this java script from this Gist just as it instructs https://gist.github.com/ShockwaveNN/a0baf2ca26d1711f10e2 (Note this is referenced above and found through the link #gep shared in step one)
Once the JS as completed running it will download a text file. (It does still run successfully and on large numbers, I just exported almost 2500 articles)
Create a blank test.json in SublimeText.
Copy all entries from your exported text file into this json file
Weirdly it does seem you need to copy and past as I tried just renaming the text file and when I did that I received errors on the next step
Make sure you are signed into pocket
Go here: https://getpocket.com/import/springpad
Select your newly created test.json
Upload
Note: On large uploads the import page fails to refresh (this did not seem to be an issue as all my articles did make it into my account)
This allows you to directly upload json into your pocket account. Thus no more messing around with random supposed other fixes. I hope this make it a lot easier for everyone in the future.
I'm trying to do this by using a Tampermonkey Script. However I'm open to new approaches...
What I want to do is extract some data (data-video), from a specific <div>. However this data is not available under the HTML code of the page, but it's available under Dev Tools -> Resources and then on Frames.
Anyone knows if it's possible to get that information available under DevTools? And how can I do that?
Comparative between the two pages can be found here: "Original HTML PAGE" and "HTML PAGE under DevTools"
On the first hyperlink the id=video-canvas cannot be seen, however it's on the <object type="application/x-shockwave-flash(...)
As you state in your question the data you're looking for is available in DevTools under the "Resources" tab in the "Frames" folder. What you are looking at there is the Source HTML, similar to View Source.
The code you want, is what is getting replaced. It appears the site is using the JW Player Plugin, which is replacing the <div id="video-canvas"> with the appropriate HTML for the device / browser detected to play the video. With all of my browsers on my Mac, they are being forced to use the Flash, even when it's disabled. When using my iPhone, which can't play flash , and inspecting the page it uses JW's own custom video element. It appears that it must be storing the file location in memory since it is not in the generated markup.
I am able to run through the console in the dev tools and access their JS class. It appears i can call jwplayer._tracker , which has an object b . Object b has an object AlWv3iHmEeOzwBIxOUCPzg This object seems to be consistent each time i check between different browsers, you can use the for loop inmy first example to get the correct value but tirmming it down to .b Following that object is e and in e is the object http://i.n.jwpltx.com/v1.... really long string that appears to contain a url, so it will need to parsed.
So to get the HTML string i ran
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
loc
}
so if we put that in a function to parse the string and return a value
function getSubURL(){
var initURL;
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
initURL = loc;
}
//look for 'mp4:' this is in front of the file path
var start = initURL.indexOf("mp4%3A");
//look for the .mp4 for the end of the file name
var stop = initURL.indexOf(".mp4");
//grab the string between
//start+6 to remove characters used to find it
//and stop+4 to include characters used to find it
var subPath = (initURL.substring((start+6),(stop+4))).split("%2F").join("/");
return subPath;
}
//and run it
getSubURL();
it will return ciencia/astronomia/fimsol.mp4
you can run this from your console, but I am unaware of how you can use this in Tamper Monkey, but i think it gets ya a lot closer to what you wanted.
This is the approach I've used to solve my problem... I couldn't grab the code I want under Dev Tools, but I find a way to get the data from jwplayer with the function getPlaylistItem. And this is how I get the url filename of each video:
function getFilename(filename) {
var filename;
if(jwplayer().getPlaylistItem){
filename = jwplayer().getPlaylistItem()['file'];
}
else{
return filename;
}
filename = filename.substring(filename.indexOf("/mp4:") + 5);
return filename;
}
I'm working with the Firefox Addon SDK writing a Spanish dictionary addon. What I want to do is have a dictionary pop up with the translation when the mouse hovers over text. No highlighting the text or right-clicking should be required (although some dictionaries use this). There are several programs that do this with the old addon XUL format (Rikaichan among others), but I was wondering if there was a way to do this with the new SDK.
My current workaround is to inject javascript tags around every word in the text nodes along with onmouseover="lookThisUp()". This works, but it runs into complications when I want to check words that change meaning when in pairs ("get up" rather than "get"), so a method without cutting up all the text with javascript tags would be preferential.
this is an example of how to do it with the most recent navigator:browser window:
var {Cu} = require('chrome');
Cu.import('resource://gre/modules/Services.jsm');
var aDOMWindow = Services.wm.getMostRecentWindow('navigator:browser');
aDOMWindow.gBrowser.addEventListener('mouseover', isTextNode, true);
function isTextNode(event) {
var node = event.explicitOriginalTarget;
if (node.nodeName == '#text') {
Services.appShell.hiddenDOMWindow.console.log('moused over a text node = ',node,'the event:',event);
}
}
as you mouse over things in the most recent browser if its over a textnode it will log it to Browser Console.
I have about 100 static HTML pages that I want to apply some DOM manipulations to. They all follow the same HTML structure. I want to apply some DOM manipulations to each of these files, and then save the resulting HTML.
These are the manipulations I want to apply:
# [start]
$("h1.title, h2.description", this).wrap("<hgroup>");
if ( $("h1.title").height() < 200 ) {
$("div.content").addClass('tall');
}
# [end]
# SAVE NEW HTML
The first line (.wrap()) I could easily do with a find and replace, but it gets tricky when I have to determine the calculated height of an element, which can't be easily be determined sans-JavaScript.
Does anyone know how I can achieve this? Thanks!
While the first part could indeed be solved in "text mode" using regular expressions or a more complete DOM implementation in JavaScript, for the second part (the height calculation), you'll need a real, full browser or a headless engine like PhantomJS.
From the PhantomJS homepage:
PhantomJS is a command-line tool that packs and embeds WebKit.
Literally it acts like any other WebKit-based web browser, except that
nothing gets displayed to the screen (thus, the term headless). In
addition to that, PhantomJS can be controlled or scripted using its
JavaScript API.
A schematic instruction (which I admit is not tested) follows.
In your modification script (say, modify-html-file.js) open an HTML page, modify it's DOM tree and console.log the HTML of the root element:
var page = new WebPage();
page.open(encodeURI('file://' + phantom.args[0]), function (status) {
if (status === 'success') {
var html = page.evaluate(function () {
// your DOM manipulation here
return document.documentElement.outerHTML;
});
console.log(html);
}
phantom.exit();
});
Next, save the new HTML by redirecting your script's output to a file:
#!/bin/bash
mkdir modified
for i in *.html; do
phantomjs modify-html-file.js "$1" > modified/"$1"
done
I tried PhantomJS as in katspaugh's answer, but ran into several issues trying to manipulate pages. My use case was modifying the static html output of Doxygen, without modifying Doxygen itself. The goal was to reduce delivered file size by remove unnecessary elements from the page, and convert it to HTML5. Additionally I also wanted to use jQuery to access and modify elements more easily.
Loading the page in PhantomJS
The APIs appear to have changed drastically since the accepted answer. Additionally, I used a different approach (derived from this answer), which will be important in mitigating one of the major issues I encountered.
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
// Reading the page's content into your "webpage"
// This automatically refreshes the page
page.content = fs.read(system.args[1]);
// Make all your changes here
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Preventing JavaScript from Running
My page uses Google Analytics in the footer, and now the page is modified beyond my intention, presumably because javascript was run. If we disable javascript, we can't actually use jQuery to modify the page, so that isn't an option. I've tried temporarily changing the tag, but when I do, every special character is replaced with an html-escaped equivalent, destroying all javascript code on the page. Then, I came across this answer, which gave me the following idea.
var rawPageString = fs.read(system.args[1]);
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
// Make all your changes here
rawPageString = page.content;
rawPageString = rawPageString.replace(/<script type='foo\/bar'/g, "<script");
Adding jQuery
There's actually an example on how to use jQuery. However, I thought an offline copy would be more appropriate. Initially I tried using page.includeJs as in the example, but found that page.injectJs was more suitable for the use case. Unlike includeJs, there's no <script> tag added to the page context, and the call blocks execution which simplifies the code. jQuery was placed in the same directory I was executing my script from.
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type more easily here
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Putting it All Together
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
var rawPageString = fs.read(system.args[1]);
// Prevent in-page javascript execution
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Using it from the command line:
phantomjs modify-html-file.js "input_file.html" "output_file.html"
Note: This was tested and working with PhantomJS 2.0.0 on Windows 8.1.
Pro tip: If speed matters, you should consider iterating the files from within your PhantomJS script rather than a shell script. This will avoid the latency that PhantomJS has when starting up.
you can get your modified content by $('html').html() (or a more specific selector if you don't want stuff like head tags), then submit it as a big string to your server and write the file server side.
I have used HTTRACK to download Federal regulations from a government website, and the resulting HTML files are not intuitively named. Each file has a <TITLE></TITLE> tag set, that would serve nicely to name each file in a fashion that will lend itself to ebook creation. I want to turn these regulations into an ebook for my Kindle, so that I can have the regulations readily available for reference, rather than having to carry volumes of books with me everywhere.
My preferred text/hex editor, UltraEdit Professional 15.20.0.1026, has scripting commands enable through embedding of the JavaScript engine. In researching possible solutions to my problem, I found xmlTitleSave on the IDM UltraEdit website.
// ----------------------------------------------------------------------------
// Script Name: xmlTitleSave.js
// Creation Date: 2008-06-09
// Last Modified:
// Copyright: none
// Purpose: find the <title> value in an XML document, then saves the file as the
// title.xml in a user-specified directory
// ----------------------------------------------------------------------------
//Some variables we need
var regex = "<title>(.*)</title>" //Perl regular expression to find title string
var file_path = UltraEdit.getString("Path to save file at? !! MUST PRE EXIST !!",1);
// Start at the beginning of the file
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.unicodeToASCII();
// Turn on regular expressions
UltraEdit.activeDocument.findReplace.regExp = true;
// Find it
UltraEdit.activeDocument.findReplace.find(regex);
// Load it into a selection
var titl = UltraEdit.activeDocument.selection;
// Javascript function 'match' will match the regex within the javascript engine
// so we can extract the actual title via array
t = titl.match(regex);
// 't' is an array of the match from 'titl' based on the var 'regex'
// the 2nd value of the array gives us what we need... then append '.xml'
saveTitle = t[1]+".xml";
UltraEdit.saveAs(file_path + saveTitle);
// Uncomment for debugging
// UltraEdit.outputWindow.write("titl = " + titl);
// UltraEdit.outputWindow.write("t = " + t);
My question is two-fold:
Can this JavaScript be modified to extract the <TITLE></TITLE> contents from an HTML file and rename the files?
If the JavaScript cannot be modified easily, is there a script/program/black magic/animal sacrifice that can accomplish the same thing?
EDIT:
I have been able to get the script to work as desired by removing the UltraEdit.activeDocument.unicodeToASCII(); line and changing the file extension to .html. My only issue now is that while this script works on single open files, it does not batch process the directory.
You can use just about any "scriptable" language to do something like this pretty quickly. Ruby is my favorite:
require 'fileutils'
dir = "/your/directory"
files = Dir["#{dir}/*.html"]
files.each do |file|
html = IO.read file
title = $1 if html.match /<title>([^<]+)<\/title>/i
FileUtils.mv file "#{dir}/#{title}.html"
puts "Renamed #{file} to #{title}.html."
end
Obviously if your UltraEdit script worked for you this might be obtuse, but for anybody running a different env, hopefully this is useful.
Does this not work out of the box?
I don't know anything about UltraEdit, but as far as a regex engine is concerned, if it can parse <title>(.*)</title> out of an XML document, it can do the exact same for HTML.
Just modify the final file title to .html instead of .xml
saveTitle = t[1]+".html";
Assuming you can get that script to work as it's intended (point being I don't know UltraEdit), I'm pretty confident that same process will work for HTML.
XML and HTML are both plain text, and that script is simply running a regular expression on the text to extract the title tags, which are the same in both; the only thing you need to do is change this line:
saveTitle = t[1]+".xml";
to this:
saveTitle = t[1]+".html";
After much searching and trial and error on the scripting side, I ran across a fantastic program for Windows that will do the renaming via TITLE tags: Flexible Renamer 8.3. The author's website is http://hp.vector.co.jp/authors/VA014830/english/FlexRena/, and it manages to handle every bit of what I needed. Many thanks to #coreyward and #Yuji for their fantastic advice on the scripting end of things.