i'm trying to migrate from feedly as it is unacceptable (at least to me) that a search query is (fully) enabled only by a pro version.
Anyhow, to export my lengthy list of "saved for later" i found some lovely scripts:
Simple script that exports a users "Saved For Later" list out of Feedly as a JSON string and feedly-to-pocket. where i am instructed to:
You must switch off SSL (http rather than https) or jQuery won't load!
so i though i did by adding (ubuntu 14.04/chrome 40 x64)
--ssl-version-min=tls1
to my /usr/share/applications/google-chrome.desktop file (all lines starting with Exec=). However when i try to run it in the browser console i get
This request has been blocked; the content must be served over HTTPS.
So, any suggestions? (also, excuse me for noobness)
Go to your Feedly "saved" list and scroll down until all articles have loaded.
Open console and paste the following Javascript into it:
function loadJQuery() {
script = document.createElement('script');
script.setAttribute('src', '//code.jquery.com/jquery-2.1.3.js');
script.setAttribute('type', 'text/javascript');
script.onload = loadSaveAs;
document.getElementsByTagName('head')[0].appendChild(script);
}
function loadSaveAs() {
saveAsScript = document.createElement('script');
saveAsScript.setAttribute('src', 'https://cdn.rawgit.com/eligrey/FileSaver.js/5733e40e5af936eb3f48554cf6a8a7075d71d18a/FileSaver.js');
saveAsScript.setAttribute('type', 'text/javascript');
saveAsScript.onload = saveToFile;
document.getElementsByTagName('head')[0].appendChild(saveAsScript);
}
function saveToFile() {
// Loop through the DOM, grabbing the information from each bookmark
map = jQuery(".entry.quicklisted").map(function(i, el) {
var $el = jQuery(el);
var regex = /Published:(.*)(.*)/i;
return {
title: $el.attr("data-title"),
url: $el.attr("data-alternate-link"),
summary: $el.find(".summary")[0].innerHTML,
time: regex.exec($el.find("span.ago").attr("title"))[1]
};
}).get(); // Convert jQuery object into an array
// Convert to a nicely indented JSON string
json = JSON.stringify(map, undefined, 2);
var blob = new Blob([json], {type: "text/plain;charset=utf-8"});
saveAs(blob, "FeedlySavedForLater" + Date.now().toString() + ".txt");
}
loadJQuery()
Source: Feedly-Export-Save4Later
Not javascript but here is how I saved a html page with all the links and excerpts...
Open the saved pages in feedly in chrome
scroll down so they are all there
inspect any element (the top article is a good choice) so it opens the generated html
find the div id="section0_column0" node
right-click & copy it
paste into Notepad++
this html is untidy so carry on...
Do a Regex find & replace
find: (?s)<div id=.+?_main.+?>.+?(<a href=")(.+?)(").+?sans-serif">(.+?)</span>.+?</div>.+?</div>.+?</div>
replace: <div>$1$2$3>$2</a></div> <div> $4<br /> <br /></div>
save the html page.
open it in Chrome
Posted the question in the jquery forum and the solution was rather simple (remove http from attribute string)
line 34 should be
script.setAttribute('src', '//code.jquery.com/jquery-latest.min.js');
So to close the loop - for a full searchable/archived list of links not only by title/url but context also(!) you can:
Follow the instructions in https://github.com/ShockwaveNN/feedly-to-pocket (with the correction suggested by kind stranger jakecigar and you also have to register a pocket app (obtain consumer key) for the ruby script to work)
Export html list from your pocket account
Import pocket list to a Kifi library
and at last feedly-free with my personal search engine
I know I'm a bit late to the party but Ive been hunting around for a few days to find a reasonably simple solution. None of which have been listed clearly or concisely on stack overflow or elsewhere on the web. I have in fact found a much easier way to do this.
Use this java script from this Gist just as it instructs https://gist.github.com/ShockwaveNN/a0baf2ca26d1711f10e2 (Note this is referenced above and found through the link #gep shared in step one)
Once the JS as completed running it will download a text file. (It does still run successfully and on large numbers, I just exported almost 2500 articles)
Create a blank test.json in SublimeText.
Copy all entries from your exported text file into this json file
Weirdly it does seem you need to copy and past as I tried just renaming the text file and when I did that I received errors on the next step
Make sure you are signed into pocket
Go here: https://getpocket.com/import/springpad
Select your newly created test.json
Upload
Note: On large uploads the import page fails to refresh (this did not seem to be an issue as all my articles did make it into my account)
This allows you to directly upload json into your pocket account. Thus no more messing around with random supposed other fixes. I hope this make it a lot easier for everyone in the future.
Related
tl;dr - After exporting a Google Doc as an HTML file and pasting the HTML into a GMail draft it does not contain the formatting from the original Google Doc (other than hyperlinks).
Code snippet:
//copies the doc to HTML format
var htmlExport = "https://docs.google.com/feeds/download/documents/export/Export?id=" + docID + "&exportFormat=html";
var param = {
method: "get",
headers: {"Authorization": "Bearer " + ScriptApp.getOAuthToken()},
muteHttpExceptions: true,
};
var htmlExportText = UrlFetchApp.fetch(htmlExport,param).getContentText();
//the variables below (contactEmail & emailSubject) are both taken from a spreadsheet
//copies recent draft body to new email, then updates body of new email to include HTML export
var draftEmailBody = GmailApp.getMessageById(draftEmailID).getBody();
var draftToSend = GmailApp.createDraft(contactEmail,emailSubject,'',{htmlBody: htmlExportText + draftEmailBody}).getMessageId();
Long version:
I am building a mail merge that pulls contact info from a GSheet and uses GDoc as the template for the body. The GDoc has several bits of formatting in it (bold, italics, superscript) that, when exported as an HTML using the script above, appear in the GMail draft devoid of formatting (for some reason it leaves the hyperlinks). For some odd reason it even leaves the images from the doc!
The GMail draft pulled into the body (draftEmailBody) does, however, keep all it's formatting. I can only assume this means I'm doing something wrong by using getContentText but I don't know how else to go about it.
(This is completely separate and I should probably just make another question for this, but I'm here so...)
Separately, I wanted to have the script edit specific fields within the GDoc template, but I have run into 2 issues.
Problem 1 - I have found no way to replace specific text within a GMail draft.
Workaround 1 - I have the script edit the text in a GDoc instead, using repalceText. This, however, leads to:
Problem 2 - Using replaceText in a GDoc requires you to saveAndClose before the script can recognize the change. For some reason I can never get my script to open the GDoc again, despite including openByID in various places of the script!
Workaround 2 - I create a copy of the doc for each contact, replacing the text within that doc, then trash all of the copies on completion so there's no clutter. Quite clunky and slow but it gets the job done.
While it's not the prettiest solution, I found something that helps:
Google Scripts: Generating Email from Docs Loses Formatting
I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?
RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)
Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).
I want to use HtmlUnit (v2.21) to get some search result pages from google. This requires me to click on "people also looked for" link when searching for a person (right side, see example link), which triggers some JavaScript and changes the content of the current page. But this gives me an JavaScript Wrapper Exception (see below).
Clickable example link: https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj
Simple TestCase with errors:
String url = "https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj";
WebClient client = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage page = client.getPage(url);
HtmlElement link = page.getFirstByXPath("//a[#class='_Zjg']");
HtmlPage newPage = link.click(); //throws exception
this.storeResultFile(newPage.asXml(), "test");
client.close();
Result:
net.sourceforge.htmlunit.corejs.javascript.WrappedException: Wrapped java.lang.NullPointerException
at net.sourceforge.htmlunit.corejs.javascript.Context.throwAsScriptRuntimeEx(Context.java:2053)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:947)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1012)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:799)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:742)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:689)
I stored the xml of the "page" object and made sure that the XPath expression is valid and has results.
Anybody got any ideas?
Looks like the JavaScript-Engine (based on Rhino) is very easy to upset and quits on some script-issues, where other browsers are still able to run the script.
I dont know if there is a mistake in the scripts from google, but these two lines solved it for me:
JavaScriptEngine engine = client.getJavaScriptEngine();
engine.holdPosponedActions();
Nevertheless, when running multiple htmlunit-objects in multiple threads it is still possible to get accross this error. This is more a workaround than a solution.
I'm trying to do this by using a Tampermonkey Script. However I'm open to new approaches...
What I want to do is extract some data (data-video), from a specific <div>. However this data is not available under the HTML code of the page, but it's available under Dev Tools -> Resources and then on Frames.
Anyone knows if it's possible to get that information available under DevTools? And how can I do that?
Comparative between the two pages can be found here: "Original HTML PAGE" and "HTML PAGE under DevTools"
On the first hyperlink the id=video-canvas cannot be seen, however it's on the <object type="application/x-shockwave-flash(...)
As you state in your question the data you're looking for is available in DevTools under the "Resources" tab in the "Frames" folder. What you are looking at there is the Source HTML, similar to View Source.
The code you want, is what is getting replaced. It appears the site is using the JW Player Plugin, which is replacing the <div id="video-canvas"> with the appropriate HTML for the device / browser detected to play the video. With all of my browsers on my Mac, they are being forced to use the Flash, even when it's disabled. When using my iPhone, which can't play flash , and inspecting the page it uses JW's own custom video element. It appears that it must be storing the file location in memory since it is not in the generated markup.
I am able to run through the console in the dev tools and access their JS class. It appears i can call jwplayer._tracker , which has an object b . Object b has an object AlWv3iHmEeOzwBIxOUCPzg This object seems to be consistent each time i check between different browsers, you can use the for loop inmy first example to get the correct value but tirmming it down to .b Following that object is e and in e is the object http://i.n.jwpltx.com/v1.... really long string that appears to contain a url, so it will need to parsed.
So to get the HTML string i ran
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
loc
}
so if we put that in a function to parse the string and return a value
function getSubURL(){
var initURL;
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
initURL = loc;
}
//look for 'mp4:' this is in front of the file path
var start = initURL.indexOf("mp4%3A");
//look for the .mp4 for the end of the file name
var stop = initURL.indexOf(".mp4");
//grab the string between
//start+6 to remove characters used to find it
//and stop+4 to include characters used to find it
var subPath = (initURL.substring((start+6),(stop+4))).split("%2F").join("/");
return subPath;
}
//and run it
getSubURL();
it will return ciencia/astronomia/fimsol.mp4
you can run this from your console, but I am unaware of how you can use this in Tamper Monkey, but i think it gets ya a lot closer to what you wanted.
This is the approach I've used to solve my problem... I couldn't grab the code I want under Dev Tools, but I find a way to get the data from jwplayer with the function getPlaylistItem. And this is how I get the url filename of each video:
function getFilename(filename) {
var filename;
if(jwplayer().getPlaylistItem){
filename = jwplayer().getPlaylistItem()['file'];
}
else{
return filename;
}
filename = filename.substring(filename.indexOf("/mp4:") + 5);
return filename;
}
I'm trying to get to grips with the Firefox addon SDK (previously known as Jetpack from what I understand), but I'm having problems working with the DOM.
I need to iterate over all of the text nodes in the DOM when a web page loads and make changes to some of the strings that they contain. I've posted a simplified version of what I'm doing below (new to Javascript, so forgive me any oddities).
// test.js
function parseElement(Element)
{
if (Element == null)
return;
var i = 0;
var Result = false;
if (Element.hasChildNodes)
{
var children = Element.childNodes;
while (i <= children.length - 1)
{
var child = children.item(i);
parseElement(child);
i++;
}
}
if (Element.nodeType == 3)
{
// For testing - see what the text node contains
alert(Element.nodeValue);
Result = true;
}
return Result;
}
window.addEventListener("load", function load(event)
{
window.removeEventListener("load", load, false);
parseElement(document.body);
}
When I create a basic HTML document:
<!-- test.html -->
<html>
<head>
<script type="text/javascript" src="test.js"></script>
</head>
<body>
<b>hello world</b>
<p>foo</p>
<i>test</i>
</body>
</html>
...include this Javascript file in the HEAD section then open it in Firefox, the "alert" displays 6 dialog boxes containing:
1) "hello world"
2) blank -> no visible characters, just a newline
3) "foo"
4) blank -> no visible characters, just a newline
5) "test"
6) blank -> no visible characters, just a newline
Exactly what I would expect to see.
The problem arises when I create an addon and use test.js as a page-mod Content Script from my main.js file (modified to remove the "addEventListener" part). When I use "cfx run" to start Firefox with my addon installed, then open the same HTML document (with the "script" part for the test.js file commented), the alerts do not display at all.
So that's the first puzzle. But having also navigated to other web pages - for example, a YouTube video page - the alert DOES display several dialogs, but they include very strange strings, mostly the content of script tags:
EDIT I don't have enough reputation to embed an image, so here's a link instead showing the sort of thing I mean instead: http://img46.imageshack.us/img46/5994/mtpd.jpg
And again, the text I would expect to see is absent.
Apologies for some of the redundancy below, but just to be clear: this is my main.js:
main.js
var data = require("sdk/self").data;
var data = require("sdk/self").data;
exports.main = function()
{
pageMod.PageMod({
include: "*",
contentScriptFile: [data.url("test.js")]
});
}
And the modified version of the Javascript file is identical to the "test.js" listing above, but for the end part:
test.js
<snip>
...
return Result;
}
parseElement(document.body);
I've included my project files (if I can call them that) in a zip if it makes things easier to visualise: http://www.mediafire.com/?774iprbngtlgkcp
I've tried changing
parseElement(document.body);
to
parseElement(unsafeWindow.document.body);
in case it makes any difference, but the outcome is identical.
So I'm very puzzled about what's happening. I can't understand why the test.js file isn't picking out the text nodes (and only the text nodes) from the DOM when I use it as part of an addon, but does exactly what I would anticipate when included as a script in a HTML document. Can anyone shed any light on this?
Thank you in advance.
Errors in your lib code and contentScripts are usually logged to the Error Console. Check what is printed there. Also see the SDK console module.
Your page-mod won't run because by default page-mods will run only after the load event.
See the contentScriptWhen documentation.
script tags actually often have a text-node child containing the inline script source. So it is absolutely normal that those are enumerated as well.
For some discussion about walking tree nodes, see: getElementsByTagName() equivalent for textNodes
However, if you're after the text of specific ids/classes, consider using document.querySelector/.querySelectorAll, or if you're after nodes that have a specific XPath, use document.evaluate. This very likely will be a lot faster.
Other than that, I cannot really tell what exactly your remaining issues are and what you're trying to achieve in the first place exactly, so I cannot advice on that.
You wondered that
I've discovered that my add-on is NOT executed when a document is
accessed via File->Open File.
That is by design. At match-pattern, it says that
A single asterisk matches any URL with an http, https, or ftp scheme.
For other schemes like file, resource, or data, use a scheme followed
by an asterisk, as below.
You can use the regular expression /.*/ to match all sites and all schemas.