Use "readabilitySAX" on distant pages, in Node.js

Use "readabilitySAX" on distant pages, in Node.js - javascript

I want to get the length of articles published on newspapers and magazines websites and on blogs.
In a server made in Node.js, I want to use the "readabilitySAX" module (https://github.com/fb55/readabilitySAX), but I must make a mistake with the way to use it because this code is not working:
var Readability = require("readabilitySAX/readabilitySAX.js"),
Parser = require("htmlparser2/lib/Parser.js");
var readable = new Readability({
pageURL: "http://www.nytimes.com/2014/04/18/business/treatment-cost-could-influence-doctors-advice.html?src=me&ref=general"
});
parser = new Parser(readable, {});
console.log(readable.getArticle().textLength);

The pageURL attribute is used when Readability resolve relative links, not to download a page.
To download a page, you can use the get method :
require("readabilitySAX").get("http://url", {type:"html"}, function(article) {
console.log(article.textLength);
})

Related

Can I extract comments of any page from https://www.rt.com/ using python3?

I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?

RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)

Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).

Share webpages on social media with counter

I'm creating a website that's going to have hundreds of pages. I want each page to be shareable on Facebook and Twitter. I've already created these buttons but I also want to have their respective share counters next to my share buttons. I don't want to use the standard Facebook method they provide because the coding looks bloated.
Right, so after doing some research, I found this example on codepen.
This looks exactly what I want - very simple!
However, I need some clarification and basic help with how this javascript code works:
var permalink = 'http://codepen.io';
var getTwitterCount = function () {
$.getJSON('http://urls.api.twitter.com/1/urls/count.json?
url='+permalink+'&callback=?', function(data){
var twitterShares = data.count;
$('.twitter .share-count').text(twitterShares);
});
};
getTwitterCount();
var getFacebookCount = function () {
$.getJSON('http://graph.facebook.com/?ids='+permalink+'&callback=?',
function(data){
var facebookShares = data[permalink].shares;
$('.facebook .share-count').text(facebookShares);
});
};
getFacebookCount();
This bit of code:
var permalink = 'http://codepen.io';
Does this have to be:
1) the url of the actual page I want shared, eg: http://www.example.com/page-1/
OR
2) Must this be the root of the domain name, eg: http://www.example.com/
?
Or am I missing something else?
If the answer is #1 above, then that means I have to include + edit this line for each page which isn't ideal because I have all my javascript code + plugins in ONE .js file to reduce http requests, so I'd prefer it that I don't have to add this javascript on-page for every page.

It would be the page that you want to share, but you could get around it without using a separate variable for each page by setting it to something like document.location.href for example?

Generic link to open latest version of document wiki pages

Hi I am completely new to Sharepoint and wiki pages. I manage to do few changes to wiki pages to have a feel of it. I noticed that every time i create a link to a document if the version changes i need to update the link manually by editing it. Is there any way to automate this process?
Eg: Docv1.0.doc is updated to Docv2.0
Thanks

When you change a document, you don't need to change the file name. SharePoint has versioning built in, so you can keep the file name the same.
That's the only solution actually, don't change the filename. Enable versioning on the library to be able to see previous versions.

Sharepoint site has several links to templates and documents that points to a shared server and these documents can be updated as new versions, so the links to this files needs to be updated automatically, in fact these links need to call some script for dynamically linking them to the latest files. (not sure if a better way of doing this without a script). Here is what I could manage to do, a better and other options to achieve will be appreciated.
I manage to get something working using webparts Content Editor and linking it to a file. Not sure if this the only/best approached for Sharepoint 2007
<script type="text/javascript">
function getLatestFile(){
var myObject;
var recent = "";
myObject = new ActiveXObject("Scripting.FileSystemObject");
var folderObj = myObject.GetFolder("C:\Test");
var fc = new Enumerator(folderObj.files);
for(var objEnum = new Enumerator(FileCollection); !objEnum.atEnd(); objEnum.moveNext()) {
If (recentFile = ""){
recentFile = file;
else if (file.DateLastModified > recentFile.DateLastModified){
recentFile = file;
}
}
}//for loop
alert("recentFile : " + recentFile);
var mylink = document.getElementById("myLink");
mylink.setAttribute("href", urlToFile);
mylink.click();
}
</script>
<P> </P><A id="myLink" onclick="getUrl();"> TestFile1 </A>
Content Type Editor
check above link for more on using content links to use JavaScript and HTML.

Using RHINO js engine to make http requests

I'm trying to use the Mozilla/Rhino js engine to test some SOAP requests in the command line. However, none of the normal objects for making requests (XMLHttpRequest, HttpRequest) seem to be available. Why is this? Can I import libraries?

I was able to get it to work using just Rhino with the following code.
var post = new org.apache.commons.httpclient.methods.PostMethod("https://someurl/and/path/");
var client = new org.apache.commons.httpclient.HttpClient();
// ---- Authentication ---- //
var creds = new org.apache.commons.httpclient.UsernamePasswordCredentials("username", "password");
client.getParams().setAuthenticationPreemptive(true);
client.getState().setCredentials(org.apache.commons.httpclient.auth.AuthScope.ANY, creds);
// -------------------------- //
post.setRequestHeader("Content-type", "application/xml");
post.setRequestEntity(new org.apache.commons.httpclient.methods.StringRequestEntity(buildXML(), "text/plain", "ASCII" ));
var status = client.executeMethod(post);
var br = new java.io.BufferedReader(new java.io.InputStreamReader(post.getResponseBodyAsStream()));
var response = "";
var line = br.readLine();
while(line != null){
response = response + line;
line = br.readLine();
}
post.releaseConnection();

You might possibly find a library to import, you could also write your own in Java and make them available to your rhino instance, depending on how your are using it. Keep in mind Rhino is just a Javascript language engine. It doesn't have a DOM, and is not inherently 'web aware' so to speak.
However, since it sounds like you are doing this for testing/experimentation purposes, and you will probably be more productive not having to reinvent the wheel to do so, I will strongly, strongly suggest that you just download Node.js and look into the request module (for making HTTP requests) or any of the various SOAP modules.
You can do a ton more with Node.js, but you can also use it as a very simple runner for Javascript files as well. Regardless you should move away from Rhino though. It is really old and not really supported anymore, especially now that with JDK8 even the javax.script support will switch to the Nashorn engine.
UPDATE: If you really want to give it a go (and if you are prepared to monkey around with Java), you might look at this SO question and its answers. But unless you are something of a masochist, I think you'll be happier taking a different path.

I was actually able to do this using Orchestrator 5.1 with the 'Scriptable task' object to interface with the Zabbix API:
var urlObject = new URL(url);
var jsonString = JSON.stringify({ jsonrpc: '2.0', method: 'user.login', params: { user: 'username', password: 'password' }, id: 1 });
urlObject.contentType = "application/json";
result = urlObject.postContent(jsonString);
System.log(result);
var authenticationToken = JSON.parse(result).result;

Retrieving the favicon url of the websites from firefox extension

I want to retrieve the favicon url of the website once it is loaded. How can I implement this for my firefox extension.

You can use nsIFaviconService, it caches favicons for known pages. Along these lines:
var faviconService = Components.classes["#mozilla.org/browser/favicon-service;1"]
.getService(Components.interfaces.nsIFaviconService);
var favicon = faviconService.getFaviconImageForPage(gBrowser.currentURI);
alert(favicon.spec);
Please note that it works with nsIURI objects, not with strings. You can use nsIIOService.newURI() to get an nsIURI object from a string.
Yes, I realize that I am duplicating karthik's answer - but it has no explanation and only a bogus code example.

https://developer.mozilla.org/en/nsIFaviconService
https://developer.mozilla.org/en/Using_the_Places_favicon_service
Please read the page carefully. You can use the service defined below:
nsIServiceManager serviceManager =
Mozilla.getInstance().getServiceManager();
nsIFaviconService service =
(nsIFaviconService)serviceManager.getServiceByContractID("#mozilla.org/brows
er/favicon-service;1", nsIFaviconService.NS_IFAVICONSERVICE_IID);

We Keep Coding

JavaScript is the programming language of the Web.

Use "readabilitySAX" on distant pages, in Node.js - javascript

The pageURL attribute is used when Readability resolve relative links, not to download a page. To download a page, you can use the get method : require("readabilitySAX").get("http://url", {type:"html"}, function(article) { console.log(article.textLength); })

Related

Can I extract comments of any page from https://www.rt.com/ using python3?

Share webpages on social media with counter

Generic link to open latest version of document wiki pages

Using RHINO js engine to make http requests

Retrieving the favicon url of the websites from firefox extension

Categories

Resources