What algorithm does Readability use for extracting text from URLs?

What algorithm does Readability use for extracting text from URLs? - javascript

For a while, I've been trying to find a way of intelligently extracting the "relevant" text from a URL by eliminating the text related to ads and all the other clutter.After several months of researching, I gave it up as a problem that cannot be accurately determined. (I've tried different ways but none were reliable)
A week back, I stumbled across Readability - a plugin that converts any URL into readable text. It looks pretty accurate to me. My guess is that they somehow have an algorithm that's smart enough to extract the relevant text.
Does anyone know how they do it? Or how I could do it reliably?

Readability mainly consists of heuristics that "just somehow work well" in many cases.
I have written some research papers about this topic and I would like to explain the background of why it is easy to come up with a solution that works well and when it gets hard to get close to 100% accuracy.
There seems to be a linguistic law underlying in human language that is also (but not exclusively) manifest in Web page content, which already quite clearly separates two types of text (full-text vs. non-full-text or, roughly, "main content" vs. "boilerplate").
To get the main content from HTML, it is in many cases sufficient to keep only the HTML text elements (i.e. blocks of text that are not interrupted by markup) which have more than about 10 words. It appears that humans choose from two types of text ("short" and "long", measured by the number of words they emit) for two different motivations of writing text. I would call them "navigational" and "informational" motivations.
If an author wants you to quickly get what is written, he/she uses "navigational" text, i.e. few words (like "STOP", "Read this", "Click here"). This is the mostly prominent type of text in navigational elements (menus etc.)
If an author wants you to deeply understand what he/she means, he/she uses many words. This way, ambiguity is removed at the cost of an increase in redundancy. Article-like content usually falls into this class as it has more than only a few words.
While this separation seems to work in a plethora of cases, it is getting tricky with headlines, short sentences, disclaimers, copyright footers etc.
There are more sophisticated strategies, and features, that help separating main content from boilerplate. For example the link density (number of words in a block that are linked versus the overall number of words in the block), the features of the previous/next blocks, the frequency of a particular block text in the "whole" Web, the DOM structure of HTML document, the visual image of the page etc.
You can read my latest article "Boilerplate Detection using Shallow Text Features" to get some insight from a theoretical perspective. You may also watch the video of my paper presentation on VideoLectures.net.
"Readability" uses some of these features. If you carefully watch the SVN changelog, you will see that the number of strategies varied over time, and so did the extraction quality of Readability. For example, the introduction of link density in December 2009 very much helped improving.
In my opinion, it therefore makes no sense in saying "Readability does it like that", without mentioning the exact version number.
I have published an Open Source HTML content extraction library called boilerpipe, which provides several different extraction strategies. Depending on the use case, one or the other extractor works better. You can try these extractors on pages on your choice using the companion boilerpipe-web app on Google AppEngine.
To let numbers speak, see the "Benchmarks" page on the boilerpipe wiki which compares some extraction strategies, including boilerpipe, Readability and Apple Safari.
I should mention that these algorithms assume that the main content is actually full text. There are cases where the "main content" is something else, e.g. an image, a table, a video etc. The algorithms won't work well for such cases.

readability is a javascript bookmarklet. meaning its client side code that manipulates the DOM. Look at the javascript and you should be able to see whats going on.
Readability's workflow and code:
/*
* 1. Prep the document by removing script tags, css, etc.
* 2. Build readability's DOM tree.
* 3. Grab the article content from the current dom tree.
* 4. Replace the current DOM tree with the new one.
* 5. Read peacefully.
*/
javascript: (function () {
readConvertLinksToFootnotes = false;
readStyle = 'style-newspaper';
readSize = 'size-medium';
readMargin = 'margin-wide';
_readability_script = document.createElement('script');
_readability_script.type = 'text/javascript';
_readability_script.src = 'http://lab.arc90.com/experiments/readability/js/readability.js?x=' + (Math.random());
document.documentElement.appendChild(_readability_script);
_readability_css = document.createElement('link');
_readability_css.rel = 'stylesheet';
_readability_css.href = 'http://lab.arc90.com/experiments/readability/css/readability.css';
_readability_css.type = 'text/css';
_readability_css.media = 'all';
document.documentElement.appendChild(_readability_css);
_readability_print_css = document.createElement('link');
_readability_print_css.rel = 'stylesheet';
_readability_print_css.href = 'http://lab.arc90.com/experiments/readability/css/readability-print.css';
_readability_print_css.media = 'print';
_readability_print_css.type = 'text/css';
document.getElementsByTagName('head')[0].appendChild(_readability_print_css);
})();
And if you follow the JS and CSS files that the above code pulls in you'll get the whole picture:
http://lab.arc90.com/experiments/readability/js/readability.js (this is pretty well commented, interesting reading)
http://lab.arc90.com/experiments/readability/css/readability.css

There's no 100% reliable way to do this, of course. You can have a look at the Readability source code here
Basically, what they're doing is trying to identify positive and negative blocks of text. Positive identifiers (i.e. div IDs) would be something like:
article
body
content
blog
story
Negative identifiers would be:
comment
discuss
And then they have unlikely and maybe candidates.
What they would do is determine what is most likely to be the main content of the site, see line 678 in the readability source. This is done by analyzing mostly the length of paragraphs, their identifiers (see above), the DOM tree (i.e. if the paragraph is a last child node), strip out everything unnecessary, remove formatting, etc.
The code has 1792 lines. It does seem like a non trivial problem, so maybe you can get your inspirations from there.

Interesting. I have developed a similar PHP script. It basically scans articles and attaches parts of speech to all text (Brill Tagger). Then, grammatically invalid sentences are instantly eliminated. Then, sudden shifts in pronouns or past tense indicate the article is over, or hasn't started yet. Repeated phrases are searched for and eliminated, like "Yahoo news sports finance" appears ten times in the page. You can also get statistics on the tone with a plethora of word banks relating to various emotions. Sudden changes in tone, from active/negative/financial, to passive/positive/political indicates a boundary. It's endless really, however dig you want to deep.
The major issues are links, embedded anomalies, scripting styles and updates.

Related

Resource impact to give many HTML elements an id?

I'm writing an HTML + JavaScript application that has quite strict resource limitations: it will run in the browser for ages (can be many days or more; think of kiosk mode) and should also run without any change on mobile devices. It is also only one HTML page, i.e. DOM, that uses scrolling etc. to show different content.
=> I really have to make sure not to waste any ressources (CPU, RAM)
Now I'm creating hooks that an "external" editor for such an application / page could use, to have a WYSIWYG preview when modifying the content. Here I need to address elements on the page - an element is a div that will contain further DOM elements, but it is the smallest addressable unit for the editor. (We can probably assume 100 to 1000 of those elements in this long running page)
Now I could find the relevant element given by a "path" by an
algorithm at runtime (not elegant, but lookup time is ok in an
interactive environment).
Or I could add an HTML id attribute to
the elements which contains each individual path. (This would make my program more clear and a lookup very fast)
But I don't know the resource impact of giving so many elements an id attribute...
How much RAM would it need? Only the strings and a couple of pointers each?
Or would it create lots of new and heavy internal structures in the browser?

Having additional ID attributes on your elements would have very minimal impact on any resource use.
The main effect would be that it could increase file size ever so slightly depending on how many and how long IDs you were to use.

Certain HTML Character Entities are HUGE in Firefox Only

For some reason, in my Firefox 12.0 for Mac OS X, my 〉 (〉) characters are much larger than they should be. On Chrome and Safari, they look exactly how I want them to be.
I have AddDefaultCharset utf-8 in my .htaccess as well as <meta charset="utf-8"> in my <head> (as the group I'm delivering these files to may not use my .htaccess).
Also, according to Adobe's Browser Lab, IE 7 and 8 just show a square box... is there anyone I can get these browsers to support that character? It would make things a lot easier (as the colors are going to be changing, so images are very inconvenient, and no color fade with images).
Demo: http://cameronspear.com/demos/rang/
This is what I see in Chrome and expect to see:
This is what my Firefox is showing:
This is a screenshot from Browser Labs of IE8:
TL;DR: I want all of these screenshots to look like the first one using 〉 aka 〉 characters. Use of JavaScript would even be acceptable.
Thanks.
[edit] I should specify that it's not as crucial I have the 〉 character as I am able to change its color with CSS and have it look the same across multiple browsers.
Solution
I just wanted to share exactly what I did for posterity's sake.
Thanks to Pointy's tips and resources, I created my own SVG with Inkscape using the template and methods as described at "How to make your own icon webfont". I mapped a big angle bracket to X and a small one to x.
The one thing I ran into was that my angle needs to touch the baseline and only go about 72% the way to the top of the box to fit "inline," so capital X was my original too-tall one, and lowercase x was the more inline one.
I then converted my SVG to TTF with http://www.freefontconverter.com/ and converted to a webfont with http://www.fontsquirrel.com/fontface/generator
... and that was it.
The demo (http://cameronspear.com/demos/rang/) is still up. You can see it looks consistent in all the browsers and the onclick rotation animation is dang close to the point, etc.
[Update] I found a great resource called IcoMoon that helps on making fonts and organizing fonts for the web, and it accepts regular svg vectors so you can make it in Illustrator and not mess with Inkscape since IcoMoon handles the keyboard mapping and stuff. You can only export the icons you use, so you only load 3 or 4 icons if that's all you need and not the entire font.
It's become an invaluable resource, and I recommend everyone else wanting to get into Icon Fonts check it out. You can learn more about the entire process from CSS-Trick's 113th Screencast.

Are you able to use images? They would provide a consistent look across all browsers. In many cases, images are preferable to character symbols.

This is a font issue. To maximize the odds of having a rare character (one that is not present in most fonts) rendered properly, specify a maximal list of fonts that all contain it.
The page now has just font-family: Arial,sans-serif set on the span elements that contain the bracket. Since Arial does not contain it, each browser will use its own definition for sans-serif. If the map that it is mapped to does not contain the bracket, clever browsers try something clever, like scanning through other fonts in the system, but this may still result in lack of any glyph for it.
There’s an additional problem. Normally it does not matter that you use entity references like 〉 instead of the character itself, but here it does. By HTML 4.01, 〉 means U+232A; by HTML5 drafts, it means U+27E9. IE obeys HTML 4.01 here, whereas Firefox uses the HTML5 definition. So it is better to use the character you really want, either as such in UTF-8 encoding, or as a character reference 〉.
If you can check e.g. the font coverage for U+232A and write the fonts in order of preference. But you should check that all of the fonts give an acceptable presentation. For example, if Cambria Math is used, the default line height will be very large, so you probably want to set line-height explicitly to some reasonable value like 1.3.
Finally—and this should perhaps have been asked very first—do you really want to use RIGHT-POINTING ANGLE BRACKET or MATHEMATICAL RIGHT ANGLE BRACKET? They are brackets, to be used as paired with left angle brackets, not arrow symbols.
Some more info: Guide to using special characters in HTML.

Making an icon font is easy enough that I can do it, though (for me) the process is somewhat mysterious. I suspect there are many actual graphic artists who are better at it, and surely many who understand the technical details more than I do.
Here is a pretty thorough blog post on the topic. (Not mny blog.) The main thing it doesn't describe very well is the relationship between the Inkscape "art board" area and the vertical positioning of each glyph in the font. It goes into some detail, but I've just never been able to figure it out.
What I did, therefore, is just make a square artboard 1024 pixels on a side. I then set up a grid in Inkscape so that the art board is divided up into a 16x16 grid. That makes it (somewhat) easy to design characters that will render nicely at a 16px font size. (Of course you could target a different font size, if you want; 16x16 is good for stuff that needs to be pretty small however.) Then, I just make sure that when I put the glyphs on the page, they're in a 1em by 1em box (or 16px by 16px; however you want to do it in your CSS) with no padding. I use <i> tags, and give them display: inline-block. That gives me a lot of flexibility, and it generally works great.
The Inkscape SVG font tool is, to put it mildly, pretty raw. It's literally the result of somebody's summer project. It works, but not much more than barely. Save often.
Now the process for generating the font files is somewhat crazy. I use FontSquirrel. I upload the .svg saved from Inksccape, and then ask for EOT, WOFF, and TTF. Amazingly, it works.
If you just need a few glyphs, this is a pretty sweet way to go, because you'll have a little bitty font file to download and it'll be cached by the browser. There are some accessibility issues however and the practice is sufficiently controversial that some more fanatical members of the community may consider you a barbarian for doing this :-)

What is the difference between using tab and space when we do source code formatting?

When we want to give some formatting to code to make it more readable. We can use both Tab and Space (by pressing space bar on keyboard). Many time I give space manually.
I've some confusions
Is there any difference between them?
Is using tab better than spaces?
Does using space increase the size of document and tab not?
Is there any differences between single tab/space and multiple. in terms of page size?
Is there any difference on using tab or space in HTML, CSS and Javascript?
Dreamweaver has this option in preferences

Tabs need less characters and are formatted by users (some display as 2, some as 4 spaces), what can make some code look bad on some other environment.
Spaces consume more KBs but look equal everywhere.
Good editors have retab-functions to convert those.
In JS and CSS it does not matter, HTML should not matter, but can in some cases.

Using tabs is annoying because some editors interpret a tab as 4 spaces, some others as 8 spaces, and some others as 2 spaces, which makes the indentation completely wrong if tabs and spaces are used in the same file.
I always use spaces only to avoid this problem.
It could slightly increase the download size of your pages, but you could also minify the JavaScript and the CSS files, and/or use gzip on-the-fly compression to mitigate this small problem.

Using Tabs :
Tab takes less space
Tab is user system defined :: So in my case if i prefer 2space::tab i can view it that way
Moving one indentation level back is lot easier if you use tabs .
Using Spaces :
Tab space ratio usually defaults to 1:8 . So on all 'newbie' systems your code will be difficult to read . Also if you view your code on github / pastebin there again it will be some what awkward .
My take : Go with tabs for development , find replace '\t' with ' ' [4 spaces] and then for release minify [ this strips tabs and spaces ] .

Personally, I see it as an issue of preference involving ease of formatting. I'd rather use tab, because it's only one click and it simulates 4-5 spaces. A lot simpler in my mind. Otherwise, I don't see much purpose.

In pretty much all cases, you can use tabs or spaces and it won't make a difference. Using tabs will make the source files slightly smaller, while using spaces will ensure that spacing is consistent for everyone (since tabs can have a variable width).
I believe the only language where it actually matters is Python. For any other language, tabs vs. spaces is basically even - just be sure to pick one and be consistent with it.

Using a tab technically takes up less memory than using, say, 5 spaces. So when trying to optimize file size it could be helpful. However minifying text would have a better effect. Look up minifying code for more on this topic.
Some people prefer spaces and some prefer tabs. Its a matter of preference and many folks have different reasons for it. Here is a great article on this point: http://www.jwz.org/doc/tabs-vs-spaces.html
Chances are for your application it wont matter much.

If your code is being downloaded, such as HTML or Javascript or CSS, it makes a difference because the file is larger if spaces are used. How much larger depends on the number of lines and indent levels. It's the same as including comments in the code: they do increase file size.
If your code will be compiled, such as Actionscript or Java or C, or tokenized such as Perl, it makes no difference whether you use spaces or tabs, and you can include as many paragraphs of comments/documentation as you like, because it's only for your own benefit. All those tabs and spaces and comments will be ignored when the final, lower-level code is built.

Find links to images in HTML (incl. outside of common tags/attributes)

I'd like to find (using javascript) all of the references to image links on an HTML page. Since I'm also looking for image references that may not be displayed, or are in unknown attribute types, simply looking for image tags or src's etc. isn't enough. As such, I haven't yet found a simple method using an html parser to do this.
Having looked through the stackoverflow threads, I don't want to lose my soul by employing the dark method of matching that dare not speak its name - I hesitate to mention it here, lest I draw down the fury of those who hate using regu1#r_expre$$i0n$ for such a purpose. But I haven't found the right method yet either.
I know that all links that look like images links are not, and vice versa, but that's OK. I don't need complete coverage, just the widest possible without sacrificing speed. so I'm guessing that following all the links is too intensive, and restricting myself to links that 'look' like images will be just fine.

Anthony was right, regex worked just fine for my purpose here.

Jquery append using multiline

I have been working on a project that dynamically creates a javascript file using ASP.NET which is called from another site.
This jquery javascript file appends a div and fills it with a rather large HTML segment and in order to do that I need to turn the segment into a string like so:
$(document).ready(function(){
var html = "Giving this magazine such a lofty epithet may seem a bit presumptuous, but for a non scientifically trained outsider this magazine offers a fresh and challenging look at the fast paced world of science that doesn't shy away from humor and the use of terms and ideas that may require its readers to go online and define a term. And in some cases it may inspire the reader to pick up a book on science by such greats as Hawking and Greene in order to better grasp some of the concepts dealing with time, space and atoms. This magazine isn't dumbed down. It includes well placed and efficient illustrations to help explain some of the more abstract points. It is not designed in the way popular magazinea are, in so much as they only touch upon a topic in the simplest manner and then move on before the audience is lost. Yet this magazine keeps the attention of the reader by combining explanatory notes that help people with no background knowledge have some grasp of the topic and by using humor and well written articles to clearly make their points. <br />For a magazine with a serious and well researched list of topics having small cartoons the likes of the New Yorker shows how comfortable this magazine is with itself. From the moment I picked up this magazine for the first time I felt like every word I read mattered and was worth my time to read. (Not true of many other magazines) American Scientist may not have the audience of Discover or National Geographic, nor is it as accessible as said titles, but for those with a true interest in science willing to challenge themselves and commit to real learning this magazine may be a perfect fit. At $4.95 it is certainly worth it to pick a copy on the news stand and try it out."
$("#divname").append(html);
});
As you can see the segment will be pretty large and I have no way of knowing how big as it is generated dynamically from my database depending on the reviewID which is defined by the user in their request.
The html to be inserted into the div is a list of reviews and is generated using asp.net MVC by a repeater which loops through a list. (if that helps give you an idea of what I am doing).
Is there any way to turn this large segment into one string which can be inserted into the append script?
Thank You

Cross domain jquery json
http://docs.jquery.com/Release:jQuery_1.2/Ajax#Cross-Domain_getJSON_.28using_JSONP.29

Some ideas:
You can replace new lines with spaces and create a huge line. There shouldn't be a problem with it.
Use string concatenation. Split the string and lines and do:
var html = line1 +
line2 +
...
linen;
Make an Ajax call to fill the div:
$("#divname").load(service_url);
You need to create a service that will return the string.
In my opinion the 3rd option is better than the other ones.

Correct me if i'm wrong but i think everything between the starting and ending quotation marks would be considered part of that string no matter how many lines it has. Unless your string has got any quotation marks in itself, in which case it'll be better to do the equivalent of php's addslashes() function in ASP on your string, which should add a \ before all the " marks in the string.
Another idea can be to use Json to encode/decode the string.

i don't see what's wrong with just generating one big-ass long single-line string and appending it just like you are doing. period. done. Fancier isn't going to gain you anything.

Hide it else where on the page and populate the div with it when you need it?

We Keep Coding

JavaScript is the programming language of the Web.