Getting another html document and extracting its text - javascript

I have the task of writing JavaScript code to extract the text from an
external web page and count the number of occurrences of each word in the text. I am also given these two assumptions:
You may assume that the web page will be on the same file system as the web page
written for the exercise.
You may also assume that the web page comprises correctly-formed XHTML
I've worked out from some similar posts on this site how to get the text from the html using the .textContent and .innerText.
I want the user to be able to specify the webpage in a text input.
What I don't understand is getting the other html document in some sort of way so that I can get the text and parse it.

use jQuery.load()
var targetDiv = document.getElementById('my-div');
var input = $("input");
$(targetDiv).load(input.value);

Executing javascript in someones browser means you are telling the user to do something for you. To prevent you using that someone to load a completely foreign page for yourself is something limited by security reasons to protect the user and the external site. If that foreign site is allowed you to download / parse their content then jquery.get is enough for this.

Related

Can I create a new web page from the HTML in a div using javascript?

I have a web page that allows a user to choose some options for a widget and then dynamically generates example HTML based on those options. The HTML is put in a div on the page so that the user can see how it looks and copy/paste it to their own site, if they so desire.
I would like to add a "view this example page" link, which opens in a new window and has the example HTML from the div, so that the example can instantly be seen in action.
Is there a way to do this with javascript/jquery?
You can actually use the window.open method, saving a reference to the opened window, and then writing to it.
https://developer.mozilla.org/en-US/docs/Web/API/Window/open
var exampleWin = window.open("", "example");
var docMarkup = "<!doctype html><html><head><title>test</title></head>" +
"<body><p>Hello, world.</p></body></html>";
exampleWin.document.write(docMarkup);
// later you can also do exampleWin.close() if you wish
Try pasting the above code in your browser's developer tools console.
The usual way to accomplish the end goal works a bit differently. You have a web server listening for GET requests at /code (or similar) and it constructs and responds with the appropriate HTML based on the query string. So you can request /code?color=blue, for example.
Constructing documents is what web servers are designed to do. This approach allows you to leverage caching policies, integrate with a wider variety of user authentication and authorization systems, etc.
To display the source code to the user, simply fetch() the appropriate URL and put the contents in a <code> tag. To display the rendered widget, use an <iframe> whose src is the same URL.
If you really want it to be a new window, open() the URL instead of using an iframe. But beware of popup blockers.

How to extract title, image from others' blog posts and publish on own site

I'm planning to build a site where I can share my handpicked curated contents and I couldn't wrap my head around the basic idea of getting those data fed into my site without going through API.
I first thought maybe I should inspect the source HTML of the page I want to embed on my site and access it with something like $('div.post').find('img').attr('src').
But I can't imagine myself doing that every time so I guess there must be a better way.
It's what Google+ does with their post. Once you add a url link, after a second it pulls featured image and some text snippet from the linked page.
Many sites use the Open graph protocol to get the meta-title, meta-description, image etc. for any url.
For example open: view-source:https://blog.kissmetrics.com/open-graph-meta-tags/ and search for "Open Graph Protocol Meta".
They are contained in the page source. You will have to send a request to the URL you want to crawl from, and read the appropriate meta tags through Regular Expr / HTML Parsers.
You can't make this with javascript. You need a server-side script that downloads the page you need and then parse it with a DOM parser.
With PHP you can get the content of one URL with cURL.
See more:
http://php.net/manual/es/book.curl.php

How to get Dynamic HTML code by PHP or JS

I want to get contents from a website, but when I use file_get_contents() function, I get the HTML code, but some of them lost, I check the site code, I know some parts generate by Ajax, I don't know how to get them, does someone have any suggestions?
I may get some examples,
Site: http://www.drbattery.com/category/notebook+battery/acer/aspire+series.aspx?p=3
Request: I want to get those laptop model which list on this page, such as "Aspire 1690" etc. I need all of those models.
Mhm.
In JS you can access the HTML content in a browser by
document.getElementsByTagName('body')[0].innerHTML
Doing this server-side, you would probably need a headless browser for this.
The tricky part would be detecting, when the content has finished loading and everything is in place. (You wont be able to track AJAX requests by "window.onload".)
Doing it manually, you could add a bookmarklet to your browser, like
javascript:alert(document.getElementsByTagName('body')[0].innerHTML)
You could then select the alert's content by keyboard shortcut (CTRL + A or Command + A), copy it, and hit return (as the dialog's close-button will probably be out of sight).

How can i convert html content(having images also) to a document file using javascript?

I am trying to capture some content of a div in html (both text and images) and I want to convert that to a doc file so that i can mail it. I am using html5 javascript and jQuery.
I have to convert it on the client side.
Here's your solution http://www.phpdocx.com/. You'll need a server side language to do that, the example uses PHP.
Since you have such strict requirements:
Email the user a link with a version of the report you want the user to see when they open the doc.
Tell the user to open MS Word, Click File, Open, Then paste the link in.
The user can then save it as a .doc file where ever they want.
Note: By the way this is the wrong answer, Although you've already turned it down, slash197's answer is the correct way to do this and the way i would normally suggest.
That or just email the report as html.

How to download the current page as a file / attachment using Javascript?

I am aware of the hidden iFrame trick as mentioned here (http://stackoverflow.com/questions/365777/starting-file-download-with-javascript) and in other answers.
I am interested in a similar problem:
How can I use Javascript to download the current page (IE: the current DOM, or some sub-set of it) as a file?
I have a web page which fetches results from a non-deterministic query (eg. a random sample) to display to the user. I can already, via a querystring parameter, make the page return a file instead of rendering the page. I can add a "Get file version" button (our standard approach) but the results will be different to those displayed because it is a different run of the query.
Is there any way via Javascript to download the current page as a file, or is copying to the clipboard my only option?
EDIT
An option suggested by Stefan Kendall and dj_segfault is to write the result server side for later retrieval. Good idea, but unfortunately writing files server side is out of the question in this instance.
How about shudder passing the innerHTML as a post parameter to another page?
You can try with the protocol data:text/attachment
Like in:
<html>
<head>
<style>
</style>
</head>
<body>
<div id="hello">
<span>world</span>
</div>
<script>
(function(){
document.location =
'data:text/attachment;,' + //here is the trick
document.getElementById('hello').innerHTML;
//document.documentElement.innerHTML; //To Download Entire Html Source
})();
</script>
</body>
</html>
Edit after shesek comment
To add to Mic's terrific answer above, some additional points:
If you have Unicode content (Or want to preserve indentation in the source), you need to convert the string to Base64 and tell the Data URI to treat the data as such:
(function(){
document.location =
'data:text/attachment;base64,' + // Notice the new "base64" bit!
utf8_to_b64(document.getElementById('hello').innerHTML);
//utf8_to_b64(document.documentElement.innerHTML); //To Download Entire Html Source
})();
function utf8_to_b64( str ) {
return window.btoa(unescape(encodeURIComponent( str )));
}
utf_to_b64() via MDN -- works in Chrome/FF.
You can drop this all into an anchor tag, allowing you to set the download attribute:
<a onclick="$(this).attr('href', 'data:text/plain;base64,' + utf8_to_b64($('html').clone().find('#generate').remove().end()[0].outerHTML));" download="index.html" id="generate">Generate static</a>
This will download the current page's HTML as index.html and removes the link used to generate the output. This assumes the utf8_to_b64() function from above is defined somewhere else.
Some useful links on Data URIs:
MDN article
MSDN article
Depending on the size and if support is needed for ancient browsers, but you can consider creating a dynamic file using data: URIs and link to it. I'be seen several places that do that. To get the brorwser to download rather than display it, play around with the content type you put in the URI and use the new html5 download attribute. (Sorry for any typos, I'm writing from my phone)
I don't think you're going to be able to do it exactly the way you want to. JavaScript can't create a file and download it for security reasons. Nor can it create it on the server for download.
What I would do if I were you is, on the server side, create an output file with the session ID in the name in a temp directory as you create the output for the web page, and have a button on the web page with a link to that file.
You'll probably want a separate process to remove files over a day old or something like that.
Can you not cache the query results, and store it by some key? That way you can reference the same report output forever, or until your file garbage collector comes along. This also implies that you can create static URLs to report outputs, which tends to be nice.

Categories