I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.
So for example, when I route to the following address:
https://connect.garmin.com/modern/activity/1915361012
And then enter this into the console (after the page has loaded):
var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:
However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):
view-source:https://connect.garmin.com/modern/activity/1915361012
So if I use either of the the following AJAX calls:
// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim() );
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
I'll still get the initial footer, but won't get any of the other page contents:
I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
$page.find("script").each(function() {
var scriptContent = $(this).html(); //Grab the content of this tag
eval(scriptContent); //Execute the content
});
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
Q: Any options to fully load a webpage that will scrapable over JavaScript?
You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.
The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.
I wanted to try Headless Chrome so let's see what it can do with your page:
Quick check using internal REPL
Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:
% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim()
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}
NB: to get chrome command line working on a Mac I did this beforehand:
alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"
Using programmatically with Node & Puppeteer
Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.
(Step 0 : Install Node & Yarn if you don't have them)
In a new directory:
yarn init
yarn add puppeteer
Create index.js with this:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://connect.garmin.com/modern/activity/1915361012';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Wait for the results to show up
await page.waitForSelector('.page-title');
// Extract the results from the page
const text = await page.evaluate(() => {
const title = document.querySelector('.page-title');
return title.innerText.trim();
});
console.log(`Found: ${text}`);
browser.close();
})();
Result:
$ node index.js
Found: Daily Mile - Round 2 - Day 27
First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.
The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.
The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.
However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.
Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:
I think you should know the concept of SPA,
SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.
Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.
If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.
Reference:
SSR
the SRR frame of vue: Nuxt.js
PhantomJS
Node API of Headless Chrome
Related
HTML I am getting using node.js is much different than HTML I can see in the browser (using google chrome inspect feature). I assume this is happening because when using browser I have to wait for some elements to load but I don't wait for them when creating a request. How can I request a fully loaded HTML? Is it possible without pretending to be a real user (puppeteer)?
For example, this is my attempt to get a video element from this link https://clips.twitch.tv/IronicPoisedTermite4Head
but video element is not present at all in the HTML I have fetched.
const fetch = require("node-fetch");
const jsdom = require("jsdom");
(async () => {
let htmlDoc = await fetch("https://clips.twitch.tv/IronicPoisedTermite4Head")
.then((res) => res.text())
.then((body) => body); //body is totally different than HTML in the browser
try {
const document = new jsdom.JSDOM().window.document;
console.log(htmlDoc);
console.log(document.getElementsByTagName('video')[0]);
} catch (e) {
console.log(e);
}
})();
When a browser loads a web page, it does an HTTP GET and gets back a static piece of HTML. Let's call that the "original content". It then parses that HTML and runs any <script> tags it finds in that HTML. Those script tags may then modify the content you see. In particular some sites make additional HTTP requests to retrieve additional content and then they insert that content into the page. The produces what I will call the "full content". Those scripts may even continue running over time to continue to update the content.
When you do a fetch() of some URL, that retrieves what was labeled above as the "original content". That's all it does. fetch() just does the initial HTTP GET for that URL. It doesn't parse the resulting HTML and it doesn't run any of the <script> tags it could find in that HTML. Thus, fetch() does not produce the "full content" as described above. Sometimes, the "original content" is sufficient for your work and sometimes the "full content" is what you need - it really depends upon the specific web site.
To get the "full content", you have to feed the "original content" to a browser-like environment that can "run" it to let its scripts do their things, to provide a DOM environment for those scripts to run in so you can then query the resulting DOM to get the "full content". puppeteer is one such tool for obtaining the "full content". It actually uses the Chromium engine (same engine the Chrome browser uses) to literally "run" the web page and let its <script> tags do their thing and you can then obtain the "full content" from it after those scripts run.
fetch(), by itself, cannot get the "full content" because it doesn't parse or run the page's scripts and doesn't offer a DOM environment for them to run in either. That's what a tool like puppeteer can do.
How can I request a fully loaded HTML? Is it possible without pretending to be a real user (puppeteer)?
If the site builds its "full content" uses Javascript in <script> tags, then you have to use a tool like puppeteer to get the "full content". It's not just a matter of waiting. You need a tool that actually runs the scripts in the page.
Is there a way in which you can get a screenshot of another websites pages?
e.g: you introduce a url in an input, hit enter, and a script gives you a screenshot of the site you put in. I manage to do it with headless browsers, but I fear that could take too much resources and time, to launch. let's say phantomjs each time the input is used the headless browser would need to get the new data, I investigate HotJar, it does something similar to what I'm looking for, but it gives you a script that you must put into the page header, which is fine by me, afterwards, you get a preview, how does it work?, and how can one replicate it?
Do you want a print screen of your page or someone else's?
Own page
Use puppeteer or phantomJS with Beverly build of your site, this way you will only run it when it changes, and have a screenshot ready at any time.
Foreign page
You have access to it (the owner runs your script)
Either try to get into his build pipeline, and use solution from above.
Or use this solution Using HTML5/Canvas/JavaScript to take in-browser screenshots.
You don't have any access
Use some long-running process that will give you screenshot when asked.
Imagine a server with one URL endpoint: screenshot.example.com?facebook.com.
The long-running server has a puppeteer/phantomJS instance ready to go when given URL, it will flood that page, get the screenshot and send it back. The browser will actually think of it as a slow ping image request.
You can make this with puppeteer
install with: npm i puppeteer
save the following code to example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
and run it with:
node example.js
Is it possible to capture the entire window as screenshot using JavaScript?
The application might contain many iframes and div's where content are loaded asynchronously.
I have explored canvas2image but it works on an html element, using the same discards any iframe present on the page.
I am looking for a solution where the capture will take care of all the iframes present.
The only way to capture the contents of an iframe using ONLY JavaScript in the webpage (No extensions, or application running outside the browser on a users system) is to use the HTMLIFrameElement.getScreenshot() API in Firefox. This API is non-standard, and ONLY works in Firefox.
For any other browser, no. An iframe is typically sandboxed, and as such it is not accessible by the browser by design.
The best way to get a screenshot of a webpage that I have found and use, is an instance of Headless Chrome or Headless Firefox. These will take a screenshot of everything on the page, just as a user would see it.
Yes, widh Puppeteer it is possible.
1 - Just install the dependency:
npm i puppeteer-core
2 - Create JavaScript file, screenshot.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://yourweb.com');
await page.screenshot({path: 'screenshot.png'});
await browser.close();
})();
3 - Run:
node screenshot.js
Source
Web pages are not the best things to be "screenshoted", because of their nature; they can include async elements, frames or something like that, they are usually responsive etc...
For your purpose the best way is to use external api or an external service, I think is not a good idea to try doing that with JS.
You should try https://www.url2png.com/
I am trying take screenshots of a page that loads a series of content (slideshow) via Javascript. I can take screenshots of individual items with Firefox Devtools just fine. However it's tedious to do so by hand.
I can think of a few options-
Run the 'screenshot' command in a loop and call a JS function in each loop to load the next content. However I can't find any documentation to script the developer tools or call JS functions from within it.
Run a JS script on the page to load the contents at an interval and call the devtools to take a screenshot each time. But I can't find any documentation on calling devtools from JS in webpage.
Have Devtools take screenshots in response to a page event. But I can't find any documentation on this either.
How do I do this?
Your first questions is, how to take screenshots with javascript in a programmed way:
use selenium Webdriver to steer the browser instead of trying to script the developer tools of a specific browser.
Using WebdriverJS as framework you can script anything you need around the Webdriver itself.
Your second question is, how to script the FF dev tools:
- no answer from my side -
I will second Ralf R's recommendation to use webdriver instead of trying to wrangle the firefox devtools.
Here's a webdriverjs script that goes to a webpage with a slow loading carousel, and takes a screenshot as soon as the image I request is fully loaded (with this carousel, I tell it to wait until the css opacity is 1). You can just loop this through however many slide images you have.
var webdriver = require('selenium-webdriver');
var By = webdriver.By;
var until = webdriver.until;
var fs = require("fs");
var driver = new webdriver.Builder().forBrowser("chrome").build();
//Go to website
driver.get("http://output.jsbin.com/cerutusihe");
//Tell webdriver to wait until the opacity is 1
driver.wait(function(){
//first store the element you want to find in a variable.
var theEl = driver.findElement(By.css(".mySlides:nth-child(1)"));
//return the css value (it can be any value you like), then return a boolean (that the 'result' of the getCssValue request equals 1)
return theEl.getCssValue('opacity').then(function(result){
return result == 1;
})
}, 60000) //specify a wait of 60 seconds.
//call webdriver's takeScreenshot method.
driver.takeScreenshot().then(function(data) {
//use the node file system module 'fs' to write the file to your disk. In this case, it writes it to the root directory of my webdriver project.
fs.writeFileSync("pic2.png", data, 'base64');
});
This question already has answers here:
getting the raw source from Firefox with javascript
(3 answers)
Closed 8 years ago.
I'm not using Selenium to automate testing, but to automate saving AJAX pages that inject content, even if they require prior authentication to access.
I tried
tl;dr: I tried multiple tools for downloading sites with AJAX and gave up because they were hard to work with or simply didn't work. I'm resorting to using Selenium after trying out WebHTTrack (whose GUI wasn't able to start up on my Ubuntu machine + was a headache to provide authentication with in interactive-terminal mode), wget (which didn't download any of the scripts of stylesheets included on my page, see the bottom for what I tried with wget)... and then I finally gave up after a promising post on using a Mozilla XULRunner AJAX scraper called Crowbar simply seg-faulted on me. So...
ended up making my own broken thing in NodeJS and Selenium-WebdriverJS
My NodeJS script uses selenium-webdriver npm module which is "officially supported by the main project" to:
provide login information + do necessary button-clicking & typing for authentication
download all JS and CSS referenced on target page
download target page with original JS/CSS file links change to local file paths
Now when I view my test page locally I see double of many page elements because the target site loads HTML snippets into the page each time it's loaded. I use this to download my target page right now:
var $;
var getTarget = function () {
driver.getPageSource().then(function (source) {
$ = cheerio.load(source.toString());
});
};
var targetHtmlDest = 'test.html';
var writeTarget = function () {
fs.writeFile(targetHtmlDest, $.html());
}
driver.get(targetSite)
.then(authenticate)
.then(getRoot)
.then(downloadResources)
.then(writeRoot);
driver.quit();
The problem is that the page source I get is the already modified page source, instead of the original one. Trying to run alert("x");window.stop(); within driver.executeAsyncScript() and driver.executeScript() does nothing.
Perhaps using Curl to get the page (you can pass authentication in the command) will get you the bare source?
Otherwise you may be able to turn off JavaScript on your test browsers to prevent JS actions from firing.