I am trying take screenshots of a page that loads a series of content (slideshow) via Javascript. I can take screenshots of individual items with Firefox Devtools just fine. However it's tedious to do so by hand.
I can think of a few options-
Run the 'screenshot' command in a loop and call a JS function in each loop to load the next content. However I can't find any documentation to script the developer tools or call JS functions from within it.
Run a JS script on the page to load the contents at an interval and call the devtools to take a screenshot each time. But I can't find any documentation on calling devtools from JS in webpage.
Have Devtools take screenshots in response to a page event. But I can't find any documentation on this either.
How do I do this?
Your first questions is, how to take screenshots with javascript in a programmed way:
use selenium Webdriver to steer the browser instead of trying to script the developer tools of a specific browser.
Using WebdriverJS as framework you can script anything you need around the Webdriver itself.
Your second question is, how to script the FF dev tools:
- no answer from my side -
I will second Ralf R's recommendation to use webdriver instead of trying to wrangle the firefox devtools.
Here's a webdriverjs script that goes to a webpage with a slow loading carousel, and takes a screenshot as soon as the image I request is fully loaded (with this carousel, I tell it to wait until the css opacity is 1). You can just loop this through however many slide images you have.
var webdriver = require('selenium-webdriver');
var By = webdriver.By;
var until = webdriver.until;
var fs = require("fs");
var driver = new webdriver.Builder().forBrowser("chrome").build();
//Go to website
driver.get("http://output.jsbin.com/cerutusihe");
//Tell webdriver to wait until the opacity is 1
driver.wait(function(){
//first store the element you want to find in a variable.
var theEl = driver.findElement(By.css(".mySlides:nth-child(1)"));
//return the css value (it can be any value you like), then return a boolean (that the 'result' of the getCssValue request equals 1)
return theEl.getCssValue('opacity').then(function(result){
return result == 1;
})
}, 60000) //specify a wait of 60 seconds.
//call webdriver's takeScreenshot method.
driver.takeScreenshot().then(function(data) {
//use the node file system module 'fs' to write the file to your disk. In this case, it writes it to the root directory of my webdriver project.
fs.writeFileSync("pic2.png", data, 'base64');
});
Related
I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.
So for example, when I route to the following address:
https://connect.garmin.com/modern/activity/1915361012
And then enter this into the console (after the page has loaded):
var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:
However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):
view-source:https://connect.garmin.com/modern/activity/1915361012
So if I use either of the the following AJAX calls:
// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim() );
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
I'll still get the initial footer, but won't get any of the other page contents:
I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
$page.find("script").each(function() {
var scriptContent = $(this).html(); //Grab the content of this tag
eval(scriptContent); //Execute the content
});
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
Q: Any options to fully load a webpage that will scrapable over JavaScript?
You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.
The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.
I wanted to try Headless Chrome so let's see what it can do with your page:
Quick check using internal REPL
Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:
% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim()
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}
NB: to get chrome command line working on a Mac I did this beforehand:
alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"
Using programmatically with Node & Puppeteer
Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.
(Step 0 : Install Node & Yarn if you don't have them)
In a new directory:
yarn init
yarn add puppeteer
Create index.js with this:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://connect.garmin.com/modern/activity/1915361012';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Wait for the results to show up
await page.waitForSelector('.page-title');
// Extract the results from the page
const text = await page.evaluate(() => {
const title = document.querySelector('.page-title');
return title.innerText.trim();
});
console.log(`Found: ${text}`);
browser.close();
})();
Result:
$ node index.js
Found: Daily Mile - Round 2 - Day 27
First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.
The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.
The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.
However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.
Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:
I think you should know the concept of SPA,
SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.
Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.
If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.
Reference:
SSR
the SRR frame of vue: Nuxt.js
PhantomJS
Node API of Headless Chrome
Testing a site using Cucumber and Selenium. In my hooks.js file I have the following:
driver.get("https://localhost:8000/");
sleep(2000);
TakeScreenshot('./test_artifacts/img/', 'Load Success', driver);
var btn = this.driver.wait(selenium.until.elementLocated(By.css('#app > div > div > div.col-xs-6.textColumn > button'), seleniumTimeOut));
TakeScreenshot('./test_artifacts/img/', 'Load Success', driver);
this.driver.sleep(3000);
The objective here is to successfully load the page and to take a screenshot of it. The website is running off of localhost. The problem occurs when a screenshot is taken. No matter how long I get driver to sleep I get a black screenshot, indicating to me that the website is not 'building' in time (to use what may be an incorrect term, given the circumstances). I then get this error:
Waiting for element to be located By(css selector, #app > div > div > div.col-xs-6.textColumn > button)
Wait timed out after 20112ms
If I change the URL to https://google.com/ I get a screenshot of the site, no problem. Any ideas what is happening here? Is my above hypothesis correct?
Thanks in advance!
Please try to wait until some other elements to be available. Use xpath locator instead CSS. Is your machine behind proxy? if means please add the phantom browser in behind the proxy.
var phantom = require('phantom');
phantom.create(function(browser){
browser.createPage(function(page){
browser.setProxy('98.239.198.83','21320','http', null, null, function(){
page.open(
'http://example.com/req.php', function() {
});});});});
First, change the driver to chrome driver and see what is script is doing after changing the URL. The different environment some time have different id's and XPath which affect the script. so before moving to phantomjs directly first check the behaviour of your script with common drivers like chrome or firefox.
I have gone through the same scenario and I have experienced the same situation. It's a silly mistake which was wasted my half of the day :p
Hope it will help you :)
The problem seems to have been a certification problem. The tests would run on http but when localhost was using https the tests would fail.
I worked around it by adding the following to this.BeforeFreature in my hooks.js file:
this.driver = new selenium.Builder()
.withCapabilities({'phantomjs.cli.args': ['--ignore-ssl-errors=true']})
.forBrowser('phantomjs')
.build();
I need to take screenshots of multiple separate shopping websites for the final checkout page.
All selections of items in cart and other navigation to pages must be through code.
The output screenshots should be in image file(jpg,png) or inserted in a docx file(if possible)
What tool and technology can I use for this task?
I have a little idea about screen capture through php and phantomjs but for a static webpage only. I am a newbie and would be happy if someone guides me here.
For example:
To open google.com, search for "stackoverflow" and further opening stackoverflow.com and take a screenshot of the homepage. These steps must be done via code (i.e) automated. Thanks in advance guyz!!
The Selenium website has an example of how to do something similar to this from Java (using Firefox as the browser) at http://www.seleniumhq.org/docs/03_webdriver.jsp#introducing-the-selenium-webdriver-api-by-example
Here's a quick TL;DR version. It doesn't click through to Stack Overflow but instead should take a screenshot of the Google results for that term. Going via Google when you already know the URL of the site may be a redundant step anyway, I am sure you can modify this example to make it do what you need it to.
WebDriver driver = new FirefoxDriver();
driver.get("http://www.google.com");
// Find the text input element by its name
WebElement element = driver.findElement(By.name("q"));
element.sendKeys("Stack Overflow");
element.submit();
// Google's search is rendered dynamically with JavaScript.
// Wait for the page to load, timeout after 10 seconds
(new WebDriverWait(driver, 10)).until(new ExpectedCondition<Boolean>() {
public Boolean apply(WebDriver d) {
return d.getTitle().toLowerCase().startsWith("Stack Overflow");
}
});
// Screenshot of search results (screen not whole page)
File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
FileUtils.copyFile(scrFile, new File("c:\\screenshot.png"));
Screenshot code is from Sergii Pozharov's answer at Take a screenshot with Selenium WebDriver - see that for other considerations such as choice of driver.
With Selenium or JavaScript how could you get the (over the network) transferred size (bytes) of the loaded page including all the content, images, css, js, etc?
The preferred size is that of what goes over the network, that is compressed, only for the requests that are made, etc.
This is what you usually can see in dev tools, to the right in the network status bar:
If that's not possible, could one just get a total size of all the loaded resources (without compression, etc)? That would be an acceptable alternative.
The browser is Firefox, but if it could be done with some other Selenium compatible browser that would be acceptable also.
I guess this could be done using a proxy, but is there any JS or Selenium way to get such information?
If proxy is the only way, which one would one use (or implement) to keep things simple for such a task? Just implementing something in Java before setting up the driver?
(The solution should work at least on Linux, but preferably on Windows also. I'm using Selenium WebDriver via Java.)
For future reference, it is possible to request this information from the browser by javascript. However, at the time of writing no browser supports this feature for this specific data yet. More information can be found here.
In the mean time, for Chrome you can parse this information from the performance log.
//Enable performance logging
LoggingPreferences logPrefs = new LoggingPreferences();
logPrefs.enable(LogType.PERFORMANCE, Level.ALL);
capa.setCapability(CapabilityType.LOGGING_PREFS, logPrefs);
//Start driver
WebDriver driver = new ChromeDriver(capa);
You can then get this data like this
for (LogEntry entry : driver.manage().logs().get(LogType.PERFORMANCE)) {
if(entry.getMessage().contains("Network.dataReceived")) {
Matcher dataLengthMatcher = Pattern.compile("encodedDataLength\":(.*?),").matcher(entry.getMessage());
dataLengthMatcher.find();
//Do whatever you want with the data here.
}
If, like in your case, you want to know the specifics of a single page load, you could use a pre- and postload timestamp and only get entries within that timeframe.
The performance API mentioned in Hakello's answer is now well supported (on everything except IE & Safari), and is simple to use:
return performance
.getEntriesByType("resource")
.map((x) => x.transferSize)
.reduce((a, b) => (a + b), 0);
You can run that script using executeScript to get the number of bytes downloaded since the last navigation event. No setup or configuration is required.
Yes you can do it using BrowserMobProxy. This is a java jar which use selenium Proxy to track network traffic from client side.
like page load time duration, Query string to different services etc.
you can get it bmp.lightbody.net . This api will create .har files which will contain all these information in json format which you can read using
an online tool http://www.softwareishard.com/har/viewer/
I have achieved this in Python, which might save people some time. To setup the logging:
logging_prefs = {'performance' : 'INFO'}
caps = DesiredCapabilities.CHROME.copy()
caps['loggingPrefs'] = logging_prefs
driver = webdriver.Chrome(desired_capabilities=caps)
To calculate the total:
total_bytes = []
for entry in driver.get_log('performance'):
if "Network.dataReceived" in str(entry):
r = re.search(r'encodedDataLength\":(.*?),', str(entry))
total_bytes.append(int(r.group(1)))
mb = round((float(sum(total_bytes) / 1000) / 1000), 2)
This question already has answers here:
getting the raw source from Firefox with javascript
(3 answers)
Closed 8 years ago.
I'm not using Selenium to automate testing, but to automate saving AJAX pages that inject content, even if they require prior authentication to access.
I tried
tl;dr: I tried multiple tools for downloading sites with AJAX and gave up because they were hard to work with or simply didn't work. I'm resorting to using Selenium after trying out WebHTTrack (whose GUI wasn't able to start up on my Ubuntu machine + was a headache to provide authentication with in interactive-terminal mode), wget (which didn't download any of the scripts of stylesheets included on my page, see the bottom for what I tried with wget)... and then I finally gave up after a promising post on using a Mozilla XULRunner AJAX scraper called Crowbar simply seg-faulted on me. So...
ended up making my own broken thing in NodeJS and Selenium-WebdriverJS
My NodeJS script uses selenium-webdriver npm module which is "officially supported by the main project" to:
provide login information + do necessary button-clicking & typing for authentication
download all JS and CSS referenced on target page
download target page with original JS/CSS file links change to local file paths
Now when I view my test page locally I see double of many page elements because the target site loads HTML snippets into the page each time it's loaded. I use this to download my target page right now:
var $;
var getTarget = function () {
driver.getPageSource().then(function (source) {
$ = cheerio.load(source.toString());
});
};
var targetHtmlDest = 'test.html';
var writeTarget = function () {
fs.writeFile(targetHtmlDest, $.html());
}
driver.get(targetSite)
.then(authenticate)
.then(getRoot)
.then(downloadResources)
.then(writeRoot);
driver.quit();
The problem is that the page source I get is the already modified page source, instead of the original one. Trying to run alert("x");window.stop(); within driver.executeAsyncScript() and driver.executeScript() does nothing.
Perhaps using Curl to get the page (you can pass authentication in the command) will get you the bare source?
Otherwise you may be able to turn off JavaScript on your test browsers to prevent JS actions from firing.