Puppeteer code to respect robots.txt file

Puppeteer code to respect robots.txt file - javascript

This seems to be possible in Scrapy Scrapy and respect of robots.txt but is there a simple way to do this in Puppeteer?
I haven't found an easy way to build in "respect the Robots" into Puppeteer commands.

I don't believe puppeteer has anything built in but you can use puppeteer to access robots.txt and then use any of a number of npm modules to parse robots.txt to see if you're allowed to get any particular URL. For example, here's how you might use robots-txt-parser:
const robotsParser = require('robots-txt-parser')
const robots = robotsParser()
// Now inside an async function
// (or not if using a version of Node.js that supports top-level await)
await robots.useRobotsFor('https://example.com/')
if (await robots.canCrawl(urlToVisit)) {
// Do stuff with puppeteer here to visit the URL
} else {
// Inform the user that sadly crawling that URL is forbidden
}

Related

Interacting with browser using javascript

I'm trying to run a simple javascript file from the terminal (Ubuntu) that clicks a button on a website. However, I haven't been able to find how to do so, since I've learned that you can't interact with the browser in Node (for doing things like running commands such as window.location.href).
(source: ReferenceError : window is not defined at object. <anonymous> Node.js).
For example, I'd like to be able to create a script (let's call it test.js) where when I run ./test.js or node test.js in the terminal, it will:
go to www.google.com
Click on the "Images" button in the top right.
I wrote out how I understand to do that below:
window.location.href = "https://www.google.com"
document.getElementById('the id of the image button').click()
It seems extremely straightforward, but I am a beginner to Javascript and am not aware of its limitations and could most definitely be wrong about Node. Could someone help explain how I should go about doing something as simple as this? Thanks
EDIT: For clarification on the context, this is just a part of me trying to automate form submissions. I also want to be able to enter specified text into input fields and so on.

Node.js does not include a browser or any browser-like control required to execute the code you posted. Fortunately, this is fairly straightforward with the addition of some extra Node.js software.
What you're looking for is Puppeteer. It's a Node.js library that comes with a small Chrome browser and allows you to remote control that browser from some very easy Node.js functions / methods.
In a directory of your choosing, install puppeteer with npm like so:
npm install -S puppeteer
This will install the library locally into a node_modules/ directory.
Then, you'll need a single javascript file (like test.js in your example) in which you write code like the example in the README (linked above):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.google.com');
await page.click('the id of the "images" link or some selector');
await page.screenshot({path: 'example.png'});
await browser.close();
})();

How to get a screenshot/preview of another website

Is there a way in which you can get a screenshot of another websites pages?
e.g: you introduce a url in an input, hit enter, and a script gives you a screenshot of the site you put in. I manage to do it with headless browsers, but I fear that could take too much resources and time, to launch. let's say phantomjs each time the input is used the headless browser would need to get the new data, I investigate HotJar, it does something similar to what I'm looking for, but it gives you a script that you must put into the page header, which is fine by me, afterwards, you get a preview, how does it work?, and how can one replicate it?

Do you want a print screen of your page or someone else's?
Own page
Use puppeteer or phantomJS with Beverly build of your site, this way you will only run it when it changes, and have a screenshot ready at any time.
Foreign page
You have access to it (the owner runs your script)
Either try to get into his build pipeline, and use solution from above.
Or use this solution Using HTML5/Canvas/JavaScript to take in-browser screenshots.
You don't have any access
Use some long-running process that will give you screenshot when asked.
Imagine a server with one URL endpoint: screenshot.example.com?facebook.com.
The long-running server has a puppeteer/phantomJS instance ready to go when given URL, it will flood that page, get the screenshot and send it back. The browser will actually think of it as a slow ping image request.

You can make this with puppeteer
install with: npm i puppeteer
save the following code to example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
and run it with:
node example.js

Capturing application screen with JavaScript

Is it possible to capture the entire window as screenshot using JavaScript?
The application might contain many iframes and div's where content are loaded asynchronously.
I have explored canvas2image but it works on an html element, using the same discards any iframe present on the page.
I am looking for a solution where the capture will take care of all the iframes present.

The only way to capture the contents of an iframe using ONLY JavaScript in the webpage (No extensions, or application running outside the browser on a users system) is to use the HTMLIFrameElement.getScreenshot() API in Firefox. This API is non-standard, and ONLY works in Firefox.
For any other browser, no. An iframe is typically sandboxed, and as such it is not accessible by the browser by design.
The best way to get a screenshot of a webpage that I have found and use, is an instance of Headless Chrome or Headless Firefox. These will take a screenshot of everything on the page, just as a user would see it.

Yes, widh Puppeteer it is possible.
1 - Just install the dependency:
npm i puppeteer-core
2 - Create JavaScript file, screenshot.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://yourweb.com');
await page.screenshot({path: 'screenshot.png'});
await browser.close();
})();
3 - Run:
node screenshot.js
Source

Web pages are not the best things to be "screenshoted", because of their nature; they can include async elements, frames or something like that, they are usually responsive etc...
For your purpose the best way is to use external api or an external service, I think is not a good idea to try doing that with JS.
You should try https://www.url2png.com/

Webscrape JS rendered Website

I am trying to figure out how to website this website https://cnx.org/search?q=subject:%22Arts%22 that is rendered via JavaScript. When I view the page source, there is very little code. I know that BeautifulSoup can't do this. I have tried Selenium but I am new to it. Any suggestions on how scraping this site could be accomplished?

You can use selenium to do this. You won't look at HTML source code though. Press F12 on chrome (or install firebug on firefox) to get into the developer tools. Once there, you can select elements (pointer icon on top left of dev tools window). Once you click what you want, you can right click the highlighted portion in the "Elements" column and copy -> Xpath. Be careful to use proper quotes in your code because the xpaths usually use double quotes, which is also common when using the find_element_by_expath method.
Essentially you instantiate your browser, go to the page, find the element by xpath (an XML language to just go to a specific spot on a page that uses javascript). It's roughly like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
# Load page
driver.get("https://www.instagram.com/accounts/login/")
# Find your element via its xpath (see above to get)
# The "Madlavning" entry on the page would be:
element = driver.find_element_by_xpath('//*[#id="results"]/div/table/tbody/tr[1]/td[2]/h4/a')
#Pull the text:
element.text
#ensure you dont get zombie/defunct chrome/firefox instances that suck up resources
driver.quit()
selenium can be used for plenty of scraping, you just need to know what you want to do once you find the info.

You can use the API that the web-page gets it's data from (using JavaScript) directly. https://archive.cnx.org/search?q=subject:%22Arts%22 It returns JSON so you just need to parse the JSON.
import requests
import json
url = "https://archive.cnx.org/search?q=subject:%22Arts%22"
r = requests.get(url)
j = r.json()
# Print the json object
print (json.dumps(j, indent=4, sort_keys=True))
# Or print specific values
for i in j['results']['items']:
print (i['title'])
print(i['summarySnippet'])

Try Google's official headless browser wrapper around Chrome, puppeteer.
Install:
npm i puppeteer
Usage:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
It's easy to use and have a good documentation.

Puppeteer: is there a way to access the DevTools Network API?

I am trying to use Puppeteer for end-to-end tests. These tests require accessing the network emulation capabilities of DevTools (e.g. to simulate offline browsing).
So far I am using chrome-remote-interface, but it is too low-level for my taste.
As far as I know, Puppeteer does not expose the network DevTools features (emulateNetworkConditions in the DevTools protocol).
Is there an escape hatch in Puppeteer to access those features, e.g. a way to execute a Javascript snippet in a context in which the DevTools API is accessible?
Thanks
Edit:
OK, so it seems that I can work around the lack of an API using something like this:
const client = page._client;
const res = await client.send('Network.emulateNetworkConditions',
{ offline: true, latency: 40, downloadThroughput: 40*1024*1024,
uploadThroughput: 40*1024*1024 });
But I suppose it is Bad Form and may slip under my feet at any time?

Update: headless Chrome now supports network throttling!
In Puppeteer, you can emulate devices (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageemulateoptions) but not network conditions. It's something we're considering, but headless Chrome needs to support network throttling first.
To emulate a device, I'd use the predefined devices found in DeviceDescriptors:
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto('https://www.google.com');
// other actions...
browser.close();
});

We Keep Coding

JavaScript is the programming language of the Web.

Puppeteer code to respect robots.txt file - javascript

This seems to be possible in Scrapy Scrapy and respect of robots.txt but is there a simple way to do this in Puppeteer? I haven't found an easy way to build in "respect the Robots" into Puppeteer commands.

Related

Interacting with browser using javascript

How to get a screenshot/preview of another website

Capturing application screen with JavaScript

Webscrape JS rendered Website

Puppeteer: is there a way to access the DevTools Network API?

Categories

Resources