How to get all links from a website with puppeteer

How to get all links from a website with puppeteer - javascript

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:
array = [ ]
for(L in links){
array.push(L)
//The code should take all the links and add these links to the array
}
But how can I get all references to javascript style files and all URLs that are in the source code of a website?
I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.
Supposing you want to get all the tags on this page for example:
view-source:https://www.nike.com/
How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it

It is possible to get all links from a URL using only node.js, without puppeteer:
There are two main steps:
Get the source code for the URL.
Parse the source code for links.
Simple implementation in node.js:
// get-links.js
///
/// Step 1: Request the URL's html source.
///
axios = require('axios');
promise = axios.get('https://www.nike.com');
// Extract html source from response, then process it:
promise.then(function(response) {
htmlSource = response.data
getLinksFromHtml(htmlSource);
});
///
/// Step 2: Find links in HTML source.
///
// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
// Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)/gi;
// Use the regular expression from above to find all the links:
matches = htmlString.match(LINK_REGEX);
// Output to console:
console.log(matches);
// Alternatively, return the array of links for further processing:
return matches;
}
Sample usage:
$ node get-links.js
[
'http://www.w3.org/2000/svg',
...
'https://s3.nikecdn.com/unite/scripts/unite.min.js',
'https://www.nike.com/android-icon-192x192.png',
...
'https://connect.facebook.net/',
... 658 more items
]
Notes:
I used the axios library for simplicity and to avoid "access denied" errors from nike.com. It is possible to use any other method to get the HTML source, like:
Native node.js http/https libraries
Puppeteer (Get complete web page source html with puppeteer - but some part always missing)

Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.
Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const pageUrls = await page.evaluate(() => {
const urlArray = Array.from(document.links).map((link) => link.href);
const uniqueUrlArray = [...new Set(urlArray)];
return uniqueUrlArray;
});
console.log(pageUrls);
await browser.close();
})();

yes you can get all the script tags and their links without opening view source.
You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below
here is the code:
const axios = require('axios');
const jsdom = require("jsdom");
// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');
// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document
// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)
Here I have made CSS selector for <script> tags that have src attribute inside them.
You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.
you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.

Related

How to find specific value of html element (such as h1) with node.js

I'm trying to find the numbers of users on this website, using node.js, but I'm not sure of what process I can do to get it. The value is subclassed under multiple divs with different class names. Right now I will just be console.logging it but I'm going to do more with the data later on.
I was suggested to use the package "puppeteer", but I'm not sure how this would fetch it or what to do with it to fetch it.
Thank you in advance

You can use Puppeteer and obtain the value via a query string. The typical code looks like this:
const pt= require('puppeteer')
async function getText(){
//launch browser in headless mode
const browser = await pt.launch()
//browser new page
const page = await browser.newPage()
//launch URL
await page.goto('https://feds.lol/')
//identify element
const f = await page.$("QUERY_STRIG")
//obtain text
const text = await (await f.getProperty('textContent')).jsonValue()
console.log("Text is: " + text)
}
getText()
However, I see this website is protected by Captcha

How to get the content of a div tag when scraping with puppeteer and NodeJs

I heard of this library called puppeteer and is usefulness in scraping web pages. so i decided to scrape a gaming site content so I can store it data and go through it later.
But after i copied the XPATH of the div tag I want puppeteer to scrape it content, its returning Empty string Please what am I doing wrong.
This is the url am trying to scrape here
i want to scrape the div tag where the result of the 6 different color ball are being displayed.
so i can get the number of those colors every 45 seconds.
const puppeteer = require("puppeteer");
async function scrapeData(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [dataReceived] = await page.$x('/html/body/div[1]/div/div/div/footer/div[2]/div[1]/div/div[1]/div[2]/div/div');
const elContent = await dataReceived.getProperty('textContent');
const elValue = await elContent.jsonValue();
console.log({elValue});
//console.log(elContent);
//console.log(dataReceived)
browser.close();
}
scrapeData("https://logigames.bet9ja.com/Games/Launcher?gameId=11000&provider=0&sid=&pff=1&skin=201");
console.log("just testing");

Rather than using page.$x here, you could use a simpler selector, which would be less brittle. Try page.$('.ball-value'), or possibly page.waitForSelector('.ball-value') to deal with transition times. Testing on that page using a simpler selector seems to work. If you want to get all the ball values rather than just the first one, there's page.$$ (which is the same as document.querySelectorAll, so it would return an array of elements).

Tampermonkey To open multiple javascript in href in new tab [duplicate]

Over the years on snapchat I have saved lots of photos that I would like to retrieve now, The problem is they do not make it easy to export, but luckily if you go online you can request all the data (thats great)
I can see all my photos download link and using the local HTML file if I click download it starts downloading.
Here's where the tricky part is, I have around 15,000 downloads I need to do and manually clicking each individual one will take ages, I've tried extracting all of the links through the download button and this creates lots of Urls (Great) but the problem is, if you past the url into the browser then ("Error: HTTP method GET is not supported by this URL") appears.
I've tried a multitude of different chrome extensions and none of them show the actually download, just the HTML which is on the left-hand side.
The download button is a clickable link that just starts the download in the tab. It belongs under Href A
I'm trying to figure out what the best way of bulk downloading each of these individual files is.

So, I just watched their code by downloading my own memories. They use a custom JavaScript function to download your data (a POST request with ID's in the body).
You can replicate this request, but you can also just use their method.
Open your console and use downloadMemories(<url>)
Or if you don't have the urls you can retrieve them yourself:
var links = document.getElementsByTagName("table")[0].getElementsByTagName("a");
eval(links[0].href);
UPDATE
I made a script for this:
https://github.com/ToTheMax/Snapchat-All-Memories-Downloader

Using the .json file you can download them one by one with python:
req = requests.post(url, allow_redirects=True)
response = req.text
file = requests.get(response)
Then get the correct extension and the date:
day = date.split(" ")[0]
time = date.split(" ")[1].replace(':', '-')
filename = f'memories/{day}_{time}.mp4' if type == 'VIDEO' else f'memories/{day}_{time}.jpg'
And then write it to file:
with open(filename, 'wb') as f:
f.write(file.content)
I've made a bot to download all memories.
You can download it here
It doesn't require any additional installation, just place the memories_history.json file in the same directory and run it. It skips the files that have already been downloaded.

Short answer
Download a desktop application that automates this process.
Visit downloadmysnapchatmemories.com to download the app. You can watch this tutorial guiding you through the entire process.
In short, the app reads the memories_history.json file provided by Snapchat and downloads each of the memories to your computer.
App source code
Long answer (How the app described above works)
We can iterate over each of the memories within the memories_history.json file found in your data download from Snapchat.
For each memory, we make a POST request to the URL stored as the memories Download Link. The response will be a URL to the file itself.
Then, we can make a GET request to the returned URL to retrieve the file.
Example
Here is a simplified example of fetching and downloading a single memory using NodeJS:
Let's say we have the following memory stored in fakeMemory.json:
{
"Date": "2022-01-26 12:00:00 UTC",
"Media Type": "Image",
"Download Link": "https://app.snapchat.com/..."
}
We can do the following:
// import required libraries
const fetch = require('node-fetch'); // Needed for making fetch requests
const fs = require('fs'); // Needed for writing to filesystem
const memory = JSON.parse(fs.readFileSync('fakeMemory.json'));
const response = await fetch(memory['Download Link'], { method: 'POST' });
const url = await response.text(); // returns URL to file
// We can now use the `url` to download the file.
const download = await fetch(url, { method: 'GET' });
const fileName = 'memory.jpg'; // file name we want this saved as
const fileData = download.body; // contents of the file
// Write the contents of the file to this computer using Node's file system
const fileStream = fs.createWriteStream(fileName);
fileData.pipe(fileStream);
fileStream.on('finish', () => {
console.log('memory successfully downloaded as memory.jpg');
});

chrome extract coverage report as part of build process

in chrome you can generate a coverage report of the css and js used, how can i get this report during my build process in order to automatically split and load my files accordingly? (coverage report without any interactions with the website).

Don't mean on this practice as usefull at all, i'll explain:
if you are using dynamic content this will break the styling of your dynamic elements or your dynamic elements itself.
The best way i found to deal with this is using parcel js as package manager and properly set your js and scss into each .html for each component on your view. This will create a "vitual tree" of styles for each view and will create an insert a js and css on each view. Of course you may get some extra css and js, depending on your needs and your best practices.
For dynamic content: if you are in a view that can create dynamically an element, you'll need to preload its js and css alongside with the rest of css and js even if the user does not interact with the element that will create this other element dynamically.
For static content: If you're working with static content, then parcel.js and write your css and js specifically for each view will be the way to go.
There's no much point on scrape the coverage and take what you need now, as you can work with best practices in development time.
If you want to do that anyway, you can do this with puppeteer (npm i puppeteer --save).
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage()
//Start sending raw DevTools Protocol commands are sent using `client.send()`
//First off enable the necessary "Domains" for the DevTools commands we care about
const client = await page.target().createCDPSession()
await client.send('Page.enable')
await client.send('DOM.enable')
await client.send('CSS.enable')
const inlineStylesheetIndex = new Set();
client.on('CSS.styleSheetAdded', stylesheet => {
const { header } = stylesheet
if (header.isInline || header.sourceURL === '' || header.sourceURL.startsWith('blob:')) {
inlineStylesheetIndex.add(header.styleSheetId);
}
});
//Start tracking CSS coverage
await client.send('CSS.startRuleUsageTracking')
await page.goto(`http://localhost`)
// const content = await page.content();
// console.log(content);
const rules = await client.send('CSS.takeCoverageDelta')
const usedRules = rules.coverage.filter(rule => {
return rule.used
})
const slices = [];
for (const usedRule of usedRules) {
// console.log(usedRule.styleSheetId)
if (inlineStylesheetIndex.has(usedRule.styleSheetId)) {
continue;
}
const stylesheet = await client.send('CSS.getStyleSheetText', {
styleSheetId: usedRule.styleSheetId
});
slices.push(stylesheet.text.slice(usedRule.startOffset, usedRule.endOffset));
}
console.log(slices.join(''));
await page.close();
await browser.close();
})();
Source

Way to scrape a JS-Rendered page?

I'm currently scraping a list of URLs on my site using the request-promise npm module.
This works well for what I need, however, I'm noticing that not all of my divs are appearing because some are rendered after the fact with JS. I know I can't run that JS code remotely to force the render, but is there any ways to be able to scrape the pages only after those elements are added in?
I'm doing this currently with Node, and would prefer to keep using Node if possible.
Here is what I have:
const urls ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(url => {
request(url)
.then(function(html){
//get dummy dom
const d_dom = new JSDOM(html);
....
}
});
Any thoughts on how to accomplish this? Or if there is currently an alternative to Selenium as an npm module?

You will want to use puppeteer which is a Chrome headless browser (owned and maintained by Chrome/Google) for loading and parsing dynamic web pages.
Use page.goto() to goto a specific page, then use page.content() to load the html content from the rendered page.
Here is an example of how to use it:
const { JSDOM } = require("jsdom");
const puppeteer = require('puppeteer')
const urls = ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(async url => {
let dom = new JSDOM(await makeRequest(url))
console.log(dom.window.document.title)
});
async function makeRequest(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let html = await page.content()
await browser.close();
return html
}

We Keep Coding

JavaScript is the programming language of the Web.

How to get all links from a website with puppeteer - javascript

Related

How to find specific value of html element (such as h1) with node.js

How to get the content of a div tag when scraping with puppeteer and NodeJs

Tampermonkey To open multiple javascript in href in new tab [duplicate]

chrome extract coverage report as part of build process

Way to scrape a JS-Rendered page?

Categories

Resources