I'm writing a scraper for Skyscanner just for fun. What I'm trying to do is to iterate through the list of all listings, and for each listing, extract the URL.
What I've done so far is getting the listing $("div[class^='FlightsResults_dayViewItems']") which returns
but I'm not sure how to iterate through the returned object and get the URL (/transport/flight/bos...). The pseudo code that I have is
for(listings in $("div[class^='FlightsResults_dayViewItems']")) {
go to class^='EcoTickerWrapper_itineraryContainer'
go to class^='FlightsTicket_container'
go to class^='FlightsTicket_link' and get the href and save in an array
}
How would I go about doing this?
Side-note, I'm using cheerio and jquery.
Update:
I figured out the CSS selector is
$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(at_index_i) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > a[class^='FlightsTicket_link']").href
Now, I'm trying to figure out how to loop through the listing and apply the selector for each listing in the loop.
Also, it seems like not including the div:nth-child(at_index_i) won't work. Is there a way around this?
$("div[class^='FlightsResults_dayViewItems'] > div:nth-child(3) > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")
"/transport/flights/bos/cun/210301/210331/config/10081-2103010815--32733-0-10803-2103011250|10803-2103311225--31722-1-10081-2103312125?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&destinationentityid=27540602&inboundaltsenabled=false&infants=0&originentityid=27539525&outboundaltsenabled=false&preferdirects=false&preferflexible=false&ref=home&rtn=1"
$("div[class^='FlightsResults_dayViewItems'] > div[class^='EcoTicketWrapper_itineraryContainer'] > div[class^='FlightsTicket_container'] > [class^='FlightsTicket_link']").attr("href")
undefined
Here's the function to iterate the listings and grab the URLs for each listing.
async function scrapeListingUrl(listingURL) {
try {
const page = await browser.newPage();
await page.goto(listingURL, { waitUntil: "networkidle2" });
// await page.waitForNavigation({ waitUntil: "networkidle2" }); // Wait until page is finished loading before navigating
console.log("Finished loading page.");
const html = await page.evaluate(() => document.body.innerHTML);
fs.writeFileSync("./listing.html", html);
const $ = await cheerio.load(html); // Inject jQuery to easily get content of site more easily compared to using raw js
// Iterate through flight listings
// Note: Using regex to match class containing "FlightsResults_dayViewItems" to get listing since actual class name contains nonsense string appended to end.
const bookingURLs = $('a[class*="FlightsTicket_link"]')
.map((i, elem) => console.log(elem.href))
.get();
console.log(bookingURLs);
return bookingURLs;
} catch (error) {
console.log("Scrape flight url failed.");
console.log(error);
}
}
Using map()
const hrefs = $(selector).map((i, elem) => elem.href).get()
Looking at the code you are not using jQuery so above does not work. So you just need to use a basic selector that matches part of the class with querySelectorAll. And map is used to grab the hrefs.
const links = [...document.querySelectorAll('a[class*="FlightsTicket_link"]')]
.map(l=>l.href)
Related
Thank you for reading. I would appreciate any suggestions or information.
What I'm doing
I'm making web scraping app JSDOM and axios.
Trying to query all of <a href="url"> and get href value.
Problem
Why lists 's length is 0?
How can I get expected result? I want to get NodeList with 3 nodes
Is there any points to be careful about JSDOM? I doubt this is some JSDOM problem.
// target HTML
getLink
// It seems that this a tag is clickable and that gives #download-options "display: none !important; visibility: hidden !important". Does this affect what I'm doing?
<div id="download-options">
<div class="panel-body">
::before
<ul>
<li></li>
<li></li>
<li></li>
</ul>
::after
</div>
</div>
// My web-scraping code
let res = await axios.get('url')
conts dom = new JSDOM(res.data)
const ulist = dom.window.document.querySelector('#download-options > .panel-body > ul')
// => returns HTMLUListElement {}
// ulist.childElementCount => returns 1
const lists = ulist.querySelectorAll('li')
// => returns NodeList {}
// lists.length => returns 0 expected 3, so cannot forEach.
NodeList
HTMLUlistElement
What I've tried
I checked same query code from my google chrome browser developer console, then it returned what I expected. (I got NodeList with 3 nodes and could execute forEach and got all of hrefs value.)
added user-agent for axios request.
Thank you for reading. I would appreciate any suggestions or information.
Turned out that the target site doesn't respond with <li> in <ul> on first request to the site. I don't know why and how the site does this, but I think its related to cookies or cache things.
So I used puppeteer to visit the site's homepage and then target page. This solved the problem.
Codes are like below
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto('http://example.com') //go to homepage to solve cache? problem
await page.goto('https://example.com/targetpage'); // then go to actual target
await page.waitForSelector('#download-options li'); // wait for it just in case
const ul = await page.$("#download-options ul")
const lis = await ul.$$("li")
for await (const li of lis ) {
const a = await li.$('a')
const hrefValue = await a.evaluate((el) => el.getAttribute('href'))
console.log(hrefValue)
}
await browser.close();
})
I have read a few different QAs related to this but none seem to be working.
I am trying to target an element (Angular) called mat-radio-button with a class called mat-radio-checked. And then select the inner text.
In Chrome this is easy:
https://i.stack.imgur.com/Ev0iQ.png
https://i.stack.imgur.com/lVoG3.png
To find the first element that matches in Playwright I can do something like:
let test: any = await page.textContent(
"mat-radio-button.mat-radio-checked"
);
console.log(test);
But if I try this:
let test: any = await page.$$(
"mat-radio-button.mat-radio-checked"
);
console.log(test);
console.log(test[0]);
console.log(test[1]);
});
It does not return an array of elements I can select the inner text of.
I need to be able to find all elements with that class so I can use expect to ensure the returned inner text is correct, eg:
expect(test).toBe("Australian Citizen");
I found out the issue was due to the page generating beforehand and the elements were not available. So I added a waitForSelector:
await page.waitForSelector("mat-radio-button");
const elements = await page.$$("mat-radio-button.mat-radio-checked");
console.log(elements.length);
console.log(await elements[0].innerText());
console.log(await elements[1].innerText());
console.log(await elements[2].innerText());
public async validListOfItems(name) {
await this.page.waitForSelector(name, {timeout: 5000});
const itemlists = await this.page.$$(name);
let allitems: string[]=new Array();
for await (const itemlist of itemlists) {
allitems.push(await itemlist.innerText());
}
return allitems;
}
Make a common method and call by sending parameters.
I have a HTML page that has many div elements, each one with the following structure (the input id and name changes):
<div class="item">
<div class="box">
<div class="img-block">
<label for="check-11">
<input id="check-11" name="result11" type="checkbox">
<span class="fake-input"></span>
</label>
</div>
</div>
</div>
I want to use puppeteer to get all the span with the 'fake-input' class and click on them.
The problem is that it never works, no matter what I try.
In every attempt the start is the same:
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(baseUrl, { waitUntil: 'networkidle2' });
// FETCHING AND CLICKING
}();
I tried many things:
1:
await page.waitForSelector('span.fake-input');
await page.click('span.fake-input');
2:
await page.waitForSelector('span.fake-input')
.then(()=>{
console.log(`clicked!`);
page.click('span.fake-input')
3:
const spans = await page.evaluate(() => {
return Array.from(document.querySelectorAll('span'), el => el.textContent)
})
console.log('spans', spans)
for (let index = 0; index < 7; index++) {
const element = spans[index];
await page.click('span')
}'=
4:
await page.evaluate(selector=>{
return document.querySelector(selector).click();
},'span.fake-input)
console.log('clicked');
In every solution the page fails to get anything at all (either return null or undefined, so the error is "click" is not a funciton in null) or it fails with the error "Node is either not visible or not an HTMLElement".
No matter the error, in any case I fail to fetch all the spans, and click on them.
Can anyone tell me what I'm doing wrong?
Use page.$$ to return multiple elements (equivalent of document.querySelectorAll). Use page.$ to return a single element (equivalent of document.querySelector).
If you want to extract a certain value from a group of elements, use page.$$eval and page.$eval for a single element.
e.g. return elementHandle to script
const spans = await page.$$('div#item label .fake-input')
spans.forEach(span=>span.click())
If you extracting a value from an element, pass a callback to it that returns what you need to extract
e.g.
const spanTexts = page.$$eval('div#item label .fake-input', spans => {
spans.map(span=>span.innerText)
})
console.log(spanTexts)
I should add that page.$$eval and page.$eval executes your callback in the browser context.
var obj = document.querySelectorAll("span.fake-input");
for(var i=0;i<obj.length;i++){
obj[i].click();
}
Vanilla JavaScript would work much easier
I want to get the whole html not just text.
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({
url: //adress,
uniqueKey: makeid(100)
});
const handlePageFunction = async ({ request, $ }) => {
var content_to = $('.class')
};
// Set up the crawler, passing a single options object as an argument.
const crawler = new Apify.CheerioCrawler({
requestQueue,
handlePageFunction,
});
await crawler.run();
});
When I try this the crawler returns complex object. I know I can extract the text from the content_to variable using .text() but I need the whole html with tags like . What should I do?
If I understand you correctly - you could just use .html() instead of .text(). This way you will get inner html instead of inner text of the element.
Another thing to mention - you could also put body to handlePageFunction arg object:
const handlePageFunction = async ({ request, body, $ }) => {
body would have the whole raw html of the page.
Hello I have site that url are rendered by javascript.
I want to find all script tags in my site, then math script src and return only valid ones.
Next find parent of script and finally click on link.
This is what i have:
const scripts = await page.$$('script').then(scripts => {
return scripts.map(script => {
if(script.src.indexOf('aaa')>0){
return script
}
});
});
scripts.forEach(script => {
let link = script.parentElement.querySelector('a');
link.click();
});
My problem is that i have script.src is undefined.
When i remove that condition i move to forEach loop but i get querySelector is undefined. I can write that code in js inside of console of debug mode but i cant move it to Puppeteer API.
from console i get results as expected
let scripts = document.querySelectorAll('script');
scripts.forEach(script=>{
let el = script.parentElement.querySelector('a');
console.log(el)
})
When you use $$ or $, it will return a JSHandle, which is not same as a HTML Node or NodeList that returns when you run querySelector inside evaluate. So script.src will always return undefined.
You can use the following instead, $$eval will evaluate a selector and map over the NodeList/Array of Nodes for you.
page.$$eval('script', script => {
const valid = script.getAttribute('src').indexOf('aaa') > 0 // do some checks
const link = valid && script.parentElement.querySelector('a') // return the nearby anchor element if the check passed;
if (link) link.click(); // click if it exists
})
There are other ways to achieve this, but I merged all into one. Ie, If it works on browser, then you can also use .evaluate and run the exact code and get exact desired result.
page.evaluate(() => {
let scripts = document.querySelectorAll('script');
scripts.forEach(script => {
let el = script.parentElement.querySelector('a');
console.log(el) // it won't show on your node console, but on your actual browser when it is running;
el.click();
})
})