Related
I'm practicing web scraping using Selenium and trying to scrape all the product links from Lululemon->Woman's main page. But I found that when I tried to use XPath to locate product URLs and then loop through the lists, the different part of each XPath for each product is in the middle, which suggests I cannot do as I expected.
For example, the Xpath of each product is :
/html/body/div[1]/div/main/div/section/div/div[3]/div[2]/div[2]/div/div[133]/div/div/div[2]/h3/a
/html/body/div[1]/div/main/div/section/div/div[3]/div[2]/div[2]/div/div[134]/div/div/div[2]/h3/a
/html/body/div[1]/div/main/div/section/div/div[3]/div[2]/div[2]/div/div[1]/div/div/div[2]/h3/a
See, the difference of each XPath lies in 133, 134, and 1, which represent the #id of products on this page
So how can I create a full list of information of all products (if XPath works) which allows me to loop through it to get every single product's list? Can anyone help me? I pasted my current code and attached the screenshot for reference. Thank you so much!
#this is how I got the web page
driver_path = 'D:/Python/Selenium/chromedriver'
url = "https://shop.lululemon.com/c/womens-leggings/_/N-8s6"
max_pass = 5
#get each product's url
option1 = webdriver.ChromeOptions()
option1.add_experimental_option('detach',True)
driver = webdriver.Chrome(chrome_options=option1,executable_path=driver_path)
driver.get(url)
sleep(2)
for i in range(max_pass):
sleep(3)
try:
driver.find_element_by_xpath('/html/body/div[1]/div/main/div/section/div/div[4]/div/button/span').click()
except:
pass
try:
driver.find_element_by_xpath('/html/body/div[1]/div/main/div/section/div/div[2]/div/button/span').click()
except:
pass
sleep(3)
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
#the next step should be to find the pattern of where each URL is located (this should be a list), then I need to loop through the list to get "href" for every single product
#By the way, I have also tried to use class name "link lll-font-weight-medium" to locate, but I don't know why python says "Message: chrome not reachable (Session info: chrome=95.0.4638.69)"
[p.get_attribute('href') for p in driver.find_elements_by_class_name('link lll-font-weight-medium')] #this doesn't work
To print the href attributes you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://shop.lululemon.com/c/womens-leggings/_/N-8s6")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h3.product-tile__product-name > a")))])
Using XPATH:
driver.get("https://shop.lululemon.com/c/womens-leggings/_/N-8s6")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h3[contains(#class, 'product-tile__product-name')]/a")))])
Console Output:
['https://shop.lululemon.com/p/womens-leggings/Invigorate-HR-Tight-25/_/prod9750552?color=52445', 'https://shop.lululemon.com/p/womens-leggings/Wunder-Train-HR-Tight-25/_/prod9750562?color=47184', 'https://shop.lululemon.com/p/womens-leggings/Instill-High-Rise-Tight-25/_/prod10641675?color=30210', 'https://shop.lululemon.com/p/womens-leggings/Base-Pace-High-Rise-Tight-25/_/prod10641591?color=51039', 'https://shop.lululemon.com/p/womens-leggings/Align-Crop-21-Shine/_/prod10850236?color=51756', 'https://shop.lululemon.com/p/women-pants/Fast-And-Free-Tight-II-NR/_/prod8960003?color=28948', 'https://shop.lululemon.com/p/women-pants/Align-Pant-Full-Length-28/_/prod8780551?color=46741', 'https://shop.lululemon.com/p/women-pants/Align-Pant-2/_/prod2020012?color=26950', 'https://shop.lululemon.com/p/women-pants/Align-Pant-Super-Hi-Rise-28/_/prod9200552?color=26083']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Getting all links of the displayed products you can go with xpath but in my opinion css selectors are quiet more comfortable:
for a in driver.find_elements(By.CSS_SELECTOR, '[data-testid="product-list"] h3 a'):
print(a.get_attribute('href'))
Instead of printing in the iteration you can also append them to a list or process the single product page directly.
Example (selenium 4)
...
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
for a in driver.find_elements(By.CSS_SELECTOR, '[data-testid="product-list"] h3 a'):
print(a.get_attribute('href'))
I have the following code in which I enter a page and search for a product, I want to execute a JavaScript code
from selenium import webdriver
from getpass import getpass
#-------------------------------------------PRODUCT SEARCH-----------------------------------------------------------------------------------
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome("C:\\Users\\stbaz\\Documents\\Python\\ChromeTools\\chromedriver.exe", options=options)
driver.get("https://www.innvictus.com/")
product_textbox = driver.find_element_by_id("is-navigation__search-bar__input")
product_textbox.send_keys("FW7093")
product_textbox.submit()
#------------------------------------------PRODUCT SEARCH END--------------------------------------------------------------------------------------
driver.implicitly_wait(5)
js='javascript:document.getElementsByClassName("buy-button buy-button--sticky buy-button--buy-now visible-xs visible-sm")[1].click();window.open("/checkout")'
driver.execute_script(js)
But I get the following error
selenium.common.exceptions.JavascriptException: Message: javascript error: Cannot read property 'click' of undefined
I can run that code manually in chrome, I use a bookmark, but I want to run it in Python, what am I doing wrong?
Mainly Getting an error "Cannot read property 'click' of undefined" is if selector or class cannot be found in the DOM. I believe the error happens if JavaScript cannot find the selector in the markup/DOM or if it does not exist.
Maybe you should try
js='javascript:document.getElementsByClassName("buy-button buy-button--sticky buy-button--buy-now visible-xs visible-sm").click();window.open("/checkout")'
with out the index.
check Cannot read property 'click' of undefined while using Java Script Executor in Selenium
It may help
This error message...
selenium.common.exceptions.JavascriptException: Message: javascript error: Cannot read property 'click' of undefined
...implies that the click() method can't be executed as the WebElement haven't completely rendered within the DOM Tree and the element is still an undefined element.
Solution
To execute the JavaScript, you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".buy-button.buy-button--sticky.buy-button--buy-now.visible-xs.visible-sm")))
driver.execute_script("var x= document.getElementsByClassName('buy-button buy-button--sticky buy-button--buy-now visible-xs visible-sm')[0];"+"x.click();")
Using XPATH:
WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[#class='buy-button buy-button--sticky buy-button--buy-now visible-xs visible-sm']")))
driver.execute_script("var x= document.getElementsByClassName('buy-button buy-button--sticky buy-button--buy-now visible-xs visible-sm')[0];"+"x.click();")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
So, I included the main program without using JavaScript; but, under the Main Program Reference, I included the method for using JavaScript as well.
To achieve this solution without JS, I used xpath to validate that the page loaded successfully. Then, after that, I found the xpath for the search button
//nav[#class='is-navigation']//span[contains(#class, 'search-btn')]
Once I discovered the xpath for this, I clicked the search button and then I created a method to search for a specific product. In my example, I used the "Converse" shoes as an example.
def search_for_text(driver : ChromeDriver, text : str):
driver.find_element(By.XPATH, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]").send_keys(text)
time.sleep(2)
if driver.find_element(By.XPATH, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]").get_attribute('value').__len__() != text.__len__():
raise Exception("Failed to populate our search textbox")
else:
driver.find_element(By.XPATH, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]").send_keys(Keys.RETURN)
time.sleep(2)
wait_displayed(driver, "//div[#class='is-pw__products']//div[contains(#class, 'products-list')]", 30)
print(f'Your search for {text} was successful')
In that method, you see that I used wait_displayed to validate that my product list displays properly.
MAIN PROGRAM - FOR REFERENCE
from selenium import webdriver
from selenium.webdriver.chrome.webdriver import WebDriver as ChromeDriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as DriverWait
from selenium.webdriver.support import expected_conditions as DriverConditions
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.keys import Keys
import time
def get_chrome_driver():
"""This sets up our Chrome Driver and returns it as an object"""
path_to_chrome = "F:\Selenium_Drivers\Windows_Chrome87_Driver\chromedriver.exe"
chrome_options = webdriver.ChromeOptions()
# Browser is displayed in a custom window size
chrome_options.add_argument("window-size=1500,1000")
return webdriver.Chrome(executable_path = path_to_chrome,
options = chrome_options)
def wait_displayed(driver : ChromeDriver, xpath: str, int = 5):
try:
DriverWait(driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
except:
raise WebDriverException(f'Timeout: Failed to find {xpath}')
def is_displayed(driver : ChromeDriver, xpath: str, int = 5):
try:
webElement = DriverWait(driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
return True if webElement != None else False
except:
return False
def click_search(driver : ChromeDriver):
driver.find_element(By.XPATH, "//nav[#class='is-navigation']//span[contains(#class, 'search-btn')]").click()
time.sleep(2)
if is_displayed(driver, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]") == False:
raise Exception("Failed to click our search button")
else:
print('You clicked the Search Button Successfully')
def search_for_text(driver : ChromeDriver, text : str):
driver.find_element(By.XPATH, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]").send_keys(text)
time.sleep(2)
if driver.find_element(By.XPATH, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]").get_attribute('value').__len__() != text.__len__():
raise Exception("Failed to populate our search textbox")
else:
driver.find_element(By.XPATH, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]").send_keys(Keys.RETURN)
time.sleep(2)
wait_displayed(driver, "//div[#class='is-pw__products']//div[contains(#class, 'products-list')]", 30)
print(f'Your search for {text} was successful')
# Gets our chrome driver and opens our site
chrome_driver = get_chrome_driver()
chrome_driver.get("https://www.innvictus.com/")
wait_displayed(chrome_driver, "(//nav[#class='is-navigation']//div[contains(#class, 'desktop-menu')]//a[contains(#href, 'lanzamientos')])[1]")
wait_displayed(chrome_driver, "//div[#class='content-page']//div[#class='scrolling-wrapper-flexbox']")
wait_displayed(chrome_driver, "//nav[#class='is-navigation']//span[contains(#class, 'search-btn')]")
click_search(chrome_driver)
search_for_text(chrome_driver, "Converse")
chrome_driver.quit()
chrome_driver.service.stop()
COMMAND FOR CLICKING THE SEARCH BUTTON USING JAVASCRIPT
jsText = "document.querySelector('nav').querySelector('div .is-navigation__main').querySelector('div .is-navigation__main__right').querySelector('span').click()"
driver.execute_script(jsText)
METHOD
def click_search_using_javaScript(driver : ChromeDriver):
jsText = "document.querySelector('nav').querySelector('div .is-navigation__main').querySelector('div .is-navigation__main__right').querySelector('span').click()"
driver.execute_script(jsText)
time.sleep(2)
if is_displayed(driver, "//form[#id='formSearch']//input[contains(#data-options, 'SearchBox')]") == False:
raise Exception("Failed to click our search button")
else:
print('You clicked the Search Button Successfully')
I want to scrape all the data of a page implemented by a infinite scroll. The following python code works.
for i in range(100):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
This means every time I scroll down to the bottom, I need to wait 5 seconds, which is generally enough for the page to finish loading the newly generated contents. But, this may not be time efficient. The page may finish loading the new contents within 5 seconds. How can I detect whether the page finished loading the new contents every time I scroll down? If I can detect this, I can scroll down again to see more contents once I know the page finished loading. This is more time efficient.
The webdriver will wait for a page to load by default via .get() method.
As you may be looking for some specific element as #user227215 said, you should use WebDriverWait to wait for an element located in your page:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print "Page is ready!"
except TimeoutException:
print "Loading took too much time!"
I have used it for checking alerts. You can use any other type methods to find the locator.
EDIT 1:
I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.
Trying to pass find_element_by_id to the constructor for presence_of_element_located (as shown in the accepted answer) caused NoSuchElementException to be raised. I had to use the syntax in fragles' comment:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('url')
timeout = 5
try:
element_present = EC.presence_of_element_located((By.ID, 'element_id'))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print "Timed out waiting for page to load"
This matches the example in the documentation. Here is a link to the documentation for By.
Find below 3 methods:
readyState
Checking page readyState (not reliable):
def page_has_loaded(self):
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
page_state = self.driver.execute_script('return document.readyState;')
return page_state == 'complete'
The wait_for helper function is good, but unfortunately click_through_to_new_page is open to the race condition where we manage to execute the script in the old page, before the browser has started processing the click, and page_has_loaded just returns true straight away.
id
Comparing new page ids with the old one:
def page_has_loaded_id(self):
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
try:
new_page = browser.find_element_by_tag_name('html')
return new_page.id != old_page.id
except NoSuchElementException:
return False
It's possible that comparing ids is not as effective as waiting for stale reference exceptions.
staleness_of
Using staleness_of method:
#contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
old_page = self.find_element_by_tag_name('html')
yield
WebDriverWait(self, timeout).until(staleness_of(old_page))
For more details, check Harry's blog.
As mentioned in the answer from David Cullen, I've always seen recommendations to use a line like the following one:
element_present = EC.presence_of_element_located((By.ID, 'element_id'))
WebDriverWait(driver, timeout).until(element_present)
It was difficult for me to find somewhere all the possible locators that can be used with the By, so I thought it would be useful to provide the list here.
According to Web Scraping with Python by Ryan Mitchell:
ID
Used in the example; finds elements by their HTML id attribute
CLASS_NAME
Used to find elements by their HTML class attribute. Why is this
function CLASS_NAME not simply CLASS? Using the form object.CLASS
would create problems for Selenium's Java library, where .class is a
reserved method. In order to keep the Selenium syntax consistent
between different languages, CLASS_NAME was used instead.
CSS_SELECTOR
Finds elements by their class, id, or tag name, using the #idName,
.className, tagName convention.
LINK_TEXT
Finds HTML tags by the text they contain. For example, a link that
says "Next" can be selected using (By.LINK_TEXT, "Next").
PARTIAL_LINK_TEXT
Similar to LINK_TEXT, but matches on a partial string.
NAME
Finds HTML tags by their name attribute. This is handy for HTML forms.
TAG_NAME
Finds HTML tags by their tag name.
XPATH
Uses an XPath expression ... to select matching elements.
From selenium/webdriver/support/wait.py
driver = ...
from selenium.webdriver.support.wait import WebDriverWait
element = WebDriverWait(driver, 10).until(
lambda x: x.find_element_by_id("someId"))
On a side note, instead of scrolling down 100 times, you can check if there are no more modifications to the DOM (we are in the case of the bottom of the page being AJAX lazy-loaded)
def scrollDown(driver, value):
driver.execute_script("window.scrollBy(0,"+str(value)+")")
# Scroll down the page
def scrollDownAllTheWay(driver):
old_page = driver.page_source
while True:
logging.debug("Scrolling loop")
for i in range(2):
scrollDown(driver, 500)
time.sleep(2)
new_page = driver.page_source
if new_page != old_page:
old_page = new_page
else:
break
return True
Have you tried driver.implicitly_wait. It is like a setting for the driver, so you only call it once in the session and it basically tells the driver to wait the given amount of time until each command can be executed.
driver = webdriver.Chrome()
driver.implicitly_wait(10)
So if you set a wait time of 10 seconds it will execute the command as soon as possible, waiting 10 seconds before it gives up. I've used this in similar scroll-down scenarios so I don't see why it wouldn't work in your case. Hope this is helpful.
To be able to fix this answer, I have to add new text. Be sure to use a lower case 'w' in implicitly_wait.
Here I did it using a rather simple form:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("url")
searchTxt=''
while not searchTxt:
try:
searchTxt=browser.find_element_by_name('NAME OF ELEMENT')
searchTxt.send_keys("USERNAME")
except:continue
Solution for ajax pages that continuously load data. The previews methods stated do not work. What we can do instead is grab the page dom and hash it and compare old and new hash values together over a delta time.
import time
from selenium import webdriver
def page_has_loaded(driver, sleep_time = 2):
'''
Waits for page to completely load by comparing current page hash values.
'''
def get_page_hash(driver):
'''
Returns html dom hash
'''
# can find element by either 'html' tag or by the html 'root' id
dom = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
# dom = driver.find_element_by_id('root').get_attribute('innerHTML')
dom_hash = hash(dom.encode('utf-8'))
return dom_hash
page_hash = 'empty'
page_hash_new = ''
# comparing old and new page DOM hash together to verify the page is fully loaded
while page_hash != page_hash_new:
page_hash = get_page_hash(driver)
time.sleep(sleep_time)
page_hash_new = get_page_hash(driver)
print('<page_has_loaded> - page not loaded')
print('<page_has_loaded> - page loaded: {}'.format(driver.current_url))
How about putting WebDriverWait in While loop and catching the exceptions.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
while True:
try:
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('IdOfMyElement')))
print "Page is ready!"
break # it will break from the loop once the specific element will be present.
except TimeoutException:
print "Loading took too much time!-Try again"
You can do that very simple by this function:
def page_is_loading(driver):
while True:
x = driver.execute_script("return document.readyState")
if x == "complete":
return True
else:
yield False
and when you want do something after page loading complete,you can use:
Driver = webdriver.Firefox(options=Options, executable_path='geckodriver.exe')
Driver.get("https://www.google.com/")
while not page_is_loading(Driver):
continue
Driver.execute_script("alert('page is loaded')")
use this in code :
from selenium import webdriver
driver = webdriver.Firefox() # or Chrome()
driver.implicitly_wait(10) # seconds
driver.get("http://www.......")
or you can use this code if you are looking for a specific tag :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox() #or Chrome()
driver.get("http://www.......")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "tag_id"))
)
finally:
driver.quit()
Very good answers here. Quick example of wait for XPATH.
# wait for sizes to load - 2s timeout
try:
WebDriverWait(driver, 2).until(expected_conditions.presence_of_element_located(
(By.XPATH, "//div[#id='stockSizes']//a")))
except TimeoutException:
pass
I struggled a bit to get this working as that didn't worked for me as expected. anyone who is still struggling to get this working, may check this.
I want to wait for an element to be present on the webpage before proceeding with my manipulations.
we can use WebDriverWait(driver, 10, 1).until(), but the catch is until() expects a function which it can execute for a period of timeout provided(in our case its 10) for every 1 sec. so keeping it like below worked for me.
element_found = wait_for_element.until(lambda x: x.find_element_by_class_name("MY_ELEMENT_CLASS_NAME").is_displayed())
here is what until() do behind the scene
def until(self, method, message=''):
"""Calls the method provided with the driver as an argument until the \
return value is not False."""
screen = None
stacktrace = None
end_time = time.time() + self._timeout
while True:
try:
value = method(self._driver)
if value:
return value
except self._ignored_exceptions as exc:
screen = getattr(exc, 'screen', None)
stacktrace = getattr(exc, 'stacktrace', None)
time.sleep(self._poll)
if time.time() > end_time:
break
raise TimeoutException(message, screen, stacktrace)
If you are trying to scroll and find all items on a page. You can consider using the following. This is a combination of a few methods mentioned by others here. And it did the job for me:
while True:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.implicitly_wait(30)
time.sleep(4)
elem1 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
len_elem_1 = len(elem1)
print(f"A list Length {len_elem_1}")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.implicitly_wait(30)
time.sleep(4)
elem2 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
len_elem_2 = len(elem2)
print(f"B list Length {len_elem_2}")
if len_elem_1 == len_elem_2:
print(f"final length = {len_elem_1}")
break
except TimeoutException:
print("Loading took too much time!")
selenium can't detect when the page is fully loaded or not, but javascript can. I suggest you try this.
from selenium.webdriver.support.ui import WebDriverWait
WebDriverWait(driver, 100).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
this will execute javascript code instead of using python, because javascript can detect when page is fully loaded, it will show 'complete'. This code means in 100 seconds, keep tryingn document.readyState until complete shows.
nono = driver.current_url
driver.find_element(By.XPATH,"//button[#value='Send']").click()
while driver.current_url == nono:
pass
print("page loaded.")
I am trying to webscrape a website that has multiple javascript rendered pages (https://openlibrary.ecampusontario.ca/catalogue/). I am able to get the content from the first page, but I am not sure how to get my script to click on the buttons on the subsequent pages to get that content. Here is my script.
import time
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
# The path to where you have your chrome webdriver stored:
webdriver_path = '/Users/rawlins/Downloads/chromedriver'
# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')
# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path,
chrome_options = chrome_options)
# Load webpage
url = "https://openlibrary.ecampusontario.ca/catalogue/"
browser.get(url)
# to ensure that the page has loaded completely.
time.sleep(3)
data = []
# Parse HTML, close browser
page_soup = soup(browser.page_source, 'lxml')
containers = page_soup.findAll("div", {"class":"result-item tooltip"})
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.find('h4', {'class' : 'textbook-title'}).text.strip()
item['author'] = container.find('p', {'class' : 'textbook-authors'}).text.strip()
item['link'] = "https://openlibrary.ecampusontario.ca/catalogue/" + container.find('h4', {'class' : 'textbook-title'}).a["href"]
item['source'] = "eCampus Ontario"
item['base_url'] = "https://openlibrary.ecampusontario.ca/catalogue/"
data.append(item) # add the item to the list
with open("js-webscrape-2.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
browser.quit()
You do not have to actually click on any button. For example, to search for items with the keyword 'electricity', you navigate to the url
https://openlibrary-repo.ecampusontario.ca/rest/filtered-items?query_field%5B%5D=*&query_op%5B%5D=matches&query_val%5B%5D=(%3Fi)electricity&filters=is_not_withdrawn&offset=0&limit=10000
This will return a json string of items with the first item being:
{"items":[{"uuid":"6af61402-b0ec-40b1-ace2-1aa674c2de9f","name":"Introduction to Electricity, Magnetism, and Circuits","handle":"123456789/579","type":"item","expand":["metadata","parentCollection","parentCollectionList","parentCommunityList","bitstreams","all"],"lastModified":"2019-05-09 15:51:06.91","parentCollection":null,"parentCollectionList":null,"parentCommunityList":null,"bitstreams":null,"withdrawn":"false","archived":"true","link":"/rest/items/6af61402-b0ec-40b1-ace2-1aa674c2de9f","metadata":null}, ...
Now, to get that item, you use its uuid, and navigate to:
https://openlibrary.ecampusontario.ca/catalogue/item/?id=6af61402-b0ec-40b1-ace2-1aa674c2de9f
You can proceed like this for any interaction with that website (this is not always working for all websites, but it is working for your website).
To find out what are the urls that are navigated to when you click such and such button or enter text (what I did for the above urls), you can use fiddler.
I made a little script that can help you (selenium).
what this script does is "while the last page of the catalogue is not selected (in this case, contain 'selected' in it's class), i'll scrape , then click next"
while "selected" not in driver.find_elements_by_css_selector("[id='results-pagecounter-pages'] a")[-1].get_attribute("class"):
#your scraping here
driver.find_element_by_css_selector("[id='next-btn']").click()
There's probably a problem that you'll run into using this method, it doesn't wait for the results to load, but you can figure out what to do from here onwards.
Hope it helps
I've written a script to test a process involving data input & several pages, but after writing it I've found the forms & main content to be generated from javascript.
The following is a snippet of the script I wrote, and after that initial link the content is generated by JS (its my first python script so excuse any mistakes);
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
browser = webdriver.Firefox()
browser.get('http://127.0.0.1:46727/?ajax=1')
assert "Home" in browser.title
# Find and click the Employer Database link
empDatabaseLink = browser.find_element_by_link_text('Employer Database')
click = ActionChains(browser).click(on_element = empDatabaseLink)
click.perform()
# Content loaded by the above link is generated by the JS
# Find and click the Add Employer button
addEmployerButton = browser.find_element_by_id('Add Employer')
addEmployer = ActionChains(browser).click(on_element = addEmployerButton)
addEmployer.perform()
browser.save_screenshot(r'images\Add_Employer_Form.png')
# Input Employer name
employerName = browser.find_element_by_id('name')
employerName.send_keys("Selenium")
browser.save_screenshot(r'images\Entered_Employer_Name.png')
# Move to next
nextButton = broswer.find_element_by_name('button_next')
moveForward = ActionChains(browser).click(on_element = nextButton)
# Move through various steps
# Then
# Move to Finish
moveForward = ActionChains(browser).click(on_element = nextButton)
How do you access page elements that aren't in the source? I've been looking around & found GetEval but not found anything that I can use :/
Well, to the people of the future, our above conversation appears to have lead to the conclusion that xpath is what mark was looking for. So remember to try xpath, and to use the Selenium IDE and Firebug to locate particularly obstinate page elements.