I am very new to Python, JavaScript, and Web-Scraping. I am trying to write code that writes all of the data in tables like this into a csv file. The webpage is "https://www.mcmaster.com/cam-lock-fittings/material~aluminum/"
I started by trying to find the data in the html but then realized that the website uses JavaScript. I then tried using selenium but I cannot find anywhere in the JavaScript code that has the actual data that is displayed in these tables. I wrote this code to see if I could find the display data anywhere but I was unable to find it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.mcmaster.com/cam-lock-fittings/material~aluminum/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='C:/Users/Brian Knoll/Desktop/chromedriver.exe', options=options)
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.close()
filename = "McMaster Text.txt"
fo = open(filename, "w")
fo.write(html)
fo.close()
I'm sure there's an obvious answer that is just going over my head. Any help would be greatly appreciated! Thank you!
I guess you need to wait till the table your looking for is loaded.
To do so, add the following line to wait for 10 seconds before start scraping the data
fullLoad = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[contains(#class, 'ItmTblCntnr')]")))
Here is the full code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = 'https://www.mcmaster.com/cam-lock-fittings/material~aluminum/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path=os.path.abspath("chromedriver"), options=options)
driver.get(url)
fullLoad = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[contains(#class, 'ItmTblCntnr')]")))
html = driver.execute_script("return document.documentElement.outerHTML")
driver.close()
filename = "McMaster Text.txt"
fo = open(filename, "w")
fo.write(html)
fo.close()
I need some help on a data Scraping Task on this : https://soilhealth.dac.gov.in/NewHomePage/NutriPage
I Managed to fill the dropdown Menu and to click on view using this code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
url = "https://soilhealth.dac.gov.in/NewHomePage/NutriPage"
driver = webdriver.Chrome(executable_path='./chromedriver.exe')
driver.get(url)
select = Select(driver.find_element_by_id('NutriCatId'))
select.select_by_visible_text('Sample Wise')
select = Select(driver.find_element_by_id('CycleId'))
select.select_by_visible_text('All Cycle')
select = Select(driver.find_element_by_id('State_Code'))
select.select_by_visible_text('Andaman And Nicobar Islands')
driver.implicitly_wait(5)
select = Select(driver.find_element_by_id('District_Code'))
select.select_by_visible_text('Nicobars')
driver.find_element_by_id('s').click()
driver.implicitly_wait(30)
soup_level1=BeautifulSoup(driver.page_source, 'lxml')
I need to scrape the table data from the source code, instead of having it on soup_level1 xml, I only have the javascript code.
Any help to know if scraping the data is possible using Selenium is possible and how can I do it would be awsome.
Thank you for your help.
Hey the below code does the job. But it is slow since the table is huge and it takes a bit of time to parse. I observed that the report has an export option available so do try to download it directly using Selenium. Oh and for the explanation the report is generated as an iframe which is different from the default source of the page so you need to switch to that frame to get the info. Do let me know for any clarification. The required data is in df variable.
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 24 11:08:32 2020
#author: prakh
"""
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
import time
url = "https://soilhealth.dac.gov.in/NewHomePage/NutriPage"
driver = webdriver.Chrome(executable_path='C:/Users/prakh/Documents/PythonScripts/chromedriver.exe')
driver.get(url)
select = Select(driver.find_element_by_id('NutriCatId'))
select.select_by_visible_text('Sample Wise')
select = Select(driver.find_element_by_id('CycleId'))
select.select_by_visible_text('All Cycle')
select = Select(driver.find_element_by_id('State_Code'))
select.select_by_visible_text('Andaman And Nicobar Islands')
driver.implicitly_wait(5)
select = Select(driver.find_element_by_id('District_Code'))
select.select_by_visible_text('Nicobars')
driver.find_element_by_id('s').click()
driver.implicitly_wait(30)
#soup_level1=BeautifulSoup(driver.page_source, 'lxml')
#src = driver.find_element_by_xpath('//*[#id="report"]/iframe').get_attribute("src")
driver.switch_to.frame(driver.find_element_by_xpath('//*[#id="report"]/iframe'))
time.sleep(10)
html = driver.page_source
df_list = pd.read_html(html)
df = df_list[-3]
driver.quit()
I am trying to scrape data from this dynamic JavaScript website. Since the page is dynamic I am using Selenium to extract the data from the table. Please suggest me how to scrape the data from the dynamic table. Here is my code.
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import lxml.html as LH
import requests
# specify the url
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
##driver = webdriver.Firefox(executable_path = 'C:/Users/Shresth Suman/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 5s
time.sleep(5)
# driver.quit()
# find elements by xpath
##results = driver.find_elements_by_xpath("//div[#id='div_taboa']//table[#id='taboa']/tbody")
##results = driver.find_elements_by_xpath("//*[#id='page-title']")
##results = driver.find_elements_by_xpath("//*[#id='div_main']/h2[1]")
results = driver.find_elements_by_xpath("//*[#id = 'frame_historicos']")
print(results)
print(len(results))
# create empty array to store data
data = []
# loop over results
for result in results:
heading = result.text
print(heading)
headingfind = result.find_element_by_tag_name('h1')
# append dict to array
data.append({"head" : headingfind, "name" : heading})
# close driver
driver.quit()
###################################################################
# save to pandas dataframe
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('testsot.csv')
I want to extract data from 2005 till present with Averages/Totals of 10 min which gives me data for only one month.
Induce WebDriverWait And element_to_be_clickable()
Install Beautiful soup library
Using pandas read_html()
I haven't create list. you should create startdate and enddate list and itearte for all those month since 1/1/2005
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from bs4 import BeautifulSoup
import time
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
driver.get(urlpage)
WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"frame_historicos")))
inputstartdate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[#class='dijitReset dijitInputInner'])[1]")))
inputstartdate.clear()
inputstartdate.send_keys("1/1/2005")
inputenddate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[#class='dijitReset dijitInputInner'])[last()]")))
inputenddate.clear()
inputenddate.send_keys("1/31/2005")
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//input[#class='form-submit'][#value='REFRESH']"))).click()
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#taboa")))
time.sleep(3)
soup=BeautifulSoup(driver.page_source,"html.parser")
table=soup.find("table", id="taboa")
df=pd.read_html(str(table))
df.to_csv('testsot.csv')
print(df)
I'm new to Python and got interested in writing scripts. I'm currently building a crawler that goes on a page and extract copy from tags. Write now I can only list tags; I'm having trouble getting the text out of tags and I'm not sure why exactly. I'm also using BeautifulSoup and PyQt4 to get dynamic data(this might need a new question).
So based on this code below, I should be getting the "Images" copy from the Google homepage, or at least the span tag itself. I'm getting returned NONE
I tried reading the docs for BeautifulSoup and it was a little overwhelming. I'm still reading it, but I think I keep going down a rabbit hole. I can print all anchor tags or all divs, but targeting a specific one is where I'm struggling.
import urllib
import re
from bs4 import BeautifulSoup, Comment
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://google.com'
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source, 'html.parser')
js_test = soup.find("a", class_="gb_P")
print js_test
I'm using a Python script to render web pages and retrieve their HTML's. It works fine with most of the pages, but with some of them the HTML retrieved is incomplete. And I don't quite understand why. This is the script I'm using to scrap this page, for some reason, the link to every product is not in the HTML:
Link: http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html
Python script:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork
from PyQt4 import QtCore
url = sys.argv[1]
path = sys.argv[2]
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.request = QtNetwork.QNetworkRequest()
self.request.setUrl(QtCore.QUrl(url))
self.request.setRawHeader("Accept-Language", QtCore.QByteArray ("es ,*"))
self.mainFrame().load(self.request)
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
r = Render(url)
result = r.frame.toHtml()
html_file = open(path, "w")
html_file.write("%s" % result.encode("utf-8"))
html_file.close()
sys.exit(app.exec_())
This code was taken from here: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/
Am I missing something? What are the limitations of this framework?
Thanks in advance,
If you want headless browsing you can combine phantomjs with selenium, the following gets all the source:
url = "http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html"
from selenium import webdriver
dr = webdriver.PhantomJS()
dr.get(url)
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(dr, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, "grid_itemContainer"))
)
Just using selenium without the WebDriverWait did not always return the full source, adding the wait until the a tags with the grid_itemContainer class were visible makes sure the html has been generated, the xpath below returns all your links:
print([a.get_attribute('href') for a in dr.find_elements_by_xpath("//a[#class='grid_itemContainer']")])
[u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-detalle-crochet-pechera-c29016p100064004.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordado-escote-pico-c29016p100123006.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-larga-espalda-abierta-c29016p100147503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-hombros-descubiertos-beads-c29016p100182001.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-capa-c29016p100255505.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-eyelets-c29016p100336010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-oversized-c29016p100036616.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-pico-c29016p100166506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-rayas-c29016p100234507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-corta-liso-c29016p100262008.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-liso-c29016p100036162.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-%C3%A9tnico-c29016p100259002.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-rayas-c29016p100036161.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-tri%C3%A1ngulo-c29016p100255506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-marinero-escote-bardot-c29016p100259003.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-escote-espalda-c29016p100262007.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cruzado-c29016p100216013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-canes%C3%BA-bordado-c29016p100203011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordados-c29016p100037160.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-volante-c29016p100216014.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-lencero-c29016p100104515.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuadros-detalle-encaje-c29016p100216016.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-bolsillo-plastr%C3%B3n-c29016p100036822.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-bajo-desigual-c29016p100123010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-vaquero-c29016p100036575.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-estampado-rayas-c29016p100189011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-rayas-manga-3-4-c29016p100149507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-canal%C3%A9-ajustado-c29016p100149508.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-bolsillos-c29016p100212503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-corte-evas%C3%A9-bolsillos-c29016p100189012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-camisero-cuadros-c29016p100036624.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/pichi-vaquero-c29016p100073526.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-geom%C3%A9trico-cuello-halter-c29016p100037021.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-evas%C3%A9-c29016p100037207.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100036932.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100037280.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-parche-c29016p100037464.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-liso-c29016p100037156.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-flores-c29016p100036921.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-corbatero-c29016p100037155.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-c29016p100170011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-rayas-c29016p100170012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-acampanada-c29016p100149506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-punto-espalda-abierta-c29016p100195504.html']
If you want to write the source:
with open("out.html", "w") as f:
f.write(dr.page_source)
I think you can use http://ghost-py.readthedocs.org/en/latest/ for this case. It's loads web page like real browser and run JavaScript.
Also you can try PhantomJS for example, but it written on nodeJS.