Scraping a site using Selenium and BeautifulSoup - javascript

So I'm trying to scrape a site that loads something dynamically with JS. My goal is to build a quick python script to load a site, see if there's a certain word, and then email me if it's there.
I'm relatively new to coding, so if there's a better way, I'd be happy to hear.
I'm currently working to load the page with Selenium, then scrape the generated page with BeautifulSoup, and that's where I'm having the issue. How do I get beautifulsoup to scrape the site I just opened in selenium?
from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import urllib, urllib2
import time
url = 'http://www.somesite.com/'
path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
site = browser.get(url)
html = urllib.urlopen(site).read()
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())
I have an error that says
Traceback (most recent call last):
File "probation color.py", line 16, in <module>
html = urllib.urlopen(site).read()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open
fullurl = unwrap(toBytes(fullurl))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap
url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
which I don't really understand or understand why is happening. Is it something internally with urllib? How do I fix it? I think solving that will fix my problem.

The HTML can be found using the "page_source" attribute on the browser. This should work:
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
#import urllib, urllib2
import time
url = 'http://www.somesite.com/'
path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
site = browser.get(url)
html = site.page_source #you should have used this...
#html = urllib.urlopen(site).read() #this is the mistake u did...
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

Related

How do I extract the data from these JavaScript tables using Selenium and Python?

I am very new to Python, JavaScript, and Web-Scraping. I am trying to write code that writes all of the data in tables like this into a csv file. The webpage is "https://www.mcmaster.com/cam-lock-fittings/material~aluminum/"
I started by trying to find the data in the html but then realized that the website uses JavaScript. I then tried using selenium but I cannot find anywhere in the JavaScript code that has the actual data that is displayed in these tables. I wrote this code to see if I could find the display data anywhere but I was unable to find it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.mcmaster.com/cam-lock-fittings/material~aluminum/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='C:/Users/Brian Knoll/Desktop/chromedriver.exe', options=options)
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.close()
filename = "McMaster Text.txt"
fo = open(filename, "w")
fo.write(html)
fo.close()
I'm sure there's an obvious answer that is just going over my head. Any help would be greatly appreciated! Thank you!
I guess you need to wait till the table your looking for is loaded.
To do so, add the following line to wait for 10 seconds before start scraping the data
fullLoad = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[contains(#class, 'ItmTblCntnr')]")))
Here is the full code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = 'https://www.mcmaster.com/cam-lock-fittings/material~aluminum/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path=os.path.abspath("chromedriver"), options=options)
driver.get(url)
fullLoad = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[contains(#class, 'ItmTblCntnr')]")))
html = driver.execute_script("return document.documentElement.outerHTML")
driver.close()
filename = "McMaster Text.txt"
fo = open(filename, "w")
fo.write(html)
fo.close()

Selenium Python - Get Table Data Instead of JavaScript Code

I need some help on a data Scraping Task on this : https://soilhealth.dac.gov.in/NewHomePage/NutriPage
I Managed to fill the dropdown Menu and to click on view using this code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
url = "https://soilhealth.dac.gov.in/NewHomePage/NutriPage"
driver = webdriver.Chrome(executable_path='./chromedriver.exe')
driver.get(url)
select = Select(driver.find_element_by_id('NutriCatId'))
select.select_by_visible_text('Sample Wise')
select = Select(driver.find_element_by_id('CycleId'))
select.select_by_visible_text('All Cycle')
select = Select(driver.find_element_by_id('State_Code'))
select.select_by_visible_text('Andaman And Nicobar Islands')
driver.implicitly_wait(5)
select = Select(driver.find_element_by_id('District_Code'))
select.select_by_visible_text('Nicobars')
driver.find_element_by_id('s').click()
driver.implicitly_wait(30)
soup_level1=BeautifulSoup(driver.page_source, 'lxml')
I need to scrape the table data from the source code, instead of having it on soup_level1 xml, I only have the javascript code.
Any help to know if scraping the data is possible using Selenium is possible and how can I do it would be awsome.
Thank you for your help.
Hey the below code does the job. But it is slow since the table is huge and it takes a bit of time to parse. I observed that the report has an export option available so do try to download it directly using Selenium. Oh and for the explanation the report is generated as an iframe which is different from the default source of the page so you need to switch to that frame to get the info. Do let me know for any clarification. The required data is in df variable.
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 24 11:08:32 2020
#author: prakh
"""
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
import time
url = "https://soilhealth.dac.gov.in/NewHomePage/NutriPage"
driver = webdriver.Chrome(executable_path='C:/Users/prakh/Documents/PythonScripts/chromedriver.exe')
driver.get(url)
select = Select(driver.find_element_by_id('NutriCatId'))
select.select_by_visible_text('Sample Wise')
select = Select(driver.find_element_by_id('CycleId'))
select.select_by_visible_text('All Cycle')
select = Select(driver.find_element_by_id('State_Code'))
select.select_by_visible_text('Andaman And Nicobar Islands')
driver.implicitly_wait(5)
select = Select(driver.find_element_by_id('District_Code'))
select.select_by_visible_text('Nicobars')
driver.find_element_by_id('s').click()
driver.implicitly_wait(30)
#soup_level1=BeautifulSoup(driver.page_source, 'lxml')
#src = driver.find_element_by_xpath('//*[#id="report"]/iframe').get_attribute("src")
driver.switch_to.frame(driver.find_element_by_xpath('//*[#id="report"]/iframe'))
time.sleep(10)
html = driver.page_source
df_list = pd.read_html(html)
df = df_list[-3]
driver.quit()

How do I scrape data from JavaScript website?

I am trying to scrape data from this dynamic JavaScript website. Since the page is dynamic I am using Selenium to extract the data from the table. Please suggest me how to scrape the data from the dynamic table. Here is my code.
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import lxml.html as LH
import requests
# specify the url
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
print(urlpage)
# run firefox webdriver from executable path of your choice
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
##driver = webdriver.Firefox(executable_path = 'C:/Users/Shresth Suman/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 5s
time.sleep(5)
# driver.quit()
# find elements by xpath
##results = driver.find_elements_by_xpath("//div[#id='div_taboa']//table[#id='taboa']/tbody")
##results = driver.find_elements_by_xpath("//*[#id='page-title']")
##results = driver.find_elements_by_xpath("//*[#id='div_main']/h2[1]")
results = driver.find_elements_by_xpath("//*[#id = 'frame_historicos']")
print(results)
print(len(results))
# create empty array to store data
data = []
# loop over results
for result in results:
heading = result.text
print(heading)
headingfind = result.find_element_by_tag_name('h1')
# append dict to array
data.append({"head" : headingfind, "name" : heading})
# close driver
driver.quit()
###################################################################
# save to pandas dataframe
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('testsot.csv')
I want to extract data from 2005 till present with Averages/Totals of 10 min which gives me data for only one month.
Induce WebDriverWait And element_to_be_clickable()
Install Beautiful soup library
Using pandas read_html()
I haven't create list. you should create startdate and enddate list and itearte for all those month since 1/1/2005
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from bs4 import BeautifulSoup
import time
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
driver.get(urlpage)
WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"frame_historicos")))
inputstartdate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[#class='dijitReset dijitInputInner'])[1]")))
inputstartdate.clear()
inputstartdate.send_keys("1/1/2005")
inputenddate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[#class='dijitReset dijitInputInner'])[last()]")))
inputenddate.clear()
inputenddate.send_keys("1/31/2005")
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//input[#class='form-submit'][#value='REFRESH']"))).click()
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#taboa")))
time.sleep(3)
soup=BeautifulSoup(driver.page_source,"html.parser")
table=soup.find("table", id="taboa")
df=pd.read_html(str(table))
df.to_csv('testsot.csv')
print(df)

getting dynamic data using python

I'm new to Python and got interested in writing scripts. I'm currently building a crawler that goes on a page and extract copy from tags. Write now I can only list tags; I'm having trouble getting the text out of tags and I'm not sure why exactly. I'm also using BeautifulSoup and PyQt4 to get dynamic data(this might need a new question).
So based on this code below, I should be getting the "Images" copy from the Google homepage, or at least the span tag itself. I'm getting returned NONE
I tried reading the docs for BeautifulSoup and it was a little overwhelming. I'm still reading it, but I think I keep going down a rabbit hole. I can print all anchor tags or all divs, but targeting a specific one is where I'm struggling.
import urllib
import re
from bs4 import BeautifulSoup, Comment
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://google.com'
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source, 'html.parser')
js_test = soup.find("a", class_="gb_P")
print js_test

How to retrieve the exact HTML as in a browser

I'm using a Python script to render web pages and retrieve their HTML's. It works fine with most of the pages, but with some of them the HTML retrieved is incomplete. And I don't quite understand why. This is the script I'm using to scrap this page, for some reason, the link to every product is not in the HTML:
Link: http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html
Python script:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork
from PyQt4 import QtCore
url = sys.argv[1]
path = sys.argv[2]
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.request = QtNetwork.QNetworkRequest()
self.request.setUrl(QtCore.QUrl(url))
self.request.setRawHeader("Accept-Language", QtCore.QByteArray ("es ,*"))
self.mainFrame().load(self.request)
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
r = Render(url)
result = r.frame.toHtml()
html_file = open(path, "w")
html_file.write("%s" % result.encode("utf-8"))
html_file.close()
sys.exit(app.exec_())
This code was taken from here: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/
Am I missing something? What are the limitations of this framework?
Thanks in advance,
If you want headless browsing you can combine phantomjs with selenium, the following gets all the source:
url = "http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html"
from selenium import webdriver
dr = webdriver.PhantomJS()
dr.get(url)
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(dr, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, "grid_itemContainer"))
)
Just using selenium without the WebDriverWait did not always return the full source, adding the wait until the a tags with the grid_itemContainer class were visible makes sure the html has been generated, the xpath below returns all your links:
print([a.get_attribute('href') for a in dr.find_elements_by_xpath("//a[#class='grid_itemContainer']")])
[u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-detalle-crochet-pechera-c29016p100064004.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordado-escote-pico-c29016p100123006.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-larga-espalda-abierta-c29016p100147503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-hombros-descubiertos-beads-c29016p100182001.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-capa-c29016p100255505.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-eyelets-c29016p100336010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-oversized-c29016p100036616.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-pico-c29016p100166506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-rayas-c29016p100234507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-corta-liso-c29016p100262008.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-liso-c29016p100036162.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-%C3%A9tnico-c29016p100259002.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-rayas-c29016p100036161.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-tri%C3%A1ngulo-c29016p100255506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-marinero-escote-bardot-c29016p100259003.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-escote-espalda-c29016p100262007.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cruzado-c29016p100216013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-canes%C3%BA-bordado-c29016p100203011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordados-c29016p100037160.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-volante-c29016p100216014.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-lencero-c29016p100104515.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuadros-detalle-encaje-c29016p100216016.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-bolsillo-plastr%C3%B3n-c29016p100036822.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-bajo-desigual-c29016p100123010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-vaquero-c29016p100036575.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-estampado-rayas-c29016p100189011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-rayas-manga-3-4-c29016p100149507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-canal%C3%A9-ajustado-c29016p100149508.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-bolsillos-c29016p100212503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-corte-evas%C3%A9-bolsillos-c29016p100189012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-camisero-cuadros-c29016p100036624.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/pichi-vaquero-c29016p100073526.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-geom%C3%A9trico-cuello-halter-c29016p100037021.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-evas%C3%A9-c29016p100037207.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100036932.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100037280.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-parche-c29016p100037464.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-liso-c29016p100037156.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-flores-c29016p100036921.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-corbatero-c29016p100037155.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-c29016p100170011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-rayas-c29016p100170012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-acampanada-c29016p100149506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-punto-espalda-abierta-c29016p100195504.html']
If you want to write the source:
with open("out.html", "w") as f:
f.write(dr.page_source)
I think you can use http://ghost-py.readthedocs.org/en/latest/ for this case. It's loads web page like real browser and run JavaScript.
Also you can try PhantomJS for example, but it written on nodeJS.

Categories