I would like to scrape a website for its "raw" JavaScript code. For example, if I were to scrape this website. I would get a string containing:
This is just a small portion of the existing JS in the given link, but I would like to obtain the entire JS in a string or array of strings.
I have tried different approaches to obtain this data: using requests and selenium.
Simply loading the HTML of the website doesn't seem to work, as the script tags don't seem to load.
Using selenium, I hoped this would work:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "https://www.udemy.com"
driver = webdriver.Chrome()
driver.get(url)
wait = ui.WebDriverWait(driver, 10)
results = wait.until(EC.visibility_of_all_elements_located((By.TAG_NAME, "script")))
print(results)
Then using results I could get a string, but it doesn't work.
Another example for the JS Scripts chunks I'd like to get:
The red rectangle indicates JS Scripts, as you can see there is a lot of it and I would like to get it in its "raw" form (not execute it).
My question is: How would I get the "raw" JS script in a string format? and what is the most efficient way (time-wise) to perform this?
You are looking for .get_attribute('innerHTML'). You also do not want to use visibility_of_all_elements_located since you are looking for something that will not ever be visible.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "https://www.udemy.com"
driver = webdriver.Chrome()
driver.get(url)
#wait = ui.WebDriverWait(driver, 10)
#results = wait.until(EC.visibility_of_all_elements_located((By.TAG_NAME, "script")))
wait = WebDriverWait(driver, 10)
script_tag = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//script")))
innerHTML_of_script_tag = []
for script in script_tag:
innerHTML_of_script_tag.append(script.get_attribute('innerHTML'))
print(script.get_attribute('innerHTML'))
print("################################################################")
print("---------------------------------------------------------------------")
print("---------------------------------------------------------------------")
print(innerHTML_of_script_tag)
Related
I am working with www.freightquote.com and at some point I need to sign in otherwise not allowed me to get freight rates for more than 45 pairs.
I would like to enter sign in information for this website but for some reason it is not working. I could not understand the problem.
You can directly use this website: https://account.chrobinson.com/
I have problem to enter the information that I am asked. Here is what I did:
from selenium import webdriver
from time import sleep
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.service import Service
PATH = r'C:\Users\b\Desktop\Webscraping\chromedriver.exe'
s= Service(PATH )
driver = webdriver.Chrome(service=s)
link = "https://www.freightquote.com/book/#/free-quote/pickup"
driver.get(link)
sleep(2)
driver.maximize_window()
sleep(2)
driver.find_elements(by=By.XPATH, value = '//button[#type="button"]')[0].click()
sleep(3)
#Username:
driver.find_element(by=By.XPATH, value='//input[#type="email"]').send_keys('USERNAME')
driver.find_elements(by=By.XPATH, value = '//input[#class="button button-primary" and #type="submit"]')[0].click()
#password
driver.find_element(by=By.XPATH, value='//input[#type="password"]').send_keys('PASSWORD')
driver.find_elements(by=By.XPATH, value = '//input[#class="button button-primary" and #type="submit"]')[0].click()
sleep(2)
your code and your technic have too many problems, you should learn how to code in selenium completely and then start writing code.
I modified your code to the point of entering the email, please complete the code accordingly.
driver = webdriver.Chrome()
link = "https://www.freightquote.com/book/#/free-quote/pickup"
driver.get(link)
driver.maximize_window()
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.XPATH,
'(//button[#type="button"])[1]'))).click()
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.XPATH,
'//input[#type="email"]'))).send_keys('USERNAME')
also, you don't need to add chromedriver path in your code. if you use Windows or Linux you should add it into your virtualenv, in the /bin folder
and if you use from mac you should add it to this path /usr/local/bin
To enter sign in information for the website you need to induce WebDriverWait for the element_to_be_clickable() and you can use the following locator strategies:
Using CSS_SELECTOR:
driver.get("https://account.chrobinson.com/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))).send_keys("Ribella")
driver.find_element(By.CSS_SELECTOR, "input[name='password']").send_keys("Ribella")
driver.find_element(By.CSS_SELECTOR, "input[value='Sign In']").click()
Note: You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Browser Snapshot:
I'm trying to automate the bet365 casino, I know they have tools to block bots.
link :https://casino.bet365.com/Play/LiveRoulette
I can't handle anything that's inside the div class="app-container", at least by selenium. But I find these elements using JavaScript in the browser console.
import undetected_chromedriver as UChrome
from webdriver_manager.chrome import ChromeDriverManager
UChrome.install(ChromeDriverManager().install())
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
driver = UChrome.Chrome()
driver.get('https://www.bet365.com/#/HO/')
after login
driver.get('https://casino.bet365.com/Play/LiveRoulette')
locator = (By.XPATH,'//*[contains(#class, "second-dozen")]')
I try
probably the selectors should be a little different
driver.execute_script('return document.getElementsByClassName("roulette-table-cell roulette-table-cell_side-first-dozen roulette-table-cell_group-dozen")[0].getBoundingClientRect()')
Try
driver.find_element(locator[0], locator[1])
but I recive this: raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"(//*[contains(text(), "PAR")])[1]"}
(Session info: chrome=96.0.4664.110)
Stacktrace:
0 0x55f8fa1bcee3
1 0x55f8f9c8a608
2 0x55f8f9cc0aa1
You probably missing a delay / wait.
Redirecting to the inner page with
driver.get('https://casino.bet365.com/Play/LiveRoulette')
It takes some time to make all the elements loaded there, you can not access elements immediately.
The recommended way to do that is to use to use Expected Conditions explicit waits, something like this:
import undetected_chromedriver as UChrome
from webdriver_manager.chrome import ChromeDriverManager
UChrome.install(ChromeDriverManager().install())
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
driver = UChrome.Chrome()
wait = WebDriverWait(driver, 20)
driver.get('https://www.bet365.com/#/HO/')
#perform the login here
driver.get('https://casino.bet365.com/Play/LiveRoulette')
locator = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '//*[contains(#class, "second-dozen")]')))
I see you are also basically missing the driver.find_element method.
This:
(By.XPATH,'//*[contains(#class, "second-dozen")]')
will not return a web element.
Also make sure that element is not inside the iframe.
I am very new to Python, JavaScript, and Web-Scraping. I am trying to write code that writes all of the data in tables like this into a csv file. The webpage is "https://www.mcmaster.com/cam-lock-fittings/material~aluminum/"
I started by trying to find the data in the html but then realized that the website uses JavaScript. I then tried using selenium but I cannot find anywhere in the JavaScript code that has the actual data that is displayed in these tables. I wrote this code to see if I could find the display data anywhere but I was unable to find it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.mcmaster.com/cam-lock-fittings/material~aluminum/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='C:/Users/Brian Knoll/Desktop/chromedriver.exe', options=options)
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.close()
filename = "McMaster Text.txt"
fo = open(filename, "w")
fo.write(html)
fo.close()
I'm sure there's an obvious answer that is just going over my head. Any help would be greatly appreciated! Thank you!
I guess you need to wait till the table your looking for is loaded.
To do so, add the following line to wait for 10 seconds before start scraping the data
fullLoad = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[contains(#class, 'ItmTblCntnr')]")))
Here is the full code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = 'https://www.mcmaster.com/cam-lock-fittings/material~aluminum/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path=os.path.abspath("chromedriver"), options=options)
driver.get(url)
fullLoad = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[contains(#class, 'ItmTblCntnr')]")))
html = driver.execute_script("return document.documentElement.outerHTML")
driver.close()
filename = "McMaster Text.txt"
fo = open(filename, "w")
fo.write(html)
fo.close()
There is a link
…もっと見る
in a html page, I want to click it, I tried code as below, it seems like page can't be parsed, how can I run that function ga_and_go('//weathernews.jp/s/topics/?fm=onebox', 'topics_more'):
more_button = driver.find_element_by_partial_link_text("…もっと見る")
more_button.click()
As per the HTML the element seems to be JavaScript based so you need to induce WebDriverWait and can use either of the following solutions:
Using PARTIAL_LINK_TEXT:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "もっと見る"))).click()
Using XPATH:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(.,'もっと見る')]"))).click()
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Can you try using the below xpath to click the element?
//a[contains(#href,'javascript:ga_and_go(')]
more_button = driver.find_element_by_xpath("//a[contains(#href,'javascript:ga_and_go(')]")
more_button.click()
The webpage (see driver.get() below) seems to have one table with class name as table. I can't seem to locate it using the code below.
I was under the impression that I could locate these type of Javascript elements using Selenium.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.quit()
driver = webdriver.PhantomJS()
driver.get('http://investsnips.com/list-of-publicly-traded-micro-cap-diversified-biotechnology-and-pharmaceutical-companies/')
content = driver.find_element_by_css_selector('table.table')
x = driver.find_element_by_class_name("table")
I'm getting this error (content nor x work)
NoSuchElementException: Message: {"errorMessage":"Unable to find element with class name 'table'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"94","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:49464","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"class name\", \"value\": \"table\", \"sessionId\": \"a988f310-65da-11e7-a655-01f6986e9e41\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/a988f310-65da-11e7-a655-01f6986e9e41/element"}}
Screenshot: available via screen
The table in in iframe. You have to switch iframe before finding the table. See code below.
driver = webdriver.PhantomJS()
driver.get('http://investsnips.com/list-of-publicly-traded-micro-cap-diversified-biotechnology-and-pharmaceutical-companies/')
#Find the iframe tradingview_xxxxx and then switch into the iframe
iframeElement = driver.find_element_by_css_selector('iframe[id*="tradingview_"]')
driver.switch_to_frame(iframeElement)
#Wait for the table
waitForPresence = WebDriverWait(driver, 10)
waitForPresence.until(EC.presence_of_element_located((By.CSS_SELECTOR,'table.table'))
theTable = driver.find_element_by_css_selector('table.table')