Webscraping in Python Selenium - Can't find button - javascript

So, Im trying to access some data from this webpage http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm . Im trying to click on the button named as "Setor de atuação" with selenium. The problem is The requests lib is returning me a different HTML from the one I see when I inspect the page. I already tried to sent a header with my request but it wasn't the solution. Although, when I print the content in
browser.page_source
I still get an incomplete part of the page that I want. In order to try solving the problem Ive seen that two requests are posted when the site is initializes:
print1
Well, I'm not sure what to do now. If anyone can help me or send me a tutorial, explain what is happening I would be really glad. Thanks in advance. Ive only done simple web-scraping so I'm not sure how to proceed, Ive also checked other questions in the forums and none seem to be similar to my problem.
import bs4 as bs
import requests
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito') #private
#options.add_argument('--headless') # doesnt open page
browser = webdriver.Chrome('/home/itamar/Desktop/chromedriver', chrome_options=options)
site = 'http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm'
browser.get(site)
Thats my code till now. Im having trouble to find the and click in the element buttor "Setor de Atuação". Ive tried through X_path,class,id but nothing seems to work.

The aimed button is inside an iframe, in this case you'll have to use the switch_to function from your selenium driver, this way switching the driver to the iframe DOM, and only then you can look for the button. I've played with the page provided and it worked - only using Selenium though, no need of Beautiful Soup. This is my code:
from selenium import webdriver
import time
class B3:
def __init__(self):
self.bot = webdriver.Firefox()
def start(self):
bot = self.bot
bot.get('http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm')
time.sleep(2)
iframe = bot.find_element_by_xpath('//iframe[#id="bvmf_iframe"]')
bot.switch_to.frame(iframe)
bot.implicitly_wait(30)
tab = bot.find_element_by_xpath('//a[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]')
time.sleep(3)
tab.click()
time.sleep(2)
if __name__ == "__main__":
worker = B3()
worker.start()
Hope it suits you well!
refs:
https://www.techbeamers.com/switch-between-iframes-selenium-python/

In this case I suggest you to work only with the Selenium, because it depends on the Javascripts processing.
You can inspect the elements and use the XPath to choose and select the elements.
XPath:
//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span
So your code will looks like:
elementSelect = driver.find_elements_by_xpath('//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span')
elementSelect[0].click()
time.sleep(5) # Wait the page to load.
PS: I recommend you to search an API service for the B3. I found this link, but I didn't read it. Maybe they already have disponibilised these kind of data.
About XPath: https://www.guru99.com/xpath-selenium.html

I can't understand the problem, so if you can show a code snippet it would be better.
And I suggest you to use BeautifulSoup for web scraping.

Related

BeautifulSoup & requests_html unable to find element

I need to scrape information from this page: https://professionals.tarkett.co.uk/en_GB/collection-C001030-arcade/arcade-b023-2128.
There are 6000+ of those pages I need to scrape. I really don't want to use selenium as it is too slow for this type of job.
The information I am trying to scrape is the 'Document' section down the bottom page, just above the 'Case studies' and 'About' section. There is a pdf DataSheet and several other key bits on information in that area I need to scrape but am finding absolutely impossible.
I have tried everything at this point, requests, dryscrape, requests_html etc. and nothing works.
It seems the information I need is being rendered by JavaScript. I have tried using libraries that supposedly work for these type of issues but in my case it's not working.
Here's snippet of code to show:
from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs
session = HTMLSession()
resp = session.get("https://professionals.tarkett.co.uk/en_GB/collection-C001030-arcade/arcade-b023-2128", headers=headers) # tried without headers too
resp.html.render()
soup = bs(resp.html.html, "html.parser")
soup.find("section", {"id" : "collection-documentation"})
Output:
<section data-v-6f916884="" data-v-aed8933c="" id="collection-documentation"><!-- --></section>
No matter what I try, the information just isn't there. This is one element specifically that I am trying to get:
<a data-v-5a3c0164="" href="https://media.tarkett-image.com/docs/DS_INT_Arcade.pdf" target="_blank" rel="noopener noreferrer" class="basic-clickable tksb-secondary-link-with-large-icon" is-white-icon="true"><svg xmlns="http://www.w3.org/2000/svg" width="47" height="47" viewBox="0 0 47 47" class="tksb-secondary-link-with-large-icon__icon"><g transform="translate(-1180 -560)"><path d="M1203.5,560a23.5,23.5,0,1,1-23.5,23.5A23.473,23.473,0,0,1,1203.5,560Z" class="download-icon__background"></path></g> <g><path d="M29.5,22.2l-5.1,5.5V10.3H22.6V27.7l-5.1-5.5-1.4,1.2,7.4,7.9,7.4-7.9Z" class="download-icon__arrow-fill"></path> <g><path d="M31.6,37.6H15.4V31.3h1.8v4.5H29.8V31.3h1.8Z" class="download-icon__arrow-fill"></path></g></g></svg> <span class="tksb-secondary-link-with-large-icon__text-container"><span class="tksb-secondary-link-with-large-icon__label">Datasheet</span> <span class="tksb-secondary-link-with-large-icon__description">PDF</span></span></a>
The best I've come up with so far is finding this URL from using Chrome Dev tools and inspecting the network tab to see what happens when I scroll into view of the data I want; https://professionals.tarkett.co.uk/en_GB/collection-product-formats-json/fb02/C001030/b023-2128?fields[]=sku_design&fields[]=sku_design_key&fields[]=sku_thumbnail&fields[]=sku_hex_color_code&fields[]=sku_color_family&fields[]=sku_delivery_sla&fields[]=sku_is_new&fields[]=sku_part_number&fields[]=sku_sap_number&fields[]=sku_id&fields[]=sku_format_type&fields[]=sku_format_shape&fields[]=sku_format&fields[]=sku_backing&fields[]=sku_items_per_box&fields[]=sku_surface_per_box&fields[]=sku_box_per_pallet&fields[]=sku_packing_unit_code&fields[]=sku_collection_names&fields[]=sku_category_b2b_names&fields[]=sku_sap_sales_org&fields[]=sku_minimum_order_qty&fields[]=sku_base_unit_sap&fields[]=sku_selling_units&fields[]=sku_pim_prices&fields[]=sku_retailers_prices&fields[]=sku_retailers_prices_unit&fields[]=sku_installation_method.
Now I could easily scrape the information I want from here (which at this rate I will probably have to) as it does have key information in there I need that isn't being loaded from HTML. So all I'll have to do is extract each products ID code and modify each URL to do it. But even then, the ONLY but of information this still doesn't have is the DataSheet URL. I thought I had figured it all out when I discovered this but no, this still leaves me stuck in the mud and I am sinking fast trying to find any solution other than selenium to extracting this one bit of info in a terminal using requests and libraries alike. I'm implementing threading as well which is why it's really important for me to be able to do this without loading up a browser like selenium.
Is it even possible at this point? I'd really appreciate someone who actually knows what they're talking unlike me, to take a look at the page, tell me what I'm missing or point me in the right direction. I need this finished by today and I've been pressed on this for 2 days now and I am starting to give up.

using python's requests-html to bypass adblock by executing js

I was aiming to use requests HTML to be able to get text loaded by js. The static HTML displays a message requesting turning off AdBlock. I'm using the code below that is supposed to execute js but still, the HTML I get is static. Could anyone offer advice on how to resolve this, or at least explain why I'm not getting the dynamic content? And why do AdBlock messages still turn up?
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://wyborcza.pl/7,173236,25975806,free-khaled-drareni.html')
r.html.render()
print(r.html.html)

PhantomJS not retrieving correct data

I am trying to scrape a web page which has javascript in it using phantomjs. I found an element for button and when i click it, it show render next link. But i am not getting the exact output what i want. Instead, i am getting different output which is not required.
The code is:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
s = requests.session()
fg =s.get('https://in.bookmyshow.com/booktickets/INCM/32076',headers=headers)
so = BeautifulSoup(fg.text,"html.parser")
texts = so.findAll("div",{"class":"__buytickets"})
print(texts[0].a['href'])
print(fg.url)
driver = webdriver.PhantomJS()
driver.get(movie_links[0])
element = driver.find_element_by_class_name('__buytickets')
element.click()
print(driver.current_url)
I am getting the output as :
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INVB/47680
what i have to get is:
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INCM/32076#Seatlayout
Actually, the link which i have to get is generated by javascript of the previous link. How to get this link? (seatlayout link) Please help! Thanks in Advance.
PhantomJS in my experience don't work well.
Сhrome and Mozilla better.
Vitaly Slobodin https://github.com/Vitallium said he will not develop more Phantomjs.
Use Headless Chrome or Firefox.

Selenium Python: Cannot find element after javascript runs

I am trying to automatize some SAP Job monitoring with Python. I want to create a script that should do the following:
Connect and login the SAP environment -> Open SM37 transaction -> Send job parameters (name-user-from-to) -> Read the output and store it into a database.
I don't know about any module or library that allow me to do that. So I checked the WEBGUI is already enabled. I am able to open the environment through a Browser. A browsing module should allows me to do everything I need.
Tried with Mechanize and RoboBrowser. It works but the WEBGUI runs a lot of javascript for renderize and those modules doesn't handle javascript.
There is one more shot: Selenium.
I was able to connect and login to the environment. But when trying to select an element from new page (main menu), Selenium cannot locate the element.
Printing the sourcecode I realized that the Main Menu site is rendered with javascript. The sourcecode doesn't contains the element at all, only the title ("Welcome "). That means the login was successfull.
I read a lot of posts asking for this, and everybody reccommend to use WebDriverWait with some explicit conditions.
Tried this, didn't work:
driver.get("http://mysapserver.domain:8000/sap/bc/gui/sap/its/webgui?sap-client=300&sap-language=ES")
wait = WebDriverWait(driver, 30)
element = wait.until(EC.presence_of_element_located((By.ID, 'ToolbarOkCode')))
EDIT:
There are two sourcecodes: SC-1 is the one that Selenium reads. SC-2 is the one that appears once the javascript renders the site (the one from "Inspect Element").
The full SC-1 is this:
https://pastebin.com/5xURA0Dc
The SC-2 for the element itself is the following:
<input id="ToolbarOkCode" ct="I" lsdata="{0:'ToolbarOkCode',1:'Comando',4:200,13:'150px',23:true}" lsevents="{Change:[{ClientAction:'none'},{type:'TOOLBARINPUTFIELD'}],Enter:[{ClientAction:'submit',PrepareScript:'return\x20its.XControlSubmit\x28\x29\x3b',ResponseData:'delta',TransportMethod:'partial'},{Submit:'X',type:'TOOLBARINPUTFIELD'}]}" type="text" maxlength="200" tabindex="0" ti="0" title="Comando" class="urEdf2TxtRadius urEdf2TxtEnbl urEdfVAlign" value="" autocomplete="on" autocorrect="off" name="ToolbarOkCode" style="width:150px;">
Still can't locate the element. How can I solve it?
Thanks in advance.
The solution was to go into the iframe that containts the renderized html (with the control).
driver2.get("http://mysapserver.domain:8000/sap/bc/gui/sap/its/webgui?sap-client=300&sap-language=ES")
iframe = driver2.find_elements_by_tag_name('iframe')[0]
driver2.switch_to_default_content()
driver2.switch_to_frame(iframe)
driver2.find_element_by_id("ToolbarOkCode").send_keys("SM37")
driver2.find_element_by_id("ToolbarOkCode").send_keys(Keys.ENTER)

Cannot select element from a javascript webpage using Python scrapy and Selenium

I'm attempting to create an app that gathers some data. I'm using Python 2.7 with Scrapy and Selenium on Windows 10. I've done this with only a few web pages previously, however, I'm not able to select or click on a button from the following website.
https://aca3.accela.com/Atlanta_Ga/Default.aspx
Not able to click on the botton labeled "Search Permits/Complaints"
I've used the Chrome dev tools in order to inspect the XPaths, etc.
Here's the code that I'm using:
import scrapy
from selenium import webdriver
class PermitsSpider(scrapy.Spider):
name = "atlanta"
url = "https://aca3.accela.com/Atlanta_Ga/Default.aspx"
start_urls = ['https://aca3.accela.com/Atlanta_Ga/Default.aspx',]
def __init__(self):
self.driver = webdriver.Chrome()
self.driver.implicitly_wait(20)
def parse(self, response):
self.driver.get(self.url)
time.sleep(15)
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
search_button.click()
When Running that code, I get the following error:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]"}
I've added all manor of sleeps and waits, etc in order to ensure that the page is fully loaded before attempting the selection. I've also attempted to select the Link Text, and other methods for selecting the element. Not sure why this method isn't working on this page where it works for me on others. While running the code, the WebDriver does open the page on my screen, and I see the page loaded, etc.
Any help would be appreciated. Thank you...
Target link located inside an iframe, so you have to switch to that frame to be able to handle embedded elements:
self.driver.switch_to_frame('ACAFrame')
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
You also might need to switch back with driver.switch_to_default_content()

Categories