I'm attempting to create an app that gathers some data. I'm using Python 2.7 with Scrapy and Selenium on Windows 10. I've done this with only a few web pages previously, however, I'm not able to select or click on a button from the following website.
https://aca3.accela.com/Atlanta_Ga/Default.aspx
Not able to click on the botton labeled "Search Permits/Complaints"
I've used the Chrome dev tools in order to inspect the XPaths, etc.
Here's the code that I'm using:
import scrapy
from selenium import webdriver
class PermitsSpider(scrapy.Spider):
name = "atlanta"
url = "https://aca3.accela.com/Atlanta_Ga/Default.aspx"
start_urls = ['https://aca3.accela.com/Atlanta_Ga/Default.aspx',]
def __init__(self):
self.driver = webdriver.Chrome()
self.driver.implicitly_wait(20)
def parse(self, response):
self.driver.get(self.url)
time.sleep(15)
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
search_button.click()
When Running that code, I get the following error:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]"}
I've added all manor of sleeps and waits, etc in order to ensure that the page is fully loaded before attempting the selection. I've also attempted to select the Link Text, and other methods for selecting the element. Not sure why this method isn't working on this page where it works for me on others. While running the code, the WebDriver does open the page on my screen, and I see the page loaded, etc.
Any help would be appreciated. Thank you...
Target link located inside an iframe, so you have to switch to that frame to be able to handle embedded elements:
self.driver.switch_to_frame('ACAFrame')
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
You also might need to switch back with driver.switch_to_default_content()
Related
I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867
So, Im trying to access some data from this webpage http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm . Im trying to click on the button named as "Setor de atuação" with selenium. The problem is The requests lib is returning me a different HTML from the one I see when I inspect the page. I already tried to sent a header with my request but it wasn't the solution. Although, when I print the content in
browser.page_source
I still get an incomplete part of the page that I want. In order to try solving the problem Ive seen that two requests are posted when the site is initializes:
print1
Well, I'm not sure what to do now. If anyone can help me or send me a tutorial, explain what is happening I would be really glad. Thanks in advance. Ive only done simple web-scraping so I'm not sure how to proceed, Ive also checked other questions in the forums and none seem to be similar to my problem.
import bs4 as bs
import requests
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito') #private
#options.add_argument('--headless') # doesnt open page
browser = webdriver.Chrome('/home/itamar/Desktop/chromedriver', chrome_options=options)
site = 'http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm'
browser.get(site)
Thats my code till now. Im having trouble to find the and click in the element buttor "Setor de Atuação". Ive tried through X_path,class,id but nothing seems to work.
The aimed button is inside an iframe, in this case you'll have to use the switch_to function from your selenium driver, this way switching the driver to the iframe DOM, and only then you can look for the button. I've played with the page provided and it worked - only using Selenium though, no need of Beautiful Soup. This is my code:
from selenium import webdriver
import time
class B3:
def __init__(self):
self.bot = webdriver.Firefox()
def start(self):
bot = self.bot
bot.get('http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm')
time.sleep(2)
iframe = bot.find_element_by_xpath('//iframe[#id="bvmf_iframe"]')
bot.switch_to.frame(iframe)
bot.implicitly_wait(30)
tab = bot.find_element_by_xpath('//a[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]')
time.sleep(3)
tab.click()
time.sleep(2)
if __name__ == "__main__":
worker = B3()
worker.start()
Hope it suits you well!
refs:
https://www.techbeamers.com/switch-between-iframes-selenium-python/
In this case I suggest you to work only with the Selenium, because it depends on the Javascripts processing.
You can inspect the elements and use the XPath to choose and select the elements.
XPath:
//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span
So your code will looks like:
elementSelect = driver.find_elements_by_xpath('//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span')
elementSelect[0].click()
time.sleep(5) # Wait the page to load.
PS: I recommend you to search an API service for the B3. I found this link, but I didn't read it. Maybe they already have disponibilised these kind of data.
About XPath: https://www.guru99.com/xpath-selenium.html
I can't understand the problem, so if you can show a code snippet it would be better.
And I suggest you to use BeautifulSoup for web scraping.
I am trying to scrape a web page which has javascript in it using phantomjs. I found an element for button and when i click it, it show render next link. But i am not getting the exact output what i want. Instead, i am getting different output which is not required.
The code is:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
s = requests.session()
fg =s.get('https://in.bookmyshow.com/booktickets/INCM/32076',headers=headers)
so = BeautifulSoup(fg.text,"html.parser")
texts = so.findAll("div",{"class":"__buytickets"})
print(texts[0].a['href'])
print(fg.url)
driver = webdriver.PhantomJS()
driver.get(movie_links[0])
element = driver.find_element_by_class_name('__buytickets')
element.click()
print(driver.current_url)
I am getting the output as :
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INVB/47680
what i have to get is:
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INCM/32076#Seatlayout
Actually, the link which i have to get is generated by javascript of the previous link. How to get this link? (seatlayout link) Please help! Thanks in Advance.
PhantomJS in my experience don't work well.
Сhrome and Mozilla better.
Vitaly Slobodin https://github.com/Vitallium said he will not develop more Phantomjs.
Use Headless Chrome or Firefox.
I'm trying to scrape off info from a pop up page. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO and storing it in a variable. I want to use this variable to make a post request to get the pop up page. (I've also tried accessing it using selenium. It didn't work.
How should I get my code to open these pop up links for scraping data off them?
HTML behind the page
Name
Code portion is below
html = requests.get("http://ngodarpan.gov.in/index.php/home/statewise_ngo/31/35/1")
soup = BeautifulSoup(html.text, 'lxml')
first_div = soup.find ('div', class_ = "ibox-content")
get_tr = first_div.find_all('a', onclick=True)
for ngoinfo in get_tr:
try:
if re.match('show_ngo_info',ngoinfo['onclick']):
k = ngoinfo['onclick']
p = re.sub("\D", "", k)
except:pass
When you have dynamic information loaded on the web page, you should inspect what calls the page do to get this dynamic information. You can use the Inspection Tools from your web browser to find that.
Inspecting the page I saw that when you click on one of the links to show the popup, the page does two requests, the first got a CSRF token and the second got information that will have displayed in the popup.
I think you should try to simulate these calls with Python. I couldn't test this, but I think this is the approach.
First: GET http://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf
Second: POST http://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info
You should send the id that you got before
I discovered this when I inspected the network in browser inspection tools.
You will need to do these calls for each link you are extracting.
I hope that can help you.
I have a page with self-refreshing content (via WebSocket) like this one. While the content is constantly changing my firefox webdriver can only see the initial content. I could get the fresh one by refreshing the page by
driver.navigate.refresh()
but this causes unnecessary traffic besides in the Firefox window the new content already appear.
My question is: Can I get the fresh html as I can observe in the Firefox window without reloading the whole page?
If the page contents change over a period of time, one option you could do is check the page source every n seconds. A simple way to do this would be to import time then use time.sleep(5) to wait for 5 seconds, then get the page source. You can also put it in a loop, and if the page contents have changed within the succeeding 5 second periods, then selenium should be able to get the updated page contents when you check. I haven't tested this, but feel free to check if it works for you.
EDIT: Added sample code. Make sure that you have marionette properly installed and configured. You can check my answer here if you are an ubuntu user (https://stackoverflow.com/a/39536091/6284629)
# this code would print the source of a page every second
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
# side note, how to get marionette working for firefox:
# https://stackoverflow.com/a/39536091/6284629
capabilities = DesiredCapabilities.FIREFOX
capabilities["marionette"] = True
browser = webdriver.Firefox(capabilities=capabilities)
# load the page
browser.get("http://url-to-the-site.xyz")
while True:
# print the page source
print(browser.page_source)
# wait for one second before looping to print the source again
time.sleep(1)