Python: How to pass blogspot warning page (guestAuth) - javascript

I have written a python script to display blog pages and their properties.
This work pretty well except that some on blogspot which require manual validation on a warning page like this one:
http://ferdinandkreozot.blogspot.com/2015/12/busy-as.html
So I don't know how to programatically validate to get to the page contents.
The warnoing validation button has some URL like :
http://ferdinandkreozot.blogspot.com/2015/12/busy-as.html?guestAuth='SOME VERY LONG ID'
Here is part of my current script printing image URLs (replace url variable with blogspot URL above):
import os, sys, urllib, httplib2, validators, time
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request(url)
for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('img')):
print link['src']
How can I proceed to finally get to the page contents ?
How is this ID generated ?
Any example/information I could look at ?
Do I need to use GoogleAuth or google-auth-httplib2 for instance ? even if no google account is required to manaually get through.
I found nothing so far.
Thank you very much for your help.

Related

How to get past Javascript is disabled in your browser error when web scraping with Python

I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867

using python's requests-html to bypass adblock by executing js

I was aiming to use requests HTML to be able to get text loaded by js. The static HTML displays a message requesting turning off AdBlock. I'm using the code below that is supposed to execute js but still, the HTML I get is static. Could anyone offer advice on how to resolve this, or at least explain why I'm not getting the dynamic content? And why do AdBlock messages still turn up?
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://wyborcza.pl/7,173236,25975806,free-khaled-drareni.html')
r.html.render()
print(r.html.html)

Webscraping in Python Selenium - Can't find button

So, Im trying to access some data from this webpage http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm . Im trying to click on the button named as "Setor de atuação" with selenium. The problem is The requests lib is returning me a different HTML from the one I see when I inspect the page. I already tried to sent a header with my request but it wasn't the solution. Although, when I print the content in
browser.page_source
I still get an incomplete part of the page that I want. In order to try solving the problem Ive seen that two requests are posted when the site is initializes:
print1
Well, I'm not sure what to do now. If anyone can help me or send me a tutorial, explain what is happening I would be really glad. Thanks in advance. Ive only done simple web-scraping so I'm not sure how to proceed, Ive also checked other questions in the forums and none seem to be similar to my problem.
import bs4 as bs
import requests
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito') #private
#options.add_argument('--headless') # doesnt open page
browser = webdriver.Chrome('/home/itamar/Desktop/chromedriver', chrome_options=options)
site = 'http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm'
browser.get(site)
Thats my code till now. Im having trouble to find the and click in the element buttor "Setor de Atuação". Ive tried through X_path,class,id but nothing seems to work.
The aimed button is inside an iframe, in this case you'll have to use the switch_to function from your selenium driver, this way switching the driver to the iframe DOM, and only then you can look for the button. I've played with the page provided and it worked - only using Selenium though, no need of Beautiful Soup. This is my code:
from selenium import webdriver
import time
class B3:
def __init__(self):
self.bot = webdriver.Firefox()
def start(self):
bot = self.bot
bot.get('http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm')
time.sleep(2)
iframe = bot.find_element_by_xpath('//iframe[#id="bvmf_iframe"]')
bot.switch_to.frame(iframe)
bot.implicitly_wait(30)
tab = bot.find_element_by_xpath('//a[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]')
time.sleep(3)
tab.click()
time.sleep(2)
if __name__ == "__main__":
worker = B3()
worker.start()
Hope it suits you well!
refs:
https://www.techbeamers.com/switch-between-iframes-selenium-python/
In this case I suggest you to work only with the Selenium, because it depends on the Javascripts processing.
You can inspect the elements and use the XPath to choose and select the elements.
XPath:
//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span
So your code will looks like:
elementSelect = driver.find_elements_by_xpath('//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span')
elementSelect[0].click()
time.sleep(5) # Wait the page to load.
PS: I recommend you to search an API service for the B3. I found this link, but I didn't read it. Maybe they already have disponibilised these kind of data.
About XPath: https://www.guru99.com/xpath-selenium.html
I can't understand the problem, so if you can show a code snippet it would be better.
And I suggest you to use BeautifulSoup for web scraping.

PhantomJS not retrieving correct data

I am trying to scrape a web page which has javascript in it using phantomjs. I found an element for button and when i click it, it show render next link. But i am not getting the exact output what i want. Instead, i am getting different output which is not required.
The code is:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
s = requests.session()
fg =s.get('https://in.bookmyshow.com/booktickets/INCM/32076',headers=headers)
so = BeautifulSoup(fg.text,"html.parser")
texts = so.findAll("div",{"class":"__buytickets"})
print(texts[0].a['href'])
print(fg.url)
driver = webdriver.PhantomJS()
driver.get(movie_links[0])
element = driver.find_element_by_class_name('__buytickets')
element.click()
print(driver.current_url)
I am getting the output as :
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INVB/47680
what i have to get is:
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INCM/32076#Seatlayout
Actually, the link which i have to get is generated by javascript of the previous link. How to get this link? (seatlayout link) Please help! Thanks in Advance.
PhantomJS in my experience don't work well.
Сhrome and Mozilla better.
Vitaly Slobodin https://github.com/Vitallium said he will not develop more Phantomjs.
Use Headless Chrome or Firefox.

How to Scrape a popup page using python and selenium and beautiful soup

I'm trying to scrape off info from a pop up page. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO and storing it in a variable. I want to use this variable to make a post request to get the pop up page. (I've also tried accessing it using selenium. It didn't work.
How should I get my code to open these pop up links for scraping data off them?
HTML behind the page
Name
Code portion is below
html = requests.get("http://ngodarpan.gov.in/index.php/home/statewise_ngo/31/35/1")
soup = BeautifulSoup(html.text, 'lxml')
first_div = soup.find ('div', class_ = "ibox-content")
get_tr = first_div.find_all('a', onclick=True)
for ngoinfo in get_tr:
try:
if re.match('show_ngo_info',ngoinfo['onclick']):
k = ngoinfo['onclick']
p = re.sub("\D", "", k)
except:pass
When you have dynamic information loaded on the web page, you should inspect what calls the page do to get this dynamic information. You can use the Inspection Tools from your web browser to find that.
Inspecting the page I saw that when you click on one of the links to show the popup, the page does two requests, the first got a CSRF token and the second got information that will have displayed in the popup.
I think you should try to simulate these calls with Python. I couldn't test this, but I think this is the approach.
First: GET http://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf
Second: POST http://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info
You should send the id that you got before
I discovered this when I inspected the network in browser inspection tools.
You will need to do these calls for each link you are extracting.
I hope that can help you.

Categories