using python's requests-html to bypass adblock by executing js - javascript

I was aiming to use requests HTML to be able to get text loaded by js. The static HTML displays a message requesting turning off AdBlock. I'm using the code below that is supposed to execute js but still, the HTML I get is static. Could anyone offer advice on how to resolve this, or at least explain why I'm not getting the dynamic content? And why do AdBlock messages still turn up?
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://wyborcza.pl/7,173236,25975806,free-khaled-drareni.html')
r.html.render()
print(r.html.html)

Related

How to get past Javascript is disabled in your browser error when web scraping with Python

I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867

Webscraping in Python Selenium - Can't find button

So, Im trying to access some data from this webpage http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm . Im trying to click on the button named as "Setor de atuação" with selenium. The problem is The requests lib is returning me a different HTML from the one I see when I inspect the page. I already tried to sent a header with my request but it wasn't the solution. Although, when I print the content in
browser.page_source
I still get an incomplete part of the page that I want. In order to try solving the problem Ive seen that two requests are posted when the site is initializes:
print1
Well, I'm not sure what to do now. If anyone can help me or send me a tutorial, explain what is happening I would be really glad. Thanks in advance. Ive only done simple web-scraping so I'm not sure how to proceed, Ive also checked other questions in the forums and none seem to be similar to my problem.
import bs4 as bs
import requests
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito') #private
#options.add_argument('--headless') # doesnt open page
browser = webdriver.Chrome('/home/itamar/Desktop/chromedriver', chrome_options=options)
site = 'http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm'
browser.get(site)
Thats my code till now. Im having trouble to find the and click in the element buttor "Setor de Atuação". Ive tried through X_path,class,id but nothing seems to work.
The aimed button is inside an iframe, in this case you'll have to use the switch_to function from your selenium driver, this way switching the driver to the iframe DOM, and only then you can look for the button. I've played with the page provided and it worked - only using Selenium though, no need of Beautiful Soup. This is my code:
from selenium import webdriver
import time
class B3:
def __init__(self):
self.bot = webdriver.Firefox()
def start(self):
bot = self.bot
bot.get('http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm')
time.sleep(2)
iframe = bot.find_element_by_xpath('//iframe[#id="bvmf_iframe"]')
bot.switch_to.frame(iframe)
bot.implicitly_wait(30)
tab = bot.find_element_by_xpath('//a[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]')
time.sleep(3)
tab.click()
time.sleep(2)
if __name__ == "__main__":
worker = B3()
worker.start()
Hope it suits you well!
refs:
https://www.techbeamers.com/switch-between-iframes-selenium-python/
In this case I suggest you to work only with the Selenium, because it depends on the Javascripts processing.
You can inspect the elements and use the XPath to choose and select the elements.
XPath:
//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span
So your code will looks like:
elementSelect = driver.find_elements_by_xpath('//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span')
elementSelect[0].click()
time.sleep(5) # Wait the page to load.
PS: I recommend you to search an API service for the B3. I found this link, but I didn't read it. Maybe they already have disponibilised these kind of data.
About XPath: https://www.guru99.com/xpath-selenium.html
I can't understand the problem, so if you can show a code snippet it would be better.
And I suggest you to use BeautifulSoup for web scraping.

Python: How to pass blogspot warning page (guestAuth)

I have written a python script to display blog pages and their properties.
This work pretty well except that some on blogspot which require manual validation on a warning page like this one:
http://ferdinandkreozot.blogspot.com/2015/12/busy-as.html
So I don't know how to programatically validate to get to the page contents.
The warnoing validation button has some URL like :
http://ferdinandkreozot.blogspot.com/2015/12/busy-as.html?guestAuth='SOME VERY LONG ID'
Here is part of my current script printing image URLs (replace url variable with blogspot URL above):
import os, sys, urllib, httplib2, validators, time
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request(url)
for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('img')):
print link['src']
How can I proceed to finally get to the page contents ?
How is this ID generated ?
Any example/information I could look at ?
Do I need to use GoogleAuth or google-auth-httplib2 for instance ? even if no google account is required to manaually get through.
I found nothing so far.
Thank you very much for your help.

HtmlUnit Load Facebook Photos

So, I have a project where I need to get the photos from a profile.
I am able to navigate to the photos page of a profile, but I believe the JavaScript is not loading.
I am currently using HtmlUnit but if you know of another Java API that would help I'm all ears.
Basically, when I view Facebook in a normal browser, it will load all of the pages and I can inspect the elements.
When inspecting, there is a div called fbStarGrid and a few other modifiers. This div contains all the images for a user's profile.
When I use HTMLUnit, I cannot find the div. I had it print the full page XML to a file, and I found that the div is commented out. I believe this means the Javascript never ran to load the content.
After browsing a lot of javascript help on SO, I have found a few things that help with debugging but can't seem to fix the problem.
The first thing I've done is create an instance of a JavaScriptJobManager. I used it to see how much JavaScript is not complete. After waiting for a while (10+ seconds) it says there are still 3 JS jobs uncomplete. After a very long time (about 60 seconds), it says there are 2 JS jobs uncomplete.
I do not know what is hanging with those JS jobs.
I get a warning upon page load about application/ld+json not running but I do not believe that part of the website is related to the photos.
Is there something I can do to force the JS to run? Is there a job it's stuck on and won't proceed to the next job?
I've also wondered if it's an issue with the page not re-syncing.
I've tried two solutions related to this:
Setting the AjaxController to NicelyResynchronizingAjaxController()
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
And someone suggested creating a custom controller that forces syncing.
webClient.setAjaxController(new AjaxController(){
#Override
public boolean processSynchron(HtmlPage page, WebRequest request, boolean async)
{
return true;
}
});
Neither of these seemed to effect the page.
If HTMLUnit is not the right library for the job, any other ideas? I need this to be headless/guiless to run on a linux server. Java is preferred, but I can switch languages if necessary.

Shopify App - Using Script Tags with Ruby on Rails Application

I'm trying to familiarize myself with the concept of using script tags. I'm making a ruby on rails app that does something as simple as alert "Hi" when a customer visits a page. I am testing this public app on a local server and I have the shopify_app gem installed. The app has been authenticated and I have access to the store's data. I've viewed the Shopify API documentation on using script tags and I've looked at the Shopify Embedded App example that Shopify has on GitHub. The documentation details the properties of a script tag and gives examples of script tags with their properties defined, but doesn't say anything about where to place the script tag in an application, or how to configure an environment so that the js file in the script tag will go through.
I've discovered that a js file being added with a script tag will only work if the js file is hosted online, so I've uploaded the js file to google drive. I have the code for the script tag in the index action of my HomeController (the default page for the app). This is the code I'm using:
def index
if response = request.env['omniauth.auth']
sess = ShopifyAPI::Session.new(params[:shop], response[:credentials][:token])
session[:shopify] = sess
ShopifyAPI::Base.activate_session(sess)
ShopifyAPI::ScriptTag.create(
:event => "onload",
:src => "https://drive.google.com/..."
)
end
I think the problem may be tied to the request.env. The response is not being read as request.env[omniauth.auth] and I believe that the response coming back as valid may be required for the script tag to go through.
The method that I tried above is from the 2nd answer given in this topic: How to develop rails app for shopify with ScriptTags.
The first answer suggested using this code:
ShopifyAPI::Base.site = token
s = ShopifyAPI::ScriptTag.create(:events => "onload",:src => "your javascript url")
However, it doesn't say where to place both lines of code in a rails application. I tried putting the second line in a js file in my rails application, but it did not work.
I don't know if I'm encountering problems because I'm running the app on a local server or if there is something missing from the configuration of my application.
I'd appreciate it if anyone could point me in the right direction.
Try putting something like this in config/initializers/shopify_app.rb
ShopifyApp.configure do |config|
config.api_key = "xxx-xxxx-xxx-xxx"
config.secret = "xxx-xxxx-xxx-xxx"
config.scope = "read_orders, read_products"
config.embedded_app = true
config.scripttags = [
{event:'onload', src: 'https://yourdomain.herokuapp.com/javascripts/yourjs.js'}
]
end
Yes, you are correct that you'll need the js file you want to include for your script tag publicly available - if you are using localhost for development look into ngrok.
Do yourself the favor of ensuring your callbacks use SSL when interacting with the Shopify API (i.e. configure your app with https://localhost/ as a callback setting in the Shopify app settings). I went through the trouble of configuring thin as the web server locally with a self-signed SSL certificate.
With a proper set up you should be able to debug why the response is failing the omniauth check.
I'm new to the Shopify API(s), but not Rails. Their documentation leaves a lot to be desired.
Good luck to you sir,

Categories