BeautifulSoup & requests_html unable to find element - javascript

I need to scrape information from this page: https://professionals.tarkett.co.uk/en_GB/collection-C001030-arcade/arcade-b023-2128.
There are 6000+ of those pages I need to scrape. I really don't want to use selenium as it is too slow for this type of job.
The information I am trying to scrape is the 'Document' section down the bottom page, just above the 'Case studies' and 'About' section. There is a pdf DataSheet and several other key bits on information in that area I need to scrape but am finding absolutely impossible.
I have tried everything at this point, requests, dryscrape, requests_html etc. and nothing works.
It seems the information I need is being rendered by JavaScript. I have tried using libraries that supposedly work for these type of issues but in my case it's not working.
Here's snippet of code to show:
from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs
session = HTMLSession()
resp = session.get("https://professionals.tarkett.co.uk/en_GB/collection-C001030-arcade/arcade-b023-2128", headers=headers) # tried without headers too
resp.html.render()
soup = bs(resp.html.html, "html.parser")
soup.find("section", {"id" : "collection-documentation"})
Output:
<section data-v-6f916884="" data-v-aed8933c="" id="collection-documentation"><!-- --></section>
No matter what I try, the information just isn't there. This is one element specifically that I am trying to get:
<a data-v-5a3c0164="" href="https://media.tarkett-image.com/docs/DS_INT_Arcade.pdf" target="_blank" rel="noopener noreferrer" class="basic-clickable tksb-secondary-link-with-large-icon" is-white-icon="true"><svg xmlns="http://www.w3.org/2000/svg" width="47" height="47" viewBox="0 0 47 47" class="tksb-secondary-link-with-large-icon__icon"><g transform="translate(-1180 -560)"><path d="M1203.5,560a23.5,23.5,0,1,1-23.5,23.5A23.473,23.473,0,0,1,1203.5,560Z" class="download-icon__background"></path></g> <g><path d="M29.5,22.2l-5.1,5.5V10.3H22.6V27.7l-5.1-5.5-1.4,1.2,7.4,7.9,7.4-7.9Z" class="download-icon__arrow-fill"></path> <g><path d="M31.6,37.6H15.4V31.3h1.8v4.5H29.8V31.3h1.8Z" class="download-icon__arrow-fill"></path></g></g></svg> <span class="tksb-secondary-link-with-large-icon__text-container"><span class="tksb-secondary-link-with-large-icon__label">Datasheet</span> <span class="tksb-secondary-link-with-large-icon__description">PDF</span></span></a>
The best I've come up with so far is finding this URL from using Chrome Dev tools and inspecting the network tab to see what happens when I scroll into view of the data I want; https://professionals.tarkett.co.uk/en_GB/collection-product-formats-json/fb02/C001030/b023-2128?fields[]=sku_design&fields[]=sku_design_key&fields[]=sku_thumbnail&fields[]=sku_hex_color_code&fields[]=sku_color_family&fields[]=sku_delivery_sla&fields[]=sku_is_new&fields[]=sku_part_number&fields[]=sku_sap_number&fields[]=sku_id&fields[]=sku_format_type&fields[]=sku_format_shape&fields[]=sku_format&fields[]=sku_backing&fields[]=sku_items_per_box&fields[]=sku_surface_per_box&fields[]=sku_box_per_pallet&fields[]=sku_packing_unit_code&fields[]=sku_collection_names&fields[]=sku_category_b2b_names&fields[]=sku_sap_sales_org&fields[]=sku_minimum_order_qty&fields[]=sku_base_unit_sap&fields[]=sku_selling_units&fields[]=sku_pim_prices&fields[]=sku_retailers_prices&fields[]=sku_retailers_prices_unit&fields[]=sku_installation_method.
Now I could easily scrape the information I want from here (which at this rate I will probably have to) as it does have key information in there I need that isn't being loaded from HTML. So all I'll have to do is extract each products ID code and modify each URL to do it. But even then, the ONLY but of information this still doesn't have is the DataSheet URL. I thought I had figured it all out when I discovered this but no, this still leaves me stuck in the mud and I am sinking fast trying to find any solution other than selenium to extracting this one bit of info in a terminal using requests and libraries alike. I'm implementing threading as well which is why it's really important for me to be able to do this without loading up a browser like selenium.
Is it even possible at this point? I'd really appreciate someone who actually knows what they're talking unlike me, to take a look at the page, tell me what I'm missing or point me in the right direction. I need this finished by today and I've been pressed on this for 2 days now and I am starting to give up.

Related

Webscraping in Python Selenium - Can't find button

So, Im trying to access some data from this webpage http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm . Im trying to click on the button named as "Setor de atuação" with selenium. The problem is The requests lib is returning me a different HTML from the one I see when I inspect the page. I already tried to sent a header with my request but it wasn't the solution. Although, when I print the content in
browser.page_source
I still get an incomplete part of the page that I want. In order to try solving the problem Ive seen that two requests are posted when the site is initializes:
print1
Well, I'm not sure what to do now. If anyone can help me or send me a tutorial, explain what is happening I would be really glad. Thanks in advance. Ive only done simple web-scraping so I'm not sure how to proceed, Ive also checked other questions in the forums and none seem to be similar to my problem.
import bs4 as bs
import requests
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito') #private
#options.add_argument('--headless') # doesnt open page
browser = webdriver.Chrome('/home/itamar/Desktop/chromedriver', chrome_options=options)
site = 'http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm'
browser.get(site)
Thats my code till now. Im having trouble to find the and click in the element buttor "Setor de Atuação". Ive tried through X_path,class,id but nothing seems to work.
The aimed button is inside an iframe, in this case you'll have to use the switch_to function from your selenium driver, this way switching the driver to the iframe DOM, and only then you can look for the button. I've played with the page provided and it worked - only using Selenium though, no need of Beautiful Soup. This is my code:
from selenium import webdriver
import time
class B3:
def __init__(self):
self.bot = webdriver.Firefox()
def start(self):
bot = self.bot
bot.get('http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm')
time.sleep(2)
iframe = bot.find_element_by_xpath('//iframe[#id="bvmf_iframe"]')
bot.switch_to.frame(iframe)
bot.implicitly_wait(30)
tab = bot.find_element_by_xpath('//a[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]')
time.sleep(3)
tab.click()
time.sleep(2)
if __name__ == "__main__":
worker = B3()
worker.start()
Hope it suits you well!
refs:
https://www.techbeamers.com/switch-between-iframes-selenium-python/
In this case I suggest you to work only with the Selenium, because it depends on the Javascripts processing.
You can inspect the elements and use the XPath to choose and select the elements.
XPath:
//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span
So your code will looks like:
elementSelect = driver.find_elements_by_xpath('//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span')
elementSelect[0].click()
time.sleep(5) # Wait the page to load.
PS: I recommend you to search an API service for the B3. I found this link, but I didn't read it. Maybe they already have disponibilised these kind of data.
About XPath: https://www.guru99.com/xpath-selenium.html
I can't understand the problem, so if you can show a code snippet it would be better.
And I suggest you to use BeautifulSoup for web scraping.

Python Web Scraping with JavaScript Do Postback

I have been trying to:
Go to:
mdoe.state.mi.us/moecs/PublicCredentialSearch.aspx
Enter a certificate number (for the sake of illustration, you can just search for "Davidson" as the last name).
Click on a link corresponding to "Professional Teaching Certificate".
Copy and paste the resulting table.
The rub seems to be with the JavaScript doPostBack() part, as it requires rendering, I believe, to get the data.
When viewing the source code, see how the href part identifies an individual link like this? (for the 6th link down):
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gViewCredentialSearchList$ctl07$link1','')
From this:
<td class="MOECSNormal" style="border-color:Black;border-width:1px;border-style:Solid;">Professional Teaching Certificate Renewal</td><td class="MOECSNormal" style="border-color:Black;border-width:1px;border-style:Solid;">
<a id="ContentPlaceHolder1_gViewCredentialSearchList_link1_5" ItemStyle-BorderColor="Black" ItemStyle-BorderStyle="Solid" ItemStyle-BorderWidth="1px" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gViewCredentialSearchList$ctl07$link1','')">CC-XWT990004102</a>
</td>
I'm looking for a way (via Python) to get the data I need into a table, given a certification number and certificate name (i.e. "Professional Teaching Certificate".
I have tried following a tutorial using PyQt4, but installing it alone was traumatic.
Thanks in advance!
You can open the page in a browser e.g. Chrome and study how the interaction is done between the page and the server, normally this information can be seen in the network tab of Developer tool, this way you can formulate a python script to do the steps maybe using requests library
or
You can use selenium-python to do simulate your browser interaction (including javascript calls) until you got to the page where your interested data belongs to.

Untraceable HTTP redirection?

I'm currently working on a project to track products from several websites. I use a python scraper to retrieve all the URLs related to the listed products, and later, regularly check if these URLs are still active.
To do so I use the Python requests module, run a get request and look at the response's status code. Usually I get 200, 301, 302 or 404 as expected, except in the following case:
http://www.sephora.fr/Parfum/Parfum-Femme/Totem-Orange-Eau-de-Toilette/P2232006
This product has been removed and while opening the link (sorry it's in French), I am briefly shown a placeholder page saying the product is not available anymore and then redirected to the home page (www.sephora.fr).
Oddly, Python still returns a 200 status code and so do various redirect tracers such as wheregoes.com or redirectdetective.com. The worst part is that the response URL still is the original, so I can't even trace it that way.
When analyzing with Chrome DevTools and preserving the logs, I see that at some point the page is reloaded. However I'm unable to find out where.
I'm guessing this is done client-side via Javascript, but I'm not quite sure how. Furthermore, I'd really need to be able to detect this change from within Python.
As a reference, here's a link to a working product:
http://www.sephora.fr/Parfum/Parfum-Femme/Kenzo-Jeu-d-Amour-Eau-de-Parfum/P1894014
Any leads?
Thank you !
Ludwig
The page has a meta tag, that redirects the page to the root URL:
<meta http-equiv="refresh" content="0; URL=/" />

Clicking a Javascript link to make a post request in Python

I'm writing a webscraper/automation tool. This tool needs to use POST requests to submit form data. The final action uses this link:
<a id="linkSaveDestination" href='javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("linkSaveDestination", "", true, "", "", false, true))'>Save URL on All Search Engines</a>
to submit data from this form:
<input name="sem_ad_group__destination_url" type="text" maxlength="1024" id="sem_ad_group__destination_url" class="TextValueStyle" style="width:800px;">
I've been using requests and BeautifulSoup. I understand that these libraries can't interact with Javascript, and people recommend Selenium. But as I understand it Selenium can't do POSTs. How can I handle this? Is it possible to do without opening an actual browser like Selenium does?
Yes. You can absolutely duplicate what the link is doing by just submitting a POST to the proper url (this is, in reality, eventually going to be the same thing that the javascript that fires when the link is clicked does).
You'll find the relevant section in the requests docs here: http://docs.python-requests.org/en/latest/user/quickstart/#more-complicated-post-requests
So, that'll look something like this for your particular case:
payload = {'sem_ad_group__destination_url': 'yourTextValueHere'}
r = requests.post("theActionUrlForTheFormHere", data=payload)
If you're having trouble figuring out what url it is actually be posted to, just monitor the network tab (in chrome dev tools) while you manually click the link yourself, you should be able to find the right request and pull any information off of that.
Good Luck!
With selenium you mimic the real-user interactions in a real browser - tell it to locate an input, write a text inside, click a button etc - high-level approach - you don't even need to know what is there under-the-hood, you see what a real user sees. The downside here is that there is a real browser involved which, at least, slows things down. You can though, automate a a headless browser (PhantomJS), or use a Xvfb virtual framebuffer if you don't have conditions to open up a browser with a UI. Example:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('url here')
button = driver.find_element_by_id('linkSaveDestination')
button.click()
With requests+BeautifulSoup, you are going down to the bare metal - using browser developer tools you research/analyze what requests are made to a server and mimic them in your code. Sometimes the way a page is constructed and requests made are too complicated to automate, or there are anti-web-scraping technique used.
There are pros & cons about both approaches - which option to choose depends on many things.

Is there a way to mitigate downloading of resources (images/css and js files) with Javascript?

I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.
The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.
Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.
There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.
Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.

Categories