python BeautifulSoup not getting text from webpage - javascript

I am trying to get product name from the webpage using python. but it returns only a empty tags. i also tried requests library and lxml parsing in BeautifulSoup. pls help me to fix this problem, thanks in advance :-)
HTML in site:
<div class="product-name">SWAN</div>
<div class="product-price">
<span class="final-price">₹10650</span>
</div>
<div class="specification">
<div>Specifications</div>
<table>
<tr>
<td>....</td>
</tr>
<tr>
<td>....</td>
</tr>
</table>
</div>
python code:
url = "http://opor.in/ProductDetail/Index?ProductId=212"
page = urlopen(url).read()
html = bs(page, 'html.parser')
model_name = html.find('div', attrs={'class':'product-name'})
spec = html.find('div', attrs={'class':'specification'})
print(model_name)
print(spec)
Output:
<div class="product-name"></div>
<div class="specification">
<div>Specifications</div>
<table></table>
</div>

The data loaded by java-scripts.However if you see the DOM data available in script tag.To fetch the value from script tag and load into json and then get the key value.
Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import json
url = "http://opor.in/ProductDetail/Index?ProductId=212"
page = urlopen(url).read()
soup = bs(page, 'html.parser')
for item in soup.find_all('script'):
if 'productDetail' in item.text:
data=item.text.split('var productDetail =')[-1].split('};')[0] + "}"
datajson=json.loads(data.strip())
print('Product Code :' + datajson['ProductCode'])
for item in datajson['ProductSpecification']:
print(item['SpecificationName'] + " : "+ item['SpecificationValue'])
Output:
Product Code :1601KFMB
MEMBRANE : MEMBRELLA -ALPHA- 80 GPD (2 NOS)
PUMP : KEMFLO 48 V
APPLICATION : SUITABLE FOR BRACKISH WATER
FILTER LIFE : APPROX 3000 LITRE / 6 MONTHS
FILTERS : SEDIMENT, PRECARBON, POST CARBON
FLOAT : MEMBRELLA
FR : MEMBRELLA /KFL
INLINE SET : MEMBRELLA
INPUT VOLTAGE : 100-300 VOLT AC (50Hz)
INSTALLATION : COUNTER TOP
MAX.OPERATION TDS : 4000 PPM
MEMBRANE TYPE : THIN FILM COMPOSITE
MIN.INLET PRESSURE / TEMP : 0.3 kg / cm2, 10 °C
MODEL : WHALE 25
OPERATING VOLTAGE : 48 VOLT (DC)
PRODUCT DIMENSION : 21.1 (H) x 9.9 (W) x 16.7 (L)
PURIFICATION CAPACITY : 25 LITRES PER HOUR
RECOVERY RATE : MORE THAN 30% AT 27°c ± 2°c
SMPS : MEMBRELLA / EQUALIANT
SOLENOID VALVE : MEMBRELLA / SLX
STORAGE CAPACITY : 20 LITRES
TECHNOLOGY : REVERSE OSMOSIS SYSTEM
TOTAL POWER CONSUMPTION : 50 W
TUBE 1/4 : 5 METERS
TUBE 3/8 : 2 METERS
WEIGHT : 18 kg (Approx)
WARRENTY & SUPPORT : Since Whale designs its purifiers and many of its parts are a truly integrated system. Dealer only can provide one-stop service ,guaranty and support for any service and maintenance, so most issues can be resolved in a single visit

The data you are searching are actually loaded by javascript. You have to use a package such as selenium to retrieve the data.
You can try this:
CODE:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
# Use options to have your selenium headless
options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
url = "http://opor.in/ProductDetail/Index?ProductId=212"
driver.get(url)
page = driver.page_source
html = bs(page, 'html.parser')
model_name = html.find('div', {'class':'product-name'})
spec = html.find('div', {'class':'specification'})
print(model_name)
print(spec)
RESULTS:
<div class="product-name">WHALE 25 LPH</div>
<div class="specification">
<div>Specifications</div>
<table><tr><td class="specification-group" colspan="2"><div>General</div></td></tr><tr><td>Product Code</td><td>1601KFMB</td></tr><tr><td>MEMBRANE</td><td>MEMBRELLA -ALPHA- 80 GPD (2 NOS)</td></tr><tr><td>PUMP</td><td>KEMFLO 48 V</td></tr><tr><td class="specification-group" colspan="2"><div>Specifications</div></td></tr><tr><td>APPLICATION</td><td>SUITABLE FOR BRACKISH WATER</td></tr><tr><td>FILTER LIFE</td><td>APPROX 3000 LITRE / 6 MONTHS</td></tr><tr><td>FILTERS</td><td>SEDIMENT, PRECARBON, POST CARBON</td></tr><tr><td>FLOAT</td><td>MEMBRELLA</td></tr><tr><td>FR</td><td>MEMBRELLA /KFL</td></tr><tr><td>INLINE SET</td><td>MEMBRELLA</td></tr><tr><td>INPUT VOLTAGE</td><td>100-300 VOLT AC (50Hz)</td></tr><tr><td>INSTALLATION</td><td>COUNTER TOP</td></tr><tr><td>MAX.OPERATION TDS</td><td>4000 PPM</td></tr><tr><td>MEMBRANE TYPE</td><td>THIN FILM COMPOSITE</td></tr><tr><td>MIN.INLET PRESSURE / TEMP</td><td>0.3 kg / cm2, 10 °C</td></tr><tr><td>MODEL</td><td>WHALE 25</td></tr><tr><td>OPERATING VOLTAGE</td><td>48 VOLT (DC)</td></tr><tr><td>PRODUCT DIMENSION</td><td>21.1 (H) x 9.9 (W) x 16.7 (L)</td></tr><tr><td>PURIFICATION CAPACITY</td><td>25 LITRES PER HOUR</td></tr><tr><td>RECOVERY RATE</td><td>MORE THAN 30% AT 27°c ± 2°c</td></tr><tr><td>SMPS</td><td>MEMBRELLA / EQUALIANT</td></tr><tr><td>SOLENOID VALVE</td><td>MEMBRELLA / SLX</td></tr><tr><td>STORAGE CAPACITY</td><td>20 LITRES</td></tr><tr><td>TECHNOLOGY</td><td>REVERSE OSMOSIS SYSTEM</td></tr><tr><td>TOTAL POWER CONSUMPTION</td><td>50 W</td></tr><tr><td>TUBE 1/4</td><td>5 METERS</td></tr><tr><td>TUBE 3/8</td><td>2 METERS</td></tr><tr><td>WEIGHT</td><td>18 kg (Approx)</td></tr><tr><td>WARRENTY & SUPPORT</td><td>Since Whale designs its purifiers and many of its parts are a truly integrated system. Dealer only can provide one-stop service ,guaranty and support for any service and maintenance, so most issues can be resolved in a single visit</td></tr></table>
</div>

Related

How can I extract specific text and link from div class using a BeautifulSoup

I am trying to extract text and link from this website: https://www.rexelusa.com/s/terminal-block-end-stops?cat=61imhp2p
In my code, I was trying to extract first output that is all CAT# numbers.
This is my code:
import selenium.webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options
options = Options()
options.binary_location = r"C:\Program Files\Mozilla Firefox\firefox.exe"
url = "https://www.rexelusa.com/s/terminal-block-end-stops?cat=61imhp2p"
driver = selenium.webdriver.Firefox(options=options, executable_path='C:\webdrivers\geckodriver.exe')
driver.get(url)
soup = BeautifulSoup(driver.page_source,"html.parser")
all_div = soup.find_all("div", class_= 'row no-gutters')
#print(all_div)
for div in all_div:
all_items = div.find_all(class_= 'pr-4 col col-auto')
for item in all_items:
print(item)
driver.quit()
And my expected output is: all CAT# numbers(means total 92 will come in output) and category detail as shown in picture
CAT #: 1492-EAJ35
Categories
Control & Automation
Terminal Blocks
Terminal Blocks Accessories
Terminal Block End Stops
enter image description here

Use Javascript to Scrape table by selecting some other elements and clicking on button

I'm working with financial data and i want to scrape data from this site web using javascript and add the sript.js to my index.html file (https://www.sikafinance.com/marches/historiques?s=BRVMAG).
I want to scrape the data of a table from above site that takes four arguments
•Ticker who is BRVMAG in the minimal example
•dlPeriod
•datefrom
•dateto
And finally click on the button
btcChange ="OK".
After trying the code below with R I get the table but how can a achieve the same script with fetch?
I would like to be able to get the other tables when I change the start and end date using now javascript in Visual Studio Code.
Since yersterday i'm looking for a reponse searching every here but now result.
Does anyone have any idea how to recover the whole table and return table in of html?
The bottom images show what I noticed when i inspected their site.
I think either the whole table is available and it does a filter for the dates(gap between dates must not exceed 3 months)
library(httr)
library(rvest)
first_date<-as.Date("2022-02-01")
end_date <- as.Date("2022-03-29")
query_params <- list(dlPeriod = "Journalière",
datefrom = first_date,
dateto = end_date,
btnChange = "OK")
parameter_response <- GET("https://www.sikafinance.com/marches/historiques?s=BRVMAG", query_params)
parameter_response1<- httr::content(parameter_response, as = "text", encoding = "UTF-8")
parameter_response2 <- read_html(parameter_response1)%>%
html_node('#tblhistos')%>%
html_table()
parameter_response2
# Date Clôture `Plus bas` `Plus haut` Ouverture `Volume Titres` `Volume FCFA` `Variation %`
# <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr>
# 1 29/04/2022 312,09 312,09 312,09 312,09 - 0 2,53%
# 2 28/04/2022 304,38 304,38 304,38 304,38 - 0 0,00%
# 3 27/04/2022 304,38 304,38 304,38 304,38 - 0 2,69%
# 4 26/04/2022 296,42 296,42 296,42 296,42 - 0 0,81%
# 5 25/04/2022 294,05 294,05 294,05 294,05 - 0 1,34%
# 6 22/04/2022 290,17 290,17 290,17 290,17 - 0 0,36%

Execute JS using Python and store results in an array

I am working on this website where there is an SVG map and radio buttons as filters :
To get the result of the filter in the map (countries colored in blue), I execute this javascript snippet :
var n = $("input[name=adoptionStatus]:checked").val();
(n == undefined || n === "") && (n = "00000000000000000000000000000000");
$(".status-text").hide();
$("#" + n).show();
$("#vmap").vectorMap("set", "colors", resetColorsData);
resetColorsData = {};
colorsData = {};
$("#countries_list a").each(function(t, i) {
$(i).data("adoption-status").indexOf(n) >= 0 && (colorsData[$(i).data("country-code")] = "#2f98cb", resetColorsData[$(i).data("country-codecountry-code")] = "#fefefe")
});
$("#vmap").vectorMap("set", "colors", colorsData)
The variable n is used to store the value of the radio button like in this case cae64c6b731d47cca7565b2a74d11d53 :
<div class="map-filter-radio radio">
<label>
<input type="radio" name="adoptionStatus" alt="IFRS Standards are permitted, but not required, for use by at least some domestic publicly accountable entities, including listed companies and financial institutions." title="IFRS Standards are permitted, but not required, for use by at least some domestic publicly accountable entities, including listed companies and financial institutions." value="cae64c6b731d47cca7565b2a74d11d53">
IFRS Standards are permitted but not required for domestic public companies
</label>
</div>
When I execute the Javascript in the console and try to get the colorsData, I get the list of the countries colored in blue like below :
bm: "#2f98cb"
ch: "#2f98cb"
gt: "#2f98cb"
iq: "#2f98cb"
jp: "#2f98cb"
ky: "#2f98cb"
mg: "#2f98cb"
ni: "#2f98cb"
pa: "#2f98cb"
py: "#2f98cb"
sr: "#2f98cb"
tl: "#2f98cb"
How can I execute the JS script on the webpage and get the result of the colored countries in an array using python ?
By looking at the list of countries specified by #countries_list, you got a list of a tag like the following :
<a id="country_bd" data-country-code="bd" data-adoption-status="97f9b22998d546f7856bb1b4f0586521|3adc18f07ff64c908a6d835e08344531|ff784361818644798ea899f81b8b6d61" href="/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/bangladesh/">
<img src="/-/media/7bfd06a698594c2cb3614578a41caa9e.ashx" alt="Bangladesh">
Bangladesh
</a>
The data-adoption-status attribute is a list of adoptionStatus delimited by |.
You just need to split them and match only the countries that reference the value from the input adoptionValue like this :
if selectedAdoptionStatus in t["data-adoption-status"].split("|")
The following code lists all input tag and extracts the adoptionStatus for each one of these, it prompts user to choose a filter (0 to 4) and gets the selected countries by filtering on the data-adoption-status attribute :
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/")
soup = BeautifulSoup(r.text, "html.parser")
choiceContainer = soup.find("div", {"class":"map-filters"})
choices = [
(t["title"], t["value"])
for t in choiceContainer.findAll("input")
]
for idx, choice in enumerate(choices):
print(f"[{idx}] {choice[0]}")
val = input("Choose a filter index : ")
choice = choices[int(val)]
print(f"You have chosen {choice[0]}")
selectedAdoptionStatus = choice[1]
countryList = soup.find("div", {"id":"countries_list"})
selectedCountries = [
{
"countryCode": t["data-country-code"],
"adoptionStatus": t["data-adoption-status"].split("|"),
"link": t["href"],
"country": t.find("img")["alt"]
}
for t in countryList.findAll("a")
if selectedAdoptionStatus in t["data-adoption-status"].split("|")
]
for it in selectedCountries:
print(it["country"])
run this code on repl.it
Sample output
[0] IFRS Standards are required for use by all or most domestic publicly accountable entities.
[1] IFRS Standards are permitted, but not required, for use by at least some domestic publicly accountable entities, including listed companies and financial institutions.
[2] IFRS Standards are required or permitted for use by foreign securities issuers.
[3] In most cases an SME may also choose full IFRS Standards. In some cases, an SME may also choose local standards for SMEs.
[4] The body with authority to adopt financial reporting standards is actively studying whether to adopt the <em>IFRS for SMEs</em> Standard.
Choose a filter index : 1
You have chosen IFRS Standards are permitted, but not required, for use by at least some domestic publicly accountable entities, including listed companies and financial institutions.
Bermuda
Cayman Islands
Guatemala
Iraq
Japan
Madagascar
Nicaragua
Panama
Paraguay
Suriname
Switzerland
Timor-Leste

Can't pass values to or from Python using JQuery

I am trying to pass values from a website to Python, and get a return passed back to be displayed. The input location is as follows:
<form> <!-- create inputs -->
<input type="text" id="WS" style="font-size:10pt; height:25px" required><br> <!-- Wind speed input box -->
<br>
</form>
The user then clicks a button (in this case Lin):
<button id="Lin" style="height:30px; width:10%; background-color:#5188e0; border-color: black; color: white; font-weight: bold" title="Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data">Linear Regression</button>
This should pass the data to the following script:
$("#Lin").click(
function(e) {
$.get('/api/Lin/' + $('#WS').val(), function(data) {
$('#Power_Est').val(data.value);
});
});
The output box is:
<form>
<label>Estimated power output (KW/h):</label> <!-- Power label -->
<input class="form-control" id="Power_Est" type="text" style="font-size:10pt; height:25px" placeholder="Power Estimate" readonly><br> <!-- Power Estimate box -->
<br>
</form>
The Python script I have is:
import flask as fl
import numpy as np
import joblib
app = fl.Flask(__name__) # Create a new web app.
#app.route("/") # Add root route.
def home(): # Home page
return app.send_static_file("Front_Page.html") # Return the index.html file
#app.route("/api/Lin/<float:speed>", methods = ["GET"]) # If the Linear Regression button is chosen
def Lin_Reg(speed): # Call the linear regression function
lin_model_load = joblib.load("Models/lin_reg.pkl") # Reimport the linear regression model
power_est = np.round(lin_model_load.predict(speed)[0], 3) # Use the linear regression model to estimate the power for the user inputted speed
return power_est # Return the power estimate
Whenever I run the above, using flask, and http://127.0.0.1:5000/ I get the following error message:
127.0.0.1 - - [30/Dec/2020 21:07:19] "GET /api/Lin/20 HTTP/1.1" 404 -
Any suggestions on how to correct this?
Edit 1
Using the below:
def Lin_Reg(speed): # Call the linear regression function
print(speed)
speed = speed.reshape(1, -1)
lin_model_load = joblib.load("Models/lin_reg.pkl") # Reimport the linear regression model
power_est = lin_model_load.predict(speed) # Use the linear regression model to estimate the power for the user inputted speed
return power_est # Return the power estimate
print(Lin_Reg(20.0))
The error is:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in
6 return power_est # Return the power estimate
7
----> 8 print(Lin_Reg(20.0))
in Lin_Reg(speed)
1 def Lin_Reg(speed): # Call the linear regression function
2 print(speed)
----> 3 speed = speed.reshape(1, -1)
4 lin_model_load = joblib.load("Models/lin_reg.pkl") # Reimport the linear regression model
5 power_est = lin_model_load.predict(speed) # Use the linear regression model to estimate the power for the user inputted speed
AttributeError: 'float' object has no attribute 'reshape'
Make sure you send a float instead of an integer.
$.get('/api/Lin/' + parseFloat($('#WS').val()), function(data) {
$('#Power_Est').val(data.value);
});
Also call predict with a 2D array.
power_est = np.round(lin_model_load.predict([[speed]])[0], 3)

Convert text to integer and if higher than 50 alert

On our current website that is hosted offsite via another company it is all done with .NET, I simply have access to HTML, JS, and CSS files to edit. A lot of data is output on the page via tokens. On our web page we have a weight token, it grabs the items weight and outputs it between a span tag. So if you're viewing the source it'll show the following:
<span id="order_summary_weight">78.000000 lbs</span>
The token by default outputs the lbs. What I need to do is have javascript grab the 78.000000, convert it to an integer I'm assuming and if that integer, in this case 78.000000 is greater than 50.000000 I'd like it append a line after the 78.000000 lbs to say "Your weight total is over 50 lbs, we will contact you directly with a shipping charge." Understand some weight totals may be as small as 0.010000
I'm coming to you fine folks here because I am at a complete lost where to start in this endeavor.
Something like this ? :
html :
<div class="wrap">
<span class="price" id="order_summary_weight">78.000000 lbs</span>
</div>
<hr>
<div class="wrap">
<span class="price" id="order_summary">50.000000 lbs</span>
</div>
JS :
$('.wrap').each(function(){
var price = $(this).find('.price').text();
price = price.replace(' lbs', '');
price = parseInt(price);
if(price > 50){
$(this).append('<div class="alert">Your weight total is over 50 lbs, we will contact you directly with a shipping charge.</div>');
}
});
DEMO : http://jsfiddle.net/w3qg4/1/
function getWeight()
{
var x=document.getElementById("order_summary_weight");
if(x > 50){
alert("Your weight total is over 50 lbs, we will contact you directly with a shipping charge. Understand some weight totals may be as small as 0.010000");
}
}

Categories