i have a html page and i want to extract the title which is inside tag and inside object _BFD.BFD_INFO. i have accessed all the data inside but it has a lot of other data like links etc and now i don't know how to access that title which i want to extract. Kindly help me with it.
the code so far i have written is
import bs4 as bs
import urllib3.request
import requests
sauce=
requests.get('https://www.meishij.net/zuofa/huaguluobodunpaigutang.html')
print(sauce.status_code)
soup=bs.BeautifulSoup(sauce.content,'html.parser')
#print(soup.find_all("script", type="text/javascript")[9])
print(soup.find("script",type="text/javascript")[9])
and this is the html
<script type="text/javascript">
_czc.push(['_trackEvent','pc','pc_news']);
_czc.push(['_trackEvent','pc','pc_news_class_6']);
window["_BFD"] = window["_BFD"] || {};
_BFD.BFD_INFO = {
"title" :"花菇萝卜炖排骨汤",
</script>
I am not that good at regex which can be used to find the 'title' in a single line. I guess below code should work.
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.meishij.net/zuofa/huaguluobodunpaigutang.html'
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
Link = requests.get(url, headers=headers)
soup =BeautifulSoup(Link.content,"lxml")
scripts = soup.find_all("script")
for script in scripts:
if "_BFD.BFD_INFO" in script.text:
text = script.text
m_text = text.split('=')
m_text = m_text[2].split(":")
m_text = m_text[1].split(',')
encoded = m_text[0].encode('utf-8')
print(encoded.decode('utf-8'))
Update for fetching pic:
for script in scripts:
text = script.text
m_text = text.split(',')
for n in m_text:
if 'pic' in n:
print(n)
Output:
C:\Users\siva\Desktop>python test.py
"pic" :"http://s1.st.meishij.net/r/216/197/6174466/a6174466_152117574296827.jpg"
Update 2:
for script in scripts:
text = script.text
m_text = text.split('_BFD.BFD_INFO')
for t in m_text:
if "title" in t:
print(t.split(","))
Output:
C:\Users\SSubra02\Desktop>python test.py
[' = {\r\n"title" :"????????"', '\r\n"pic" :"http://s1.st.meishij.net/r/216/197/
6174466/a6174466_152117574296827.jpg"', '\r\n"id" :"1883528"', '\r\n"url" :"http
s://www.meishij.net/zuofa/huaguluobodunpaigutang.html"', '\r\n"category" :[["??"
', '"https://www.meishij.net/chufang/diy/recaipu/"]', '["??"', '"https://www.mei
shij.net/chufang/diy/tangbaocaipu/"]', '["???"', '"https://www.meishij.net/chufa
ng/diy/jiangchangcaipu/"]', '["??"', '"https://www.meishij.net/chufang/diy/wucan
/"]', '["??"', '"https://www.meishij.net/chufang/diy/wancan/"]]', '\r\n"tag" :["
??"', '"??"', '"??"', '"????"', '"????"', '"????"]', '\r\n"author":"????"', '\r\
n"pinglun":"3"', '\r\n"renqi":"4868"', '\r\n"step":"7?"', '\r\n"gongyi":"?"', '\
r\n"nandu":"????"', '\r\n"renshu":"4??"', '\r\n"kouwei":"???"', '\r\n"zbshijian"
:"10??"', '\r\n"prshijian":"<90??"', '\r\n"page_type" :"detail"\r\n};window["_BF
D"] = window["_BFD"] || {};_BFD.client_id = "Cmeishijie";_BFD.script = document.
createElement("script");_BFD.script.type = "text/javascript";_BFD.script.async =
true;_BFD.script.charset = "utf-8";_BFD.script.src =((\'https:\' == document.lo
cation.protocol?\'https://ssl-static1\':\'http://static1\')+\'.baifendian.com/se
rvice/meishijie/meishijie.js\');']
Let me know if you are facing any issues.
Related
I want to use Selenium and Web driver to catch a part of information.
I want to catch the following information:
7197409
The following code is their html code, I want to catch "7197409"
<script type="text/javascript">
var messageid = 7197409;
var highlight_id = -1;
var authorOnly = "N";
var ftype = 'MB';
var adsenseFront = '<table width="99%" cellspacing="0" cellpadding="0" style="background-color: #000000; margin-left: auto; margin-right: auto;"><tr><td style="width: 100%; background-color: #F7F3F7;">';
var adsenseEnd = '</td></tr></table>';
var Submitted = false;
var subject = true;
var HiddenThreads = new Array(26); //Temp variable to save the threads temporary
var blocked_list = Sys.Serialization.JavaScriptSerializer.deserialize('[]');
var currentUser = undefined;
var followList = [];
var lock = false;
</script>
I checked their full xpath is /html/body/form/div[5]/div/div/div[2]/div[1]/script/text()
I use the following code to execute it.
from datetime import date,datetime
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
import numpy as np
import xlrd
import csv
import codecs
import time
url = "https://forumd.hkgolden.com/view.aspx?type=MB&message=7197409"
driver_blank=webdriver.Chrome('./chromedriver')
driver_blank.get(url)
id=driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/script/text()")
print("ID:"+id.text)
driver_blank.close()
However, I got the following error message. They said that The result of the xpath expression "/html/body/form/div[5]/div/div/div[2]/div[1]/script/text()" is: [object Text]. It should be an element.
DevTools listening on
ws://127.0.0.1:50519/devtools/browser/845d0800-1dd9-4f8a-a847-7d955c8cc5e3
libpng warning: iCCP: cHRM chunk does not match sRGB
[16136:16764:0411/213956.920:ERROR:ssl_client_socket_impl.cc(941)] handshake failed; returned -1, SSL error code 1, net_error -107
[16136:16764:0411/213957.351:ERROR:ssl_client_socket_impl.cc(941)] handshake failed; returned -1, SSL error code 1, net_error -107
Traceback (most recent call last):
File ".\test.py", line 28, in
id=driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div1/script/text()")
File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py",
line 394, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py",
line 978, in find_element
'value': value})['value']
File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py",
line 321, in execute
self.error_handler.check_response(response)
File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py",
line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: The result of the xpath expression
"/html/body/form/div[5]/div/div/div[2]/div1/script/text()" is:
[object Text]. It should be an element.
(Session info: chrome=80.0.3987.132)
I want to ask two questions:
How to solve the error?
How to get only text of 7197409 in same xpath range?
Can anyone help me? Thanks
First find the script WebElement:
div = driver.find_element_by_id("ctl00_ContentPlaceHolder1_view_form")
script = div.find_element_by_tag_name('script')
Get the script InnerHTML:
text = script.get_attribute('innerHTML')
print(text)
Find the line containing "var messageid":
line = [l for l in text.split("\n") if "var messageid" in l][0]
print("Line:", line)
Get the number from the line:
ix_1 = line.find("=")
ix_2 = line.find(";")
number = int(line[ix_1+1:ix_2])
print("Number:", number)
Out (Tested in Chromium 80.x):
Number: 7197409
I am trying to summon a script that sends an email when you press a button in react, my problem is that i haven't found any good way to do so. I currently have the function on a views file such as follows
def sender(request):
me = "username"
my_password = r"my password"
you = info.__str__
msg = MIMEMultipart('alternative')
msg['Subject'] = "Alert"
msg['From'] = me
msg['To'] = you
html = '<html><body><p>hello world</p></body></html>'
part2 = MIMEText(html, 'html')
msg.attach(part2)
s = smtplib.SMTP_SSL('smtp.gmail.com')
s.login(me, my_password)
s.sendmail(me, you, msg.as_string())
s.quit()
and import it on the url file:
urlpatterns = [path('send/', views.sender)]
I am using axios on the react front with the following code
axios.get('http://127.0.0.1:8000/api/send/')
and it gives me this error when I try to access run it
AttributeError: 'function' object has no attribute 'encode'
I'm trying to use a weather api for a basic website and I'd like to use the icons too. The request works in both environments, but in my local environment I get an error for the icon
GET file://cdn.apixu.com/weather/64x64/night/116.png net::ERR_FILE_NOT_FOUND
I thought it was related to https but probably not since it's only the image that won't load.
const key = 'b7e1e81e6228412cbfe203819180104';
const url = `https://api.apixu.com/v1/current.json?key=${key}&q=auto:ip`
const main = document.getElementById('main');
$.getJSON( url, function(json) {
const loc = json.location;
const cur = json.current;
const condition = {text: cur.condition.text, icon: cur.condition.icon}
main.innerHTML = `<img src = ${condition.icon}><div>${condition.text}</div>`
}
so ${cur.condition.text} will display "partly cloudy" but the icon does not display. Any advice?
update: seems to be working fine with live-server.
It may be because the Cross-Origin Request Policy (CORS) may not allow it. Please make sure that you are allowed to access those resources.
https://enable-cors.org/ to read up more about CORS.
Secondly,
<img src = ${condition.icon}>
should be
<img src="${condition.icon}">
You are forgetting the quotation marks.
https://www.w3schools.com/tags/tag_img.asp - Read more on image tags.
Additionally use the code below:
Also add http: to image src like <img src=http:${condition.icon}>.
const key = 'b7e1e81e6228412cbfe203819180104';
const url = `https://api.apixu.com/v1/current.json?key=${key}&q=auto:ip`
const main = document.getElementById('main');
$.getJSON(url, function(json) {
const loc = json.location;
const cur = json.current;
const condition = {
text: cur.condition.text,
icon: cur.condition.icon
}
main.innerHTML = `<img src="http:${condition.icon}"><div>${condition.text}</div>`
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="main"></div>
As icon return in JSON as protocol-relative URL (without the scheme) //url.
Locally it will use the file:// protocol and that assumes the resource you’re referring to is on the local machine, But it's not.
To avoid this issue locally add http: or https:to image src like <img src=http:${condition.icon}>.
const key = 'b7e1e81e6228412cbfe203819180104';
const url = `https://api.apixu.com/v1/current.json?key=${key}&q=auto:ip`
const main = document.getElementById('main');
$.getJSON(url, function(json) {
const loc = json.location;
const cur = json.current;
const condition = {
text: cur.condition.text,
icon: cur.condition.icon
}
main.innerHTML = `<img src =http:${condition.icon}><div>${condition.text}</div>`
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="main"></div>
Here is my code, when I execute it, there are some information which is from code of original website. They are included in the results.
enter code here
import urllib.request
from bs4 import BeautifulSoup
import re
URLdict=dict()
pageNum=1
while pageNum<2:
user_agent = 'Chrome/58.0(compatible;MSIE 5.5; Windows 10)'
headers = {'User-Agent': user_agent}
if pageNum==0:
response=urllib.request.urlopen('http://www.1905.com/list-p-catid-
221.html')
else:
url = 'http://www.1905.com/list-p-catid-221.html' + '?
refresh=1321407488&page=' + str(pageNum)
request = urllib.request.Request(url,headers=headers)
response = urllib.request.urlopen(request)
first = response.read().decode('utf-8')
BSobj = BeautifulSoup(first, "html.parser")
for a in BSobj.findAll("a", href=True):
if re.findall('/news/', a['href']):
URLdict[a['href']] = a.get_text()
#print(URLdict)
for link, title in URLdict.items():
print(title, ":", link)
ContentRequest = urllib.request.Request(link,headers=headers)
ContentResponse = urllib.request.urlopen(ContentRequest)
ContentHTMLText = ContentResponse.read().decode('utf-8')
ContentBSobj = BeautifulSoup(ContentHTMLText, "html.parser")
Content = ContentBSobj.find("div", {"class": "mod-content"})
if Content is not None:
print(Content.get_text())
pageNum=pageNum+1
I checked the original code,these information are from ,they are like these:
enter code here
var ATLASCONFIG = {
id:"1197971",
prevurl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
nexturl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
shareIframe:"http://www.1905.com/api/share2.php?....
}
These information appeared in the results rather than my code. I can not send more than two links, so I deleted "//",I want to ask how to delete these,
Use delete to remove a property from an object.
var ATLASCONFIG = {
id:"1197971",
prevurl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
nexturl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
shareIframe:"http://www.1905.com/api/share2.php?...."
}
delete ATLASCONFIG["shareIframe"];
console.log(ATLASCONFIG);
I have this code:
import urllib
from bs4 import BeautifulSoup
import time
url = "http://www.downloadcrew.com/article/31121-magix_movie_edit_pro_2014_premium"
pageUrl = urllib.urlopen(url)
time.sleep(2)
soup = BeautifulSoup(pageUrl)
for a in soup.select("div.downloadLink a[href]"):
print "downloadlink: "+a["href"]
for b in soup.select("h1#articleTitle"):
print b
for c in soup.select("table.detailsTable"):
print c
What I want is the application name,date updated,developer and download link.
When I tried to run it, the output will be all the things inside each tag.
Here is the code that gets what you want:
import urllib
from bs4 import BeautifulSoup
import time
url = "http://www.downloadcrew.com/article/31121-magix_movie_edit_pro_2014_premium"
pageUrl = urllib.urlopen(url)
time.sleep(2)
soup = BeautifulSoup(pageUrl)
for a in soup.select("div.downloadLink a[href]"):
print "downloadlink: " + "?" + a["href"].split("?")[1].split(",")[0]
for b in soup.select("h1#articleTitle"):
print b.contents[0].strip()
for c in soup.findAll("th"):
if c.text == "Date Updated:":
print c.parent.td.text
elif c.text == "Developer:":
print c.parent.td.text
But you can't download the file with that URL. You will need to check JavaScript source files to see what javascript:checkDownload() does to get the actual file location.