how to delete the <script> in the news I crawlered

how to delete the <script> in the news I crawlered - javascript

Here is my code, when I execute it, there are some information which is from code of original website. They are included in the results.
enter code here
import urllib.request
from bs4 import BeautifulSoup
import re
URLdict=dict()
pageNum=1
while pageNum<2:
user_agent = 'Chrome/58.0(compatible;MSIE 5.5; Windows 10)'
headers = {'User-Agent': user_agent}
if pageNum==0:
response=urllib.request.urlopen('http://www.1905.com/list-p-catid-
221.html')
else:
url = 'http://www.1905.com/list-p-catid-221.html' + '?
refresh=1321407488&page=' + str(pageNum)
request = urllib.request.Request(url,headers=headers)
response = urllib.request.urlopen(request)
first = response.read().decode('utf-8')
BSobj = BeautifulSoup(first, "html.parser")
for a in BSobj.findAll("a", href=True):
if re.findall('/news/', a['href']):
URLdict[a['href']] = a.get_text()
#print(URLdict)
for link, title in URLdict.items():
print(title, ":", link)
ContentRequest = urllib.request.Request(link,headers=headers)
ContentResponse = urllib.request.urlopen(ContentRequest)
ContentHTMLText = ContentResponse.read().decode('utf-8')
ContentBSobj = BeautifulSoup(ContentHTMLText, "html.parser")
Content = ContentBSobj.find("div", {"class": "mod-content"})
if Content is not None:
print(Content.get_text())
pageNum=pageNum+1
I checked the original code,these information are from ,they are like these:
enter code here
var ATLASCONFIG = {
id:"1197971",
prevurl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
nexturl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
shareIframe:"http://www.1905.com/api/share2.php?....
}
These information appeared in the results rather than my code. I can not send more than two links, so I deleted "//",I want to ask how to delete these,

Use delete to remove a property from an object.
var ATLASCONFIG = {
id:"1197971",
prevurl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
nexturl:"http:www.1905.com/news/20170704/1197970.shtml#p1",
shareIframe:"http://www.1905.com/api/share2.php?...."
}
delete ATLASCONFIG["shareIframe"];
console.log(ATLASCONFIG);

Related

How to run a script using django rest framework integrated with react

I am trying to summon a script that sends an email when you press a button in react, my problem is that i haven't found any good way to do so. I currently have the function on a views file such as follows
def sender(request):
me = "username"
my_password = r"my password"
you = info.__str__
msg = MIMEMultipart('alternative')
msg['Subject'] = "Alert"
msg['From'] = me
msg['To'] = you
html = '<html><body><p>hello world</p></body></html>'
part2 = MIMEText(html, 'html')
msg.attach(part2)
s = smtplib.SMTP_SSL('smtp.gmail.com')
s.login(me, my_password)
s.sendmail(me, you, msg.as_string())
s.quit()
and import it on the url file:
urlpatterns = [path('send/', views.sender)]
I am using axios on the react front with the following code
axios.get('http://127.0.0.1:8000/api/send/')
and it gives me this error when I try to access run it
AttributeError: 'function' object has no attribute 'encode'

Extract title inside html <script> using BeautifulSoup in python3

i have a html page and i want to extract the title which is inside tag and inside object _BFD.BFD_INFO. i have accessed all the data inside but it has a lot of other data like links etc and now i don't know how to access that title which i want to extract. Kindly help me with it.
the code so far i have written is
import bs4 as bs
import urllib3.request
import requests
sauce=
requests.get('https://www.meishij.net/zuofa/huaguluobodunpaigutang.html')
print(sauce.status_code)
soup=bs.BeautifulSoup(sauce.content,'html.parser')
#print(soup.find_all("script", type="text/javascript")[9])
print(soup.find("script",type="text/javascript")[9])
and this is the html
<script type="text/javascript">
_czc.push(['_trackEvent','pc','pc_news']);
_czc.push(['_trackEvent','pc','pc_news_class_6']);
window["_BFD"] = window["_BFD"] || {};
_BFD.BFD_INFO = {
"title" :"花菇萝卜炖排骨汤",
</script>

I am not that good at regex which can be used to find the 'title' in a single line. I guess below code should work.
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.meishij.net/zuofa/huaguluobodunpaigutang.html'
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
Link = requests.get(url, headers=headers)
soup =BeautifulSoup(Link.content,"lxml")
scripts = soup.find_all("script")
for script in scripts:
if "_BFD.BFD_INFO" in script.text:
text = script.text
m_text = text.split('=')
m_text = m_text[2].split(":")
m_text = m_text[1].split(',')
encoded = m_text[0].encode('utf-8')
print(encoded.decode('utf-8'))
Update for fetching pic:
for script in scripts:
text = script.text
m_text = text.split(',')
for n in m_text:
if 'pic' in n:
print(n)
Output:
C:\Users\siva\Desktop>python test.py
"pic" :"http://s1.st.meishij.net/r/216/197/6174466/a6174466_152117574296827.jpg"
Update 2:
for script in scripts:
text = script.text
m_text = text.split('_BFD.BFD_INFO')
for t in m_text:
if "title" in t:
print(t.split(","))
Output:
C:\Users\SSubra02\Desktop>python test.py
[' = {\r\n"title" :"????????"', '\r\n"pic" :"http://s1.st.meishij.net/r/216/197/
6174466/a6174466_152117574296827.jpg"', '\r\n"id" :"1883528"', '\r\n"url" :"http
s://www.meishij.net/zuofa/huaguluobodunpaigutang.html"', '\r\n"category" :[["??"
', '"https://www.meishij.net/chufang/diy/recaipu/"]', '["??"', '"https://www.mei
shij.net/chufang/diy/tangbaocaipu/"]', '["???"', '"https://www.meishij.net/chufa
ng/diy/jiangchangcaipu/"]', '["??"', '"https://www.meishij.net/chufang/diy/wucan
/"]', '["??"', '"https://www.meishij.net/chufang/diy/wancan/"]]', '\r\n"tag" :["
??"', '"??"', '"??"', '"????"', '"????"', '"????"]', '\r\n"author":"????"', '\r\
n"pinglun":"3"', '\r\n"renqi":"4868"', '\r\n"step":"7?"', '\r\n"gongyi":"?"', '\
r\n"nandu":"????"', '\r\n"renshu":"4??"', '\r\n"kouwei":"???"', '\r\n"zbshijian"
:"10??"', '\r\n"prshijian":"<90??"', '\r\n"page_type" :"detail"\r\n};window["_BF
D"] = window["_BFD"] || {};_BFD.client_id = "Cmeishijie";_BFD.script = document.
createElement("script");_BFD.script.type = "text/javascript";_BFD.script.async =
true;_BFD.script.charset = "utf-8";_BFD.script.src =((\'https:\' == document.lo
cation.protocol?\'https://ssl-static1\':\'http://static1\')+\'.baifendian.com/se
rvice/meishijie/meishijie.js\');']
Let me know if you are facing any issues.

Javascript-Using Parsed Data From a Query String as a Heading

I am wondering how to take the information from a parsed query string and use it to display on the top of my page. Ignore the window.alert part of the code, I was just using that to verify that the function worked.
For example: If the user had choices of Spring, Summer, Winter, and Fall, whichever they chose would display a a header on the next page. So if (seasonArray[i]) = Fall, I want to transfer that information into the form and display it as a element. I'm sure this is easily done, but I can't figure it out. Thanks, in advance.
function seasonDisplay() {
var seasonVariable = location.search;
seasonVariable = seasonVariable.substring(1, seasonVariable.length);
while (seasonVariable.indexOf("+") != -1) {
seasonVariable = seasonVariable.replace("+", " ");
}
seasonVariable = unescape(seasonVariable);
var seasonArray = seasonVariable.split("&");
for (var i = 0; i < seasonArray.length; ++i) {
window.alert(seasonArray[i]);
}
if (window != top)
top.location.href = location.href
}

<h1 id="DynamicHeader"></h1>
Replace the alert line with:
document.getElementById("DynamicHeader").insertAdjacentHTML('beforeend',seasonArray[i]);

Create Dropdown list from API Query

Attempting to create a script that will pull information from an API requested XML document and put it into a 2D array.
Upon making the Get request
https://api.example.com/v1.svc/users?apikey=MY-KEY&source=MY-APP&limit=1000
An XML is produced for each user looking like
<User>
<Id>Rdh9Rsi3k4U1</Id>
<UserName>firstlast#email.com</UserName>
<FirstName>First</FirstName>
<LastName>Last</LastName>
<Active>true</Active>
<Email>firstlast#email.com</Email>
<AccessLevel>Learner</AccessLevel>
</User>
Each user has a similar looking output stacked on top of each other. How could this be scrubbed into an array? Example, the first array would have 7 "columns" with all shown information with each user having a row.
b

So I figured it out for anyone looking for an answer to this type of question in the future. Basically, I found out that the API I was trying to reach (not actually "citrowske.com" as shown in the example) did not allow for CORS or jsonp which left me with the only option of using a Proxy.
Shown is an example of code similar to what I ended up using (below), along with the test XML file shown here
A basic explanation of how this works, it uses the proxy to get the XML file and stores it as "xml" found as "function(xml)". Then the XML doc is searched and each section that starts with "User" gets the "FirstName" and "LastName" data pulled from it and appended to dropdown in the HTML section named "yourdropdownbox".
$.ajaxPrefilter( function (options) {
if (options.crossDomain && jQuery.support.cors) {
var http = (window.location.protocol === 'http:' ? 'http:' : 'https:');
options.url = http + '//cors-anywhere.herokuapp.com/' + options.url;
//options.url = "http://cors.corsproxy.io/url=" + options.url;
}
});
$.get(
'http://citrowske.com/xml.xml',
function (xml) {
//console.log("> ", xml);
//$("#viewer").html(xml);
////////////////////////////////////
var select = $('#yourdropdownbox');
select.append('<option value="">Select a User</option>');
$(xml).find('User').each(function(){
var FirstNames = $(this).find('FirstName').text();
var LastNames = $(this).find('LastName').text();
select.append("<option value='"+ FirstNames +"'>"+FirstNames+" "+LastNames+"</option>");
});
}
////////////////////////////////////
);
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
<select id="yourdropdownbox">
</select>
As a note, Proxy's are not known for being extremely secure, so watch out what you use this for.
Also, if I wanted to turn the data into an array instead of appending it each time I could have added
var firstnamesarray = ["0"];
var lastnamesarry = ["0"];
var i = 0;
Above the top row of forward-slashes and then replaced:
var FirstNames = $(this).find('FirstName').text();
var LastNames = $(this).find('LastName').text();
with
firstnamesarry[i] = $(this).find('FirstName').text();
lastnamesarry[i] = $(this).find('LastName').text();
i = i+1;
and replaced the "select.append" First & Last Names with
firstnamearry[i] & lastnamearry[i]
To view a working example, check out the jsfiddle here

How to filter the information in a page?

I have this code:
import urllib
from bs4 import BeautifulSoup
import time
url = "http://www.downloadcrew.com/article/31121-magix_movie_edit_pro_2014_premium"
pageUrl = urllib.urlopen(url)
time.sleep(2)
soup = BeautifulSoup(pageUrl)
for a in soup.select("div.downloadLink a[href]"):
print "downloadlink: "+a["href"]
for b in soup.select("h1#articleTitle"):
print b
for c in soup.select("table.detailsTable"):
print c
What I want is the application name,date updated,developer and download link.
When I tried to run it, the output will be all the things inside each tag.

Here is the code that gets what you want:
import urllib
from bs4 import BeautifulSoup
import time
url = "http://www.downloadcrew.com/article/31121-magix_movie_edit_pro_2014_premium"
pageUrl = urllib.urlopen(url)
time.sleep(2)
soup = BeautifulSoup(pageUrl)
for a in soup.select("div.downloadLink a[href]"):
print "downloadlink: " + "?" + a["href"].split("?")[1].split(",")[0]
for b in soup.select("h1#articleTitle"):
print b.contents[0].strip()
for c in soup.findAll("th"):
if c.text == "Date Updated:":
print c.parent.td.text
elif c.text == "Developer:":
print c.parent.td.text
But you can't download the file with that URL. You will need to check JavaScript source files to see what javascript:checkDownload() does to get the actual file location.

We Keep Coding

JavaScript is the programming language of the Web.

how to delete the <script> in the news I crawlered - javascript

Related

How to run a script using django rest framework integrated with react

Extract title inside html <script> using BeautifulSoup in python3

Javascript-Using Parsed Data From a Query String as a Heading

Create Dropdown list from API Query

How to filter the information in a page?

Categories

Resources