I built a light weight web scraping program to parse through a user's profile on ResearchGate. Previously when I was utilizing the following code I could discover and visit href links but now after executing the program few times I am only receiving one href mentioning that enable javascript to dofollow.
My code is as follows:
import requests
from bs4 import BeautifulSoup
import re
main_url = 'https://www.researchgate.net/profile/Luqun_Li3'
url = main_url + '/research'
page = requests.get(url)
bs = BeautifulSoup(page.content, features='lxml')
pub_links = []
for link in bs.findAll('a'):
print(link)
if 'publication/' in link.get('href'):
pub_links.append(link.get('href'))
print('found link')
visiting_links = remove_dupes(pub_links)
Previously when I executed the afoemrntioend code I would be able to view and discover links which were starting with 'publication/' but now there's only one link available saying:
'a href="http://www.enable-javascript.com/" rel="nofollow noopener" target="_blank"> instructions how to enable JavaScript in your web browser a
Can someone help me to enable javascript for dofollow links so that I can keep using this program for parsing?
Related
I am trying to webscrape a website that has multiple javascript rendered pages (https://openlibrary.ecampusontario.ca/catalogue/). I am able to get the content from the first page, but I am not sure how to get my script to click on the buttons on the subsequent pages to get that content. Here is my script.
import time
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
# The path to where you have your chrome webdriver stored:
webdriver_path = '/Users/rawlins/Downloads/chromedriver'
# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')
# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path,
chrome_options = chrome_options)
# Load webpage
url = "https://openlibrary.ecampusontario.ca/catalogue/"
browser.get(url)
# to ensure that the page has loaded completely.
time.sleep(3)
data = []
# Parse HTML, close browser
page_soup = soup(browser.page_source, 'lxml')
containers = page_soup.findAll("div", {"class":"result-item tooltip"})
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.find('h4', {'class' : 'textbook-title'}).text.strip()
item['author'] = container.find('p', {'class' : 'textbook-authors'}).text.strip()
item['link'] = "https://openlibrary.ecampusontario.ca/catalogue/" + container.find('h4', {'class' : 'textbook-title'}).a["href"]
item['source'] = "eCampus Ontario"
item['base_url'] = "https://openlibrary.ecampusontario.ca/catalogue/"
data.append(item) # add the item to the list
with open("js-webscrape-2.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
browser.quit()
You do not have to actually click on any button. For example, to search for items with the keyword 'electricity', you navigate to the url
https://openlibrary-repo.ecampusontario.ca/rest/filtered-items?query_field%5B%5D=*&query_op%5B%5D=matches&query_val%5B%5D=(%3Fi)electricity&filters=is_not_withdrawn&offset=0&limit=10000
This will return a json string of items with the first item being:
{"items":[{"uuid":"6af61402-b0ec-40b1-ace2-1aa674c2de9f","name":"Introduction to Electricity, Magnetism, and Circuits","handle":"123456789/579","type":"item","expand":["metadata","parentCollection","parentCollectionList","parentCommunityList","bitstreams","all"],"lastModified":"2019-05-09 15:51:06.91","parentCollection":null,"parentCollectionList":null,"parentCommunityList":null,"bitstreams":null,"withdrawn":"false","archived":"true","link":"/rest/items/6af61402-b0ec-40b1-ace2-1aa674c2de9f","metadata":null}, ...
Now, to get that item, you use its uuid, and navigate to:
https://openlibrary.ecampusontario.ca/catalogue/item/?id=6af61402-b0ec-40b1-ace2-1aa674c2de9f
You can proceed like this for any interaction with that website (this is not always working for all websites, but it is working for your website).
To find out what are the urls that are navigated to when you click such and such button or enter text (what I did for the above urls), you can use fiddler.
I made a little script that can help you (selenium).
what this script does is "while the last page of the catalogue is not selected (in this case, contain 'selected' in it's class), i'll scrape , then click next"
while "selected" not in driver.find_elements_by_css_selector("[id='results-pagecounter-pages'] a")[-1].get_attribute("class"):
#your scraping here
driver.find_element_by_css_selector("[id='next-btn']").click()
There's probably a problem that you'll run into using this method, it doesn't wait for the results to load, but you can figure out what to do from here onwards.
Hope it helps
I’m currently trying to download a pdf from a website (I’m trying to automate the process) and I have tried numerous different approaches. I’m currently using python and selenium/phantomjs to first find the pdf href link on the webpage source and then use something like wget to download and store the pdf on my local drive.
Whilst I have no issues finding all the href links find_elements_by_xpath("//a/#href") on the page, or narrowing in on the element that has the url path find_element_by_link_text('Active Saver') and then printing it using, the get_attribute('href') method, it does not display the link correctly.
This is the source element, an a tag, that I need the link from is:
href="#" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver<
As you can see the href attribute is href="#" and when I run get_attribute('href') on this element I get:
https://www.bupa.com.au/health-insurance/cover/active-saver#
Which is not the link to the PDF. I know this because when I open the page in Firefox and inspect the element I can see the actual, JavaScript executed source:
href="https://bupaanzstdhtauspub01.blob.core.windows.net/productfiles/J6_ActiveSaver_NSWACT_20180401_000000.pdf" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver<
This https://bupaanzstdhtauspub01.blob.core.windows.net/productfiles/J6_ActiveSaver_NSWACT_20180401_000000.pdf is the link I need.
https://www.bupa.com.au/health-insurance/cover/active-saver is the link to the page that houses the PDF. As you can see the PDF is stored on another domain, not www.bupa.com.au.
Any help with this would be very appreciated.
I realised that this is acutally an AJAX request and when executed it obtains the PDF url that I'm after. I'm now trying to figure out how to extract that url from the response object sent via a post request.
My code so far is:
import requests
from lxml.etree import fromstring
url = "post_url"
data = {data dictionary to send with request extraced from dev tools}
response = requests.post(url,data)
response.json()
However, I keep getting error indicating that No Json object could be decoded. I can look at the response, using response.text and I get
u'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<script>\r\n(function() { \r\nvar z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D363634323839373431333131303432323133352C353234303631363938363836323232363836382C393038303935393835353935393539353435312C31303035363336222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval(\'String.fromCharCode(\'+z+\')\'));})();\r\n</script></head>\r\n<body>\r\n<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>\r\n</body></html>'
This clearly does not have the url I'm after. The frustrating thing is I can see the was obtained when I used Firefox's dev tools:
Screen shot of FireFox Dev tools showing link
Can anyone help me with this?
I was able to solve this by ensuring that both my header information and the request payload (data) that was sent with the post request was complete and accurate (obtained from Firefox dev tools web console). Once I was able to receive the response data for the post request it was relatively trivial to extract the url linking to the pdf file I was wanting to download. I then downloaded the pdf using urlretrieve from the urllib module. I modeled my script based on the script from this page. However, I also ended up using the urllib2.Request form the urllib2 module instead of requests.post from the requests module. For some reason urllib2 module worked more consistently then the Requests module. My working code ended up looking like this (these two methods come from a my class object, but shows the working code):
....
def post_request(self,url,data):
self.data = data
self.url = url
req = urllib2.Request(self.url)
req.add_header('Content-Type', 'application/json')
res = urllib2.urlopen(req,self.data)
out = json.load(res)
return out
def get_pdf(self):
link ='https://www.bupa.com.au/api/cover/datasheets/search'
directory = '/Users/U1085012/OneDrive/PDS data project/Bupa/PDS Files/'
excess = [None, 0,50,100,500]
#singles
for product in get_product_names_singles():
self.search_request['PackageEntityName'] = product
print product
if 'extras' in product:
self.search_request['ProductType'] = 2
else:
self.search_request['ProductType'] = 1
for i in range(len(excess)):
try:
self.search_request['Excess'] = excess[i]
payload = json.dumps(self.search_request)
output = self.post_request(link,payload)
except urllib2.HTTPError:
continue
else:
break
path = output['FilePath'].encode('ascii')
file_name = output['FileName'].encode('ascii')
#check to see if file exists if not then retrieve
if os.path.exists(directory+file_name):
pass
else:
ul.urlretrieve(path, directory+file_name
http://www.wfri.re.kr/client/PublishHp.do?command=view&list_dis_txt=PUB¤t_page=1&isu_year=all&list_unq_no=RP00000001847&search_category=&search_keyword=&pub_dt=20170203&topMenuNo=H20000&leftMenuNo=H20100
I'm crawling this site.
I am using Python3 and Beautifulsoup
My crawler can not find any tags here.
I want to download the pdf file here.
Beautifulsoup can not scrape any tag from this site.
Why?
def second_crawler(second_url):
second_url = 'http://www.wfri.re.kr/client/PublishHp.do?command=view&list_dis_txt=PUB¤t_page=1&isu_year=all&list_unq_no=RP00000001847&search_category=&search_keyword=&pub_dt=20170203&topMenuNo=H20000&leftMenuNo=H20100'
source_code = requests.get(second_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
print(soup) # for debug
# tdTag = soup.findAll('td',class_='view_cont')
# print(len(tdTag)) ## result is 0. Why??
The website uses javascript function javascript:fnc_filedown() instead of URL to provide the download functionality for PDF files.
For example, when I visit one of the post: http://www.wfri.re.kr/client/PublishHp.do?command=view&list_dis_txt=PUB¤t_page=1&isu_year=all&list_unq_no=RP00000001847&search_category=&search_keyword=&pub_dt=20170203&topMenuNo=H20000&leftMenuNo=H20100
The download process will only be triggered by using the following line:
javascript:fnc_filedown( 'XXX.pdf', '148636884482283162132' );
Because that the reference link is stored here:
XXX.pdf
Trying to modify your crawler according to the website style is suggested.
Below is a bit of code I am puzzled with. I have been successful with web scraping info from other sites but this one I can't get my head around. I believe that I am missing something due to, maybe, JS.
My end code will take the mainurl and theurl (which is a link) and add them together. However I can even seem to be able to display theurl out. When I go through the inspect aspect I can see what I need but in page source it is not there. Am I missing something in my code for the JS?
import requests
from bs4 import BeautifulSoup
import csv
b = open('csv/homedepot.csv', 'w', newline='')
a = csv.writer(b,delimiter=',')
mainurl = "http://www.homedepot.ca" ## Main website
theurl = "https://www.homedepot.ca/en/home/categories/appliances/refrigerators-and-freezers.html" ##Target website
r = requests.get(theurl)
soup = BeautifulSoup(r.content, "lxml")
for link in soup.findAll('a'):
print (link.get('href'))
I've written a script to test a process involving data input & several pages, but after writing it I've found the forms & main content to be generated from javascript.
The following is a snippet of the script I wrote, and after that initial link the content is generated by JS (its my first python script so excuse any mistakes);
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
browser = webdriver.Firefox()
browser.get('http://127.0.0.1:46727/?ajax=1')
assert "Home" in browser.title
# Find and click the Employer Database link
empDatabaseLink = browser.find_element_by_link_text('Employer Database')
click = ActionChains(browser).click(on_element = empDatabaseLink)
click.perform()
# Content loaded by the above link is generated by the JS
# Find and click the Add Employer button
addEmployerButton = browser.find_element_by_id('Add Employer')
addEmployer = ActionChains(browser).click(on_element = addEmployerButton)
addEmployer.perform()
browser.save_screenshot(r'images\Add_Employer_Form.png')
# Input Employer name
employerName = browser.find_element_by_id('name')
employerName.send_keys("Selenium")
browser.save_screenshot(r'images\Entered_Employer_Name.png')
# Move to next
nextButton = broswer.find_element_by_name('button_next')
moveForward = ActionChains(browser).click(on_element = nextButton)
# Move through various steps
# Then
# Move to Finish
moveForward = ActionChains(browser).click(on_element = nextButton)
How do you access page elements that aren't in the source? I've been looking around & found GetEval but not found anything that I can use :/
Well, to the people of the future, our above conversation appears to have lead to the conclusion that xpath is what mark was looking for. So remember to try xpath, and to use the Selenium IDE and Firebug to locate particularly obstinate page elements.