How to scrape multiple pages with an unchanging URL - Python 3

How to scrape multiple pages with an unchanging URL - Python 3 - javascript

I recently got in touch with web scraping and tried to web scrape various pages. For now, I am trying to scrape the following site - http://www.pizzahut.com.cn/StoreList
So far I've used selenium to get the longitude and latitude scraped. However, my code right now only extracts the first page. I know there is a dynamic web scraping that executes javascript and loads different pages, but had hard time trying to find a right solution. I was wondering if there's a way to access the other 49 pages or so, because when I click next page the URL does not change because it is set, so I cannot just iterate over a different URL each time
Following is my code so far:
import os
import requests
import csv
import sys
import time
from bs4 import BeautifulSoup
page = requests.get('http://www.pizzahut.com.cn/StoreList')
soup = BeautifulSoup(page.text, 'html.parser')
for row in soup.find_all('div',class_='re_RNew'):
name = row.find('p',class_='re_NameNew').string
info = row.find('input').get('value')
location = info.split('|')
location_data = location[0].split(',')
longitude = location_data[0]
latitude = location_data[1]
print(longitude, latitude)
Thank you so much for helping out. Much appreciated

Steps to get the data:
Open the developer tools in your browser (for Google Chrome it's Ctrl+Shift+I). Now, go to the XHR tab which is located inside the Network tab.
After doing that, click on the next page button. You'll see the following file.
Click on that file. In the General block, you'll see these 2 things that we need.
Scrolling down, in the Form Data tab, you can see the 3 variables as
Here, you can see that changing the value of pageIndex will give all the pages required.
Now, that we've got all the required data, we can write a POST method for the URL http://www.pizzahut.com.cn/StoreList/Index using the above data.
Code:
I'll show you the code to scrape first 2 pages, you can scrape any number of pages you want by changing the range().
for page_no in range(1, 3):
data = {
'pageIndex': page_no,
'pageSize': 10,
'keyword': '输入餐厅地址或餐厅名称'
}
page = requests.post('http://www.pizzahut.com.cn/StoreList/Index', data=data)
soup = BeautifulSoup(page.text, 'html.parser')
print('PAGE', page_no)
for row in soup.find_all('div',class_='re_RNew'):
name = row.find('p',class_='re_NameNew').string
info = row.find('input').get('value')
location = info.split('|')
location_data = location[0].split(',')
longitude = location_data[0]
latitude = location_data[1]
print(longitude, latitude)
Output:
PAGE 1
31.085877 121.399176
31.271117 121.587577
31.098122 121.413396
31.331458 121.440183
31.094581 121.503654
31.270737000 121.481178000
31.138214 121.386943
30.915685 121.482079
31.279029 121.529255
31.168283 121.283322
PAGE 2
31.388674 121.35918
31.231706 121.472644
31.094857 121.219961
31.228564 121.516609
31.235717 121.478692
31.288498 121.521882
31.155139 121.428885
31.235249 121.474639
30.728829 121.341429
31.260372 121.343066
Note: You can change the results per page by changing the value of pageSize (currently it's 10).

Related

Webscraping website that has a button to click

I am trying to webscrape a website that has multiple javascript rendered pages (https://openlibrary.ecampusontario.ca/catalogue/). I am able to get the content from the first page, but I am not sure how to get my script to click on the buttons on the subsequent pages to get that content. Here is my script.
import time
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
# The path to where you have your chrome webdriver stored:
webdriver_path = '/Users/rawlins/Downloads/chromedriver'
# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')
# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path,
chrome_options = chrome_options)
# Load webpage
url = "https://openlibrary.ecampusontario.ca/catalogue/"
browser.get(url)
# to ensure that the page has loaded completely.
time.sleep(3)
data = []
# Parse HTML, close browser
page_soup = soup(browser.page_source, 'lxml')
containers = page_soup.findAll("div", {"class":"result-item tooltip"})
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.find('h4', {'class' : 'textbook-title'}).text.strip()
item['author'] = container.find('p', {'class' : 'textbook-authors'}).text.strip()
item['link'] = "https://openlibrary.ecampusontario.ca/catalogue/" + container.find('h4', {'class' : 'textbook-title'}).a["href"]
item['source'] = "eCampus Ontario"
item['base_url'] = "https://openlibrary.ecampusontario.ca/catalogue/"
data.append(item) # add the item to the list
with open("js-webscrape-2.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
browser.quit()

You do not have to actually click on any button. For example, to search for items with the keyword 'electricity', you navigate to the url
https://openlibrary-repo.ecampusontario.ca/rest/filtered-items?query_field%5B%5D=*&query_op%5B%5D=matches&query_val%5B%5D=(%3Fi)electricity&filters=is_not_withdrawn&offset=0&limit=10000
This will return a json string of items with the first item being:
{"items":[{"uuid":"6af61402-b0ec-40b1-ace2-1aa674c2de9f","name":"Introduction to Electricity, Magnetism, and Circuits","handle":"123456789/579","type":"item","expand":["metadata","parentCollection","parentCollectionList","parentCommunityList","bitstreams","all"],"lastModified":"2019-05-09 15:51:06.91","parentCollection":null,"parentCollectionList":null,"parentCommunityList":null,"bitstreams":null,"withdrawn":"false","archived":"true","link":"/rest/items/6af61402-b0ec-40b1-ace2-1aa674c2de9f","metadata":null}, ...
Now, to get that item, you use its uuid, and navigate to:
https://openlibrary.ecampusontario.ca/catalogue/item/?id=6af61402-b0ec-40b1-ace2-1aa674c2de9f
You can proceed like this for any interaction with that website (this is not always working for all websites, but it is working for your website).
To find out what are the urls that are navigated to when you click such and such button or enter text (what I did for the above urls), you can use fiddler.

I made a little script that can help you (selenium).
what this script does is "while the last page of the catalogue is not selected (in this case, contain 'selected' in it's class), i'll scrape , then click next"
while "selected" not in driver.find_elements_by_css_selector("[id='results-pagecounter-pages'] a")[-1].get_attribute("class"):
#your scraping here
driver.find_element_by_css_selector("[id='next-btn']").click()
There's probably a problem that you'll run into using this method, it doesn't wait for the results to load, but you can figure out what to do from here onwards.
Hope it helps

Extracting pdf link from a webpage with href = "#' using python: AJAX post request not returning expected result

I’m currently trying to download a pdf from a website (I’m trying to automate the process) and I have tried numerous different approaches. I’m currently using python and selenium/phantomjs to first find the pdf href link on the webpage source and then use something like wget to download and store the pdf on my local drive.
Whilst I have no issues finding all the href links find_elements_by_xpath("//a/#href") on the page, or narrowing in on the element that has the url path find_element_by_link_text('Active Saver') and then printing it using, the get_attribute('href') method, it does not display the link correctly.
This is the source element, an a tag, that I need the link from is:
href="#" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver<
As you can see the href attribute is href="#" and when I run get_attribute('href') on this element I get:
https://www.bupa.com.au/health-insurance/cover/active-saver#
Which is not the link to the PDF. I know this because when I open the page in Firefox and inspect the element I can see the actual, JavaScript executed source:
href="https://bupaanzstdhtauspub01.blob.core.windows.net/productfiles/J6_ActiveSaver_NSWACT_20180401_000000.pdf" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver<
This https://bupaanzstdhtauspub01.blob.core.windows.net/productfiles/J6_ActiveSaver_NSWACT_20180401_000000.pdf is the link I need.
https://www.bupa.com.au/health-insurance/cover/active-saver is the link to the page that houses the PDF. As you can see the PDF is stored on another domain, not www.bupa.com.au.
Any help with this would be very appreciated.
I realised that this is acutally an AJAX request and when executed it obtains the PDF url that I'm after. I'm now trying to figure out how to extract that url from the response object sent via a post request.
My code so far is:
import requests
from lxml.etree import fromstring
url = "post_url"
data = {data dictionary to send with request extraced from dev tools}
response = requests.post(url,data)
response.json()
However, I keep getting error indicating that No Json object could be decoded. I can look at the response, using response.text and I get
u'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<script>\r\n(function() { \r\nvar z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D363634323839373431333131303432323133352C353234303631363938363836323232363836382C393038303935393835353935393539353435312C31303035363336222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval(\'String.fromCharCode(\'+z+\')\'));})();\r\n</script></head>\r\n<body>\r\n<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>\r\n</body></html>'
This clearly does not have the url I'm after. The frustrating thing is I can see the was obtained when I used Firefox's dev tools:
Screen shot of FireFox Dev tools showing link
Can anyone help me with this?

I was able to solve this by ensuring that both my header information and the request payload (data) that was sent with the post request was complete and accurate (obtained from Firefox dev tools web console). Once I was able to receive the response data for the post request it was relatively trivial to extract the url linking to the pdf file I was wanting to download. I then downloaded the pdf using urlretrieve from the urllib module. I modeled my script based on the script from this page. However, I also ended up using the urllib2.Request form the urllib2 module instead of requests.post from the requests module. For some reason urllib2 module worked more consistently then the Requests module. My working code ended up looking like this (these two methods come from a my class object, but shows the working code):
....
def post_request(self,url,data):
self.data = data
self.url = url
req = urllib2.Request(self.url)
req.add_header('Content-Type', 'application/json')
res = urllib2.urlopen(req,self.data)
out = json.load(res)
return out
def get_pdf(self):
link ='https://www.bupa.com.au/api/cover/datasheets/search'
directory = '/Users/U1085012/OneDrive/PDS data project/Bupa/PDS Files/'
excess = [None, 0,50,100,500]
#singles
for product in get_product_names_singles():
self.search_request['PackageEntityName'] = product
print product
if 'extras' in product:
self.search_request['ProductType'] = 2
else:
self.search_request['ProductType'] = 1
for i in range(len(excess)):
try:
self.search_request['Excess'] = excess[i]
payload = json.dumps(self.search_request)
output = self.post_request(link,payload)
except urllib2.HTTPError:
continue
else:
break
path = output['FilePath'].encode('ascii')
file_name = output['FileName'].encode('ascii')
#check to see if file exists if not then retrieve
if os.path.exists(directory+file_name):
pass
else:
ul.urlretrieve(path, directory+file_name

Parsing html from a javascript rendered url with python object

I would like to extract the market information from the following url and all of its subsequent pages:
https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1
I have successfully parsed the data that I want from the first page using some code from the following url:
https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages
I have also been able to parse out the url for the next page to feed into a loop in order to grab data from the next page. The problem is it crashes before the next page loads for a reason I don't fully understand.
I have a hunch that the class that I have borrowed from 'impythonist' may be causing the problem. I don't know enough object orientated programming to work out the problem. Here is my code, much of which is borrowed from the the url above:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
import re
from bs4 import BeautifulSoup
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
base_url='https://uk.reuters.com'
complete_next_page='https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1'
#LOOP TO RENDER PAGES AND GRAB DATA
while complete_next_page != '':
print ('NEXT PAGE: ',complete_next_page, '\n')
r = Render(complete_next_page) # USE THE CLASS TO RENDER JAVASCRIPT FROM PAGE
result = r.frame.toHtml() # ERROR IS THROWN HERE ON 2nd PAGE
# PARSE THE HTML
soup = BeautifulSoup(result, 'lxml')
row_data=soup.find('div', attrs={'class':'column1 gridPanel grid8'})
print (len(row_data))
# PARSE ALL ROW DATA
stripe_rows=row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows=row_data.findAll('tr', attrs={'class':''})
print (len(stripe_rows))
print (len(non_stripe_rows))
# PARSE SPECIFIC ROW DATA FROM INDEX COMPONENTS
#non_stripe_rows: from 4 to 18 (inclusive) contain data
#stripe_rows: from 2 to 16 (inclusive) contain data
i=2
while i < len(stripe_rows):
print('CURRENT LINE IS: ',str(i))
print(stripe_rows[i])
print('###############################################')
print(non_stripe_rows[i+2])
print('\n')
i+=1
#GETS LINK TO NEXT PAGE
next_page=str(soup.find('div', attrs={'class':'pageNavigation'}).find('li', attrs={'class':'next'}).find('a')['href']) #GETS LINK TO NEXT PAGE WORKS
complete_next_page=base_url+next_page
I have annotated the bits of code that I have written and understand but I don't really know what's going on in the 'Render' class enough to diagnose the error? Unless its something else?
Here is the error:
result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'
I don't need to keep the information in the class once I have parsed it out so I was thinking perhaps it could be cleared or reset somehow and then updated to hold the new url information from page 2:n but I have no idea how to do this?
Alternatively if anyone knows another way to grab this specific data from this page and the following ones then that would be equally helpful?
Many thanks in advance.

How about using selenium and phantomjs instead of PyQt.
You can easily get selenium by executing "pip install selenium".
If you use Mac you can get phantomjs by executing "brew install phantomjs".
If your PC is Windows use choco instead of brew, or Ubuntu use apt-get.
from selenium import webdriver
from bs4 import BeautifulSoup
base_url = "https://uk.reuters.com"
first_page = "/business/markets/index/.FTSE?sortBy=&sortDir=&pn=1"
browser = webdriver.PhantomJS()
# PARSE THE HTML
browser.get(base_url + first_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
# PARSE ALL ROW DATA
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))
# GO TO THE NEXT PAGE
next_button = soup.find("li", attrs={"class":"next"})
while next_button:
next_page = next_button.find("a")["href"]
browser.get(base_url + next_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))
next_button = soup.find("li", attrs={"class":"next"})
# DONT FORGET THIS!!
browser.quit()
I know the code above is not efficient (too slow I feel), but I think that it will bring you the results you desire. In addition, if the web page you want to scrape does not use Javascript, even PhantomJS and selenium are unnecessary. You can use the requests module. However, since I wanted to show you the contrast with PyQt, I used PhantomJS and Selenium in this answer.

Python doesn't grab info due to JS

Below is a bit of code I am puzzled with. I have been successful with web scraping info from other sites but this one I can't get my head around. I believe that I am missing something due to, maybe, JS.
My end code will take the mainurl and theurl (which is a link) and add them together. However I can even seem to be able to display theurl out. When I go through the inspect aspect I can see what I need but in page source it is not there. Am I missing something in my code for the JS?
import requests
from bs4 import BeautifulSoup
import csv
b = open('csv/homedepot.csv', 'w', newline='')
a = csv.writer(b,delimiter=',')
mainurl = "http://www.homedepot.ca" ## Main website
theurl = "https://www.homedepot.ca/en/home/categories/appliances/refrigerators-and-freezers.html" ##Target website
r = requests.get(theurl)
soup = BeautifulSoup(r.content, "lxml")
for link in soup.findAll('a'):
print (link.get('href'))

Scrape football predictions from bing

When you use bing search and search for fussball bundesliga bing displays you the last week the current week and the next week. The games are normally on weekend. if none of the games for the actual week is already played you will get probabilities for each team to win, loose or have a draw.
I can already get the results/predictions from the page that don't need to be expanded, since they are in the html on load. To see more I need to expand that view somehow (can bee seen by the circle in the picture). In a human controlled browser this is easy.
The problem is that clicking on that arrow issues an onclick() event which executes a javascript. So I thought using somethin that has javascript support might help. Till now I could not get the missing games since I'm not able to programmatically click that arrow and load the page. Here is my Code:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
import requests
import sys
from lxml import html
import spynner
from time import sleep
import dryscrape
from bs4 import BeautifulSoup
if __name__ == "__main__":
url = "https://www.bing.com/search?q=fussball+bundesliga"
sess = dryscrape.Session()
sess.visit(url)
response = sess.body()
dryscrype_soup = BeautifulSoup(response,"lxml")
#test = dryscrype_soup.findAll("div",{"id":"tab_3_dynamic"})
dryscrape_actual_week = dryscrype_soup.findAll("div",{"id":"sp-full-29"})
dryscrape_text = [i for i in dryscrype_soup.recursiveChildGenerator() if type(i) == NavigableString]
dryscrape_all_text = dryscrape_actual_week[0].findAll(text=True)
browser = spynner.Browser(debug_level=spynner.DEBUG)
browser.show(True,True)
browser.load(url)
browser.runjs("sj_evt.fire('ExpandClick', '29', '');",True)
#browser.wk_click(".//*[#id='sp-expandTop-more-29']", wait_load=True)
#browser.wk_click_ajax(selector=".//*[#id='sp-expandTop-more-29']")
browser.wait_load()
markup = browser._get_html()
spynner_soup = BeautifulSoup(markup,"lxml")
spynner_actual_week = spynner_soup.findAll("div",{"id":"sp-full-29"})
spynner_all_text = spynner_actual_week[0].findAll(text=True)
Don't bother the imports, I tried several things already. I tried microsofts azure api, but this only delivers links, not these predictions. When you have a look into the html that is parsed or the variables spynner_all_text and dryscrape_all_text you will notice that they contain only the results from the non expaned webpage. Hope someone can help me with that.

We Keep Coding

JavaScript is the programming language of the Web.

How to scrape multiple pages with an unchanging URL - Python 3 - javascript

Related

Webscraping website that has a button to click

Extracting pdf link from a webpage with href = "#' using python: AJAX post request not returning expected result

Parsing html from a javascript rendered url with python object

Python doesn't grab info due to JS

Scrape football predictions from bing

Categories

Resources