Python doesn't grab info due to JS - javascript

Below is a bit of code I am puzzled with. I have been successful with web scraping info from other sites but this one I can't get my head around. I believe that I am missing something due to, maybe, JS.
My end code will take the mainurl and theurl (which is a link) and add them together. However I can even seem to be able to display theurl out. When I go through the inspect aspect I can see what I need but in page source it is not there. Am I missing something in my code for the JS?
import requests
from bs4 import BeautifulSoup
import csv
b = open('csv/homedepot.csv', 'w', newline='')
a = csv.writer(b,delimiter=',')
mainurl = "http://www.homedepot.ca" ## Main website
theurl = "https://www.homedepot.ca/en/home/categories/appliances/refrigerators-and-freezers.html" ##Target website
r = requests.get(theurl)
soup = BeautifulSoup(r.content, "lxml")
for link in soup.findAll('a'):
print (link.get('href'))

Related

Selenium chromedriver not running pages js scripts

page normally loads like this
but when it is open with chrome driver via the selenium in python it loads like this
I have looked up how to start js scripts on a page, rocket-loader.min.js is a consistent thing I see in the pages source, but nothing I try works (javascriptexecutor, implicite wait, explicit wait, time.sleep()) nothing seems to get the page to load so I can scrape the results from it. her is my code for reference
date = ["2022-11-02","2022-10-26","2022-10-19","2022-10-05"]
html = []
url = 'https://lfstats.com/scorecards/nightly?gametype=social&centerID=10&leagueID=0&isComp=0&date='
for x in date:
driver = webdriver.Chrome('C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe')
driver.get(url + str(date))
time.sleep(10)
lnks = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/ul/div[1]/div[1]/a')
print(lnks)
try:
for link in lnks:
get = link.find_element(By.CSS_SELECTOR, 'a')
hyper = (get.get_attribute('href'))
html.append(hyper)
except:
print("unable to find link in group")
any help would be greatly appreciated. I am moderately good at python but new to web scraping and html code. please feel free to comment with how big of an idiot I am. thank you.
This is not related to the javascript enabled or disabled issue. The actual issue is you are not passing the date from the 'date' list to the url correctly.
In the for loop, you have to make the below change:
for x in range(len(date)): # you have to loop through the length of the 'date' list
driver = webdriver.Chrome('C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe')
driver.get(url + str(date[x]))
time.sleep(10)
# in the below line you have to use find_elements, also I updated the correct locator
lnks = driver.find_elements(By.XPATH, ".//*[#class='list-group-item-heading']")
try:
for link in lnks:
get = link.find_element(By.CSS_SELECTOR, 'a')
hyper = (get.get_attribute('href'))
html.append(hyper)
except:
print("unable to find link in group")

How to scrape multiple pages with an unchanging URL - Python 3

I recently got in touch with web scraping and tried to web scrape various pages. For now, I am trying to scrape the following site - http://www.pizzahut.com.cn/StoreList
So far I've used selenium to get the longitude and latitude scraped. However, my code right now only extracts the first page. I know there is a dynamic web scraping that executes javascript and loads different pages, but had hard time trying to find a right solution. I was wondering if there's a way to access the other 49 pages or so, because when I click next page the URL does not change because it is set, so I cannot just iterate over a different URL each time
Following is my code so far:
import os
import requests
import csv
import sys
import time
from bs4 import BeautifulSoup
page = requests.get('http://www.pizzahut.com.cn/StoreList')
soup = BeautifulSoup(page.text, 'html.parser')
for row in soup.find_all('div',class_='re_RNew'):
name = row.find('p',class_='re_NameNew').string
info = row.find('input').get('value')
location = info.split('|')
location_data = location[0].split(',')
longitude = location_data[0]
latitude = location_data[1]
print(longitude, latitude)
Thank you so much for helping out. Much appreciated
Steps to get the data:
Open the developer tools in your browser (for Google Chrome it's Ctrl+Shift+I). Now, go to the XHR tab which is located inside the Network tab.
After doing that, click on the next page button. You'll see the following file.
Click on that file. In the General block, you'll see these 2 things that we need.
Scrolling down, in the Form Data tab, you can see the 3 variables as
Here, you can see that changing the value of pageIndex will give all the pages required.
Now, that we've got all the required data, we can write a POST method for the URL http://www.pizzahut.com.cn/StoreList/Index using the above data.
Code:
I'll show you the code to scrape first 2 pages, you can scrape any number of pages you want by changing the range().
for page_no in range(1, 3):
data = {
'pageIndex': page_no,
'pageSize': 10,
'keyword': '输入餐厅地址或餐厅名称'
}
page = requests.post('http://www.pizzahut.com.cn/StoreList/Index', data=data)
soup = BeautifulSoup(page.text, 'html.parser')
print('PAGE', page_no)
for row in soup.find_all('div',class_='re_RNew'):
name = row.find('p',class_='re_NameNew').string
info = row.find('input').get('value')
location = info.split('|')
location_data = location[0].split(',')
longitude = location_data[0]
latitude = location_data[1]
print(longitude, latitude)
Output:
PAGE 1
31.085877 121.399176
31.271117 121.587577
31.098122 121.413396
31.331458 121.440183
31.094581 121.503654
31.270737000 121.481178000
31.138214 121.386943
30.915685 121.482079
31.279029 121.529255
31.168283 121.283322
PAGE 2
31.388674 121.35918
31.231706 121.472644
31.094857 121.219961
31.228564 121.516609
31.235717 121.478692
31.288498 121.521882
31.155139 121.428885
31.235249 121.474639
30.728829 121.341429
31.260372 121.343066
Note: You can change the results per page by changing the value of pageSize (currently it's 10).

Parsing html from a javascript rendered url with python object

I would like to extract the market information from the following url and all of its subsequent pages:
https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1
I have successfully parsed the data that I want from the first page using some code from the following url:
https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages
I have also been able to parse out the url for the next page to feed into a loop in order to grab data from the next page. The problem is it crashes before the next page loads for a reason I don't fully understand.
I have a hunch that the class that I have borrowed from 'impythonist' may be causing the problem. I don't know enough object orientated programming to work out the problem. Here is my code, much of which is borrowed from the the url above:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
import re
from bs4 import BeautifulSoup
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
base_url='https://uk.reuters.com'
complete_next_page='https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1'
#LOOP TO RENDER PAGES AND GRAB DATA
while complete_next_page != '':
print ('NEXT PAGE: ',complete_next_page, '\n')
r = Render(complete_next_page) # USE THE CLASS TO RENDER JAVASCRIPT FROM PAGE
result = r.frame.toHtml() # ERROR IS THROWN HERE ON 2nd PAGE
# PARSE THE HTML
soup = BeautifulSoup(result, 'lxml')
row_data=soup.find('div', attrs={'class':'column1 gridPanel grid8'})
print (len(row_data))
# PARSE ALL ROW DATA
stripe_rows=row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows=row_data.findAll('tr', attrs={'class':''})
print (len(stripe_rows))
print (len(non_stripe_rows))
# PARSE SPECIFIC ROW DATA FROM INDEX COMPONENTS
#non_stripe_rows: from 4 to 18 (inclusive) contain data
#stripe_rows: from 2 to 16 (inclusive) contain data
i=2
while i < len(stripe_rows):
print('CURRENT LINE IS: ',str(i))
print(stripe_rows[i])
print('###############################################')
print(non_stripe_rows[i+2])
print('\n')
i+=1
#GETS LINK TO NEXT PAGE
next_page=str(soup.find('div', attrs={'class':'pageNavigation'}).find('li', attrs={'class':'next'}).find('a')['href']) #GETS LINK TO NEXT PAGE WORKS
complete_next_page=base_url+next_page
I have annotated the bits of code that I have written and understand but I don't really know what's going on in the 'Render' class enough to diagnose the error? Unless its something else?
Here is the error:
result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'
I don't need to keep the information in the class once I have parsed it out so I was thinking perhaps it could be cleared or reset somehow and then updated to hold the new url information from page 2:n but I have no idea how to do this?
Alternatively if anyone knows another way to grab this specific data from this page and the following ones then that would be equally helpful?
Many thanks in advance.
How about using selenium and phantomjs instead of PyQt.
You can easily get selenium by executing "pip install selenium".
If you use Mac you can get phantomjs by executing "brew install phantomjs".
If your PC is Windows use choco instead of brew, or Ubuntu use apt-get.
from selenium import webdriver
from bs4 import BeautifulSoup
base_url = "https://uk.reuters.com"
first_page = "/business/markets/index/.FTSE?sortBy=&sortDir=&pn=1"
browser = webdriver.PhantomJS()
# PARSE THE HTML
browser.get(base_url + first_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
# PARSE ALL ROW DATA
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))
# GO TO THE NEXT PAGE
next_button = soup.find("li", attrs={"class":"next"})
while next_button:
next_page = next_button.find("a")["href"]
browser.get(base_url + next_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))
next_button = soup.find("li", attrs={"class":"next"})
# DONT FORGET THIS!!
browser.quit()
I know the code above is not efficient (too slow I feel), but I think that it will bring you the results you desire. In addition, if the web page you want to scrape does not use Javascript, even PhantomJS and selenium are unnecessary. You can use the requests module. However, since I wanted to show you the contrast with PyQt, I used PhantomJS and Selenium in this answer.

Can I extract comments of any page from https://www.rt.com/ using python3?

I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?
RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)
Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).

Scrape football predictions from bing

When you use bing search and search for fussball bundesliga bing displays you the last week the current week and the next week. The games are normally on weekend. if none of the games for the actual week is already played you will get probabilities for each team to win, loose or have a draw.
I can already get the results/predictions from the page that don't need to be expanded, since they are in the html on load. To see more I need to expand that view somehow (can bee seen by the circle in the picture). In a human controlled browser this is easy.
The problem is that clicking on that arrow issues an onclick() event which executes a javascript. So I thought using somethin that has javascript support might help. Till now I could not get the missing games since I'm not able to programmatically click that arrow and load the page. Here is my Code:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
import requests
import sys
from lxml import html
import spynner
from time import sleep
import dryscrape
from bs4 import BeautifulSoup
if __name__ == "__main__":
url = "https://www.bing.com/search?q=fussball+bundesliga"
sess = dryscrape.Session()
sess.visit(url)
response = sess.body()
dryscrype_soup = BeautifulSoup(response,"lxml")
#test = dryscrype_soup.findAll("div",{"id":"tab_3_dynamic"})
dryscrape_actual_week = dryscrype_soup.findAll("div",{"id":"sp-full-29"})
dryscrape_text = [i for i in dryscrype_soup.recursiveChildGenerator() if type(i) == NavigableString]
dryscrape_all_text = dryscrape_actual_week[0].findAll(text=True)
browser = spynner.Browser(debug_level=spynner.DEBUG)
browser.show(True,True)
browser.load(url)
browser.runjs("sj_evt.fire('ExpandClick', '29', '');",True)
#browser.wk_click(".//*[#id='sp-expandTop-more-29']", wait_load=True)
#browser.wk_click_ajax(selector=".//*[#id='sp-expandTop-more-29']")
browser.wait_load()
markup = browser._get_html()
spynner_soup = BeautifulSoup(markup,"lxml")
spynner_actual_week = spynner_soup.findAll("div",{"id":"sp-full-29"})
spynner_all_text = spynner_actual_week[0].findAll(text=True)
Don't bother the imports, I tried several things already. I tried microsofts azure api, but this only delivers links, not these predictions. When you have a look into the html that is parsed or the variables spynner_all_text and dryscrape_all_text you will notice that they contain only the results from the non expaned webpage. Hope someone can help me with that.

Categories