Selenium chromedriver not running pages js scripts - javascript

page normally loads like this
but when it is open with chrome driver via the selenium in python it loads like this
I have looked up how to start js scripts on a page, rocket-loader.min.js is a consistent thing I see in the pages source, but nothing I try works (javascriptexecutor, implicite wait, explicit wait, time.sleep()) nothing seems to get the page to load so I can scrape the results from it. her is my code for reference
date = ["2022-11-02","2022-10-26","2022-10-19","2022-10-05"]
html = []
url = 'https://lfstats.com/scorecards/nightly?gametype=social&centerID=10&leagueID=0&isComp=0&date='
for x in date:
driver = webdriver.Chrome('C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe')
driver.get(url + str(date))
time.sleep(10)
lnks = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/ul/div[1]/div[1]/a')
print(lnks)
try:
for link in lnks:
get = link.find_element(By.CSS_SELECTOR, 'a')
hyper = (get.get_attribute('href'))
html.append(hyper)
except:
print("unable to find link in group")
any help would be greatly appreciated. I am moderately good at python but new to web scraping and html code. please feel free to comment with how big of an idiot I am. thank you.

This is not related to the javascript enabled or disabled issue. The actual issue is you are not passing the date from the 'date' list to the url correctly.
In the for loop, you have to make the below change:
for x in range(len(date)): # you have to loop through the length of the 'date' list
driver = webdriver.Chrome('C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe')
driver.get(url + str(date[x]))
time.sleep(10)
# in the below line you have to use find_elements, also I updated the correct locator
lnks = driver.find_elements(By.XPATH, ".//*[#class='list-group-item-heading']")
try:
for link in lnks:
get = link.find_element(By.CSS_SELECTOR, 'a')
hyper = (get.get_attribute('href'))
html.append(hyper)
except:
print("unable to find link in group")

Related

Python Selenium Take a Screenshot Of An Whole Page Without Using Headless Mode

I need to take a screenshot of an element which is very long and not fit on the screen, I can use headless mode to do this but site doesn't allow me to do even with user-agent and other stuff.
But I can access the site with undetectedChromeDriver, so there's a extension to do this stuff called 'HTML Elements Screenshot'.
That extension will allow you to select element and take the screenshot of an whole element for you.
I automated that process with pyautogui and cv2 but I want to do it without this libraries. Is there any Javascript code to do it ? I did my research but can't find any useful. Thanks in advance
The codes I tried:
def save_screenshot(driver, path: str = 'screenshot.png') -> None:
input('Let it go when u ready.')
driver.switch_to.window(driver.window_handles[-1])
original_size = driver.get_window_size()
required_width = driver.execute_script('return document.body.parentNode.scrollWidth')
required_height = driver.execute_script('return document.body.parentNode.scrollHeight')
driver.set_window_size(required_width, required_height)
driver.save_screenshot(path) # has scrollbar
#driver.find_element_by_tag_name('body').screenshot(path) # avoids scrollbar
driver.set_window_size(original_size['width'], original_size['height'])
def saveScreenshot(driver,path: str="screenshot.png"):
input('Let it go when u ready.')
driver.switch_to.window(driver.window_handles[-1])
el = driver.find_element(By.TAG_NAME,"body")
el.screenshot(path)
driver.quit()
For python you can use pyppeteer.
For javascript you can use puppeteer
You can find the documentation here

scrape data from 1 page go to next page and so on, javascript used on website

I've tried many different web scraping avenues and hoping to get some help here. I have some Python code that gets me what I want from page 1 of my website.
response = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find(id='preblockBody')
results
print(results.prettify())
job_elems = results.find_all('table', class_='pbListingTable')
for job_elem in job_elems:
title_elem2 = job_elem.find_all('tr', class_='pbListingTable1')
for pbListingTable1 in job_elem.find_all('tr', {'class':'pbListingTable1'}):
print(pbListingTable1.text)
title_elem = job_elem.find_all('tr', class_='pbListingTable0')
for pbListingTable0 in job_elem.find_all('tr', {'class':'pbListingTable0'}):
print(pbListingTable0.text)
Then I would like to go to the next page and do the same thing, looping through all pages until the end and combining everything. However, I've had some trouble as the next page action is in javascript like so:
2
After further inspecting the website, I see that the action is:
<script language="JavaScript">
function sortPage(i) {
document.location = baseHref + "website" + i;
}
function gotoNextPage(i) {
document.location = baseHref + "website" + i;
}
I'm fairly new to all this so I'm pretty stuck. Any guidance is appreciated greatly. How can I get to the next page, loop through them all, and then combine?
Does this help?
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Firefox()
driver.get(url)
##your code goes here##
nxt=driver.find_element_by_xpath("//a[contains(#href, 'gotoNextPage(2)')]")
driver.execute_script("arguments[0].click();", nxt)
Thanks again for everyone's suggestions, I've rewrote using Selenium and now I'm just hoping to get some help looping through all the pages and append to a single output. Here is my code:
table = driver.find_element_by_id('preblockBody')
job_elems = table.find_elements_by_xpath("//*[contains(#class,'pbListingTable')]")
for value in job_elems:
print(value.text)
nxt=driver.find_element_by_xpath("//a[contains(#href, 'gotoNextPage(2)')]")
driver.execute_script("arguments[0].click();", nxt)

Can I run Javascript code on a given URL via Python or PHP on the server side?

I'm building a website right now in which I'm trying to execute Javascript code to obtain specific elements of a totally separate site's web page given its URL. I've figured out how to use Selenium.Webdriver on my laptop in Python to achieve this by:
driver = webdriver.Firefox()
driver.get(url)
price = driver.execute_script("var priceElements = document.getElementById('priceblock_ourprice'); var prices = []; for (var i = 0; i < 3; i++) { prices.push(priceElements.children[i].innerHTML); } var price = prices[1] + '.' + prices[2]; return price;")
driver.close()
where it opens up the Firefox browser, runs the JS and then finds this price value that I want. I know there is also a way to do it which is headless and it won't need to physically open up a browser, but I'm trying to figure this out for the case of doing something similar on the website that I'm building.
Is there some way that I can achieve this result but instead of it running on my personal machine, the code runs on the Apache web server? I'm just not sure if this is possible without the machine its running on having a browser that it can open.
I hope I'm clear enough on what I'm doing, but if anyone needs clarification then I'll be happy to answer any questions you may have about this situation.
I did something similar with "lxml" and "urllib"
Here is a example:
from lxml import html
import urllib
pageLink = ""
page = urllib.urlopen(pageLink)
pageString = page.read()
pageContent = html.fromstring(pageString)
imgLink = pageContent.xpath('//img[#id="img-1"]/#src')[0]

Can I extract comments of any page from https://www.rt.com/ using python3?

I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?
RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)
Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).

invoking onclick event with beautifulsoup python

I am trying to fetch the links to all accomodations in Cyprus from this website:
http://www.zoover.nl/cyprus
So far I can retrieve the first 15 which are already shown. So now I have to invoke the click on the "volgende"-link. However I don't know how to do that and in the source code I am not able to track down the function called to use e.g. sth like posted here:
Issues with invoking "on click event" on the html page using beautiful soup in Python
I only need the step where the "clicking" happens so I can fetch the next 15 links and so on.
Does anybody know how to help?
Thanks already!
EDIT:
My code looks like this now:
def getZooverLinks(country):
zooverWeb = "http://www.zoover.nl/"
url = zooverWeb + country
parsedZooverWeb = parseURL(url)
driver = webdriver.Firefox()
driver.get(url)
button = driver.find_element_by_class_name("next")
links = []
for page in xrange(1,3):
for item in parsedZooverWeb.find_all(attrs={'class': 'blue2'}):
for link in item.find_all('a'):
newLink = zooverWeb + link.get('href')
links.append(newLink)
button.click()'
and I get the following error:
selenium.common.exceptions.StaleElementReferenceException: Message: Element is no longer attached to the DOM
Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:8956)
at Utils.getElementAt (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:8546)
at fxdriver.preconditions.visible (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:9585)
at DelayedCommand.prototype.checkPreconditions_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12257)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12274)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12279)
at DelayedCommand.prototype.execute/< (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12221)
I'm confused :/
While it might be tempting to try to do this using Beautifulsoup's evaluateJavaScript method, in the end Beautifulsoup is a parser rather than an interactive web browsing client.
You should seriously consider solving this with selenium, as briefly shown in this answer. There are pretty good Python bindings available for selenium.
You could just use selenium to find the element and click it, and then pass the page on to Beautifulsoup, and use your existing code to fetch the links.
Alternatively, you could use the Javascript that's listed in the onclick handler. I pulled this from the source: EntityQuery('Ns=pPopularityScore%7c1&No=30&props=15292&dims=530&As=&N=0+3+10500915');. The No parameter increments with 15 for each page, but the props has me guessing. I'd recommend not getting into this, though, and just interact with the website as a client would, using selenium. That's much more robust to changes on their side, as well.
I tried the following code and was able to load next page. Hope this will help you too.
Code:
from selenium import webdriver
import os
chromedriver = "C:\Users\pappuj\Downloads\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url='http://www.zoover.nl/cyprus'
driver.get(url)
driver.find_element_by_class_name('next').click()
Thanks

Categories