I am trying to fetch the links to all accomodations in Cyprus from this website:
http://www.zoover.nl/cyprus
So far I can retrieve the first 15 which are already shown. So now I have to invoke the click on the "volgende"-link. However I don't know how to do that and in the source code I am not able to track down the function called to use e.g. sth like posted here:
Issues with invoking "on click event" on the html page using beautiful soup in Python
I only need the step where the "clicking" happens so I can fetch the next 15 links and so on.
Does anybody know how to help?
Thanks already!
EDIT:
My code looks like this now:
def getZooverLinks(country):
zooverWeb = "http://www.zoover.nl/"
url = zooverWeb + country
parsedZooverWeb = parseURL(url)
driver = webdriver.Firefox()
driver.get(url)
button = driver.find_element_by_class_name("next")
links = []
for page in xrange(1,3):
for item in parsedZooverWeb.find_all(attrs={'class': 'blue2'}):
for link in item.find_all('a'):
newLink = zooverWeb + link.get('href')
links.append(newLink)
button.click()'
and I get the following error:
selenium.common.exceptions.StaleElementReferenceException: Message: Element is no longer attached to the DOM
Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:8956)
at Utils.getElementAt (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:8546)
at fxdriver.preconditions.visible (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:9585)
at DelayedCommand.prototype.checkPreconditions_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12257)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12274)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12279)
at DelayedCommand.prototype.execute/< (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12221)
I'm confused :/
While it might be tempting to try to do this using Beautifulsoup's evaluateJavaScript method, in the end Beautifulsoup is a parser rather than an interactive web browsing client.
You should seriously consider solving this with selenium, as briefly shown in this answer. There are pretty good Python bindings available for selenium.
You could just use selenium to find the element and click it, and then pass the page on to Beautifulsoup, and use your existing code to fetch the links.
Alternatively, you could use the Javascript that's listed in the onclick handler. I pulled this from the source: EntityQuery('Ns=pPopularityScore%7c1&No=30&props=15292&dims=530&As=&N=0+3+10500915');. The No parameter increments with 15 for each page, but the props has me guessing. I'd recommend not getting into this, though, and just interact with the website as a client would, using selenium. That's much more robust to changes on their side, as well.
I tried the following code and was able to load next page. Hope this will help you too.
Code:
from selenium import webdriver
import os
chromedriver = "C:\Users\pappuj\Downloads\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url='http://www.zoover.nl/cyprus'
driver.get(url)
driver.find_element_by_class_name('next').click()
Thanks
Related
I need to take a screenshot of an element which is very long and not fit on the screen, I can use headless mode to do this but site doesn't allow me to do even with user-agent and other stuff.
But I can access the site with undetectedChromeDriver, so there's a extension to do this stuff called 'HTML Elements Screenshot'.
That extension will allow you to select element and take the screenshot of an whole element for you.
I automated that process with pyautogui and cv2 but I want to do it without this libraries. Is there any Javascript code to do it ? I did my research but can't find any useful. Thanks in advance
The codes I tried:
def save_screenshot(driver, path: str = 'screenshot.png') -> None:
input('Let it go when u ready.')
driver.switch_to.window(driver.window_handles[-1])
original_size = driver.get_window_size()
required_width = driver.execute_script('return document.body.parentNode.scrollWidth')
required_height = driver.execute_script('return document.body.parentNode.scrollHeight')
driver.set_window_size(required_width, required_height)
driver.save_screenshot(path) # has scrollbar
#driver.find_element_by_tag_name('body').screenshot(path) # avoids scrollbar
driver.set_window_size(original_size['width'], original_size['height'])
def saveScreenshot(driver,path: str="screenshot.png"):
input('Let it go when u ready.')
driver.switch_to.window(driver.window_handles[-1])
el = driver.find_element(By.TAG_NAME,"body")
el.screenshot(path)
driver.quit()
For python you can use pyppeteer.
For javascript you can use puppeteer
You can find the documentation here
page normally loads like this
but when it is open with chrome driver via the selenium in python it loads like this
I have looked up how to start js scripts on a page, rocket-loader.min.js is a consistent thing I see in the pages source, but nothing I try works (javascriptexecutor, implicite wait, explicit wait, time.sleep()) nothing seems to get the page to load so I can scrape the results from it. her is my code for reference
date = ["2022-11-02","2022-10-26","2022-10-19","2022-10-05"]
html = []
url = 'https://lfstats.com/scorecards/nightly?gametype=social¢erID=10&leagueID=0&isComp=0&date='
for x in date:
driver = webdriver.Chrome('C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe')
driver.get(url + str(date))
time.sleep(10)
lnks = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/ul/div[1]/div[1]/a')
print(lnks)
try:
for link in lnks:
get = link.find_element(By.CSS_SELECTOR, 'a')
hyper = (get.get_attribute('href'))
html.append(hyper)
except:
print("unable to find link in group")
any help would be greatly appreciated. I am moderately good at python but new to web scraping and html code. please feel free to comment with how big of an idiot I am. thank you.
This is not related to the javascript enabled or disabled issue. The actual issue is you are not passing the date from the 'date' list to the url correctly.
In the for loop, you have to make the below change:
for x in range(len(date)): # you have to loop through the length of the 'date' list
driver = webdriver.Chrome('C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe')
driver.get(url + str(date[x]))
time.sleep(10)
# in the below line you have to use find_elements, also I updated the correct locator
lnks = driver.find_elements(By.XPATH, ".//*[#class='list-group-item-heading']")
try:
for link in lnks:
get = link.find_element(By.CSS_SELECTOR, 'a')
hyper = (get.get_attribute('href'))
html.append(hyper)
except:
print("unable to find link in group")
I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?
RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)
Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).
I want to use HtmlUnit (v2.21) to get some search result pages from google. This requires me to click on "people also looked for" link when searching for a person (right side, see example link), which triggers some JavaScript and changes the content of the current page. But this gives me an JavaScript Wrapper Exception (see below).
Clickable example link: https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj
Simple TestCase with errors:
String url = "https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj";
WebClient client = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage page = client.getPage(url);
HtmlElement link = page.getFirstByXPath("//a[#class='_Zjg']");
HtmlPage newPage = link.click(); //throws exception
this.storeResultFile(newPage.asXml(), "test");
client.close();
Result:
net.sourceforge.htmlunit.corejs.javascript.WrappedException: Wrapped java.lang.NullPointerException
at net.sourceforge.htmlunit.corejs.javascript.Context.throwAsScriptRuntimeEx(Context.java:2053)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:947)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1012)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:799)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:742)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:689)
I stored the xml of the "page" object and made sure that the XPath expression is valid and has results.
Anybody got any ideas?
Looks like the JavaScript-Engine (based on Rhino) is very easy to upset and quits on some script-issues, where other browsers are still able to run the script.
I dont know if there is a mistake in the scripts from google, but these two lines solved it for me:
JavaScriptEngine engine = client.getJavaScriptEngine();
engine.holdPosponedActions();
Nevertheless, when running multiple htmlunit-objects in multiple threads it is still possible to get accross this error. This is more a workaround than a solution.
i'm trying to migrate from feedly as it is unacceptable (at least to me) that a search query is (fully) enabled only by a pro version.
Anyhow, to export my lengthy list of "saved for later" i found some lovely scripts:
Simple script that exports a users "Saved For Later" list out of Feedly as a JSON string and feedly-to-pocket. where i am instructed to:
You must switch off SSL (http rather than https) or jQuery won't load!
so i though i did by adding (ubuntu 14.04/chrome 40 x64)
--ssl-version-min=tls1
to my /usr/share/applications/google-chrome.desktop file (all lines starting with Exec=). However when i try to run it in the browser console i get
This request has been blocked; the content must be served over HTTPS.
So, any suggestions? (also, excuse me for noobness)
Go to your Feedly "saved" list and scroll down until all articles have loaded.
Open console and paste the following Javascript into it:
function loadJQuery() {
script = document.createElement('script');
script.setAttribute('src', '//code.jquery.com/jquery-2.1.3.js');
script.setAttribute('type', 'text/javascript');
script.onload = loadSaveAs;
document.getElementsByTagName('head')[0].appendChild(script);
}
function loadSaveAs() {
saveAsScript = document.createElement('script');
saveAsScript.setAttribute('src', 'https://cdn.rawgit.com/eligrey/FileSaver.js/5733e40e5af936eb3f48554cf6a8a7075d71d18a/FileSaver.js');
saveAsScript.setAttribute('type', 'text/javascript');
saveAsScript.onload = saveToFile;
document.getElementsByTagName('head')[0].appendChild(saveAsScript);
}
function saveToFile() {
// Loop through the DOM, grabbing the information from each bookmark
map = jQuery(".entry.quicklisted").map(function(i, el) {
var $el = jQuery(el);
var regex = /Published:(.*)(.*)/i;
return {
title: $el.attr("data-title"),
url: $el.attr("data-alternate-link"),
summary: $el.find(".summary")[0].innerHTML,
time: regex.exec($el.find("span.ago").attr("title"))[1]
};
}).get(); // Convert jQuery object into an array
// Convert to a nicely indented JSON string
json = JSON.stringify(map, undefined, 2);
var blob = new Blob([json], {type: "text/plain;charset=utf-8"});
saveAs(blob, "FeedlySavedForLater" + Date.now().toString() + ".txt");
}
loadJQuery()
Source: Feedly-Export-Save4Later
Not javascript but here is how I saved a html page with all the links and excerpts...
Open the saved pages in feedly in chrome
scroll down so they are all there
inspect any element (the top article is a good choice) so it opens the generated html
find the div id="section0_column0" node
right-click & copy it
paste into Notepad++
this html is untidy so carry on...
Do a Regex find & replace
find: (?s)<div id=.+?_main.+?>.+?(<a href=")(.+?)(").+?sans-serif">(.+?)</span>.+?</div>.+?</div>.+?</div>
replace: <div>$1$2$3>$2</a></div> <div> $4<br /> <br /></div>
save the html page.
open it in Chrome
Posted the question in the jquery forum and the solution was rather simple (remove http from attribute string)
line 34 should be
script.setAttribute('src', '//code.jquery.com/jquery-latest.min.js');
So to close the loop - for a full searchable/archived list of links not only by title/url but context also(!) you can:
Follow the instructions in https://github.com/ShockwaveNN/feedly-to-pocket (with the correction suggested by kind stranger jakecigar and you also have to register a pocket app (obtain consumer key) for the ruby script to work)
Export html list from your pocket account
Import pocket list to a Kifi library
and at last feedly-free with my personal search engine
I know I'm a bit late to the party but Ive been hunting around for a few days to find a reasonably simple solution. None of which have been listed clearly or concisely on stack overflow or elsewhere on the web. I have in fact found a much easier way to do this.
Use this java script from this Gist just as it instructs https://gist.github.com/ShockwaveNN/a0baf2ca26d1711f10e2 (Note this is referenced above and found through the link #gep shared in step one)
Once the JS as completed running it will download a text file. (It does still run successfully and on large numbers, I just exported almost 2500 articles)
Create a blank test.json in SublimeText.
Copy all entries from your exported text file into this json file
Weirdly it does seem you need to copy and past as I tried just renaming the text file and when I did that I received errors on the next step
Make sure you are signed into pocket
Go here: https://getpocket.com/import/springpad
Select your newly created test.json
Upload
Note: On large uploads the import page fails to refresh (this did not seem to be an issue as all my articles did make it into my account)
This allows you to directly upload json into your pocket account. Thus no more messing around with random supposed other fixes. I hope this make it a lot easier for everyone in the future.