In the web console, getting the selected (highlighted) text is a simple manner
window.getSelection().toString()
How about doing this in a headless browser? In particular, I'm using selenium with its python API. I cannot find methods similar to getSelection() around driver:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get("http://www.python.org")
For example, suppose I have selected/highlighted (with the cursor) the string "suppose I have " on this page, the desired output should be "suppose I have ". In case no text is selected/highlighted, return the empty string "".
The answer I found is to execute Javascript directly within selenium. For example, to fulfill what I want, run the following script.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
# Manually highlight some text with your cursor.
driver.execute_script("return window.getSelection().toString()")
Slightly unrelated but useful: This works within the currently selected window. To switch among different windows, see [1].
[1] Python Selenium get current window handle
Related
I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want.
I have tried to search on the web for a solution but the fact that I am searching with the keyword javascript most of the stuff I am getting is how to scrape with the javascript language.
Is there anyway to use the requests framework with pages that use javascript?
Good news: there is now a requests module that supports javascript: https://pypi.org/project/requests-html/
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://www.yourjspage.com')
r.html.render() # this call executes the js in the page
As a bonus this wraps BeautifulSoup, I think, so you can do things like
r.html.find('#myElementID').text
which returns the content of the HTML element as you'd expect.
You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.
While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. If you only need javascript, I recommend PyQt (check https://www.youtube.com/watch?v=FSH77vnOGqU to learn it).
However, if you want to use Selenium, I recommend Chrome over PhantomJS. Many users have problems with PhantomJS where a website simply does not work in Phantom. Chrome can be headless (non-graphical) too!
First, make sure you have installed ChromeDriver, which Selenium depends on for using Google Chrome.
Then, make sure you have Google Chrome of version 60 or higher by checking it in the URL chrome://settings/help
Now, all you need to do is the following code:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
If you do not know how to use Selenium, here is a quick overview:
driver.get("https://www.google.com") #Browser goes to google.com
Finding elements:
Use either the ELEMENTS or ELEMENT method. Examples:
driver.find_element_by_css_selector("div.logo-subtext") #Find your country in Google. (singular)
driver.find_element(s)_by_css_selector(css_selector) # Every element that matches this CSS selector
driver.find_element(s)_by_class_name(class_name) # Every element with the following class
driver.find_element(s)_by_id(id) # Every element with the following ID
driver.find_element(s)_by_link_text(link_text) # Every with the full link text
driver.find_element(s)_by_partial_link_text(partial_link_text) # Every with partial link text.
driver.find_element(s)_by_name(name) # Every element where name=argument
driver.find_element(s)_by_tag_name(tag_name) # Every element with the tag name argument
Ok! I found an element (or elements list). But what do I do now?
Here are the methods you can do on an element elem:
elem.tag_name # Could return button in a .
elem.get_attribute("id") # Returns the ID of an element.
elem.text # The inner text of an element.
elem.clear() # Clears a text input.
elem.is_displayed() # True for visible elements, False for invisible elements.
elem.is_enabled() # True for an enabled input, False otherwise.
elem.is_selected() # Is this radio button or checkbox element selected?
elem.location # A dictionary representing the X and Y location of an element on the screen.
elem.click() # Click elem.
elem.send_keys("thelegend27") # Type thelegend27 into elem (useful for text inputs)
elem.submit() # Submit the form in which elem takes part.
Special commands:
driver.back() # Click the Back button.
driver.forward() # Click the Forward button.
driver.refresh() # Refresh the page.
driver.quit() # Close the browser including all the tabs.
foo = driver.execute_script("return 'hello';") # Execute javascript (COULD TAKE RETURN VALUES!)
Using Selenium or jQuery enabled requests are slow. It is more efficient to find out which cookie is generated after website checking for JavaScript on the browser and get that cookie and use it for each of your requests.
In one example it worked through following cookies:
the cookie generated after checking for javascript for this example is "cf_clearance".
so simply create a session.
update cookie and headers as such:
s = requests.Session()
s.cookies["cf_clearance"] = "cb4c883efc59d0e990caf7508902591f4569e7bf-1617321078-0-150"
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
})
s.get(url)
and you are good to go no need for JavaScript solution such as Selenium. This is way faster and efficient. you just have to get cookie once after opening up the browser.
Some way to do that is to invoke your request by using selenium.
Let's install dependecies by using pip or pip3:
pip install selenium
etc.
If you run script by using python3
use instead:
pip3 install selenium
(...)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'http://myurl.com'
# Please wait until the page will be ready:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.some_placeholder")))
element.text = 'Some text on the page :)' # <-- Here it is! I got what I wanted :)
its a wrapper around pyppeteer or smth? :( i thought its something different
#property
async def browser(self):
if not hasattr(self, "_browser"):
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
return self._browser
I am trying to figure out how to website this website https://cnx.org/search?q=subject:%22Arts%22 that is rendered via JavaScript. When I view the page source, there is very little code. I know that BeautifulSoup can't do this. I have tried Selenium but I am new to it. Any suggestions on how scraping this site could be accomplished?
You can use selenium to do this. You won't look at HTML source code though. Press F12 on chrome (or install firebug on firefox) to get into the developer tools. Once there, you can select elements (pointer icon on top left of dev tools window). Once you click what you want, you can right click the highlighted portion in the "Elements" column and copy -> Xpath. Be careful to use proper quotes in your code because the xpaths usually use double quotes, which is also common when using the find_element_by_expath method.
Essentially you instantiate your browser, go to the page, find the element by xpath (an XML language to just go to a specific spot on a page that uses javascript). It's roughly like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
# Load page
driver.get("https://www.instagram.com/accounts/login/")
# Find your element via its xpath (see above to get)
# The "Madlavning" entry on the page would be:
element = driver.find_element_by_xpath('//*[#id="results"]/div/table/tbody/tr[1]/td[2]/h4/a')
#Pull the text:
element.text
#ensure you dont get zombie/defunct chrome/firefox instances that suck up resources
driver.quit()
selenium can be used for plenty of scraping, you just need to know what you want to do once you find the info.
You can use the API that the web-page gets it's data from (using JavaScript) directly. https://archive.cnx.org/search?q=subject:%22Arts%22 It returns JSON so you just need to parse the JSON.
import requests
import json
url = "https://archive.cnx.org/search?q=subject:%22Arts%22"
r = requests.get(url)
j = r.json()
# Print the json object
print (json.dumps(j, indent=4, sort_keys=True))
# Or print specific values
for i in j['results']['items']:
print (i['title'])
print(i['summarySnippet'])
Try Google's official headless browser wrapper around Chrome, puppeteer.
Install:
npm i puppeteer
Usage:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
It's easy to use and have a good documentation.
I am trying to scrape a web page which has javascript in it using phantomjs. I found an element for button and when i click it, it show render next link. But i am not getting the exact output what i want. Instead, i am getting different output which is not required.
The code is:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
s = requests.session()
fg =s.get('https://in.bookmyshow.com/booktickets/INCM/32076',headers=headers)
so = BeautifulSoup(fg.text,"html.parser")
texts = so.findAll("div",{"class":"__buytickets"})
print(texts[0].a['href'])
print(fg.url)
driver = webdriver.PhantomJS()
driver.get(movie_links[0])
element = driver.find_element_by_class_name('__buytickets')
element.click()
print(driver.current_url)
I am getting the output as :
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INVB/47680
what i have to get is:
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INCM/32076#Seatlayout
Actually, the link which i have to get is generated by javascript of the previous link. How to get this link? (seatlayout link) Please help! Thanks in Advance.
PhantomJS in my experience don't work well.
Сhrome and Mozilla better.
Vitaly Slobodin https://github.com/Vitallium said he will not develop more Phantomjs.
Use Headless Chrome or Firefox.
I'm attempting to create an app that gathers some data. I'm using Python 2.7 with Scrapy and Selenium on Windows 10. I've done this with only a few web pages previously, however, I'm not able to select or click on a button from the following website.
https://aca3.accela.com/Atlanta_Ga/Default.aspx
Not able to click on the botton labeled "Search Permits/Complaints"
I've used the Chrome dev tools in order to inspect the XPaths, etc.
Here's the code that I'm using:
import scrapy
from selenium import webdriver
class PermitsSpider(scrapy.Spider):
name = "atlanta"
url = "https://aca3.accela.com/Atlanta_Ga/Default.aspx"
start_urls = ['https://aca3.accela.com/Atlanta_Ga/Default.aspx',]
def __init__(self):
self.driver = webdriver.Chrome()
self.driver.implicitly_wait(20)
def parse(self, response):
self.driver.get(self.url)
time.sleep(15)
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
search_button.click()
When Running that code, I get the following error:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]"}
I've added all manor of sleeps and waits, etc in order to ensure that the page is fully loaded before attempting the selection. I've also attempted to select the Link Text, and other methods for selecting the element. Not sure why this method isn't working on this page where it works for me on others. While running the code, the WebDriver does open the page on my screen, and I see the page loaded, etc.
Any help would be appreciated. Thank you...
Target link located inside an iframe, so you have to switch to that frame to be able to handle embedded elements:
self.driver.switch_to_frame('ACAFrame')
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
You also might need to switch back with driver.switch_to_default_content()
I am doing some testing on my site, and I have a python program which does gets on few different pages. Some of these pages have $(document).ready(function(). I noticed that when I do get through python, I get the code, but for example $(document).ready(function() doesn't run.
How can I run the $(document).ready(function() of the site I am doing a GET on?
Thank you for help.
You should go for Selenium, it lets you control a real browser from your python code . That means your javascript will be executed by the browser .
Example code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()