Selenium + PhantomJS in Scrapy

Selenium + PhantomJS in Scrapy - javascript

I am trying to use Selenium and PhantomJS to get the dynamic content of a website. Here's my code
class judge(Spider):
name = "judge"
start_urls = ["http://wenshu.court.gov.cn/List/List?sorttype=1&conditions=searchWord+2+AJLX++%E6%A1%88%E4%BB%B6%E7%B1%BB%E5%9E%8B:%E6%B0%91%E4%BA%8B%E6%A1%88%E4%BB%B6"]
def init_driver(self):
driver = webdriver.Chrome()
return driver
def parse(self,response):
driver = self.init_driver()
driver.get(self.start_urls[0])
sel = Selector(text=driver.page_source)
self.logger.info(u'---------------Parsing----------------')
print sel.xpath("//div[#class='dataItem'][1]/table/tbody/tr[1]/td/div[#class='wstitle']/a/text()").extract()
self.logger.info(u'---------------success----------------')
When I try my script with driver = webdriver.Chrome(), sel.xpath("//div[#class='dataItem'] gives the desired content and everything works fine. But when I instead use driver = webdriver.PhantomJS(), sel.xpath("//div[#class='dataItem'] is empty. I have try to use WebDriverWait after driver.get() to make the page fully loaded, but it does not work.

You might try this:
driver = webdriver.PhantomJS('add your directory of phantomjs here')

Related

Parsing URL's from JavaScript driven page with Beautifulsoup and Selenium

I want to parse all URL's in Git repository where any e-mails occur.
I use https://grep.app
The code:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)
When code started, Chrome started and page with results are loaded and in Developers tools in Chrome, in source code I can see a lot of A and HREF for URL's.
Source from page
Like:
lib/plugins/revert/lang/eu/lang.php
But my code return only "tags" from footer:
"[<span class="slashes">//</span>grep.app, Contact]"
As I understand something wrong with JS parsing.
Please advise what I'm doing wrong?

Code:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
links = []
tags = soup.find_all('a', href=True)
for tag in tags:
links.append(tag['href'])
print(links)
Output:
['/', 'mailto:hello#grep.app']

How to Get Int from JS Prompt Using Selenium in Python

I am trying to create a prompt for a number from the user on a web page while using selenium in python.
This is the code I have written but it returns None
driver = webdriver.Chrome()
driver.get('https://www.google.com')
input_number = driver.execute_script('return parseInt(prompt("Enter a number", 20));')
print(input_number)

So I figured out the answer to my question.
Here is the code for anyone who might have the same issue:
from selenium.common.exceptions import UnexpectedAlertPresentException
driver = webdriver.Chrome()
driver.get('https://www.google.com')
while True:
try:
driver.execute_script("var a = prompt('Enter a number');document.body.setAttribute('user-manual-input', a)")
sleep(10) # must
print(driver.find_element_by_tag_name('body').get_attribute('user-manual-input')) # get the text
break
except (UnexpectedAlertPresentException):
pass

Webscraping website that has a button to click

I am trying to webscrape a website that has multiple javascript rendered pages (https://openlibrary.ecampusontario.ca/catalogue/). I am able to get the content from the first page, but I am not sure how to get my script to click on the buttons on the subsequent pages to get that content. Here is my script.
import time
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
# The path to where you have your chrome webdriver stored:
webdriver_path = '/Users/rawlins/Downloads/chromedriver'
# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')
# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path,
chrome_options = chrome_options)
# Load webpage
url = "https://openlibrary.ecampusontario.ca/catalogue/"
browser.get(url)
# to ensure that the page has loaded completely.
time.sleep(3)
data = []
# Parse HTML, close browser
page_soup = soup(browser.page_source, 'lxml')
containers = page_soup.findAll("div", {"class":"result-item tooltip"})
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.find('h4', {'class' : 'textbook-title'}).text.strip()
item['author'] = container.find('p', {'class' : 'textbook-authors'}).text.strip()
item['link'] = "https://openlibrary.ecampusontario.ca/catalogue/" + container.find('h4', {'class' : 'textbook-title'}).a["href"]
item['source'] = "eCampus Ontario"
item['base_url'] = "https://openlibrary.ecampusontario.ca/catalogue/"
data.append(item) # add the item to the list
with open("js-webscrape-2.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
browser.quit()

You do not have to actually click on any button. For example, to search for items with the keyword 'electricity', you navigate to the url
https://openlibrary-repo.ecampusontario.ca/rest/filtered-items?query_field%5B%5D=*&query_op%5B%5D=matches&query_val%5B%5D=(%3Fi)electricity&filters=is_not_withdrawn&offset=0&limit=10000
This will return a json string of items with the first item being:
{"items":[{"uuid":"6af61402-b0ec-40b1-ace2-1aa674c2de9f","name":"Introduction to Electricity, Magnetism, and Circuits","handle":"123456789/579","type":"item","expand":["metadata","parentCollection","parentCollectionList","parentCommunityList","bitstreams","all"],"lastModified":"2019-05-09 15:51:06.91","parentCollection":null,"parentCollectionList":null,"parentCommunityList":null,"bitstreams":null,"withdrawn":"false","archived":"true","link":"/rest/items/6af61402-b0ec-40b1-ace2-1aa674c2de9f","metadata":null}, ...
Now, to get that item, you use its uuid, and navigate to:
https://openlibrary.ecampusontario.ca/catalogue/item/?id=6af61402-b0ec-40b1-ace2-1aa674c2de9f
You can proceed like this for any interaction with that website (this is not always working for all websites, but it is working for your website).
To find out what are the urls that are navigated to when you click such and such button or enter text (what I did for the above urls), you can use fiddler.

I made a little script that can help you (selenium).
what this script does is "while the last page of the catalogue is not selected (in this case, contain 'selected' in it's class), i'll scrape , then click next"
while "selected" not in driver.find_elements_by_css_selector("[id='results-pagecounter-pages'] a")[-1].get_attribute("class"):
#your scraping here
driver.find_element_by_css_selector("[id='next-btn']").click()
There's probably a problem that you'll run into using this method, it doesn't wait for the results to load, but you can figure out what to do from here onwards.
Hope it helps

Simulating a JavaScript button click with Scrapy

My intent is to run a scrapy crawler on this web page: http://visit.rio/en/o-que-fazer/outdoors/ . However, there's some resources on id="container" that load by a JavaScript button ("VER MAIS") click only. I've read some stuffs about selenium, but I've got nothing.

You read right, your best bet would be scrapy + selenium using a Firefox browser or a headless one like PhantomJS for faster scraping.
Example adapted from https://stackoverflow.com/a/17979285/2781701
import scrapy
from selenium import webdriver
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['visit.rio']
start_urls = ['http://visit.rio/en/o-que-fazer/outdoors']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('//div[#id="show_more"]/a')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()

How to get html with javascript rendered sourcecode by using selenium

I run a query in one web page, then I get result url. If I right click see html source, I can see the html code generated by JS. If I simply use urllib, python cannot get the JS code. So I see some solution using selenium. Here's my code:
from selenium import webdriver
url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2'
driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe')
driver.get(url)
print driver.page_source
>>> <html><head></head><body></body></html> Obviously It's not right!!
Here's the source code I need in right click windows, (I want the INFORMATION part)
</script></div><div class="searchColRight"><div id="topActions" class="clearfix
noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary"
href="Default.aspx? _act=VitalSearchR ...... <<INFORMATION I NEED>> ...
to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-microsoft-com:xslt">
jQuery(document).ready(function() {
jQuery(".ancestry-information-tooltip").actooltip({
href: "#AncestryInformationTooltip", orientation: "bottomleft"});
});
So my question is: How to get the information generated by JS?

You will need to get get the document via javascript you can use seleniums execute_script function
from time import sleep # this should go at the top of the file
sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print html
That will get everything inside of the <html> tag

It's not necessary to use that workaround, you can use instead:
driver = webdriver.PhantomJS()
driver.get('http://www.google.com/')
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

I have same problem about getting Javascript sourcecode from Internet, and I solved it using above Victory's suggestion.
*First: execute_script
driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
#print(driver.page_source)
*Second: parse html using beautifulsoup (You can Downloaded beautifulsoup by pip command)
import bs4 #import beautifulsoup
import re
from time import sleep
sleep(1) #wait one second
root=bs4.BeautifulSoup(innerHTML,"lxml") #parse HTML using beautifulsoup
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'}) #find the value which you need.
*Third: print out the value you need
for span in viewcount:
print(span.string)
*Full code
from selenium import webdriver
import lxml
urls="http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2"
driver = webdriver.PhantomJS()
##driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
##print(driver.page_source)
import bs4
import re
from time import sleep
sleep(1)
root=bs4.BeautifulSoup(innerHTML,"lxml")
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})
for span in viewcount:
print(span.string)
driver.quit()

I am thinking that you are getting the source code before the JavaScript has rendered the dynamic HTML.
Initially try putting a few seconds sleep between the navigate and get page source.
If this works, then you can change to a different wait strategy.

You try Dryscrape this browser is fully supported heavy js codes try it i hope it work for you

I met the same problem and finally solved by desired_capabilities.
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType
proxy = Proxy(
{
'proxyType': ProxyType.MANUAL,
'httpProxy': 'ip_or_host:port'
}
)
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS.copy()
proxy.add_to_capabilities(desired_capabilities)
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
driver.get('test_url')
print driver.page_source

We Keep Coding

JavaScript is the programming language of the Web.

Selenium + PhantomJS in Scrapy - javascript

You might try this: driver = webdriver.PhantomJS('add your directory of phantomjs here')

Related

Parsing URL's from JavaScript driven page with Beautifulsoup and Selenium

How to Get Int from JS Prompt Using Selenium in Python

Webscraping website that has a button to click

Simulating a JavaScript button click with Scrapy

How to get html with javascript rendered sourcecode by using selenium

Categories

Resources