PhantomJS not retrieving correct data - javascript

I am trying to scrape a web page which has javascript in it using phantomjs. I found an element for button and when i click it, it show render next link. But i am not getting the exact output what i want. Instead, i am getting different output which is not required.
The code is:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
s = requests.session()
fg =s.get('https://in.bookmyshow.com/booktickets/INCM/32076',headers=headers)
so = BeautifulSoup(fg.text,"html.parser")
texts = so.findAll("div",{"class":"__buytickets"})
print(texts[0].a['href'])
print(fg.url)
driver = webdriver.PhantomJS()
driver.get(movie_links[0])
element = driver.find_element_by_class_name('__buytickets')
element.click()
print(driver.current_url)
I am getting the output as :
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INVB/47680
what i have to get is:
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INCM/32076#Seatlayout
Actually, the link which i have to get is generated by javascript of the previous link. How to get this link? (seatlayout link) Please help! Thanks in Advance.

PhantomJS in my experience don't work well.
Сhrome and Mozilla better.
Vitaly Slobodin https://github.com/Vitallium said he will not develop more Phantomjs.
Use Headless Chrome or Firefox.

Related

How to get past Javascript is disabled in your browser error when web scraping with Python

I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867

Webscraping in Python Selenium - Can't find button

So, Im trying to access some data from this webpage http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm . Im trying to click on the button named as "Setor de atuação" with selenium. The problem is The requests lib is returning me a different HTML from the one I see when I inspect the page. I already tried to sent a header with my request but it wasn't the solution. Although, when I print the content in
browser.page_source
I still get an incomplete part of the page that I want. In order to try solving the problem Ive seen that two requests are posted when the site is initializes:
print1
Well, I'm not sure what to do now. If anyone can help me or send me a tutorial, explain what is happening I would be really glad. Thanks in advance. Ive only done simple web-scraping so I'm not sure how to proceed, Ive also checked other questions in the forums and none seem to be similar to my problem.
import bs4 as bs
import requests
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito') #private
#options.add_argument('--headless') # doesnt open page
browser = webdriver.Chrome('/home/itamar/Desktop/chromedriver', chrome_options=options)
site = 'http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm'
browser.get(site)
Thats my code till now. Im having trouble to find the and click in the element buttor "Setor de Atuação". Ive tried through X_path,class,id but nothing seems to work.
The aimed button is inside an iframe, in this case you'll have to use the switch_to function from your selenium driver, this way switching the driver to the iframe DOM, and only then you can look for the button. I've played with the page provided and it worked - only using Selenium though, no need of Beautiful Soup. This is my code:
from selenium import webdriver
import time
class B3:
def __init__(self):
self.bot = webdriver.Firefox()
def start(self):
bot = self.bot
bot.get('http://www.b3.com.br/pt_br/produtos-e-servicos/negociacao/renda-variavel/empresas-listadas.htm')
time.sleep(2)
iframe = bot.find_element_by_xpath('//iframe[#id="bvmf_iframe"]')
bot.switch_to.frame(iframe)
bot.implicitly_wait(30)
tab = bot.find_element_by_xpath('//a[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]')
time.sleep(3)
tab.click()
time.sleep(2)
if __name__ == "__main__":
worker = B3()
worker.start()
Hope it suits you well!
refs:
https://www.techbeamers.com/switch-between-iframes-selenium-python/
In this case I suggest you to work only with the Selenium, because it depends on the Javascripts processing.
You can inspect the elements and use the XPath to choose and select the elements.
XPath:
//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span
So your code will looks like:
elementSelect = driver.find_elements_by_xpath('//*[#id="ctl00_contentPlaceHolderConteudo_tabMenuEmpresaListada_tabSetor"]/span/span')
elementSelect[0].click()
time.sleep(5) # Wait the page to load.
PS: I recommend you to search an API service for the B3. I found this link, but I didn't read it. Maybe they already have disponibilised these kind of data.
About XPath: https://www.guru99.com/xpath-selenium.html
I can't understand the problem, so if you can show a code snippet it would be better.
And I suggest you to use BeautifulSoup for web scraping.

Webscrape JS rendered Website

I am trying to figure out how to website this website https://cnx.org/search?q=subject:%22Arts%22 that is rendered via JavaScript. When I view the page source, there is very little code. I know that BeautifulSoup can't do this. I have tried Selenium but I am new to it. Any suggestions on how scraping this site could be accomplished?
You can use selenium to do this. You won't look at HTML source code though. Press F12 on chrome (or install firebug on firefox) to get into the developer tools. Once there, you can select elements (pointer icon on top left of dev tools window). Once you click what you want, you can right click the highlighted portion in the "Elements" column and copy -> Xpath. Be careful to use proper quotes in your code because the xpaths usually use double quotes, which is also common when using the find_element_by_expath method.
Essentially you instantiate your browser, go to the page, find the element by xpath (an XML language to just go to a specific spot on a page that uses javascript). It's roughly like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
# Load page
driver.get("https://www.instagram.com/accounts/login/")
# Find your element via its xpath (see above to get)
# The "Madlavning" entry on the page would be:
element = driver.find_element_by_xpath('//*[#id="results"]/div/table/tbody/tr[1]/td[2]/h4/a')
#Pull the text:
element.text
#ensure you dont get zombie/defunct chrome/firefox instances that suck up resources
driver.quit()
selenium can be used for plenty of scraping, you just need to know what you want to do once you find the info.
You can use the API that the web-page gets it's data from (using JavaScript) directly. https://archive.cnx.org/search?q=subject:%22Arts%22 It returns JSON so you just need to parse the JSON.
import requests
import json
url = "https://archive.cnx.org/search?q=subject:%22Arts%22"
r = requests.get(url)
j = r.json()
# Print the json object
print (json.dumps(j, indent=4, sort_keys=True))
# Or print specific values
for i in j['results']['items']:
print (i['title'])
print(i['summarySnippet'])
Try Google's official headless browser wrapper around Chrome, puppeteer.
Install:
npm i puppeteer
Usage:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
It's easy to use and have a good documentation.

How can i get embed JSON Data from Website with Python?

I have a device to collect energy data with a webinterface on it and sadly no API.
There is a JSON stored in window.dataJSON.
I can get the value of it with: console.log(JSON.stringify(window.dataJSON)); via the Chrome Debugger.
But my question is: How can i get this data with python?
I know i can get the Sourcecode of the page with:
import urllib2
response = urllib2.urlopen("10.10.10.10")
page_source = response.read()
But how can i read the JSON stored in window.dataJSON?
Thank you in advance!
window object exists only in a browser. So to get property of window, you should use a browser to do it.
You can use Selenium :
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.example.com')
result = driver.execute_script('return JSON.stringify(window.dataJSON)')
And you can change webdriver to use Headless Chrome or PhatomJS if you don't want a browser to show up.
Maybe you need to tell driver to wait if dataJSON is assigned to window asynchronously.

Cannot select element from a javascript webpage using Python scrapy and Selenium

I'm attempting to create an app that gathers some data. I'm using Python 2.7 with Scrapy and Selenium on Windows 10. I've done this with only a few web pages previously, however, I'm not able to select or click on a button from the following website.
https://aca3.accela.com/Atlanta_Ga/Default.aspx
Not able to click on the botton labeled "Search Permits/Complaints"
I've used the Chrome dev tools in order to inspect the XPaths, etc.
Here's the code that I'm using:
import scrapy
from selenium import webdriver
class PermitsSpider(scrapy.Spider):
name = "atlanta"
url = "https://aca3.accela.com/Atlanta_Ga/Default.aspx"
start_urls = ['https://aca3.accela.com/Atlanta_Ga/Default.aspx',]
def __init__(self):
self.driver = webdriver.Chrome()
self.driver.implicitly_wait(20)
def parse(self, response):
self.driver.get(self.url)
time.sleep(15)
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
search_button.click()
When Running that code, I get the following error:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]"}
I've added all manor of sleeps and waits, etc in order to ensure that the page is fully loaded before attempting the selection. I've also attempted to select the Link Text, and other methods for selecting the element. Not sure why this method isn't working on this page where it works for me on others. While running the code, the WebDriver does open the page on my screen, and I see the page loaded, etc.
Any help would be appreciated. Thank you...
Target link located inside an iframe, so you have to switch to that frame to be able to handle embedded elements:
self.driver.switch_to_frame('ACAFrame')
search_button = self.driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_TabDataList_TabsDataList_ctl01_LinksDataList_ctl00_LinkItemUrl"]')
You also might need to switch back with driver.switch_to_default_content()

Categories