How to let phantomjs show dynamically loaded webpage? - javascript

When I access https://www.ncbi.nlm.nih.gov/pubmed/?term=cell, there is "Results by year", a histogram under it and "Download CSV".
But when I access the same URL with the following script, I don't see them. Does anybody know why?
Is there a way to get the histogram along with "Download CSV" using a command line scraper ? Thanks.
$ cat phjsget.py
#!/usr/bin/env python
import sys
from selenium import webdriver
browser = webdriver.PhantomJS(service_log_path='/dev/null')
browser.get(sys.argv[1])
print browser.page_source.encode('utf-8')
browser.close()
$ ./phjsget.py https://www.ncbi.nlm.nih.gov/pubmed/?term=cell

The "Results by year" and "Download CSV" are loaded in after the page has loaded with Javascript. wget will not execute Javascript. You can use a tool like PhantomJS or Selenium to simulate real browser behavior that will execute Javascript.

Related

Selenium does not run a javascript

I'm trying to execute a simple java script command at selenium using python for scroll a part of a web page.
Here's the code:
command_js = 'document.querySelector(' + "'" + 'css_div_example[css_example="this is just a example for css selector"]' + "').scroll(0, 10000)"
#string joined "document.querySelector('css_div_example[css_example="this is just a example for css selector"]').scroll(0, 10000)"
driver.execute_script(command_js)
Picture:
Picture of scroll bar
OBS.:
I tried to execute directly from Firefox console and it worked very well.
Selenium does not return any error, just not execute it.
I tried to WebDriverWait with EC and time.sleep().
I am using python 3+ and PyCharm IDE, Firefox webdrivers...
Anyone can help me?

How to get past Javascript is disabled in your browser error when web scraping with Python

I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867

PhantomJS not retrieving correct data

I am trying to scrape a web page which has javascript in it using phantomjs. I found an element for button and when i click it, it show render next link. But i am not getting the exact output what i want. Instead, i am getting different output which is not required.
The code is:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
s = requests.session()
fg =s.get('https://in.bookmyshow.com/booktickets/INCM/32076',headers=headers)
so = BeautifulSoup(fg.text,"html.parser")
texts = so.findAll("div",{"class":"__buytickets"})
print(texts[0].a['href'])
print(fg.url)
driver = webdriver.PhantomJS()
driver.get(movie_links[0])
element = driver.find_element_by_class_name('__buytickets')
element.click()
print(driver.current_url)
I am getting the output as :
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INVB/47680
what i have to get is:
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INCM/32076#Seatlayout
Actually, the link which i have to get is generated by javascript of the previous link. How to get this link? (seatlayout link) Please help! Thanks in Advance.
PhantomJS in my experience don't work well.
Сhrome and Mozilla better.
Vitaly Slobodin https://github.com/Vitallium said he will not develop more Phantomjs.
Use Headless Chrome or Firefox.

Python Get on a website and running $(document).ready(function()

I am doing some testing on my site, and I have a python program which does gets on few different pages. Some of these pages have $(document).ready(function(). I noticed that when I do get through python, I get the code, but for example $(document).ready(function() doesn't run.
How can I run the $(document).ready(function() of the site I am doing a GET on?
Thank you for help.
You should go for Selenium, it lets you control a real browser from your python code . That means your javascript will be executed by the browser .
Example code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

Faking Flash plugin info in PhantomJS and Python

I'm a newbie to PhantomJs. I'm using phantomjs via selenium webdriver with python as my language. I want to fake my flash plugins info which is very visible using javascript.
I want to do something like this(done in javascript) in Python using selenium webdriver.
page.onInitialized = function () {
page.evaluate(function () {
(function () {
window.navigator.plugins = {
'length': 1,
'Shockwave Flash': {
'description':'fakeflash'
}
};
})();
});
};
I don't know how to implement page.onInitialized and other functions in Python(with selenium webdriver)
Any help will be appreciated.
Personally, I couldn't find a way to get this to work either, so I instead went with using Firefox via selenium webdriver with gnash installed as a flash plugin. I know this is not quite what you are looking for, but it does have the desired effect in the end, as long as you have the system memory to support it.
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Firefox()
browser.get(url)
print browser.page_source
browser.quit()
display.stop()
Or to be more safe (and never leave behind nasty Xvfb and firefox processes!):
from pyvirtualdisplay import Display
from selenium import webdriver
try:
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Firefox()
browser.get(url)
print browser.page_source
finally:
if browser:
browser.quit()
if display:
display.stop()
I suppose it could be done with Chrome in a virtual display as well. If someone ever does share the magic to let webdriver.PhantomJS preload in a flash faker for us, I'd be happy to switch over as it has far less system resource requirements.

Categories