How can i get embed JSON Data from Website with Python? - javascript

I have a device to collect energy data with a webinterface on it and sadly no API.
There is a JSON stored in window.dataJSON.
I can get the value of it with: console.log(JSON.stringify(window.dataJSON)); via the Chrome Debugger.
But my question is: How can i get this data with python?
I know i can get the Sourcecode of the page with:
import urllib2
response = urllib2.urlopen("10.10.10.10")
page_source = response.read()
But how can i read the JSON stored in window.dataJSON?
Thank you in advance!

window object exists only in a browser. So to get property of window, you should use a browser to do it.
You can use Selenium :
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.example.com')
result = driver.execute_script('return JSON.stringify(window.dataJSON)')
And you can change webdriver to use Headless Chrome or PhatomJS if you don't want a browser to show up.
Maybe you need to tell driver to wait if dataJSON is assigned to window asynchronously.

Related

Selected text with Selenium and Python?

In the web console, getting the selected (highlighted) text is a simple manner
window.getSelection().toString()
How about doing this in a headless browser? In particular, I'm using selenium with its python API. I cannot find methods similar to getSelection() around driver:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get("http://www.python.org")
For example, suppose I have selected/highlighted (with the cursor) the string "suppose I have " on this page, the desired output should be "suppose I have ". In case no text is selected/highlighted, return the empty string "".
The answer I found is to execute Javascript directly within selenium. For example, to fulfill what I want, run the following script.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
# Manually highlight some text with your cursor.
driver.execute_script("return window.getSelection().toString()")
Slightly unrelated but useful: This works within the currently selected window. To switch among different windows, see [1].
[1] Python Selenium get current window handle

Webscrape JS rendered Website

I am trying to figure out how to website this website https://cnx.org/search?q=subject:%22Arts%22 that is rendered via JavaScript. When I view the page source, there is very little code. I know that BeautifulSoup can't do this. I have tried Selenium but I am new to it. Any suggestions on how scraping this site could be accomplished?
You can use selenium to do this. You won't look at HTML source code though. Press F12 on chrome (or install firebug on firefox) to get into the developer tools. Once there, you can select elements (pointer icon on top left of dev tools window). Once you click what you want, you can right click the highlighted portion in the "Elements" column and copy -> Xpath. Be careful to use proper quotes in your code because the xpaths usually use double quotes, which is also common when using the find_element_by_expath method.
Essentially you instantiate your browser, go to the page, find the element by xpath (an XML language to just go to a specific spot on a page that uses javascript). It's roughly like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
# Load page
driver.get("https://www.instagram.com/accounts/login/")
# Find your element via its xpath (see above to get)
# The "Madlavning" entry on the page would be:
element = driver.find_element_by_xpath('//*[#id="results"]/div/table/tbody/tr[1]/td[2]/h4/a')
#Pull the text:
element.text
#ensure you dont get zombie/defunct chrome/firefox instances that suck up resources
driver.quit()
selenium can be used for plenty of scraping, you just need to know what you want to do once you find the info.
You can use the API that the web-page gets it's data from (using JavaScript) directly. https://archive.cnx.org/search?q=subject:%22Arts%22 It returns JSON so you just need to parse the JSON.
import requests
import json
url = "https://archive.cnx.org/search?q=subject:%22Arts%22"
r = requests.get(url)
j = r.json()
# Print the json object
print (json.dumps(j, indent=4, sort_keys=True))
# Or print specific values
for i in j['results']['items']:
print (i['title'])
print(i['summarySnippet'])
Try Google's official headless browser wrapper around Chrome, puppeteer.
Install:
npm i puppeteer
Usage:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
It's easy to use and have a good documentation.

PhantomJS not retrieving correct data

I am trying to scrape a web page which has javascript in it using phantomjs. I found an element for button and when i click it, it show render next link. But i am not getting the exact output what i want. Instead, i am getting different output which is not required.
The code is:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
s = requests.session()
fg =s.get('https://in.bookmyshow.com/booktickets/INCM/32076',headers=headers)
so = BeautifulSoup(fg.text,"html.parser")
texts = so.findAll("div",{"class":"__buytickets"})
print(texts[0].a['href'])
print(fg.url)
driver = webdriver.PhantomJS()
driver.get(movie_links[0])
element = driver.find_element_by_class_name('__buytickets')
element.click()
print(driver.current_url)
I am getting the output as :
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INVB/47680
what i have to get is:
javascript:;
https://in.bookmyshow.com/booktickets/INCM/32076
https://in.bookmyshow.com/booktickets/INCM/32076#Seatlayout
Actually, the link which i have to get is generated by javascript of the previous link. How to get this link? (seatlayout link) Please help! Thanks in Advance.
PhantomJS in my experience don't work well.
Сhrome and Mozilla better.
Vitaly Slobodin https://github.com/Vitallium said he will not develop more Phantomjs.
Use Headless Chrome or Firefox.

Multiple screenshot with Firefox Developer Tools

I am trying take screenshots of a page that loads a series of content (slideshow) via Javascript. I can take screenshots of individual items with Firefox Devtools just fine. However it's tedious to do so by hand.
I can think of a few options-
Run the 'screenshot' command in a loop and call a JS function in each loop to load the next content. However I can't find any documentation to script the developer tools or call JS functions from within it.
Run a JS script on the page to load the contents at an interval and call the devtools to take a screenshot each time. But I can't find any documentation on calling devtools from JS in webpage.
Have Devtools take screenshots in response to a page event. But I can't find any documentation on this either.
How do I do this?
Your first questions is, how to take screenshots with javascript in a programmed way:
use selenium Webdriver to steer the browser instead of trying to script the developer tools of a specific browser.
Using WebdriverJS as framework you can script anything you need around the Webdriver itself.
Your second question is, how to script the FF dev tools:
- no answer from my side -
I will second Ralf R's recommendation to use webdriver instead of trying to wrangle the firefox devtools.
Here's a webdriverjs script that goes to a webpage with a slow loading carousel, and takes a screenshot as soon as the image I request is fully loaded (with this carousel, I tell it to wait until the css opacity is 1). You can just loop this through however many slide images you have.
var webdriver = require('selenium-webdriver');
var By = webdriver.By;
var until = webdriver.until;
var fs = require("fs");
var driver = new webdriver.Builder().forBrowser("chrome").build();
//Go to website
driver.get("http://output.jsbin.com/cerutusihe");
//Tell webdriver to wait until the opacity is 1
driver.wait(function(){
//first store the element you want to find in a variable.
var theEl = driver.findElement(By.css(".mySlides:nth-child(1)"));
//return the css value (it can be any value you like), then return a boolean (that the 'result' of the getCssValue request equals 1)
return theEl.getCssValue('opacity').then(function(result){
return result == 1;
})
}, 60000) //specify a wait of 60 seconds.
//call webdriver's takeScreenshot method.
driver.takeScreenshot().then(function(data) {
//use the node file system module 'fs' to write the file to your disk. In this case, it writes it to the root directory of my webdriver project.
fs.writeFileSync("pic2.png", data, 'base64');
});

Python Get on a website and running $(document).ready(function()

I am doing some testing on my site, and I have a python program which does gets on few different pages. Some of these pages have $(document).ready(function(). I noticed that when I do get through python, I get the code, but for example $(document).ready(function() doesn't run.
How can I run the $(document).ready(function() of the site I am doing a GET on?
Thank you for help.
You should go for Selenium, it lets you control a real browser from your python code . That means your javascript will be executed by the browser .
Example code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

Categories