I am working on a web-scraping project. One of the websites I am working with has the data coming from Javascript.
There was a suggestion on one of my earlier questions that I can directly call the Javascript from Python, but I'm not sure how to accomplish this.
For example: If a JavaScript function is defined as: add_2(var,var2)
How would I call that JavaScript function from Python?
Find a JavaScript interpreter that has Python bindings. (Try Rhino? V8? SeaMonkey?). When you have found one, it should come with examples of how to use it from python.
Python itself, however, does not include a JavaScript interpreter.
To interact with JavaScript from Python I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. In particular there is a function for executing JavaScript called evaluateJavaScript().
Here is a full example to execute JavaScript and extract the final HTML.
An interesting alternative I discovered recently is the Python bond module, which can be used to communicate with a NodeJs process (v8 engine).
Usage would be very similar to the pyv8 bindings, but you can directly use any NodeJs library without modification, which is a major selling point for me.
Your python code would look like this:
val = js.call('add2', var1, var2)
or even:
add2 = js.callable('add2')
val = add2(var1, var2)
Calling functions though is definitely slower than pyv8, so it greatly depends on your needs. If you need to use an npm package that does a lot of heavy-lifting, bond is great. You can even have more nodejs processes running in parallel.
But if you just need to call a bunch of JS functions (for instance, to have the same validation functions between the browser/backend), pyv8 will definitely be a lot faster.
You can eventually get the JavaScript from the page and execute it through some interpreter (such as v8 or Rhino). However, you can get a good result in a way easier way by using some functional testing tools, such as Selenium or Splinter. These solutions launch a browser and effectively load the page - it can be slow but assures that the expected browser displayed content will be available.
For example, consider the HTML document below:
<html>
<head>
<title>Test</title>
<script type="text/javascript">
function addContent(divId) {
var div = document.getElementById(divId);
div.innerHTML = '<em>My content!</em>';
}
</script>
</head>
<body>
<p>The element below will receive content</p>
<div id="mydiv" />
<script type="text/javascript">addContent('mydiv')</script>
</body>
</html>
The script below will use Splinter. Splinter will launch Firefox and after the complete load of the page it will get the content added to a div by JavaScript:
from splinter.browser import Browser
import os.path
browser = Browser()
browser.visit('file://' + os.path.realpath('test.html'))
elements = browser.find_by_css("#mydiv")
div = elements[0]
print div.value
browser.quit()
The result will be the content printed in the stdout.
You might call node through Popen.
My example how to do it
print execute('''function (args) {
var result = 0;
args.map(function (i) {
result += i;
});
return result;
}''', args=[[1, 2, 3, 4, 5]])
Hi so one possible solution would be to use ajax with flask to comunicate between javascript and python. You would run a server with flask and then open the website in a browser. This way you could run javascript functions when the website is created via pythoncode or with a button how it is done in this example.
HTML code:
<html>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
function pycall() {
$.getJSON('/pycall', {content: "content from js"},function(data) {
alert(data.result);
});
}
</script>
<button type="button" onclick="pycall()">click me</button>
</html>
Python Code:
from flask import Flask, jsonify, render_template, request
app = Flask(__name__)
def load_file(file_name):
data = None
with open(file_name, 'r') as file:
data = file.read()
return data
#app.route('/pycall')
def pycall():
content = request.args.get('content', 0, type=str)
print("call_received",content)
return jsonify(result="data from python")
#app.route('/')
def index():
return load_file("basic.html")
import webbrowser
print("opening localhost")
url = "http://127.0.0.1:5000/"
webbrowser.open(url)
app.run()
output in python:
call_received content from js
alert in browser:
data from python
This worked for me for simple js file, source:
https://www.geeksforgeeks.org/how-to-run-javascript-from-python/
pip install js2py
pip install temp
file.py
import js2py
eval_res, tempfile = js2py.run_file("scripts/dev/test.js")
tempfile.wish("GeeksforGeeks")
scripts/dev/test.js
function wish(name) {
console.log("Hello, " + name + "!")
}
Did a whole run-down of the different methods recently.
PyQt4
node.js/zombie.js
phantomjs
Phantomjs was the winner hands down, very straightforward with lots of examples.
Related
My main goal here is to execute a python script I have written when I run a function triggered through HTML. Here is how I have things currently set up:
I have a JavaScript File containing python run functions:
const PythonShell = require('python-shell').PythonShell;
class AHK {
static async runScript() {
PythonShell.run('/ahk/script.py', null, function (err) {
if (err) throw err;
console.log('finished');
});
}
module.exports = AHK;
I have my main.js file which would be the js code for the HTML to handle. I'd like for it to take in the module AHK. Something simple like this:
const AHK = require('./ahk');
function runFunction(x){
if(x = 1)
AHK.runScript()
}
And then I have some HTML with a javascript tag
<script type="text/javascript">
let x =1; //this is just to show x is getting populated. In the actual code it's constantly changing values
async function predict() {
if(x > 1)
runFunction(x)
}
</script>
Biggest issue I'm facing:
I've become aware that browser javascript doesn't like requirements/modules. For example, the main.js file doesn't like having a requirement at the top. I've tried using things like requirejs, but I can't seem to figure out how to make something like this work. I basically need it so that when the requirement is met and the function runFunction is run, the python script is executed on my machine.
Important to note that this is all running for a personal project on my computer, so it will never not be local.
Make the application on your pc an API and use the web page to send a request to the API telling it which python script to run. I haven't used python too much but I believe you can make an API with it. Then you can just make buttons for each python program you want to run and have these buttons send a request to the API.
I'm trying to get python ReactiveX stream (using RxPy library) to be sent to a javascript on Web UI component, but I can't seem to find a way to do so. Also, I might need to get the data stream coming into the Javascript into a RxJS Observable of sorts for further processing.
Could you please help me understand how to achieve this?
I'm still getting a grip on ReactiveX so maybe there are some fundamental concepts I'm missing, but I'm struggling to find anything similar to this around the net.
This issue has come up as I'm working on a desktop app that takes data from a csv or a zeromq endpoint, and streams it to a UI where the data will be plotted dynamically (updated the plot as new data comes in). I'm using Electron to build my app, using python as my backend code. Python is a must as I will be extending the app with some TensorFlow models.
Following fyears really well made example as an initial structure, I have written some sample code to play with but I can't seem to get it to work.
I manage to get from the UI button all the way to the python scripts, but I get stuck in the return of the PricesApi.get_stream(...) method.
index.html
The front end is straight forward.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Electron Application</title>
</head>
<body>
<button id="super-button">Trigger Python Code</button>
<div id="py-output">
</div>
</body>
<script src="renderer.js" ></script>
</html>
api.py:
The ZeroRPC server file is like the one in the above mentioned link.
import gevent
import json
import signal
import zerorpc
from core_operator import stream
class PricesApi(object):
def get_stream(self, filename):
return stream(filename)
def stop(self):
print('Stopping strategy.')
def echo(self, text):
"""echo any text"""
return text
def load_settings():
with open('settings.json') as json_settings:
settings_dictionary = json.load(json_settings)
return settings_dictionary
def main():
settings = load_settings()
s = zerorpc.Server(PricesApi())
s.bind(settings['address'])
print(f"Initialising server on {settings['address']}")
s.run()
if __name__ == '__main__':
main()
core_operator.py
This is the file were the major logic will sit to get prices from zeroMQ subscription, but currently just creates an Observable from a csv.
import sys
import rx
from csv import DictReader
def prepare_csv_timeseries_stream(filename):
return rx.from_(DictReader(open(filename, 'r')))
def stream(filename):
price_observable = prepare_csv_timeseries_stream(filename)
return price_observable
rendered.js
finally, the javascript that should be receiving the stream:
const zerorpc = require('zerorpc');
const fs = require('fs')
const settings_block = JSON.parse(fs.readFileSync('./settings.json').toString());
let client = new zerorpc.Client();
client.connect(settings_block['address']);
let button = document.querySelector('#super-button');
let pyOutput = document.querySelector('#py-output');
let filename = '%path-to-file%'
button.addEventListener('click', () => {
let line_to_write = '1'
console.log('button click received.')
client.invoke('get_stream', filename, (error, result) => {
var messages = pyOutput;
message = document.createElement('li'),
content = document.createTextNode(error.data);
message.appendChild(content);
messages.appendChild(message);
if(error) {
console.error(error);
} else {
var messages = pyOutput;
message = document.createElement('li'),
content = document.createTextNode(result.data);
message.appendChild(content);
messages.appendChild(message);
}
})
})
I have been looking into using WebSockets, but failed in understanding how to implement it. I did find some examples using Tornado server, however I am trying to keep it as pure as possible and, also, it feels odd that having already a client/server structure from Electron, I'm not able to use that directly.
Also I'm trying to maintain the entire system a PUSH structure as the data requirements don't allow for a PULL type of pattern, with regular pollings etc.
Thank you very much in advance for any time you can dedicate to this, and please let me know if you require any further details or explanations.
I found a solution by using an amazing library called Eel (described as "A little Python library for making simple Electron-like HTML/JS GUI apps"). Its absolute simplicity and intuitiveness allowed me to achieve what I wanted a few simple lines.
Follow the intro to understand the layout.
Then your main python file (which I conveniently named main.py), you expose the stream function to eel, so it can be called from JS file, and pipe the stream into the JavaScript "receive_price" function which is exposed from the JS file!
import sys
import rx
from csv import DictReader
def prepare_csv_timeseries_stream(filename):
return rx.from_(DictReader(open(filename, 'r')))
def process_logic():
return pipe(
ops.map(lambda p: print(p)), # just to view what's flowing through
ops.map(lambda p: eel.receive_price(p)), # KEY FUNCTION in JS file, exposed via eel, is called for each price.
)
#eel.expose # Decorator so this function can get triggered from JavaScript
def stream(filename):
price_observable = prepare_csv_timeseries_stream(filename)
price_observable.pipe(process_logic()).subscribe() # apply the pipe and subscribe to trigger stream
eel.init('web')
eel.start('main.html') # look at how beautiful and elegant this is!
Now we create the price_processing.js file (placed in the 'web' folder as per Eel instructions) to incorporate the exposed functions
let button = document.querySelector('#super-button');
let pyOutput = document.querySelector('#py-output' );
let filename = '%path-to-file%'
console.log("ready to receive data!")
eel.expose(receive_price); // Exposing the function to Python, to process each price
function receive_price(result) {
var messages = pyOutput;
message = document.createElement('li');
content = document.createTextNode(result);
message.appendChild(content);
messages.appendChild(message);
// in here you can add more functions to process data, e.g. logging, charting and so on..
};
button.addEventListener('click', () => {
console.log('Button clicked magnificently! Bloody good job')
eel.stream(filename); // calling the Python function exposed through Eel to start stream.
})
The HTML stays almost the same, apart from the changing the script refs: /eel.js, as per Eel documentation and our price_processing.js file.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Let's try Eel</title>
</head>
<body>
<h1>Eel-saved-my-life: the App!</h1>
<button id="super-button">Trigger Python Code</button>
<div id="py-output">
</div>
</body>
<script type="text/javascript" src="/eel.js"></script>
<script type="text/javascript" src="price_processing.js"></script>
</html>
I hope this can help anyone struggling with the same problem.
I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?
RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)
Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).
I found a guide here:
https://docs.python.org/2/library/htmlparser.html
but the function HTMLParser.feed(data) uses data as the html itself.
There is a way to do similar feed but only with the web address ?
something like this ...
HTMLParser.feed("www.a.com") ?
Generally, i want to take a variable from different web pages and load it into python variable with python script and compere between them.
Thanks.
import urllib2
f = urllib2.urlopen(url)
page_data = f.read()
# do stuff with html
HTMLParser.feed(page_data)
f.close()
This will return the raw html from the page. You can then parse it and find whatever you want. Not sure if there is a faster solution.
Maybe
python-requests?
import requests
r = requests.get("https://github.com/")
r.content
Later if you want to parse the content you can use lxml
I have Windows Application (.EXE file is written in C and built with MS-Visual Studio), that outputs ASCII text to stdout. I’m looking to enhance the ASCII text to include limited HTML with a few links. I’d like to invoke this application (.EXE File) and take the output of that application and pipe it into a Browser. This is not a one time thing, each new web page would be another run of the Local Application!
The HTML/java-script application below has worked for me to execute the application, but the output has gone into a DOS Box windows and not to pipe it into the Browser. I’d like to update this HTML Application to enable the Browser to capture that text (that is enhanced with HTML) and display it with the browser.
<body>
<script>
function go() {
w = new ActiveXObject("WScript.Shell");
w.run('C:/DL/Browser/mk_html.exe');
return true;
}
</script>
<form>
Run My Application (Window with explorer only)
<input type="button" value="Go"
onClick="return go()">
</FORM>
</body>
Have the executable listen on a port following the HTTP protocol.
Then have the web page make AJAX-style HTTP requests to the local port with JAvascript.
The executable returns text.
The web page updates itself through DOM manipulation in Javascript.
Yes, this works. It is happening 5 feet away from me right now in another cubicle.
This is called CGI
Your already using WScript to launch, it can also read StdOut.
<html>
<head>
<script type="text/javascript">
function foo() {
var WshShell = new ActiveXObject("WScript.Shell");
var oExec = WshShell.Exec("ipconfig.exe");
var input = "";
while (!oExec.StdOut.AtEndOfStream) {
input += oExec.StdOut.ReadLine() + "<br />";
}
if (input)
document.getElementById("plop").innerHTML = input;
}
</script>
</head>
<body onload="foo();">
<code id="plop"></code>
</body>
</html>
It would be easier to have your EXE create a temp file containing the HTML, then just tell Windows to open the temp HTML file in the browser.