Scraping javascript page with PyQt5 and QWebEngineView

Scraping javascript page with PyQt5 and QWebEngineView - javascript

I'm trying to render a javascripted webpage into populated HTML for scraping. Researching different solutions (selenium, reverse-engineering the page etc.) led me to this technique but I can't get it working. BTW I am new to python, basically at the cut/paste/experiment stage. Got past installation and indentation issues but I'm stuck now.
In the test code below, print(sample_html) works and returns the original html of the target page but print(render(sample_html)) always returns the word 'None'.
Interestingly, if you run this on amazon.com they detect it is not a real browser and return html with a warning about automated access. However the other test pages provide true html that should render, except it doesn't.
How do I troubleshoot the result always returning "None'?
def render(source_html):
"""Fully render HTML, JavaScript and all."""
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView
class Render(QWebEngineView):
def __init__(self, html):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.setHtml(html)
self.app.exec_()
def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self.callable)
def callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()
return Render(source_html).html
import requests
#url = 'http://webscraping.com'
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))
EDIT: Thanks for the responses which were incorporated into the code. But now it returns an error and the script hangs until I kill the python launcher which then causes a segfault:
This is the revised code:
def render(source_url):
"""Fully render HTML, JavaScript and all."""
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView
class Render(QWebEngineView):
def __init__(self, url):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
# self.setHtml(html)
self.load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self._callable)
def _callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()
return Render(source_url).html
# url = 'http://webscraping.com'
# url='http://www.amazon.com'
url = "https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))
Which throws these errors:
$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
File "fees-pkg-v2.py", line 30, in _callable
self.html = data
AttributeError: 'method' object has no attribute 'html'
None (hangs here until force-quit python launcher)
Segmentation fault: 11
$
I already started reading up on python classes to fully understand what I'm doing (always a good thing). I'm thinking something in my environment could be the problems (OSX Yosemite, Python 3.4.3, Qt5.4.1, sip-4.16.6). Any other suggestions?

The problem was the environment. I had manually installed Python 3.4.3, Qt5.4.1, and sip-4.16.6 and must have mucked something up. After installing Anaconda, the script started working. Thanks again.

Related

Child_Processes calling function from javascript node js to python breaking with imports

I am using node js for my application as well as react.
I am trying to use child_process to call a python function from my node js code, using this post to guide me. It seems pretty straightforward, until I realized that the child_process was used to call a python script, not a main method with functions.
I am relatively new to Python, so forgive me if I mix up terminology here and there. Here is a very basic version of my python file below:
import sys
# other imports
def __main__():
one = function_one()
two = function_two()
arr = [one, two]
print(arr)
sys.stdout.flush()
def function_one():
# do stuff, pretend it returns 'hello'
def function_two():
# do stuff, pretend it returns 'world'
if __name__ == '__main__':
__main__()
The end result should be ['hello', 'world'], but it seems like I am not getting returned anything. As you can see, I am printing arr and flushing it afterwards, so it should work.
The only way I can get it to work is if my file looks like this:
import sys
print('hello world')
sys.stdout.flush()
As you can see, without the main method or additional functions. Is there a reason to this or am I just implementing it wrongly? Thanks!
EDIT:
After trying a couple different things, I found out that some of my imports were breaking, such as pandas or seaborn. It works if I remove those specific imports, even if I'm calling __main__ through the if statement. Any ideas why?

You declare a function called __main__ and then never call it. I think you've mixed things up with a known pattern that prevents code from being executed when a script is imported as a module. The correct implementation of this pattern would look like this:
import sys
def myMainFunction(argv):
# the code here
if __name__ == "__main__":
myMainFunction(sys.argv)
__name__ is a special variable in Python. It is automatically set to "__main__" if the script was executed directly. If it was imported, it would hold the module's name instead.
Back to your problem. If you don't plan to import that script as a module, here's a solution I suggested in the comments:
Node.js:
const spawn = require('child_process').spawn;
const pythonProcess = spawn('python', ['script.py']);
pythonProcess.stdout.on('data', (data) => {
console.log(data.toString());
});
Python:
import sys
def function_one():
return 'hello'
def function_two():
return 'world'
one = function_one()
two = function_two()
arr = [one, two]
print(arr)
sys.stdout.flush()

So, my problem was that in my spawn, python was pointing to version 2.7, not 3.6. Just alter the spawn to point to python3.6 and it'll fix the import issues.

getting dynamic data using python

I'm new to Python and got interested in writing scripts. I'm currently building a crawler that goes on a page and extract copy from tags. Write now I can only list tags; I'm having trouble getting the text out of tags and I'm not sure why exactly. I'm also using BeautifulSoup and PyQt4 to get dynamic data(this might need a new question).
So based on this code below, I should be getting the "Images" copy from the Google homepage, or at least the span tag itself. I'm getting returned NONE
I tried reading the docs for BeautifulSoup and it was a little overwhelming. I'm still reading it, but I think I keep going down a rabbit hole. I can print all anchor tags or all divs, but targeting a specific one is where I'm struggling.
import urllib
import re
from bs4 import BeautifulSoup, Comment
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://google.com'
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source, 'html.parser')
js_test = soup.find("a", class_="gb_P")
print js_test

How to retrieve the exact HTML as in a browser

I'm using a Python script to render web pages and retrieve their HTML's. It works fine with most of the pages, but with some of them the HTML retrieved is incomplete. And I don't quite understand why. This is the script I'm using to scrap this page, for some reason, the link to every product is not in the HTML:
Link: http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html
Python script:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork
from PyQt4 import QtCore
url = sys.argv[1]
path = sys.argv[2]
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.request = QtNetwork.QNetworkRequest()
self.request.setUrl(QtCore.QUrl(url))
self.request.setRawHeader("Accept-Language", QtCore.QByteArray ("es ,*"))
self.mainFrame().load(self.request)
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
r = Render(url)
result = r.frame.toHtml()
html_file = open(path, "w")
html_file.write("%s" % result.encode("utf-8"))
html_file.close()
sys.exit(app.exec_())
This code was taken from here: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/
Am I missing something? What are the limitations of this framework?
Thanks in advance,

If you want headless browsing you can combine phantomjs with selenium, the following gets all the source:
url = "http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html"
from selenium import webdriver
dr = webdriver.PhantomJS()
dr.get(url)
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(dr, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, "grid_itemContainer"))
)
Just using selenium without the WebDriverWait did not always return the full source, adding the wait until the a tags with the grid_itemContainer class were visible makes sure the html has been generated, the xpath below returns all your links:
print([a.get_attribute('href') for a in dr.find_elements_by_xpath("//a[#class='grid_itemContainer']")])
[u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-detalle-crochet-pechera-c29016p100064004.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordado-escote-pico-c29016p100123006.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-larga-espalda-abierta-c29016p100147503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-hombros-descubiertos-beads-c29016p100182001.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-capa-c29016p100255505.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-eyelets-c29016p100336010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-oversized-c29016p100036616.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-pico-c29016p100166506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-rayas-c29016p100234507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-corta-liso-c29016p100262008.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-liso-c29016p100036162.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-%C3%A9tnico-c29016p100259002.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-rayas-c29016p100036161.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-tri%C3%A1ngulo-c29016p100255506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-marinero-escote-bardot-c29016p100259003.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-escote-espalda-c29016p100262007.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cruzado-c29016p100216013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-canes%C3%BA-bordado-c29016p100203011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordados-c29016p100037160.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-volante-c29016p100216014.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-lencero-c29016p100104515.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuadros-detalle-encaje-c29016p100216016.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-bolsillo-plastr%C3%B3n-c29016p100036822.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-bajo-desigual-c29016p100123010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-vaquero-c29016p100036575.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-estampado-rayas-c29016p100189011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-rayas-manga-3-4-c29016p100149507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-canal%C3%A9-ajustado-c29016p100149508.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-bolsillos-c29016p100212503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-corte-evas%C3%A9-bolsillos-c29016p100189012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-camisero-cuadros-c29016p100036624.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/pichi-vaquero-c29016p100073526.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-geom%C3%A9trico-cuello-halter-c29016p100037021.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-evas%C3%A9-c29016p100037207.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100036932.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100037280.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-parche-c29016p100037464.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-liso-c29016p100037156.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-flores-c29016p100036921.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-corbatero-c29016p100037155.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-c29016p100170011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-rayas-c29016p100170012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-acampanada-c29016p100149506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-punto-espalda-abierta-c29016p100195504.html']
If you want to write the source:
with open("out.html", "w") as f:
f.write(dr.page_source)

I think you can use http://ghost-py.readthedocs.org/en/latest/ for this case. It's loads web page like real browser and run JavaScript.
Also you can try PhantomJS for example, but it written on nodeJS.

Send data from javascript to python in google app engine

i want to create an application in google app engine using python. I have to send a list of json string to python script.I have done the following code but ain't worked.
$.post("/javascriptdata",{v:s},function(data,status) {});
In python script i have a class named javascriptdata to where data has to be send to
import wsgiref.handlers
import json
import sys
import cgi
from google.appengine.ext import webapp
from google.appengine.ext.webapp import Request
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp import template
class mainh(webapp.RequestHandler):
def get(self):
self.response.out.write(template.render("paint.html",{}))
class javascriptdata(webapp.RequestHandler):
def post(self):
self.response.headers['content-Type'] = 'html'
data1=self.request.get('v')
self.response.out.write("""<html><body>""")
self.response.out.write(data1)
self.response.out.write("""</body></html>""")
def main():
app = webapp.WSGIApplication([
('/',mainh),("/save",javascriptdata)], debug=True)
wsgiref.handlers.CGIHandler().run(app)
if __name__ == "__main__":
main()
The javascriptdata is associated with the url "/save". I have created a submit button named "save" that would redirect to /save but iam not getting any output. I know it may be a silly mistake but Iam struggling to sort it out. Please suggest me how to post and read the data for this code.

This seems suspicious:
$.post("/javascriptdata",{v:s},function(data,status) {});
since you don't have a /javascriptdata URL mapped in the python code. Perhaps you meant
$.post('/save', ...?
Alternatively, you could change the WSGIApplication init to be:
...("/javascriptdata",javascriptdata)...

Flask, html and javascript desktop app

I want to build a desktop application using python, html and javascript. So far i have followed the tuts on flask and have a hello world working example. What should i do now to make it working? how do the html files "talk" to the python scripts below them?
here is my code so far :
from flask import Flask, url_for, render_template, redirect
app = Flask(__name__)
#app.route('/hello/')
#app.route('/hello/<name>')
def hello(name=None):
return render_template('hello.html', name=name)
#app.route('/')
def index():
return redirect(url_for('init'))
#app.route('/init/')
def init():
css = url_for('static', filename='zaab.css')
return render_template('init.html', csse=css)
if __name__ == '__main__':
app.run()

You can use HTML forms just as you normally would in your Jinja templates - then in your handler you use the following:
from flask import Flask, url_for, render_template, redirect
from flask import request # <-- add this
# ... snip setup code ...
# We need to specify the methods that we accept
#app.route("/test-post", methods=["GET","POST"])
def test_post():
# method tells us if the user submitted the form
if request.method == "POST":
name = request.form.name
email = request.form.email
return render_template("form_page.html", name=name, email=email)
If you wanted to use GET instaed of POST to submit the form you would just check request.args rather than request.form (See flask.Request's documentation for more information). If you are going to be doing much with forms though, I recommend checking out the excellent WTForms project and the Flask-WTForms extension.

We Keep Coding

JavaScript is the programming language of the Web.

Scraping javascript page with PyQt5 and QWebEngineView - javascript

The problem was the environment. I had manually installed Python 3.4.3, Qt5.4.1, and sip-4.16.6 and must have mucked something up. After installing Anaconda, the script started working. Thanks again.

Related

Child_Processes calling function from javascript node js to python breaking with imports

getting dynamic data using python

How to retrieve the exact HTML as in a browser

Send data from javascript to python in google app engine

Flask, html and javascript desktop app

Categories

Resources