Download entire webpage (html, image, JS) by Selenium Python - javascript

I have to download source code of a website like www.humkinar.pk in simple HTML form. Content on site is dynamically generated. I have tried driver.page_source function of selenium but it does not download page completely such as image and javascript files are left. How can I download complete page. Is there any better and easy solution in python available?

Using Selenium
I know your question is about selenium, but from my experience I am telling you that selenium is recommended for testing and NOT for scraping. It is very SLOW. Even with multiple instances of headless browsers (chrome for your situation), the result is delaying too much.
Recommendation
Python 2, 3
This trio will help you a lot and save you a bunch of time.
Dryscrape
BeautifulSoup
ThreadPoolExecutor
Do not use the parser of dryscrape, it is very SLOW and buggy. For
this situation, one can use BeautifulSoup with the lxml parser. Use dryscrape to scrape Javascript generated content, plain HTML and images.
If you are scraping a lot of links simultaneously, i highly recommend
using something like ThreadPoolExecutor
Edit #1
dryscrape + BeautifulSoup usage (Python 3+)
from dryscrape import start_xvfb
from dryscrape.session import Session
from dryscrape.mixins import WaitTimeoutError
from bs4 import BeautifulSoup
def new_session():
session = Session()
session.set_attribute('auto_load_images', False)
session.set_header('User-Agent', 'SomeUserAgent')
return session
def session_reset(session):
return session.reset()
def session_visit(session, url, check):
session.visit(url)
# ensure that the market table is visible first
if check:
try:
session.wait_for(lambda: session.at_css(
'SOME#CSS.SELECTOR.HERE'))
except WaitTimeoutError:
pass
body = session.body()
session_reset(session)
return body
# start xvfb in case no X is running (server)
start_xvfb()
SESSION = new_session()
URL = 'https://stackoverflow.com/questions/45796411/download-entire-webpage-html-image-js-by-selenium-python/45824047#45824047'
CHECK = False
BODY = session_visit(SESSION, URL, CHECK)
soup = BeautifulSoup(BODY, 'lxml')
RESULT = soup.find('div', {'id': 'answer-45824047'})
print(RESULT)

I Hope below code will work to download the complete content of the page.
driver.get("http://testurl.com")
pageurl=driver.current_url
page = requests.get(pageurl)
pagecontent=page.content
`pagecontent` will contain the complete code content

It's not allowed to download a website without Permission. If you would know that, you would also know there is hidden Code on hosting Server, where you as Visitior has no access to it.

Related

JavaScript in requests package python

I want to get text from a site using Python.
But the site uses JavaScript and the requests package to receive only JavaScript code.
Is there a way to get text without using Selenium?
import requests as r
a=r.get('https://aparat.com/').text
If the site loads content using javascript then the javascript has to be run in order to get the content. I ran into this issue a while back when I did some web scraping, and ended up using Selenium. Yes its slower than BeautifulSoup but it's the easiest solution.
If you know how the server works you could send a request and it should return with content of some kind (whether that be html, json, etc)
Edit: Load the developer tools, go to network tab and refresh the page. Look for an XHR request and the URL it uses. You may be able to use this data for your needs.
For example I found these URLs:
https://www.aparat.com/api/fa/v1/etc/page/config/mode/full
https://www.aparat.com/api/fa/v1/video/video/list/tagid/1?next=1
If you navigate to these in your browser you will notice JSON content, you might be able to use this. I think some of the text is encoded in Unicode e.g \u062e\u0644\u0627\u0635\u0647 \u0628\u0627\u0632\u06cc -> خلاصه بازی
I don't know the specific python implementation you might use. Look for libs that support making http requests and recieving data. That way you can avoid selenium. But you must know the URL's beforehand. Like shown above.
For example this is what I would do:
Make a http request to the URL you find in developer tools
With JSON content, use a JSON parser to get a table/array/dictionary natively. You can then traverse this in the native programming language.
Use a unicode decoder to get the text in normal text format, there might be a lib to do this, but for example on this website using the "Decode/Unescape Unicode Entities" I was able to get the text.
I hope this helps.
Sample code:
import requests;
req = requests.get('https://www.aparat.com/api/fa/v1/video/video/show/videohash/IueKs?pr=1&mf=1&referer=direct')
res = req.json()
#do stuff with res
print(res)

How can I check whether a website has javascript or not?

I'm building a webscraper using beautifulsoup.Some websites have javascript contents and do not load using urllib3 hence I use selenium for them.But selenium takes too long too respond and I need to build a more efficient webscraper since I need to use the same generalized scraper for multiple websites. hence I'm thinking if there's some way I can find out if the website has js content only then ill use selenium else I'll go with faster urllib
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome()
strt=time.time()
y=browser.get("https://www.amazon.jobs/en/locations/bangalore-india")
#time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html,'lxml')
li=soup.find_all('ul')
print(li)
print('load time='+str(time.time()-strt))
Here is the simple check using selenium
jsSize = (len(driver.find_elements_by_xpath("/html/head/script")))
if jsSize>0:
print("Page contains javascript")
The script tag is used to define a client-side script (JavaScript).
The element either contains script statements, or it points to an external script file through the src attribute.
Right click on the webpage you want to scrape >> Go to View Page Source >>
look for the tag named script, the script tag will indicate that the web page you are trying to scrape also consist of JavaScript.

Downloading file from "javascript:__doPostBack" link using Python

I have an existing Python script that was written using urllib2 to download from a http:// link:
import urllib2
import os.path
import os
from os import chdir, getcwd, listdir, path
print "downloading with urllib2"
f = urllib2.urlopen('http://www.dcregs.dc.gov/Notice/DownLoad.aspx?VersionID=4613531')
data = f.read()
with open( "11-B300.doc", "wb" ) as code :
code.write( data )
print "All done downloads!"
The source web-page has been reformatted to uses a "javascript:__doPostBack" address:
javascript:__doPostBack('ctl00$MainContent$rpt_ruleList$ctl02$Label1','')
My presumption is that there is some form of package, similar to urllib2, that will allow me to download the same information via the "javascript:__doPostBack" formatted address or to call the http url, where the information is located, from which I can then download the information.
The existing script was working well for my purposes, so I would like to limit the additional coding, if possible.
Is there an alternate to urllib2 that will allow me to do download the information in a similar manner?
Or am I going to have to get more sophisticated in my solution (e.g., using Selenium to scrape the information)? (Do I want to get more sophisticated so that I don't have to manage updates to individual urls?)
Thanks for your help in advance.
This relates to the site that you're on using is using .NET WebForms which manages the state of the page & the interaction within hidden form variables.
So in short, you'll need to click the link via something like Selenium as you say

Python get URL contents when page requires JavaScript enabled

I am looking to get the contents of a text file hosted on my website using Python. The server requires JavaScript to be enabled on your browser. Therefore when I run:
import urllib2
target_url = "http://09hannd.me/ai/request.txt"
data = urllib2.urlopen(target_url)
I receive a html page saying to enable JavaScript.
I was wondering if there was a way of faking having JS enabled or something.
Thanks
Selenium is the way to go here, but there is another "hacky" option.
Based on this answer: https://stackoverflow.com/a/26393257/2517622
import requests
url = 'http://09hannd.me/ai/request.txt'
response = requests.get(url, cookies={'__test': '2501c0bc9fd535a3dc831e57dc8b1eb0'})
print(response.content) # Output: find me a cafe nearby
I would probably suggest tools like this. https://github.com/niklasb/dryscrape
Additionally you can see more info here: Using python with selenium to scrape dynamic web pages

Data extraction by Python from a dynamic javascript page

I have to extract the data from the table from the following website:
http://www.mcxindia.com/SitePages/indexhistory.aspx
When I click on GO, I get a table appended to the page dynamically. I want export those data from the page to a csv file(which I know how to handle), but the source code does not contain any data points.
I have tried looking into the javascript code, when I inspect the elements after the table is generated, I get the data points, but not in the source. I am using mechanize in Python.
I think it is because the page is getting loaded dynamically. What should I do/use?
mechanize doesn't/can't evaluate javascript. The easiest way that I've seen to evaluate javascript is by using Selenium, which will open a browser on your computer and communicate with python.
I answered a similar question here
I agreed Matthew Wesly comment. We will get the dynamic page using Selenium, iMacro like a addons. It captures the dynamic pages response based on our recording. It also has the JS script capability.
I think thought, for easy extraction we will go for normal Content Fetch logic using urllib2 and urllib packages.
First get the page 'viewstate' parameter. i.e Get all hidden element information from the home page and pass the form information as like the JS script does.
And also pass Content-Type key value exactly. Here your response is in the form of "text/plain; charset=utf-8".
To avoid using javascript aware transports you need to:
Install web debugger into your browser.
Goto that page. Press F12 to open debugger. Reload page.
Analyze contents of 'network' tab. Usually ajax pages downloads data as html fragments or as json. Just look into response tabs of each request made after pressing 'GO' and you will find familiar data.
Now you can create simple urllib/urllib2 downloader for that url.
parse that data and convert to csv.
http://www.mcxindia.com/SitePages/indexhistory.aspx sends POST request with search parameters on each 'GO' and recieves html fragment you need to parse and convert into csv.
So if to simulate that POST - you dont need no new browser window.
This worked!!!
import httplib
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url = 'http://www.mcxindia.com/SitePages/indexhistory.aspx'
br.open(url)
response = br.response().read()
br.select_form(nr=0)
br.set_all_readonly(False)
br.form['mTbFromDate']='08/01/2013'
br.form['mTbToDate']='08/08/2013'
response = br.submit(name='mBtnGo').read()
print response
The best thing I personally do while dealing dynamic web pages is use PyQt webkit and try to mimic as a browser, and then pass the URL to the browser and finally getting the HTML after all Javascripts are rendered.
Example Code-
import sys
from PyQt4.QtGui import QApplication
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebPage
import bs4 as bs
class Client(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.mainFrame().load(QUrl(url))
self.app.exec()
def on_page_load(self):
self.app.quit()
url = //your URL
client_response = Client(url)
source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source, "lxml")
// BeautifulSoup stuff

Categories