Python Webscraping with BS4 and html not downloading correctly - javascript

I've been trying to web scrape information off the website:
https://www.tddirectinvesting.co.uk/share-dealing/daily-trading-ideas
And the information I wanted were in the elements, with the class of "RecogniaEventSummaryBodyLinks"
But when I tried to download the html file and print it, it showed that the html file didn't download correctly. What I mean by this is that when I copied and pasted the whole html text I got from my python code into notepad++ and did CTRL+F to find if these elements were in the html text, they weren't there.
I also tried manually downloading the file directly from the website, but this also didn't work either.
Heres my code (python):
import mechanize
import cookielib
from bs4 import BeautifulSoup
def viewPage(url,proxy,userAgent):
br = mechanize.Browser()
cookieJar = cookielib.LWPCookieJar()
br.set_cookiejar(cookieJar)
br.set_proxies(proxy)
br.addheaders = userAgent
page = br.open(url)
htmlFile = page.read()
for cookie in cookieJar:
print("cookie: " + str(cookie))
print("")
return htmlFile
def ScrapeFigures(url):
html = viewPage(url,proxyAdress,agentStringSample)
soup = BeautifulSoup(html,"html.parser")
info = soup.find("a",attrs={"class":"RecogniaEventSummaryBodyLinks"})
I tried printing out variable info, but it returned null.
However, after this I tried copy & pasting the python output for the whole soup variable in the above code into another text file, and saved it as a html file. When I opened this html file with my web browser (Chrome), the elements I needed were on the page, despite not being present in the html file in text format. So I just wondered, is this caused by some sort of JS in the background thats triggered when the page is opened?
My question is, how can I scrape off the elements described above? Is there a way to get around this weird bug?
Thank you for your time

Related

How to pull the table data that is generated with javascript from a website using Python?

My goal is to get only the data from a specific table in a website. If possible, opened website could be run on the background preventing to pop-up.
I have tried Get page generated with Javascript in Python answer to apply, but it dumps all the data in the site rather than only the table. My best option is to parse it from there. As alternative I have tried How to extract tables from websites in Python answer but returned list object is empty.
As an example, I want to pull table (given information under the Symbol Size Entry_Price Mark_Price PNL) that is located at the bottom of the page from this site.
How can I achive this? While achieving it can the opened browser open on the background preventing the focus to itself?
you can use BeuatifulSoup or bs4 with requests for example:
import requests
from bs4 import BeautifulSoup
r=requests.get("") #url over here this will not open the browser
soup = BeautifulSoup(r.text, 'html.parser')
for table in soup.find_all('table'):
print(table)
This code will print all the tables and then you can see the data and use if statements to come to a conclusion.
This code prints all HTML tables starting with index 0, 1, and so on...
tables1 = pd.read_html('https://www.binance.com/en/futuresng-activity/leaderboard/7E32B49490355C9FCCAA709A1D364AA6?tradeType=PERPETUAL')
tables1[0]
#or
#tables1[1]
#or
#tables[2]

How can I check whether a website has javascript or not?

I'm building a webscraper using beautifulsoup.Some websites have javascript contents and do not load using urllib3 hence I use selenium for them.But selenium takes too long too respond and I need to build a more efficient webscraper since I need to use the same generalized scraper for multiple websites. hence I'm thinking if there's some way I can find out if the website has js content only then ill use selenium else I'll go with faster urllib
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome()
strt=time.time()
y=browser.get("https://www.amazon.jobs/en/locations/bangalore-india")
#time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html,'lxml')
li=soup.find_all('ul')
print(li)
print('load time='+str(time.time()-strt))
Here is the simple check using selenium
jsSize = (len(driver.find_elements_by_xpath("/html/head/script")))
if jsSize>0:
print("Page contains javascript")
The script tag is used to define a client-side script (JavaScript).
The element either contains script statements, or it points to an external script file through the src attribute.
Right click on the webpage you want to scrape >> Go to View Page Source >>
look for the tag named script, the script tag will indicate that the web page you are trying to scrape also consist of JavaScript.

Download entire webpage (html, image, JS) by Selenium Python

I have to download source code of a website like www.humkinar.pk in simple HTML form. Content on site is dynamically generated. I have tried driver.page_source function of selenium but it does not download page completely such as image and javascript files are left. How can I download complete page. Is there any better and easy solution in python available?
Using Selenium
I know your question is about selenium, but from my experience I am telling you that selenium is recommended for testing and NOT for scraping. It is very SLOW. Even with multiple instances of headless browsers (chrome for your situation), the result is delaying too much.
Recommendation
Python 2, 3
This trio will help you a lot and save you a bunch of time.
Dryscrape
BeautifulSoup
ThreadPoolExecutor
Do not use the parser of dryscrape, it is very SLOW and buggy. For
this situation, one can use BeautifulSoup with the lxml parser. Use dryscrape to scrape Javascript generated content, plain HTML and images.
If you are scraping a lot of links simultaneously, i highly recommend
using something like ThreadPoolExecutor
Edit #1
dryscrape + BeautifulSoup usage (Python 3+)
from dryscrape import start_xvfb
from dryscrape.session import Session
from dryscrape.mixins import WaitTimeoutError
from bs4 import BeautifulSoup
def new_session():
session = Session()
session.set_attribute('auto_load_images', False)
session.set_header('User-Agent', 'SomeUserAgent')
return session
def session_reset(session):
return session.reset()
def session_visit(session, url, check):
session.visit(url)
# ensure that the market table is visible first
if check:
try:
session.wait_for(lambda: session.at_css(
'SOME#CSS.SELECTOR.HERE'))
except WaitTimeoutError:
pass
body = session.body()
session_reset(session)
return body
# start xvfb in case no X is running (server)
start_xvfb()
SESSION = new_session()
URL = 'https://stackoverflow.com/questions/45796411/download-entire-webpage-html-image-js-by-selenium-python/45824047#45824047'
CHECK = False
BODY = session_visit(SESSION, URL, CHECK)
soup = BeautifulSoup(BODY, 'lxml')
RESULT = soup.find('div', {'id': 'answer-45824047'})
print(RESULT)
I Hope below code will work to download the complete content of the page.
driver.get("http://testurl.com")
pageurl=driver.current_url
page = requests.get(pageurl)
pagecontent=page.content
`pagecontent` will contain the complete code content
It's not allowed to download a website without Permission. If you would know that, you would also know there is hidden Code on hosting Server, where you as Visitior has no access to it.

python: how to save dynamically rendered html web page code

I have a setup where a web page in a local server (localhost:8080) is changed dynamically by sending sockets that load some scripts (d3 code mainly).
In chrome I can inspect the "rendered html status" of the page, i.e., the resulting html code of the d3/javascript loaded codes. Now, I need to save that "full html snapshot" of the rendered web-page to be able to see it later, in a "static" way.
I have tried many solutions in python, which work well to load a web and save its "on-load" d3/javascript processed content, but DO NOT get info about the code generated "after" the load.
I could also use javascript to make this if no python solution is found.
Remember that I need to retrieve the full html rendered code that has been "dynamically" modified in time, in a chosen moment of time.
Here are a list of questions found in stackoverflow that are related but do not answer this question.
Not answered:
How to save dynamically changed HTML?
Answered but not for dynamically changed html:
Using PyQt4 to return Javascript generated HTML
Not Answered:
How to save dynamically added data to update the page (using jQuery)
Not dynamic:
Python to Save Web Pages
The question could be solved using selenium-python (thanks to #Juca suggestion to use selenium).
Once installed (pip install selenium) this code makes the trick:
from selenium import webdriver
# initiate the browser. It will open the url,
# and we can access all its content, and make actions on it.
browser = webdriver.Firefox()
url = 'http://localhost:8080/test.html'
# the page test.html is changing constantly its content by receiving sockets, etc.
#So we need to save its "status" when we decide for further retrieval)
browser.get(url)
# wait until we want to save the content (this could be a buttonUI action, etc.):
raw_input("Press to print web page")
# save the html rendered content in that moment:
html_source = browser.page_source
# display to check:
print html_source

Instructing Python to click a button using urllib2

I'm writing a web scraper using urllib2 and BeautifulSoup in python and am looking for a way to instruct python to click a button on a page that it reads the HTML source code for.
The following snippet of my script reads in URLs from a csv file and is meant to scrape data from the webpages specified, but an intermediary step is to click a "submit" button that exists on the webpage that is read from the csv's provided URLs.
for line in triplines:
FromTo = line.split(",")
From = FromTo[0].strip()
print(From)
To = FromTo[1].strip()
print(To)
url = KCString1 + From + KCString2 + To + KCString3
print(url)
page = urllib2.urlopen(url)
page_source = page.read()
soup = BeautifulSoup(page_source)
print(soup.prettify())
Is there a way to utilize urllib2 functionality in such a way as to say "follow the URL that is obtained from clicking this button"? I imagine I may need to find the JavaScript source to identify the button's identifiers first.
Buttons do not typically have urls attached to them. They normally need javascript interaction, which needs emulation. If you want to click a button, you should use a browser emulator like Ghost instead of a parser like Beautifulsoup

Categories