I need to get a bit of data from a HTML tag that only appears when you're signed into a site. I need to do it in either Python or Javascript. Javascript has the Cross-Origin-Browser-Policy(CORS) as a obstacle.
I can't use server-side code.
I can't use iframes.
The data is readily available if you open the page URL in Chrome or FireFox because it keeps you signed in, much like Facebook, so we'll use it as an example. We'll say I want to get the data from the first element of my Facebook news feed.
I've tried scraping the webpage and passing in the User Agent value with Pythons urllib module. I've tried using Yahoos YQL tool with Javascript. Both returned the HTML I wanted without the values I need in them. This is because it's not using my browsers to do it, which has the cookies stored required to populate the values I need.
So is there a way to scrape a webpage that's already open? Say I had Facebook open and I ran some code that got my news feed data from the browser.
Is there some other method I haven't mentioned to accomplish this?
Background: I'm creating an autobumper for a forum(within the site rules) and need some generated values from the site HTML, but will get no cooperation towards that end from the owner.
You can try the following with python selenium webdriver as it allows you to log in and get html source.
you will have to pip install selenium first and download the chromedriver.exe from selenium website http://docs.seleniumhq.org/
here is a sample code i use on gmail:
from selenium import webdriver
#you have to download the chromedriver from selenium hq homepage
chromedriver_path = r'your chromedriver.exe path here'
#create webdriver object and get url
driver = webdriver.Chrome(chromedriver_path)
driver.implicitly_wait(1)
driver.get('https://www.google.com/gmail')
#login
driver.find_element_by_css_selector('#Email').send_keys('email#gmail.com')
driver.find_element_by_css_selector('#next').click()
driver.find_element_by_css_selector('#Passwd').send_keys('1234')
driver.find_element_by_css_selector('#signIn').click()
#get html
html = driver.page_source
Related
I'm building a webscraper using beautifulsoup.Some websites have javascript contents and do not load using urllib3 hence I use selenium for them.But selenium takes too long too respond and I need to build a more efficient webscraper since I need to use the same generalized scraper for multiple websites. hence I'm thinking if there's some way I can find out if the website has js content only then ill use selenium else I'll go with faster urllib
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome()
strt=time.time()
y=browser.get("https://www.amazon.jobs/en/locations/bangalore-india")
#time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html,'lxml')
li=soup.find_all('ul')
print(li)
print('load time='+str(time.time()-strt))
Here is the simple check using selenium
jsSize = (len(driver.find_elements_by_xpath("/html/head/script")))
if jsSize>0:
print("Page contains javascript")
The script tag is used to define a client-side script (JavaScript).
The element either contains script statements, or it points to an external script file through the src attribute.
Right click on the webpage you want to scrape >> Go to View Page Source >>
look for the tag named script, the script tag will indicate that the web page you are trying to scrape also consist of JavaScript.
The page I am trying to crawl has includes javascript code. (Possibly using AJAX?) When I crawl the page based on the html code, it can't get the javascript part. How can I do that?
I think I need some libraries in python which can crawl the javascript code including html codes.
Please give me some advice.
Below is the page link:
view-source:http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=652691&tbl=cyber
I recommend two ways.
First, request ajax url directly and parse HTML.
import requests
url = "http://www.bobaedream.co.kr/mycar/proc/mycar_regist_option.php"
data = {'param': 'ALL'}
response = requests.post(url, data=data)
# parse
...
Second, use web driver, like geckodriver, phantomjs and so on, using selenium library.
That library make virtual browser, run javascript and then render the DOM made by javascript.
This is public documents about selenium
I am looking to get the contents of a text file hosted on my website using Python. The server requires JavaScript to be enabled on your browser. Therefore when I run:
import urllib2
target_url = "http://09hannd.me/ai/request.txt"
data = urllib2.urlopen(target_url)
I receive a html page saying to enable JavaScript.
I was wondering if there was a way of faking having JS enabled or something.
Thanks
Selenium is the way to go here, but there is another "hacky" option.
Based on this answer: https://stackoverflow.com/a/26393257/2517622
import requests
url = 'http://09hannd.me/ai/request.txt'
response = requests.get(url, cookies={'__test': '2501c0bc9fd535a3dc831e57dc8b1eb0'})
print(response.content) # Output: find me a cafe nearby
I would probably suggest tools like this. https://github.com/niklasb/dryscrape
Additionally you can see more info here: Using python with selenium to scrape dynamic web pages
I am using Python to pull the HTML of a website to get satellite locations. Of course since I am not actually accessing the site via a browser I am not retrieving any html that would be populated by javascript calls.
import urllib.request
page = urllib.request.urlopen('http://n2yo.com/?s=20217')
file = open("textFile", "wb")
satelliteText = page.read()
file.write(satelliteText)
file.close()
I've explored libraries like Windmill that literally run a browser so that you can get that javascript created html, but I am using a Raspberry Pi. I'd rather not install an additional browser.
Is there anyway that I can make the ajax get calls myself that the website is making and retrieve just the data I need?
Looking at this source here: http://www.n2yo.com/js/passes.js it appears that it is calling http://www.n2yo.com/inc/all.php to get the data. By reading through passes.js carefully you should be able to figure out how to parse it.
I have a setup where a web page in a local server (localhost:8080) is changed dynamically by sending sockets that load some scripts (d3 code mainly).
In chrome I can inspect the "rendered html status" of the page, i.e., the resulting html code of the d3/javascript loaded codes. Now, I need to save that "full html snapshot" of the rendered web-page to be able to see it later, in a "static" way.
I have tried many solutions in python, which work well to load a web and save its "on-load" d3/javascript processed content, but DO NOT get info about the code generated "after" the load.
I could also use javascript to make this if no python solution is found.
Remember that I need to retrieve the full html rendered code that has been "dynamically" modified in time, in a chosen moment of time.
Here are a list of questions found in stackoverflow that are related but do not answer this question.
Not answered:
How to save dynamically changed HTML?
Answered but not for dynamically changed html:
Using PyQt4 to return Javascript generated HTML
Not Answered:
How to save dynamically added data to update the page (using jQuery)
Not dynamic:
Python to Save Web Pages
The question could be solved using selenium-python (thanks to #Juca suggestion to use selenium).
Once installed (pip install selenium) this code makes the trick:
from selenium import webdriver
# initiate the browser. It will open the url,
# and we can access all its content, and make actions on it.
browser = webdriver.Firefox()
url = 'http://localhost:8080/test.html'
# the page test.html is changing constantly its content by receiving sockets, etc.
#So we need to save its "status" when we decide for further retrieval)
browser.get(url)
# wait until we want to save the content (this could be a buttonUI action, etc.):
raw_input("Press to print web page")
# save the html rendered content in that moment:
html_source = browser.page_source
# display to check:
print html_source