Python: Issue getting updating html created via JavaScript calls in browser - javascript

I am using Python to pull the HTML of a website to get satellite locations. Of course since I am not actually accessing the site via a browser I am not retrieving any html that would be populated by javascript calls.
import urllib.request
page = urllib.request.urlopen('http://n2yo.com/?s=20217')
file = open("textFile", "wb")
satelliteText = page.read()
file.write(satelliteText)
file.close()
I've explored libraries like Windmill that literally run a browser so that you can get that javascript created html, but I am using a Raspberry Pi. I'd rather not install an additional browser.
Is there anyway that I can make the ajax get calls myself that the website is making and retrieve just the data I need?

Looking at this source here: http://www.n2yo.com/js/passes.js it appears that it is calling http://www.n2yo.com/inc/all.php to get the data. By reading through passes.js carefully you should be able to figure out how to parse it.

Related

JavaScript in requests package python

I want to get text from a site using Python.
But the site uses JavaScript and the requests package to receive only JavaScript code.
Is there a way to get text without using Selenium?
import requests as r
a=r.get('https://aparat.com/').text
If the site loads content using javascript then the javascript has to be run in order to get the content. I ran into this issue a while back when I did some web scraping, and ended up using Selenium. Yes its slower than BeautifulSoup but it's the easiest solution.
If you know how the server works you could send a request and it should return with content of some kind (whether that be html, json, etc)
Edit: Load the developer tools, go to network tab and refresh the page. Look for an XHR request and the URL it uses. You may be able to use this data for your needs.
For example I found these URLs:
https://www.aparat.com/api/fa/v1/etc/page/config/mode/full
https://www.aparat.com/api/fa/v1/video/video/list/tagid/1?next=1
If you navigate to these in your browser you will notice JSON content, you might be able to use this. I think some of the text is encoded in Unicode e.g \u062e\u0644\u0627\u0635\u0647 \u0628\u0627\u0632\u06cc -> خلاصه بازی
I don't know the specific python implementation you might use. Look for libs that support making http requests and recieving data. That way you can avoid selenium. But you must know the URL's beforehand. Like shown above.
For example this is what I would do:
Make a http request to the URL you find in developer tools
With JSON content, use a JSON parser to get a table/array/dictionary natively. You can then traverse this in the native programming language.
Use a unicode decoder to get the text in normal text format, there might be a lib to do this, but for example on this website using the "Decode/Unescape Unicode Entities" I was able to get the text.
I hope this helps.
Sample code:
import requests;
req = requests.get('https://www.aparat.com/api/fa/v1/video/video/show/videohash/IueKs?pr=1&mf=1&referer=direct')
res = req.json()
#do stuff with res
print(res)

Downloading file from "javascript:__doPostBack" link using Python

I have an existing Python script that was written using urllib2 to download from a http:// link:
import urllib2
import os.path
import os
from os import chdir, getcwd, listdir, path
print "downloading with urllib2"
f = urllib2.urlopen('http://www.dcregs.dc.gov/Notice/DownLoad.aspx?VersionID=4613531')
data = f.read()
with open( "11-B300.doc", "wb" ) as code :
code.write( data )
print "All done downloads!"
The source web-page has been reformatted to uses a "javascript:__doPostBack" address:
javascript:__doPostBack('ctl00$MainContent$rpt_ruleList$ctl02$Label1','')
My presumption is that there is some form of package, similar to urllib2, that will allow me to download the same information via the "javascript:__doPostBack" formatted address or to call the http url, where the information is located, from which I can then download the information.
The existing script was working well for my purposes, so I would like to limit the additional coding, if possible.
Is there an alternate to urllib2 that will allow me to do download the information in a similar manner?
Or am I going to have to get more sophisticated in my solution (e.g., using Selenium to scrape the information)? (Do I want to get more sophisticated so that I don't have to manage updates to individual urls?)
Thanks for your help in advance.
This relates to the site that you're on using is using .NET WebForms which manages the state of the page & the interaction within hidden form variables.
So in short, you'll need to click the link via something like Selenium as you say

Crawl some of Javascript codes in a web-page

The page I am trying to crawl has includes javascript code. (Possibly using AJAX?) When I crawl the page based on the html code, it can't get the javascript part. How can I do that?
I think I need some libraries in python which can crawl the javascript code including html codes.
Please give me some advice.
Below is the page link:
view-source:http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=652691&tbl=cyber
I recommend two ways.
First, request ajax url directly and parse HTML.
import requests
url = "http://www.bobaedream.co.kr/mycar/proc/mycar_regist_option.php"
data = {'param': 'ALL'}
response = requests.post(url, data=data)
# parse
...
Second, use web driver, like geckodriver, phantomjs and so on, using selenium library.
That library make virtual browser, run javascript and then render the DOM made by javascript.
This is public documents about selenium

Scrape currently opened webpage or get live HTML with another method?

I need to get a bit of data from a HTML tag that only appears when you're signed into a site. I need to do it in either Python or Javascript. Javascript has the Cross-Origin-Browser-Policy(CORS) as a obstacle.
I can't use server-side code.
I can't use iframes.
The data is readily available if you open the page URL in Chrome or FireFox because it keeps you signed in, much like Facebook, so we'll use it as an example. We'll say I want to get the data from the first element of my Facebook news feed.
I've tried scraping the webpage and passing in the User Agent value with Pythons urllib module. I've tried using Yahoos YQL tool with Javascript. Both returned the HTML I wanted without the values I need in them. This is because it's not using my browsers to do it, which has the cookies stored required to populate the values I need.
So is there a way to scrape a webpage that's already open? Say I had Facebook open and I ran some code that got my news feed data from the browser.
Is there some other method I haven't mentioned to accomplish this?
Background: I'm creating an autobumper for a forum(within the site rules) and need some generated values from the site HTML, but will get no cooperation towards that end from the owner.
You can try the following with python selenium webdriver as it allows you to log in and get html source.
you will have to pip install selenium first and download the chromedriver.exe from selenium website http://docs.seleniumhq.org/
here is a sample code i use on gmail:
from selenium import webdriver
#you have to download the chromedriver from selenium hq homepage
chromedriver_path = r'your chromedriver.exe path here'
#create webdriver object and get url
driver = webdriver.Chrome(chromedriver_path)
driver.implicitly_wait(1)
driver.get('https://www.google.com/gmail')
#login
driver.find_element_by_css_selector('#Email').send_keys('email#gmail.com')
driver.find_element_by_css_selector('#next').click()
driver.find_element_by_css_selector('#Passwd').send_keys('1234')
driver.find_element_by_css_selector('#signIn').click()
#get html
html = driver.page_source

Python get URL contents when page requires JavaScript enabled

I am looking to get the contents of a text file hosted on my website using Python. The server requires JavaScript to be enabled on your browser. Therefore when I run:
import urllib2
target_url = "http://09hannd.me/ai/request.txt"
data = urllib2.urlopen(target_url)
I receive a html page saying to enable JavaScript.
I was wondering if there was a way of faking having JS enabled or something.
Thanks
Selenium is the way to go here, but there is another "hacky" option.
Based on this answer: https://stackoverflow.com/a/26393257/2517622
import requests
url = 'http://09hannd.me/ai/request.txt'
response = requests.get(url, cookies={'__test': '2501c0bc9fd535a3dc831e57dc8b1eb0'})
print(response.content) # Output: find me a cafe nearby
I would probably suggest tools like this. https://github.com/niklasb/dryscrape
Additionally you can see more info here: Using python with selenium to scrape dynamic web pages

Categories