The page I am trying to crawl has includes javascript code. (Possibly using AJAX?) When I crawl the page based on the html code, it can't get the javascript part. How can I do that?
I think I need some libraries in python which can crawl the javascript code including html codes.
Please give me some advice.
Below is the page link:
view-source:http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=652691&tbl=cyber
I recommend two ways.
First, request ajax url directly and parse HTML.
import requests
url = "http://www.bobaedream.co.kr/mycar/proc/mycar_regist_option.php"
data = {'param': 'ALL'}
response = requests.post(url, data=data)
# parse
...
Second, use web driver, like geckodriver, phantomjs and so on, using selenium library.
That library make virtual browser, run javascript and then render the DOM made by javascript.
This is public documents about selenium
I have this website and I want to download the content of the page.
I tried selenium, and button clicking with it, but with no success.
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox
import time
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
# setting the url
browser.get("http://bonusbagging.co.uk/oddsmatching.php#")
# finding and clicking the button
button = browser.find_element_by_id('select_button')
button.click()
page = browser.page_source
time.sleep(5)
print(page.encode("utf8"))
This code only downloads the source code, where the data are hidden.
Can someone show me the right way to do that? Or tell my how can be the hidden data downloaded?
Thanks in advance!
I always try to avoid selenium like the plague when scraping; it's very slow and is almost never the best way to go about things. You should dig into the source more before scraping; it was clear on this page that the html was coming in and then a separate call was being made to get the table's data. Why not make the same call as the page? It's lightning fast and requires no html parsing; just returns raw data, which seems to be what you're looking for. the python requests import is perfect for this. Happy Scraping!
import requests
table_data = requests.get('http://bonusbagging.co.uk/odds-server/getdata_slow.php').content
PS: The best way to look for these calls is to open the dev console, and check out the network tab. You can see what calls are being made here. Another way is to go to the sources tab, look for some javascript, and search for ajax calls (that's where I got the url I'm calling to above, the path was: top/odds-server.com/odds-server/js/table_slow.js). The later option is sometimes easier, sometimes it's nearly impossible (if the file is minified/uglified). Do whatever works for you!
Check out the Network tab in Chrome Dev tools. Nab the URL out of there.
What you're looking at is a DataTable. You can use their API to fetch what you need.
Adjust the "start" and/or "length" parameters to fetch the data page-by-page.
It's JSON data, so it'll be super easy to parse.
But be nice and don't hammer this poor guy's server.
I am looking to get the contents of a text file hosted on my website using Python. The server requires JavaScript to be enabled on your browser. Therefore when I run:
import urllib2
target_url = "http://09hannd.me/ai/request.txt"
data = urllib2.urlopen(target_url)
I receive a html page saying to enable JavaScript.
I was wondering if there was a way of faking having JS enabled or something.
Thanks
Selenium is the way to go here, but there is another "hacky" option.
Based on this answer: https://stackoverflow.com/a/26393257/2517622
import requests
url = 'http://09hannd.me/ai/request.txt'
response = requests.get(url, cookies={'__test': '2501c0bc9fd535a3dc831e57dc8b1eb0'})
print(response.content) # Output: find me a cafe nearby
I would probably suggest tools like this. https://github.com/niklasb/dryscrape
Additionally you can see more info here: Using python with selenium to scrape dynamic web pages
I am using Python to pull the HTML of a website to get satellite locations. Of course since I am not actually accessing the site via a browser I am not retrieving any html that would be populated by javascript calls.
import urllib.request
page = urllib.request.urlopen('http://n2yo.com/?s=20217')
file = open("textFile", "wb")
satelliteText = page.read()
file.write(satelliteText)
file.close()
I've explored libraries like Windmill that literally run a browser so that you can get that javascript created html, but I am using a Raspberry Pi. I'd rather not install an additional browser.
Is there anyway that I can make the ajax get calls myself that the website is making and retrieve just the data I need?
Looking at this source here: http://www.n2yo.com/js/passes.js it appears that it is calling http://www.n2yo.com/inc/all.php to get the data. By reading through passes.js carefully you should be able to figure out how to parse it.
I'd like to make an XML document in JavaScript then have a save dialog appear.
It's OK if they have to click before the save can occur.
It's *not* OK if I *have* to use IE to achieve this (I don't even need to support it at all). However, Windows is a required platform (so Firefox or Chrome are the preferred browsers if I can only do this in one browser).
It's *not* OK if I need a web server. But conversely, I don't want to require the JavaScript to be run on a local file only, i.e. elevated privileges -- if possible. That is, I'd like to to run locally or on a *static* host. But just locally is OK.
It's OK to have to bend over backwards to do this. The file won't be very big, but internet access might either be there, be spotty or just not be a possibility at all -- see (3).
So far the only ideas I have seen are to save the XML to an iframe and save that document -- but it seems that you can only do this in IE? Also, that I could construct a data URI and place that in a link. My fear here is that it will just open the XML file in the window, rather than prompt the user to save it.
I know that if I require the JavaScript to be local, I can raise privileges and just directly save the file (or hopefully cause a save dialog box to appear). However, I'd much prefer a solution where I do not require raised privileges (even a Firefox 3.6 only solution).
I apologize if this offends anyone's sensibilities (for example, not supporting every browser). I basically want to write an offline application and Javascript/HTML/CSS seem to be the best candidate considering the complexity of the requirements and the time available. However, I have this single requirement of being able to save data that must be overcome before I can choose this line of development.
How about this downloadify script?
Which is based on Flash and jQuery, which can prompt you dialog box to save file in your computer.
Downloadify.create('downloadify',{
filename: function(){
return document.getElementById('filename').value;
},
data: function(){
return document.getElementById('data').value;
},
onComplete: function(){
alert('Your File Has Been Saved!');
},
onCancel: function(){
alert('You have cancelled the saving of this file.');
},
onError: function(){
alert('You must put something in the File Contents or there will be nothing to save!');
},
swf: 'media/downloadify.swf',
downloadImage: 'images/download.png',
width: 100,
height: 30,
transparent: true,
append: false
});
Using a base64 encoded data URI, this is possible with only html & js. What you can do is encode the data that you want to save (in your case, a string of XML data) into base64, using a js library like jquery-base64 by carlo. Then put the encoded string into a link, and add your link to the DOM.
Example using the library I mentioned (as well as jquery):
<html>
<head>
<title>Example</title>
</head>
<body>
<script>
//include jquery and jquery-base64 here (or whatever library you want to use)
document.write('click to make save dialog');
</script>
</body>
</html>
...and remember to make the content-type something like application/octet-stream so the browser doesn't try to open it.
Warning: some older IE versions don't support base64, but you said that didn't matter, so this should work fine for you.
Without any more insight into your specific requirements, I would not recommend a pure Javascript/HTML solution. From a user perspective you would probably get the best results writing a native application. However if it will be faster to use Javascript/HTML, I recommend using a local application hosting a lightweight web server to serve up your content. That way you can cleanly handle the file saving server-side while focusing the bulk of your effort on the front-end application.
You can code up a web server in - for example - Python or Ruby using very few lines of code and without 3rd party libraries. For example, see:
Making a simple web server in python
WEBrick - Writing a custom servlet
python-trick-really-little-http-server - This one is really simple, and will easily let you server up all of your HTML/CSS/JS files:
"""
Serves files out of its current directory.
Doesn't handle POST requests.
"""
import SocketServer
import SimpleHTTPServer
PORT = 8080
def move():
""" sample function to be called via a URL"""
return 'hi'
class CustomHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
def do_GET(self):
#Sample values in self for URL: http://localhost:8080/jsxmlrpc-0.3/
#self.path '/jsxmlrpc-0.3/'
#self.raw_requestline 'GET /jsxmlrpc-0.3/ HTTP/1.1rn'
#self.client_address ('127.0.0.1', 3727)
if self.path=='/move':
#This URL will trigger our sample function and send what it returns back to the browser
self.send_response(200)
self.send_header('Content-type','text/html')
self.end_headers()
self.wfile.write(move()) #call sample function here
return
else:
#serve files, and directory listings by following self.path from
#current working directory
SimpleHTTPServer.SimpleHTTPRequestHandler.do_GET(self)
httpd = SocketServer.ThreadingTCPServer(('localhost', PORT),CustomHandler)
print "serving at port", PORT
httpd.serve_forever()
Finally - Depending on who will be using your application, you also have the option of compiling a Python program into a Frozen Binary so the end user does not have to have Python installed on their machine.
Javascript is not allowed to write to a local machine. Your question is similar to this one.
I suggest creating a simple desktop app.
Is localhost PHP server ok? Web traditionally can't save to hard drive because of security concerns. PHP can push files though it requires a server.
Print to PDF plugins are available for available for all browsers. Install once, print to PDF forever. Then, you can use a javascript or Flash to call a Print function.
Also, if you are developing for an environment where internet access is spotty, conwider using VB.NET or some other desktop language.
EDIT:
You can use the browser's Print function.
Are you looking for something like this?
If PHP is ok, if would be much easier.
With IE you could use document.execCommand, but I note that IE is not an option.
Here's something that looks like it might help, although it will not prompt with SaveAs dialog, https://developer.mozilla.org/en/Code_snippets/File_I%2F%2FOL.
One simple but odd way to do this that doesn't require any Flash is to create an <a/> with a data URI for its href. This even has pretty good cross-browser support, although for IE it must be at least version 8 and the URI must be < 32k. It looks like someone else on SO has more to say on the topic.
Why not use a hybrid flash for client and some server solution server-side. Most people have flash so you can default to client side to conserve resources on the server.