I'm using Python 3.3 and Requests 2.2.1.
I'm trying to POST to a website ending in .jsp, which then changes to .doh ending. Using the same basic requests code outline I'm able to successfully login and scrape other websites, but the javascript part on this site is not working. This is my code:
import requests
url = 'https://prodpci.etimspayments.com/pbw/include/sanfrancisco/input.jsp'
payload = {'plateNumber':'notshown', 'statePlate':'CA'} #tried CA and California
s = requests.Session() #Tried 'session' and 'Session' following different advice
post = s.post(url, data=payload)
r = s.get('https://prodpci.etimspayments.com/pbw/include/sanfrancisco/input.jsp')
print(r.text)
Finally, when manually entering data into the webpage through firefox browser, the page changes and url becomes https://prodpci.etimspayments.com/pbw/inputAction.doh, which only has contet if you are redirected there after typing in license plate.
From the printed text, I know I'm getting content from the page as it would be without POSTing anything, but I need the content for the page once I've POSTed the payload.
For the POST payload, do I need to include something like 'submit':'submit' to simulate clicking the search button?
Am I doing the GET request from the right url, considering the url I POST to?
You're making POST request and after that another GET request and this is why you get the same page with the form.
response = s.post(url, data=payload)
print(response.text)
Also if you check the form markup, you'll find its action is /pbw/inputAction.doh and additionally the form sends a few parameters from hidden inputs. Therefore you should use that URL in your request and probably the values from hidden inputs.
With the next code I'm able to retrieve the same response as via regular request in browser:
import requests
url = 'https://prodpci.etimspayments.com/pbw/inputAction.doh'
payload = {
'plateNumber': 'notshown',
'statePlate': 'CA',
'requestType': 'submit',
'clientcode': 19,
'requestCount': 1,
'clientAccount': 5,
}
s = requests.Session()
response = s.post(url, data=payload)
print(response.text)
The same you can see in browser after same request via the form:
...
<td colspan="2"> <li class="error">Plate is not found</li></td>
...
Related
In Google Forms, it's possible to print a single response.
This opens a new tab with a URL such as:
https://docs.google.com/forms/u/0/d/1VqMbpn69qCApBZKXzbjmjxz1TLQ8VyxR-2aC2WqO2z8/printresponse?viewresponse=ACYDBNhGZ47ckBgoyjBgpb_r9sVdxYlo10w6MoLTV0zP
The response ID at the end of the URL seems to differ from the ID that you get from FormResponse.getId(), since the following does not work:
let printUrl = FormApp.getActiveForm().getEditUrl().replace('/edit', '/printresponse?viewresponse=') + FormApp.getActiveForm().getResponses()[0].getId();
How do I get this "print response URL" via Apps Script?
Currently, the response ID from the UI is different from both ID's fetched from FormResponse.getID(), FormResponse.getEditResponseUrl(), or FormResponse.toPrefilledUrl().
The closest you can get from Apps Script is either getEditResponseUrl(), which needs Allow Response Edits option to be enabled, or toPreFilledUrl(), which generates the same response that you can submit.
I'm almost there with my first try of using scrapy, selenium to collect data from website with javascript loaded content.
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.webdriver.common.by import By
import time
class FreePlayersSpider(scrapy.Spider):
name = 'free_players'
allowed_domains = ['www.forge-db.com']
start_urls = ['https://www.forge-db.com/fr/fr11/players/?server=fr11']
driver = {}
def __init__(self):
self.driver = webdriver.Chrome('/home/alain/Documents/repository/web/foe-python/chromedriver')
self.driver.get('https://forge-db.com/fr/fr11/players/?server=fr11')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
#time.sleep(1)
sel = Selector(text = self.driver.page_source)
players = sel.xpath('.//table/tbody/tr')
for player in players:
joueur = player.xpath('.//td[3]/a/text()').get()
guilde = player.xpath('.//td[4]/a/text()').get()
yield {
'player' : joueur,
'guild' : guilde
}
next_page_btn = self.driver.find_element_by_xpath('//a[#class="paginate_button next"]')
if next_page_btn:
time.sleep(2)
next_page_btn.click()
yield scrapy.Request(url = self.start_urls, callback=self.parse)
# Close the selenium driver, so in fact it closes the testing browser
self.driver.quit()
def parse_players(self):
pass
I want to collect user names and their relative guild and output to a csv file.
For now my issue is to proceed to NEXT PAGE and to parse again the content loaded by javascript.
if i'm able to simulate click on NEXT tag, i'm not 100% sure that code will proceed all pages and i'm not able to parse the new content using the same function.
Any idea how could i solve this issue ?
thx.
Instead of using selenium, you should try recreate the request to update the table. If you look closely at the HTML under chrometools. You can see that the request is made with parameters and a response is sent back with the data in a nice structured format.
Please see here with regards to dynamic content in scrapy. As it explains the first step to think about is it necessary to recreate browser activity ? Or can I get the information I need from reverse engineering HTTP get requests. Sometimes the information is hidden with <script></script> tags and you can use some regex or some string methods to gain what you want. Rendering the page and then using browser activity should be thought of as a last step.
Now before I go into some background on reverse engineering the requests, this website you're trying to get information from requires only to reverse engineer the HTTP requests.
Reverse Engineering HTTP requests in Scrapy
Now in terms of the actual web itself we can use chrome devtools by right clicking inspect on a page. Clicking the network tab allows you to see all requests the browser makes to render the page. In this case you want to see what happens when you click next.
Image1: here
Here you can see all the requests made when you click next on the page. I always look for the biggest sized response as that'll most likely have your data.
Image2: here
Here you can see the request headers/params etc... things you need to make a proper HTTP request. We can see that the referring URL is actually getplayers.php with all the params to get the next page added on. If you scroll down you can see all the same parameters it sends to getplayers.php. Keep this in mind, sometimes we need to send headers, cookies and parameters.
Image3: here
Here is the preview of the data we would get back from the server if we make the correct request, it's a nice neat format which is great for scraping.
Now You could copy the headers and parameters, cookies here into scrapy, but after a bit of searching and it's always worth checking this first, if just by passing in an HTTP request with the url will you get the data you want then that is the simplest way.
In this case it's true and infact you get in a nice need format with all the data.
Code example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['forge-db.com']
def start_requests(self):
url = 'https://www.forge-db.com/fr/fr11/getPlayers.php?'
yield scrapy.Request(url=url)
def parse(self,response):
for row in response.json()['data']:
yield {'name':row[2],'guild':row[3] }
Settings
In settings.py, you need to set ROBOTSTXT_OBEY = False The site doesn't want you to access this data so we need to set it to false. Be careful, you could end getting banned from the server.
I would also suggest a couple of other settings to be respectful and cache the results so if you want to play around with this large dataset you don't hammer the server.
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 3
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
Comments on the code
We make a request to https://www.forge-db.com/fr/fr11/getPlayers.php? and if you were to print the response you get all the data from the table, it's quite a lot... Now it looks like it's in json format so we use scrapy's new feature to handle json and convert into a python dictionary. response.json() be sure that you have uptodate scrapy to take advantage of this. Otherwise you could use the json library that python provides to do the same thing.
Now you have to look at the preview data abit here but the individual rows are within response.json()['data'][i] where i in the row of data. The name and guild are within response.json()['data'][i][2] and response.json()['data'][i][3]. So looping over every response.json()['data']and grabbing the name and guild.
If the data wasn't so structured as it is here and it needed modifying I would strongly urge you to use Items or ItemLoaders for creating the fields that you can then output the data. You can modifying the extracted data more easily with ItemLoaders and you can interact with duplicates items etc using a pipeline. These are just some thoughts for in the future, I almost never use yielding a dictionary for extracting data particularly large datasets.
Let me preface by saying I have very little programming experience. I've learned a bunch in the last few days trying to write this program. I am running Python 2.7 on Windows 7 using PyCharm, requests, Beautiful Soup, and lxml.
I am trying to scrape data from a website that relies heavily on Javascript. I have two options:
1) The data I need is populated through Javascript and does not necessarily need a login. However I have not been able to figure how to get at this data. I've live monitored headers with live HTTP Headers chrome plugin and I think I've found the Javascript that does it but I'ts beyond my means to figure it out. Its a long bit of code, I'll post it if anyone is interested in taking a look.
or
2)On one of the main pages I found a series of ID numbers which I can use to generate URL's for each of the individual items I am analyzing. Problem is I have to be logged in to see these individual item pages. My code is as follows:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.poolmanager import PoolManager
from BeautifulSoup import BeautifulSoup
import ssl
# Request a date from user
UDate = "06/22/2015" # raw_input('Enter a date mm/dd/yyyy\n')
# Open TLSv1 Adapter (Whataver that means)
class MyAdapter(HTTPAdapter):
def init_poolmanager(self, connections, maxsize, block=False):
self.poolmanager = PoolManager(num_pools=connections,
maxsize=maxsize,
block=block,
ssl_version=ssl.PROTOCOL_TLSv1)
# Begin a requests session. Every get from here on out will use TLSv1 Protocol
import requests
payload = {
'LogName': 'xxxxxxxx',
'LogPass': 'xxxxxxxx'
}
s = requests.Session()
s.mount('https://xxxx.xxx', MyAdapter())
# Login with post and Request source code from main page.
log = s.post('LoginURL', data=payload)
print log.text
result = s.get(url)
soup = BeautifulSoup(result.content)
print soup
Neither the post or the get show me a logged in website. The logform id's from the HTML source code look like this:
<div id="DivLogForm">
<label for="BadText"><div id="BadText" class="BadText" style="display:none" tabindex="-2">User Name or Password is Invalid</div></label>
<div class="LogLabel">
<label for="LogName" > User Name </label><input tabindex="0" id="LogName" class="LogInput" value="" />
</div>
<div class="LogLabel">
<label for="LogPass" >User Password </label><input tabindex="0"id="LogPass" type="password" class="LogInput" value="" />
</div>
So I'm passing LogName and LogPass with the post.
There is also a logform.js with this bit of code
$("#LogButton").click(function()
{ //$('#divLogForm').hide();
//$('#divLoading').show();
var uName = $("#LogName").val();
var uPass = $("#LogPass").val();
var url = "/index.cfm";
$.post(url, {ZACTION:'AJAX',ZMETHOD:'LOGIN',func:'LOGIN',USERNAME:uName, USERPASS:uPass},
function(data){if (data.isOk =="YES"){location.href="/index.cfm";}
else {$('.BadText').show(); $('#BadText').focus();};
},"json");
});
The LoginURL in my code is taken from the var url in this script. I have tried using USERNAME & USERPASS and I have tried uName and uPass with my post but these didnt work either.
Not sure how to move forward here. Any help is greatly appreciated
The last bit of javascript you posted gives a clue as to why your login POST request isn't working.
According to the javascript, you should be sending a dictionary that looks like the following with your login POST:
{
'ZACTION': 'AJAX',
'ZMETHOD': 'LOGIN',
'func': 'LOGIN',
'USERNAME': '<enter username>',
'USERPASS': '<enter password>'
},
Ok so this is my first post on StackOverflow, so go easy on me. I'm literally stuck with my Python script. I looked all over the web, cannot find a solution!
So I used mechanize to login to a website (example: http://www.foobar.com/)
HTML of form to login:
form id="loginForm" method="post" action="/z/0.123/?cmd=login-post" onsubmit="return someSubmitfunction();"
The form for the login of that website looks like this:
<HiddenControl(__FOO=someLongString) (readonly)>
<TextControl(emailAddress=)>
<PasswordControl(password=)>
<CheckboxControl(persist=[*on])>
I WAS able to login to the website and redirect to an internal link (see further in code).
Here is the code for the login... Note: Request Method is a POST
import urllib, urllib2
import cookielib
import mechanize
# Note this is the FORM, but missing the HIDDEN value, LOOK lower in code
EmailAddress = 'someusername'
Password = 'somepassword'
Persist = ['on',]
browser = mechanize.Browser()
# Enable cookie support
cookiejar = cookielib.LWPCookieJar()
browser.set_cookiejar( cookiejar )
# Browser options
browser.set_handle_equiv( True )
browser.set_handle_redirect( True )
browser.set_handle_referer( True )
browser.set_handle_robots( False )
# Pretend that I am a browser
browser.set_handle_refresh( mechanize._http.HTTPRefreshProcessor(), max_time = 1 )
browser.addheaders = [ ( 'User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1' ) ]
# Open webpage & add form fields
browser.open('http://www.foobar.com/')
browser.select_form(nr = 0) #select the ONLY form (Login form)
browser.form['emailAddress'] = EmailAddress
browser.form['password'] = Password
browser.form['persist'] = Persist
# Submit for FORM is an action, find it and redirect to internal page
# Create new control & submit to internal page
browser.new_control("HIDDEN", "action", {})
control = browser.form.find_control("action")
control.readonly = False
browser["action"] = "/z/0.123/?cmd=login-post"
browser.method = "post"
browser.action = 'http://www.foobar.com/user/summary/'
browser.submit()
Alrighty, up to this point, I am fine. I SUCCESSFULLY logged in and was redirected to http://www.foobar.com/user/summary/ just like I wanted.
url = browser.open('http://www.foobar.com/user/summary/')
print url.read() # - see content of url HTML ### THIS WORKS ###
Now I use BeautifulSoup() to parse the HTML of /user/summary/ and successfully grab another form on this page. This form doesn't have an action, like the login form, but this is how it looks...
I NEED HELP FROM HERE down.. I have trouble inputting my text(myInput) into the form and submitting!
HTML of form from .../user/summary/:
form method="post" id="foobar" name="foobar">
This is the submit button for the form:
onkeypress="return submitFormKey(event, '','foobar', 'foobar', 'pcm');">
img src="someuglyimage.jpg" class="submit" id="btn_Submit" onclick="submitForm('foobar', 'foobar', 'pcm');" alt="Foo"
This is the actual form: (THE ONLY FORM on this page, once again!)
<HiddenControl(hdnCmd=foobar) (readonly)>
<TextControl(inputvalue=)>
I tried many methods of submitting. I tried using Selenium, Splinter, urlib(1 & 2), and even JSON, javascript, iframe, embed, etc. I'M STUCK, HELP PLZ!
I thought this will work, i tried with and without Hidden control:
browser.select_form(nr = 0) #select the 1st form for inputting value
browser.form['inputvalue'] = myInput #MY INPUT I WANT THERE
browser.new_control("hidden", "foobar", {})
control = browser.form.find_control("foobar")
control.readonly = False
#browser["foobar"] = "/?cmd=foobar&from=/user/summary"
browser.method = "post"
response = browser.submit()
print response.read()
MY RESULTS:
Apparently It seems to redirect me to the homepage of the website (302 Redirect). So I KNOW that most likely it's something to do with the hidden value and passing it to the javascript/Ajax call (onclick="submitForm) when I submit. I read about CSRF tokens and it could be that, but if anyone has any ideas on how to do this, let me know because I'm in desperate need of help.
And somehow I cannot find the form of .../user/summary/ (console tells me this) because I am redirected to the homepage, even though I dont submit the browser, until after I input all the form fields...
I can read the HTML of .../user/summary/ and find the "foobar" form! This is why I am so confused. I can read it, parse it, but when i try to input myInput into the form, somehow I get redirected to homepage, but I am still logged in!
THANKS pplz.. hopefully I was clear!
I'm using jQuery with Django in server-side. What I'm trying to do is to get some text from the user through the form and simultaneously displaying the text in the canvas area like about.me and flavors.me does. Then the user drag the text in the canvas area to the desired position and when they click the next button,the data must be stored in the database and redirect to the homepage. Everything is working perfect(the datas are stored in the database) except when I click the button which I set window.location to "http://127.0.0.1:8000". But I'm not getting to that page when I click the button.
I'm getting some errors in Django server:
error: [Errno 32] Broken pipe
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 51161)
Traceback (most recent call last):
File "/usr/lib/python2.7/SocketServer.py", line 284, in _handle_request_noblock
Here is my html:
https://gist.github.com/2359541
Django views.py:
from cover.models import CoverModel
from django.http import HttpResponseRedirect
def coverview(request):
if request.is_ajax():
t = request.POST.get('top')
l = request.POST.get('left')
n = request.POST.get('name')
h = request.POST.get('headline')
try:
g = CoverModel.objects.get(user=request.user)
except CoverModel.DoesNotExist:
co = CoverModel(top=t, left=l, name=n, headline=h)
co.user = request.user
co.save()
else:
g.top = t
g.left = l
g.name = n
g.headline = h
g.save()
return HttpResponseRedirect("/")
urls.py:
url(r'^cover/check/$', 'cover.views.coverview'),
url(r'^cover/$', login_required(direct_to_template), {'template': 'cover.html'}),
Could anyone help me?
Thanks!
There's really not enough information in your question to properly diagnose this, but you can try this:
It's always a bad idea to hard-code a domain name in your JS. What happens when you take this to production, for example? If you want to send the user to the homepage (presumed from the location being set to http://127.0.0.1:8000/), then set the location simply to /. That will ensure that it will always go to the site root regardless of the IP address, domain name or port.
Part of the problem is that you're trying to post data, and then immediately leaving the page by using window.location. You should only change the window.location whenever you get the response back from the $.post().
$.post("check/", { top: t, left: l, name: n, headline: h}, function(data) {
window.location.href = "/";
});
Notice also that I removed the hardcoded URL. Use a relative one here, like Chris said.
If it still isn't working, you need to check for Javascript errors in the lines above. Use Firebug, Chrome Dev Tools, Opera Dragonfly, something. Check to make sure your POST is actually going through, and post more data about that back here.