Javascript error: Python routine to log into a website - javascript

I have been trying to get into a website and fetch some data using python. But I am facing an error when I run my script. Here I am trying to just log in to the website and print the entire website text. Script and error are as below:
Script:
import requests
with requests.session() as s:
proxy_url = "http://{0}:{1}#proxy.blah.blah.com:8099".format('user_id', 'Password')
s.proxies = {'http': proxy_url, 'https': proxy_url}
user_id_url = "https://example.ex.com/my.policy"
Headers = {'Host': 'example.ex.com', 'Connection': 'keep-alive','Cache-Control': 'max-age=0', 'Accept-Language': 'en-US,en;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Cookie': '_ga=GA1.2.1822238803.1429212674; LastMRH_Session=0a0d8c67; MRHSession=ded054e0afe1bb151c3d35cb0a0d8c67; TIN=273000', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'}
data = data = {'username': 'website_uid', 'password': 'website_password'}
r= s.post(user_id_url, data = data, headers = Headers)
print r.status_code
print r.text
Error:
<script language="javascript">
setDefaultLang();
</script><noscript>
<div id="noscript_warning_red">JavaScript is not enabled. Please enable JavaScript in your browser or contact your system administrator for assistance.</div>
<div id="noscript_warning_newsession">To open a new session, please click here.</div>
</noscript>
PS: I am able to print the html text of the page, but I am not able to login correctly and hence displaying the error output.
Javascript is enabled in my browser, I double checked it even while posting this question
Any help is really appreciated

Related

403 Response code - Request Blocked when using Cowin setu APIs

I was just trying to make covid vaccine alert using Cowin Setu API (India) in nodejs. But I am facing some strange thing, whenever I hit get request I got 403 response code from cloudfront says 'Request Blocked' but the same is working from postman as well as from browser. Please help me in this
Getting this error:-
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.
We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
<BR clear="all">
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: Q1RZ94qgFp6AjUUKE4e9urMB85VejcqMbaJO6Y8Xq5Qp4kNjDBre9A==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>
Here's my nodejs code:
var express = require("express");
var app = express();
var bodyParser = require("body-parser");
const axios = require("axios");
const { Telegram } = require("telegraf");
const fetch = require("node-fetch");
var cors = require('cors');
var request=require('request');
const tg = new Telegram(process.env.BOT_TOKEN);
const bot = new Telegram(process.env.BOT_TOKEN, {
polling: true
});
//bot.start((ctx) => ctx.reply('Welcom to Covid Vaccine Finder'))
/*bot.hears("about", ctx => {
ctx.reply("Hey, I am CoviBot!");
});
bot.launch();*/
app.use(bodyParser.json());
app.use(cors());
app.use(
bodyParser.urlencoded({
extended: true
})
);
app.get("/", function(req, res) {
res.send("Welcom to Covid Vaccine Finder");
});
app.get("/test", function(req, res, next) {
var d = new Date();
var options = {
year: "numeric",
month: "2-digit",
day: "2-digit"
};
var date = String(d.toLocaleDateString("en", options));
date = date.replace(/\//g, "-");
console.log(date);
const URL =
"https://cdn-api.co-vin.in/api/v2/appointment/sessions/public/findByPinpincode=110088&date=13-05-2021";
var options = {
url: URL,
method: 'GET',
headers: {
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en;q=0.8,en-US;q=0.6,hu;q=0.4',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
Host: 'cdn-api.co-vin.in',
'User-Agent': 'request',
}
};
request(options,function(err,res,body){
let json = body;
console.log(json);
});
const txt = "Finding vaccine centres for you....";
//tg.sendMessage(process.env.GROUP_ID, txt);
res.send(txt);
});
// Finally, start our server
app.listen(process.env.PORT, function() {
console.log("Covid app listening on port 3000!");
});
I hope this problem will solve
Thanks
I added a user-agent header to the request so that the API would recognize that my request is coming from a browser, rather than a script.
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
}
url = "https://cdn-api.co-vin.in/api/v2/appointment/sessions/public/calendarByDistrict?district_id=303&date="+date
response = requests.get(url, headers=headers)
Use following
var options = {
url: URL,
method: 'GET',
headers: {
Host: 'cdn-api.co-vin.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
};
Try These Headers They worked for me on local server (not production)
let options = {
headers: {
"user-agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
},
};
These will not work in production because Cowin APIs are geofenced and can't be accessed from IP address other than Indian. In most free hosting sites like Heroku, Indian IP is not an option. So alternative solution might be to use AWS, GCP, Azure with an Indian server (not tried yet).
Reference - https://github.com/cowinapi/developer.cowin/issues/228
It seems the api is blocked from using outside India. Try to combine some Indian proxy/use in Indian server
You have to use User Agent Identifier API
Please refer this
https://devcenter.heroku.com/articles/useragentidentifier#using-with-python
You have to make your request in the following format, I am attaching sample format for states metadata API:
curl --location --request GET 'https://cdn-api.co-vin.in/api/v2/admin/location/states' --header 'Accept-Language: hi_IN' --header 'Accept: application/json' --header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
Its not about the request user-agent or format. I faced the same issue and further testing proved cloudFront is blocking the IP if multiple requests are coming from same IP back to back. Its also unblocking after couple minutes.
Basically they don't want these alerting this, probably its overloading their server.
Ok if you want to work local you can use
let headers = {
'accept': 'application/json',
'Accept-Language': 'hi_IN',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
}
Now if you want to deploy to Heroku or firebase, then it will return 403, I think it's mostly that they are blocking any IP hit outside from Indian server.
Github link: https://github.com/manojkumar3692/reactjs_nodejs_cowin
I Will keep you posted here

Web scraping click to download

I would like to use Python to automate the following task:
given fileid 8426 and date 03312021
go to the website:
https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id=8426&date=03312021
click "Download PDF"
Save file to directory
I did some research and found a python module Request: https://docs.python-requests.org/en/master/user/quickstart/
Looks like I should be able to declare a data object and pass it in order to send out the request
r = requests.post('https://my_url', data = {'key':'value'})
with open(test.pdf, "wb") as f:
f.write(r.content)
However, I have trouble finding the proper attributes inside the data object in my case. I have tried some and unable to fetch the desired pdf file. Any help would be greatly appreciated !!
So.. in case of the request.post() method the data argument is the dictionary, that represents key-value pairs of the html post form. In order to find it you can to go on DevTools in your browser (shift-ctr-I in Chrome and Mozilla), open the Network tab and submit the form you need to inspect - in your case the form is represented as single <input type="submit" ... > element (stylized as "Download PDF" button). After you hit this input, browser will make well-formed POST-request to the server and you can see the correct html headers and key-values of that request on the Network tab - just grub it and form two dicts in your python script: first one with the headers and the second with the post-form values.
An example with the url you posted ahead
# http headers
headers =
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Length': '1017',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'ASP.NET_SessionId=okonm4wfhg5ddup5e0wkp0ur; BIGipServerfdic_Forward_prod_80=172495532.20480.0000; _ga=GA1.2.77529009.1621351450; _gid=GA1.2.1620156842.1621351450',
'Host': 'cdr.ffiec.gov',
'Origin': 'https://cdr.ffiec.gov',
'Referer': 'https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id=8426&date=03312021',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36' }
# Form data
post_form_data =
{ "__EVENTTARGET": "",
"__EVENTARGUMENT": "",
"__VIEWSTATE": "/wEPDwULLTE0NTY3MjMzNTQPFggeHVZpZXdQREZGYWNzaW1pbGVfU3VibWlzc2lvbklEApTmYR4UVmlld1BERkZhY3NpbWlsZU1vZGULKX1DZHIuUGRkLlVJLkNvbnRyb2xzLlVJSGVscGVyK1ZpZXdGYWNzaW1pbGVNb2RlLCBDZHIuUGRkLlVJLlByb2Nlc3NlcywgVmVyc2lvbj03LjEuMTMzLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49bnVsbAAeBkZJTmFtZQV4MVNUIFNVTU1JVCBCQU5LICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgHg5GRElDQ2VydE51bWJlcgUEODQyNhYCZg9kFgICAQ9kFgICBw9kFgYCAQ9kFgQCAQ8PFgYeBFRleHRkHghDc3NDbGFzcwUJZG9jaGVhZGVyHgRfIVNCAgJkZAIDDw8WBh8EZB8FBQZoZWFkZXIfBgICZGQCAw9kFgICAQ8UKwACZBQrAAUUKwAIaAUFUHJpbnRoaGRoZ2QUKwAIZwUNRG93bmxvYWQgWEJSTGdoZGhnZBQrAAhnBQxEb3dubG9hZCBQREZnaGRoZ2QUKwAIZwUMRG93bmxvYWQgU0RGZ2hkaGdkFCsACGcFEURvd25sb2FkIFRheG9ub215Z2hkaGhkZAIFDw8WAh4HVmlzaWJsZWhkZGTtXpFTz1TYX73fKLF2ros5Z2CvJ/pDUy88F6s57Qs97Q==",
"__VIEWSTATEGENERATOR": "A250BEAE",
"ctl00$MainContentHolder$viewTabStrip$Download_PDF_2": "Download PDF" }
# url to submit the form
url = 'https://cdr.ffiec.gov/Public/ViewPDFFacsimile.aspx?ds=call&idType=fdiccert&id=8426&date=03312021'
# making request
resp = requests.post(url, headers=headers, data=post_form_data)
# writing the file from response content
with open('file_name.pdf', 'wb') as file:
file.write(resp.content)
Find the document with specific fileid and date:
this information is given in url params: ... /ViewPDFFacsimile.aspx?ds=call&idType=fdiccert&id=8426&date=03312021'
You can also find it on the Network tab (in Chrome it called "Query String Parameters"). To pass it in the request, use params argument of the request.post() method.
url_params = {
"ds": "call",
"idType": "fdiccert",
"id": "8426",
"date": "03312021" }
request.post(url, headers=headers, data=post_form_data, params=params)
I know you asked about 'requests' but I think its easy with Selenium. Try this if you want:
from selenium import webdriver
from time import sleep
id = input("id: ")
date = input("date: ")
url = f"https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id={id}&date={date}"
browser = webdriver.Chrome()
browser.get(url)
el = browser.find_element_by_id("Download_PDF_2")
el.click()
sleep(5)
browser.quit()
You can change how you get the id and the date value and sleep time also
Be sure to make chromedriver available in PATH or keep it in same directory as the script

How to login website with javascript in Python?

I want to login a website https://creis.fang.com/.
My code is:
import requests
url = 'https://creis.fang.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}
data = {'cnname':'login_id', 'cnpassword':'pass1', 'cntempcode':'pass2', 'cnproductselect':'企业版'}
s = requests.Session()
res = s.post(url = url, data = data, headers = headers, allow_redirects=False)
However, it failed.
What should I do?
Thanks
In headers, you need to pass the type of object you are sending...
Try
headers={ 'Content-Type':'application/json' }
By the way, you should check which is the endpoint that server has for login, https://creis.fang.com/ is the domain, but the endpoints are unique and you need to know which it is
Maybe you need a simulate the browser with Selenium:
https://selenium-python.readthedocs.io/

Scrapy: POST request returning JSON response (200 OK) but with incomplete data

MySpider is trying to depict the load-more action click which results in loading of more items on web-page dyanamically. And this continues until nothing more is left to be loaded.
yield FormRequest(url,headers=header,formdata={'entity_id': '70431','profile_action': 'review-top','page':str(p), 'limit': '5'},callback=self.parse_review)
header = {#'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.zomato.com',
'Accept': '*/*',
'Referer': 'https://www.zomato.com',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'dont_filter':'True' }
url = 'https://www.zomato.com/php/social_load_more.php'
The response received is the json response.
jsonresponse = json.load(response)
And i do see -
('data==', {u'status': u'success', u'left_count': 0, u'html': u"<script type='text/javascript'>if (typeof initiateLaziness == 'function') initiateLaziness() </script>", u'page': u'1', u'more': 0})
U see i get response for status, left_count, page, more.
However i am interested in 'html'. Unfortunately, its the in-correct value which i do receive if done through browser(inspected the network calls and verified)
Expected 'html' is ----
<div><a> very long html stuff...............................................<div><script type='text/javascript'>if (typeof initiateLaziness == 'function') initiateLaziness() </script>
I am receiving only later part
<script>...................................</script>.
Real html stuff is missing.
Thing to note is that i do receive response but incomplete one for 'html' only.All good for rest. I believe it might be something related to dynamically generated html. But i am getting any clue on it.
No content-length is added by scrapy middleware. And not allowing me to add one as well. Respons fails with 400 when adding it to header.
Request Header being actually sent to server:
{'Accept-Language': ['en'], 'Accept-Encoding': ['gzip, deflate,br'], 'Dont_Filter': ['True'], 'Connection': ['keep-alive'], 'Accept': ['*/*'], 'User-Agent': ['Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0'], 'Host': ['www.zomato.com'], 'X-Requested-With': ['XMLHttpRequest'], 'Cookie': ['zl=en; fbtrack=9be27330646d24088c56c2531ea2fbf5; fbcity=7; PHPSESSID=2338004ce3fd540477242c3eaee685168163bd05'], 'Referer': ['https://www.zomato.com'], 'Content-Type': ['application/x-www-form-urlencoded; charset=UTF-8']})
Can any one please help me if i am missing anything here?
Or someway i can sent the send the content-length/or make middleware sent it for me?
Many Thanks.
You won't get the html content in response because of not using cookies. In the actual request header that you have mentioned, there is a cookie attribute. But in the ajax request you are sending through your code, there is no cookie field.
First a cookie is set in the response to the request made from zomato's restaurant page with the url: https://www.zomato.com/city/restaurant/reviews. Now, when the load more button is clicked, a request is sent with the cookie field containing the cookie set by the server in the previous response to the url 'https://www.zomato.com/php/social_load_more.php'. So, everytime an ajax request is made, the cookie that was set in the previous response should be sent in the request header and a new cookie will be set in the response of the present request.
So, in order to manage these cookies, I used session object of requests package. The script can be written without using scrapy also. As you wrote your code in scrapy, see if there are any session objects available to manage the cookies for scrapy.
My code :
import requests
url : 'https://www.zomato.com/city/restaurant/reviews'
s = requests.Session()
resp = s.get(url, headers=header)
The above code is to send requests to the url of the restaurant reviews. This is essential because the first cookie is set in the response to this request.
params={
'entity_id':res_id,
'profile_action':'reviews-dd',
'page':'1',
'limit':'5'
}
header = {"origin":"https://www.zomato.com","Referer":"https://www.zomato.com/","user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0", "x-requested-with":"XMLHttpRequest", 'Accept-Encoding': 'gzip, deflate, br'}
loadreviews_text = s.post("https://www.zomato.com/php/social_load_more.php", data=params, headers=header)
loadreviews = loadreviews_text.json()
Now a request is made to the social_load_more.php. The object 's' manages the cookies. The variable loadreviews will now have the html data in json format.

Unable to obtain desired response from .aspx login page using Python and the Requests module

I have been trying to log in to a .aspx site (https://web.iress.com.au/html/LogonForm.aspx - For source / initial cookie reference) which uses a javascript function __doPostBack(eventTarget, eventArgument) to submit the form (very limited knowledge of javascript- so best guess).
My current understanding of HTTP requests is that, in the context of forms, they are mainly of a POST type request. I used Chrome to sniff out the request Headers and form data used when my credentials weren't typed in (For security sake) and they are as follows:
Remote Address:##BLANKEDOUT##
Request URL:https://web.iress.com.au/html/logon.aspx
Request Method:POST
Status Code:302 Found
**Request Headers**
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8
Cache-Control:no-cache
Connection:keep-alive
Content-Length:585
Content-Type:application/x-www-form-urlencoded
Cookie:ASP.NET_SessionId=##SESSION ID STRING##
Host:web.iress.com.au
Origin:https://web.iress.com.au
Pragma:no-cache
Referer:https://web.iress.com.au/html/LogonForm.aspx
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/##ADDRESS## Safari/537.36
**Form Data**
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE: ##VIEWSTATE STRING##
__VIEWSTATEGENERATOR:##VIEWSTATEGENERATOR KEY##
__PREVIOUSPAGE: ##PREVIOSUPAGE STRING##
__EVENTVALIDATION: ##STRING##
fu:LogonForm.aspx
su:Default.aspx
un: # Would be my username if i had typed it in
pw: # Would be password
ImageButton1.x:45 # These two values change depending on where i click the submit button
ImageButton1.y:13
and this is the code I'm using to attempt a login:
from requests import session
payload = {
'__EVENTTARGET' : '',
'__EVENTARGUMENT' : '',
'__VIEWSTATE' : '##STRING FOUND FROM CHROME SNIFF##',
'__VIEWSTATEGENERATOR' : '##STRING FOUND FROM CHROME SNIFF##',
'__PREVIOUSPAGE' : '##STRING FOUND FROM CHROME SNIFF##',
'__EVENTVALIDATION' : '##STRING FOUND FROM CHROME SNIFF##',
'fu' : 'LogonForm.aspx',
'su' : 'Default.aspx',
'un' : 'myuser#company',
'pw' : 'mypassword',
'ImageButton1.x' : '0',
'ImageButton1.y' : '0'
}
requestheaders = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding' : 'gzip,deflate',
'Accept-Language' : 'en-US,en;q=0.8',
'Cache-Control' : 'no-cache',
'Connection' : 'keep-alive',
'Content-Type' : 'application/x-www-form-urlencoded',
'Host' : 'web.iress.com.au',
'Origin' : 'https://web.iress.com.au',
'Cookie' : '',
'Pragma' : 'no-cache',
'Referer' : 'https://web.iress.com.au/html/LogonForm.aspx',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/##ADRESSS AS ABOVE## Safari/537.36'
}
with session() as sesh:
LOGINURL = 'https://web.iress.com.au/html/LogonForm.aspx'
sesh.get(LOGINURL) #Get request to get the session ID cookie
sessionID = sesh.cookies['ASP.NET_SessionId'] #Grab session ID value
sessionIDname = 'ASP.NET_SessionId='
sessionIDheader = str(sessionIDname + sessionID) #Prepare session ID header
requestheaders['Cookie'] = sessionIDheader # Add session ID header to requestheaders dictionary
response = sesh.post('https://web.iress.com.au/html/LogonForm.aspx', data=payload, headers=requestheaders)
print(response.headers)
print(response.content)
All I seem to get is the source of the page (https://web.iress.com.au/html/LogonForm.aspx) for the content and its headers as a response. I am not sure if it has anything to do with the __VARIABLES either but they don't seem to change, previouspage being the exception. Would I possibly have to extract these __VARIABLES to use them in my request headers?
You are posting to the wrong URL; your own data shows the form posts to https://web.iress.com.au/html/logon.aspx but you are posting to /LogonForm.aspx instead.
Note that the session object will take care of the cookie for you, do not set the Cookie header yourself. You should avoid setting the Host, Origin and Content-Type headers, and the Cache-Control, Accept* headers and Pragma are not going to have any influence how this works.

Categories