Web scraping click to download

Web scraping click to download - javascript

I would like to use Python to automate the following task:
given fileid 8426 and date 03312021
go to the website:
https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id=8426&date=03312021
click "Download PDF"
Save file to directory
I did some research and found a python module Request: https://docs.python-requests.org/en/master/user/quickstart/
Looks like I should be able to declare a data object and pass it in order to send out the request
r = requests.post('https://my_url', data = {'key':'value'})
with open(test.pdf, "wb") as f:
f.write(r.content)
However, I have trouble finding the proper attributes inside the data object in my case. I have tried some and unable to fetch the desired pdf file. Any help would be greatly appreciated !!

So.. in case of the request.post() method the data argument is the dictionary, that represents key-value pairs of the html post form. In order to find it you can to go on DevTools in your browser (shift-ctr-I in Chrome and Mozilla), open the Network tab and submit the form you need to inspect - in your case the form is represented as single <input type="submit" ... > element (stylized as "Download PDF" button). After you hit this input, browser will make well-formed POST-request to the server and you can see the correct html headers and key-values of that request on the Network tab - just grub it and form two dicts in your python script: first one with the headers and the second with the post-form values.
An example with the url you posted ahead
# http headers
headers =
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Length': '1017',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'ASP.NET_SessionId=okonm4wfhg5ddup5e0wkp0ur; BIGipServerfdic_Forward_prod_80=172495532.20480.0000; _ga=GA1.2.77529009.1621351450; _gid=GA1.2.1620156842.1621351450',
'Host': 'cdr.ffiec.gov',
'Origin': 'https://cdr.ffiec.gov',
'Referer': 'https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id=8426&date=03312021',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36' }
# Form data
post_form_data =
{ "__EVENTTARGET": "",
"__EVENTARGUMENT": "",
"__VIEWSTATE": "/wEPDwULLTE0NTY3MjMzNTQPFggeHVZpZXdQREZGYWNzaW1pbGVfU3VibWlzc2lvbklEApTmYR4UVmlld1BERkZhY3NpbWlsZU1vZGULKX1DZHIuUGRkLlVJLkNvbnRyb2xzLlVJSGVscGVyK1ZpZXdGYWNzaW1pbGVNb2RlLCBDZHIuUGRkLlVJLlByb2Nlc3NlcywgVmVyc2lvbj03LjEuMTMzLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49bnVsbAAeBkZJTmFtZQV4MVNUIFNVTU1JVCBCQU5LICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgHg5GRElDQ2VydE51bWJlcgUEODQyNhYCZg9kFgICAQ9kFgICBw9kFgYCAQ9kFgQCAQ8PFgYeBFRleHRkHghDc3NDbGFzcwUJZG9jaGVhZGVyHgRfIVNCAgJkZAIDDw8WBh8EZB8FBQZoZWFkZXIfBgICZGQCAw9kFgICAQ8UKwACZBQrAAUUKwAIaAUFUHJpbnRoaGRoZ2QUKwAIZwUNRG93bmxvYWQgWEJSTGdoZGhnZBQrAAhnBQxEb3dubG9hZCBQREZnaGRoZ2QUKwAIZwUMRG93bmxvYWQgU0RGZ2hkaGdkFCsACGcFEURvd25sb2FkIFRheG9ub215Z2hkaGhkZAIFDw8WAh4HVmlzaWJsZWhkZGTtXpFTz1TYX73fKLF2ros5Z2CvJ/pDUy88F6s57Qs97Q==",
"__VIEWSTATEGENERATOR": "A250BEAE",
"ctl00$MainContentHolder$viewTabStrip$Download_PDF_2": "Download PDF" }
# url to submit the form
url = 'https://cdr.ffiec.gov/Public/ViewPDFFacsimile.aspx?ds=call&idType=fdiccert&id=8426&date=03312021'
# making request
resp = requests.post(url, headers=headers, data=post_form_data)
# writing the file from response content
with open('file_name.pdf', 'wb') as file:
file.write(resp.content)
Find the document with specific fileid and date:
this information is given in url params: ... /ViewPDFFacsimile.aspx?ds=call&idType=fdiccert&id=8426&date=03312021'
You can also find it on the Network tab (in Chrome it called "Query String Parameters"). To pass it in the request, use params argument of the request.post() method.
url_params = {
"ds": "call",
"idType": "fdiccert",
"id": "8426",
"date": "03312021" }
request.post(url, headers=headers, data=post_form_data, params=params)

I know you asked about 'requests' but I think its easy with Selenium. Try this if you want:
from selenium import webdriver
from time import sleep
id = input("id: ")
date = input("date: ")
url = f"https://cdr.ffiec.gov/Public/ViewFacsimileDirect.aspx?ds=call&idType=fdiccert&id={id}&date={date}"
browser = webdriver.Chrome()
browser.get(url)
el = browser.find_element_by_id("Download_PDF_2")
el.click()
sleep(5)
browser.quit()
You can change how you get the id and the date value and sleep time also
Be sure to make chromedriver available in PATH or keep it in same directory as the script

Related

Instagram API only working in the browser

I'm trying to get the instagram userId of a user from it's username. For that, I use the following endpoint of Instagram's API https://www.instagram.com/<username>/?__a=1.
Accessing that endpoint in the browser yield some JSON application/json with the infos I need.
ex. https://www.instagram.com/deletethistheo/?__a=1
Now if I go over the network tab and copy that request as fetch()
const res = await fetch("https://www.instagram.com/deletethistheo/?__a=1", {
"headers": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7",
"cache-control": "max-age=0",
"sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"90\", \"Google Chrome\";v=\"90\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "cross-site",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1"
},
"referrerPolicy": "strict-origin-when-cross-origin",
"body": null,
"method": "GET",
"mode": "cors",
"credentials": "omit"
});
console.log(await res.json());
and run that in the chrome console or in a node.js program. I get a HTML response of a blank page.
I also tried setting the user-agent header to mine: user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 but got the same result.
EDIT: Something weird too, sometimes when I copy as fetch and run the fetch in the chrome console, I get the expected JSON result. And when I re run that same code few minutes after, I get an error.
I don't understand what can be causing that difference of behaviour given the request made is the same. Any ideas?
Cheers!

From their API docs it sounds like you'll need to append any requests with an access token:
https://developers.facebook.com/docs/instagram-basic-display-api/guides/getting-profiles-and-media
Specifically this section of the docs:
Step 2: Query the User node
Send a request to the following endpoint:
GET /me?fields={fields}&access_token={access-token}
Replace {fields} with a comma-separated list of User fields you want returned and {access-token} with the user’s access token. The GET /me endpoint will determine the user’s ID from the token and redirect the request to the User node.
Sample Request
curl -X GET \
'https://graph.instagram.com/me?fields=id,username&access_token=IGQVJ...'
Sample Response
{
"id": "17841405793187218",
"username": "jayposiris"
}

Webscraping behind Log-In with X-Auth and Bearer token

I am creating a little script which would save countless hours for me and my colleagues. The thing is I need to get data of my clients from web page based on their number (CLIENT_NO). The whole page is of course behind log in page, but I manually sign in in browser and copy the Bearer and X-Auth tokens which should be enough, to authorize these requests, right?.
Then I use URL "https://moje.csobstavebni-oz.cz/group/nel/vysledky-vyhledavani?searchText=CLIENT_NO" which mimics search request from search bar.This will get me on the desired page. I am looking for data such as "birthNumberIco" and others, as highlighted in screenshot.
A little problem I see is that Request URL is of course different from the one mentioned above. But I cannot use Request URL, because in this URL there is CLIENT_ID not CLIENT_NO and I don't know that.
Unfortunately, I can't get anything from it, Python will always return blank list []. I am suspecting it is because of all the authorization keys and tokens (as you can see in my header, they are of course not written completely for obvious reasons).
I tried several options I found on the Youtube but as of right now, I am completely desperate and I don't know, what else can I do. Maybe there is just some small mistake I did, that will fix the whole thing.
Screenshot screenshot2 screenshot3
Thank you so much in advance!
import scrapy
import json
class KlientUdaje(scrapy.Spider):
name = 'klient_udaje'
start_urls = ['https://moje.csobstavebni-oz.cz/group/nel']
headers = {
"Accept": "*/*",
"Accept-Encoding": " gzip, deflate, br",
"Accept-Language": " en-US,en;q=0.9,cs;q=0.8",
"Authorization": " Bearer d2ba2XXXXXX",
"Cache-Control": " no-cache",
"Connection": "keep-alive",
"Host": " moje.csobstavebni.cz",
"Origin": " https://moje.csobstavebni-oz.cz",
"Pragma": " no-cache",
"Referer": " https://moje.csobstavebni-oz.cz/",
"RequestId": " cklydjuq000073q679q5kd2tb",
"Sec-Fetch-Dest": " empty",
"Sec-Fetch-Mode": " cors",
"Sec-Fetch-Site": " cross-site",
"SystemId": ": 47",
"User-Agent": "Mozila/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45",
"X-Auth-Token": "eyAidHlwIjogIkpXVCIsICJraWQiOiAiT2pDY3ErdklKTXXXXX"
}
def parse(self, response):
url = 'https://moje.csobstavebni-oz.cz/group/nel/vysledky-vyhledavani?searchText=CLIENT_NO'
yield scrapy.Request(url,
callback=self.parse_api,
headers=self.headers)
def parse_api(self, response):
raw_data = response.body
data = json.loads(raw_data)
rodne_cislo = data['birthNumberIco']
print(rodne_cislo)

Press Button with POST request and scraping data from popup in Python

I would like to press the "Suche starten" Button and scrape the results for a research project from this page (Basically it can be pressed without filling in any forms - then a popup opens, that holds the data I want).
https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl
Basically it is the German public announcement of companies that go bankrupt.
I have already spent some considerable time trying to get it going but somehow I can't get it to work.
I know I could also try the selenium headless browser but first of all I'd prefer the cleaner requests solution and second I'd love to be able to run the script continuously from a server with little effort and without a screen.
So what I have done so far is, to check out the post request my browser is sending using the Firefox Dev Tools and tried to emulate the Post request. The problem is that I can only get the standard data from the initial window but not from the opening up Window which holds all the data I want.
So I imported the requests library and created a custom request with header and payload.
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0',
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Length": "413",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.insolvenzbekanntmachungen.de",
"Pragma": "no-cache",
"Referer": "https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl",
"Upgrade-Insecure-Requests": "1"
}
payload={
'Suchfunktion': 'uneingeschr',
'Absenden': 'Suche+starten',
'Bundesland': '-Hamburg',
'Gericht': 'Hamburg',
'Datum1':'',
'Datum2':'',
'Name':'',
'Sitz':'',
'Abteilungsnr':'',
'Registerzeichen': '--',
'Lfdnr':'',
'Jahreszahl': '--',
'Registerart': '--+keine+Angabe+--',
'select_registergericht':'',
'Registergericht': '--+keine+Angabe+--',
'Registernummer':'',
'Gegenstand': '--+Alle+Bekanntmachungen+innerhalb+des+Verfahrens+--',
'matchesperpage': '10',
'page': '1',
'sortedby': 'Datum',
'submit': 'return validate_globe(this)',
}
And then i make The following request:
r = requests.post('https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl',headers=headers,data=payload)
Unfortunately print(r.text) will not give me the data from the popup that would appear in a browser.
Any help would be very greatly appreciated!
Jasper

Quick and easy fix would be something like below. Give it a go:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl'
payload = 'Suchfunktion=uneingeschr&Absenden=Suche+starten&Bundesland=--+Alle+Bundesl%E4nder+--&Gericht=--+Alle+Insolvenzgerichte+--&Datum1=&Datum2=&Name=&Sitz=&Abteilungsnr=&Registerzeichen=--&Lfdnr=&Jahreszahl=--&Registerart=--+keine+Angabe+--&select_registergericht=&Registergericht=--+keine+Angabe+--&Registernummer=&Gegenstand=--+Alle+Bekanntmachungen+innerhalb+des+Verfahrens+--&matchesperpage=10&page=1&sortedby=Datum'
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
s.headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
res = s.post(URL, data = payload)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("b li a"):
print(item.get_text(strip=True))
Output:
2018-07-05A & A Eco Clean Gebäudereinigung GmbH, München, 1503 IN 1836/16, Registergericht München, HRB 189121
2018-07-05A & A Eco Clean Gebäudereinigung GmbH, München, 1503 IN 1836/16, Registergericht München, HRB 189121
2018-07-05A + S Wohnungsbau Besitz GmbH & Co.KG, Kandel, 3 IN 96/12, Registergericht Landau in der Pfalz, HRA 21214
2018-07-05Abb Nicola, Untersöchering, IN 462/11
2018-07-05Abb Nicola, Untersöchering, IN 462/11
2018-07-05Abdul Basit Qureshi, Kirchheim, 13 IN 23/17
2018-07-05Abdul Basit Qureshi, Kirchheim, 13 IN 23/17
2018-07-05Abdul Basit Qureshi, Kirchheim, 13 IN 23/17
2018-07-05Abdulrahman, Oulat, Bottrop, 162 IN 76/12
2018-07-05Abdurachid Hassan, München, 1500 IK 2170/17

Scrapy: POST request returning JSON response (200 OK) but with incomplete data

MySpider is trying to depict the load-more action click which results in loading of more items on web-page dyanamically. And this continues until nothing more is left to be loaded.
yield FormRequest(url,headers=header,formdata={'entity_id': '70431','profile_action': 'review-top','page':str(p), 'limit': '5'},callback=self.parse_review)
header = {#'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.zomato.com',
'Accept': '*/*',
'Referer': 'https://www.zomato.com',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'dont_filter':'True' }
url = 'https://www.zomato.com/php/social_load_more.php'
The response received is the json response.
jsonresponse = json.load(response)
And i do see -
('data==', {u'status': u'success', u'left_count': 0, u'html': u"<script type='text/javascript'>if (typeof initiateLaziness == 'function') initiateLaziness() </script>", u'page': u'1', u'more': 0})
U see i get response for status, left_count, page, more.
However i am interested in 'html'. Unfortunately, its the in-correct value which i do receive if done through browser(inspected the network calls and verified)
Expected 'html' is ----
<div><a> very long html stuff...............................................<div><script type='text/javascript'>if (typeof initiateLaziness == 'function') initiateLaziness() </script>
I am receiving only later part
<script>...................................</script>.
Real html stuff is missing.
Thing to note is that i do receive response but incomplete one for 'html' only.All good for rest. I believe it might be something related to dynamically generated html. But i am getting any clue on it.
No content-length is added by scrapy middleware. And not allowing me to add one as well. Respons fails with 400 when adding it to header.
Request Header being actually sent to server:
{'Accept-Language': ['en'], 'Accept-Encoding': ['gzip, deflate,br'], 'Dont_Filter': ['True'], 'Connection': ['keep-alive'], 'Accept': ['*/*'], 'User-Agent': ['Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0'], 'Host': ['www.zomato.com'], 'X-Requested-With': ['XMLHttpRequest'], 'Cookie': ['zl=en; fbtrack=9be27330646d24088c56c2531ea2fbf5; fbcity=7; PHPSESSID=2338004ce3fd540477242c3eaee685168163bd05'], 'Referer': ['https://www.zomato.com'], 'Content-Type': ['application/x-www-form-urlencoded; charset=UTF-8']})
Can any one please help me if i am missing anything here?
Or someway i can sent the send the content-length/or make middleware sent it for me?
Many Thanks.

You won't get the html content in response because of not using cookies. In the actual request header that you have mentioned, there is a cookie attribute. But in the ajax request you are sending through your code, there is no cookie field.
First a cookie is set in the response to the request made from zomato's restaurant page with the url: https://www.zomato.com/city/restaurant/reviews. Now, when the load more button is clicked, a request is sent with the cookie field containing the cookie set by the server in the previous response to the url 'https://www.zomato.com/php/social_load_more.php'. So, everytime an ajax request is made, the cookie that was set in the previous response should be sent in the request header and a new cookie will be set in the response of the present request.
So, in order to manage these cookies, I used session object of requests package. The script can be written without using scrapy also. As you wrote your code in scrapy, see if there are any session objects available to manage the cookies for scrapy.
My code :
import requests
url : 'https://www.zomato.com/city/restaurant/reviews'
s = requests.Session()
resp = s.get(url, headers=header)
The above code is to send requests to the url of the restaurant reviews. This is essential because the first cookie is set in the response to this request.
params={
'entity_id':res_id,
'profile_action':'reviews-dd',
'page':'1',
'limit':'5'
}
header = {"origin":"https://www.zomato.com","Referer":"https://www.zomato.com/","user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0", "x-requested-with":"XMLHttpRequest", 'Accept-Encoding': 'gzip, deflate, br'}
loadreviews_text = s.post("https://www.zomato.com/php/social_load_more.php", data=params, headers=header)
loadreviews = loadreviews_text.json()
Now a request is made to the social_load_more.php. The object 's' manages the cookies. The variable loadreviews will now have the html data in json format.

Javascript error: Python routine to log into a website

I have been trying to get into a website and fetch some data using python. But I am facing an error when I run my script. Here I am trying to just log in to the website and print the entire website text. Script and error are as below:
Script:
import requests
with requests.session() as s:
proxy_url = "http://{0}:{1}#proxy.blah.blah.com:8099".format('user_id', 'Password')
s.proxies = {'http': proxy_url, 'https': proxy_url}
user_id_url = "https://example.ex.com/my.policy"
Headers = {'Host': 'example.ex.com', 'Connection': 'keep-alive','Cache-Control': 'max-age=0', 'Accept-Language': 'en-US,en;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Cookie': '_ga=GA1.2.1822238803.1429212674; LastMRH_Session=0a0d8c67; MRHSession=ded054e0afe1bb151c3d35cb0a0d8c67; TIN=273000', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'}
data = data = {'username': 'website_uid', 'password': 'website_password'}
r= s.post(user_id_url, data = data, headers = Headers)
print r.status_code
print r.text
Error:
<script language="javascript">
setDefaultLang();
</script><noscript>
<div id="noscript_warning_red">JavaScript is not enabled. Please enable JavaScript in your browser or contact your system administrator for assistance.</div>
<div id="noscript_warning_newsession">To open a new session, please click here.</div>
</noscript>
PS: I am able to print the html text of the page, but I am not able to login correctly and hence displaying the error output.
Javascript is enabled in my browser, I double checked it even while posting this question
Any help is really appreciated

We Keep Coding

JavaScript is the programming language of the Web.