How to deal with the captcha when doing Web Scraping in Puppeteer?

How to deal with the captcha when doing Web Scraping in Puppeteer? - javascript

I'm using Puppeteer for Web Scraping and I have just noticed that sometimes, the website I'm trying to scrape asks for a captcha due to the amount of visits I'm doing from my computer. The captcha form looks like this one:
So, I would need help about how to handle this. I have been thinking about sending the captcha form to the client-side since I use Express and EJS in order to send the values to my index website, but I don't know if Puppeteer can send something like that.
Any ideas?

This is a reCAPTCHA (version 2, check out demos here), which is shown to you as the owner of the page does not want you to automatically crawl the page.
Your options are the following:
Option 1: Stop crawling or try to use an official API
As the owner of the page does not want you to crawl that page, you could simply respect that decision and stop crawling. Maybe there is a documented API that you can use.
Option 2: Automate/Outsource the captcha solving
There is an entire industry which has people (often in developing countries) filling out captchas for other people's bots. I will not link to any particular site, but you can check out the other answer from Md. Abu Taher for more information on the topic or search for captcha solver.
Option 3: Solve the captcha yourself
For this, let me explain how reCAPTCHA works and what happens when you visit a page using it.
How reCAPTCHA (v2) works
Each page has an ID, which you can check by looking at the source code, example:
<div class="g-recaptcha form-field" data-sitekey="ID_OF_THE_WEBSITE_LONG_RANDOM_STRING"></div>
When the reCAPTCHA code is loaded it will add a response textarea to the form with no value. It will look like this:
<textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="... display: none;"></textarea>
After you solved the challenge, reCAPTCHA will add a very long string to this text field (which can then later be checked by the server/reCAPTCHA service in the backend) when the form is submitted.
How to solve the captcha yourself
By copying the value of the textarea field you can transfer the "solved challenge" from one browser to another (this is also what the solving services to for you). The full process looks like this:
Detect if the page uses reCAPTCHA (e.g. check for .g-recaptcha) in the "crawling" browser
Open a second browser in non-headless mode with the same URL
Solve the captcha yourself
Read the value from: document.querySelector('#g-recaptcha-response').value
Put that value into the first browser: document.querySelector('#g-recaptcha-response').value = '...'
Submit the form
Further information/reading
There is not much public information from Google how exactly reCAPTCHA works as this is a cat-and-mouse game between bot creators and Google detection algorithms, but there are some resources online with more information:
Official docs from Google: Obviously, they just explain the basics and not how it works "in the back"
InsideReCaptcha: This is a project from 2014 which tries to "reverse-engineer" reCAPTCHA. Although this is quite old, there is still a lot of useful information on the page.
Another question on stackoverflow: This question contains some useful information about reCAPTCHA, but also many speculative (and very likely) outdated approaches on how to fool a reCAPTCHA.

You should use combination of following:
Use an API if the target website provides that. It's the most legal way.
Increase wait time between scraping request, do not send mass request to the server.
Change/rotate IP frequently.
Change user agent, browser viewport size and fingerprint.
Use third party solutions for captcha.
Resolve the captcha by yourself, check the answer by Thomas Dondorf. Basically you need to wait for the captcha to appear on another browser, solve it from there. Third party solutions does this for you.
Disclaimer: Do not use anti-captcha plugins/services to misuse resources. Resources are expensive.
Basically the idea is to use anti-captcha services like (2captcha) to deal with persisting recaptcha.
You can use this plugin called puppeteer-extra-plugin-recaptcha by berstend.
// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')
// add recaptcha plugin and provide it your 2captcha token
// 2captcha is the builtin solution provider but others work as well.
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha')
puppeteer.use(
RecaptchaPlugin({
provider: { id: '2captcha', token: 'XXXXXXX' },
visualFeedback: true // colorize reCAPTCHAs (violet = detected, green = solved)
})
)
Afterwards you can run the browser as usual. It will pick up any captcha on the page and attempt to resolve it. You have to find the submit button which varies from site to site if it exists.
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
const page = await browser.newPage()
await page.goto('https://www.google.com/recaptcha/api2/demo')
// That's it, a single line of code to solve reCAPTCHAs 🎉
await page.solveRecaptchas()
await Promise.all([
page.waitForNavigation(),
page.click(`#recaptcha-demo-submit`)
])
await page.screenshot({ path: 'response.png', fullPage: true })
await browser.close()
})
PS:
There are other plugins, even I made a very simple one because captcha is getting harder to solve even for a human like me. You can read the code here.
I am strongly not affiliated with 2Captcha or any other third party services mentioned above.
I had created my own solution which is similar to the other answer by Thomas Dondorf, but gave up soon since Captcha is getting more ridiculous and I do not have mental energy to resolve them.

Proxy servers can be used so that the destination site does not detect a load of responses from a single IP address.
(Translated into Google Translate)

I tried #Thomas Dondorf suggestion, but I think the problem with the steps described in "How to solve the captcha yourself" section is that the token of the CAPTCHA it's valid only one time.
I'll try to explain everything in detail below.
WHAT I'M USING
I'm using as first browser (the one that will not solve the captcha) Google Chrome, and as a second browser (the one where i solve the captcha and i take the token) Firefox.
STEPS
I manually solve the captcha on this site https://recaptcha-demo.appspot.com/recaptcha-v2-checkbox.php
I type the following code document.querySelector('#g-recaptcha-response').value in the google chrome console, but I get an error (VM22:1 Uncaught TypeError: Cannot read property 'value' of null
at :1:48), so I just search the token by opening Elements in Google Chrome and searching g-recaptcha-response with CTRL+F
I copy the token of the recaptcha (here is an image to show where the token is, after the text highlighted in green)
I type the following code document.querySelector('#g-recaptcha-response').value = '...'in the firefox console, replacing the "..." with the recaptcha token just copied
I get the following error and, if you then click on the documentation linked, you'll read that the error is due to the fact that a token can be used only one time, and it has of course already been used for the CAPTCHA you just solved to obtain the token itself (so it seems that the only objective of the token it's to say that the CAPTCHA has already been solved, it seems a sort of defense measurement to prevent replay attacks, as said here in the official documentation of the recaptcha.

Related

TikTok Login Kit web flow - keep getting Redirect URI error (code 10006)

I'm having an issue getting Login Kit to work. Similar to the question asked here I have the correct redirect domain listed in tiktok settings and the redirect_uri is basically just "domain/tiktok" but no matter what I do I get the same error message:
Below is my backend code - it's basically exactly the same as what is listed in the tiktok docs. Any help on this would be much appreciated!
const CLIENT_KEY = 'my_key'
const DOMAIN = 'dev.mydomain.com'
const csrfState = Math.random().toString(36).substring(2);
res.cookie('csrfState', csrfState, { maxAge: 60000 });
const redirect = encodeURIComponent(`https://${DOMAIN}/tiktok`)
let url = 'https://www.tiktok.com/auth/authorize/';
url += '?client_key=' + CLIENT_KEY;
url += '&scope=user.info.basic,video.list';
url += '&response_type=code';
url += '&redirect_uri=' + redirect;
url += '&state=' + csrfState;
res.redirect(url);
UPDATE 8/13/2022
I submitted the app for review and was approved so the status is now "Live in production" instead of "staging". The issue is still there - still showing error message no matter what domain / callback URL I use
UPDATE 8/16/2022
OK so I've made some progress on this.
First off - I was able to get the authentication/login screen to finally show up. I realized to do this you need to:
Make sure that the status of your app is "Live in production" and not "Staging". Even though when you create a new app you may see client_key and client_secret show up don't let that fool you - Login Kit WILL NOT WORK unless your app is submitted and approved
The redirect_uri you include in your server flow must match EXACTLY to whatever value you entered in "Registered domains" in the Settings page. So if you entered "dev.mydomain.com" in Settings then redirect_uri can only be "dev.mydomain.com" not "dev.mydomain.com/tiktok".
I think I might know what the issue is. My guess is that before - on the Settings page you had to enter the FULL redirect URL (not just the domain) and whatever redirect uri was included in the authorization query was checked against this value which was saved in TikTok's database (whatever was entered in the Settings page when path/protocol were allowed). At some point recently, the front-end business logic was changed such that you could only enter a domain (e.g., mydomain.com) on the Settings page without any protocols - however TikTok's backend logic was never updated so during the Login flow they are still checking against an EXACT match for whatever was saved in their DB as the redirect uri - this would explain why an app that was previously using the API with a redirect uri that DOES include protocols (e.g., for Later.com their redirect uri is https://app.later.com/users/auth/tiktok/callback) continues to work and why for any app attempting to save redirect WITH protocols are getting the error message screen. My gut feeling is telling me that the error is not on my part and this is actually a bug on TikTok's API - my guess is it can be addressed either by changing the front-end on the Settings page to allow for path/protocols (I think this is the ideal approach) or to change their backend so that any redirect uri is checked such that it must include 1 of the listed redirect domains.
I've been emailing with the TikTok team - their email is tiktokplatform#tiktok.com - and proposed the two solutions I mentioned above. I suggest if you're having the same issue you email them as well and maybe even link this StackOverflow question so that maybe it will get higher priority if enough people message them about it.
If you're looking for a shot-term hack I'd recommend creating a dedicated app on AWS or Heroku with a clean domain (e.g., https://mydomain-tiktok.herokuapp.com) and then redirect to either your dev or production environment by appending a prefix to the "state" query (e.g., "dev_[STATE_ID]"). I'll just reiterate I consider this a very "hacky" approach handling callbacks and would definitely not want to use something like this in production.

In my case, the integration worked after doing following steps:
In TikTok developers page:
Like #eugene-blinn said: make sure your app is in Live in production status (I couldn't find anything in the documentation about why Staging apps don't work);
Add the Login Kit product to your app and set the Redirect domain field with your host domain, for example: mywebsite.com.
In your code:
From my tests, I could add whanever url path I wanted, the only constraint was that the domain should match with step 2. So, yes, you can add https://mywebsite.com/whatever/path/you/want in redirect_url parameter.
That's it. It should work with these 3 steps.
Additionally, I got other issue related to use specific features in the scope property (like upload or read videos, etc), so here the solution as well:
Only add Video Kit product to the TikTok app and set video.upload or video.list in the scope authorize request won't work unless you also add the TikTok API product in your TikTok app as well. Btw, it neeeds to be approved too.

TikTok fixed the bug that resulted in URL mismatch with redirect domain from working. However, they fixed it only for paths (e.g., /auth/tiktok) but PORT additions still result in an error - so www.domain.com:8080/auth/tiktok won't work but www.domain.com/auth/tiktok WILL work
UPDATE 10/3/2022
Got the following response directly from TikTok engineering team:
At this point, we only support production integrations with TikTok for Developers and require that you have a URL without port number. However, we understand from your communication that this makes it harder for you to build, test, and iterate your integration with us. Unfortunately, at this time, we do not have a timeline for when this additional support for development servers will be added. We request that you only redirect to URLs without port numbers. Thank you for the feedback.

The frontend of the developer's dashboard still rejects protocol and path in validation. However, the backend skips the path validation.
To be able to update the "Redirect domain" simply:
Open dev tools in chrome and go to the "Network" tab.
Clic on "Save changes" button on the dashboard.
Right clic on the "publish" request that appeared and copy as cURL.
Modify the "redirect_domains" field in the request before pasting it in the terminal.
I believe the app still needs to be approved and in production to get it to work. I'm still waiting for approval and it has been a couple of weeks.
UPDATE 9/17/2022
Just like #mauricio-ribeiro, my app worked after it was approved to production. Setting up the redirect domain without path and scheme works just fine.

I had the same problem, my solution:
1.- In my TikTok App dashboard, the “redirect_uri” is: mydomain.com, without http/https and without path (/my-redirect-url). Also you can add subdomains using this rule
2.- In my code, I have to add http or https to the redirect_uri, and feel free to use path (/my-redirect-uri)
I hope this help you

Unsplash API: How to retrieve Access_Token for authenticated access-login by browser?

I needed to submit an approved-account access to Unsplash API, so as to access certain links for access approval. Given that the replies from the support team has taken more than a few days, I would just like to seek out additional help to resolve in retrieving the access_token for new requests-submissions via GET / POST methods.
The original website was working perfectly, till when I had wanted to get ready for submission for production stage and had wanted to prepare potential increases in requests to the Unsplash API.
However, the approval process entailed certain setup criterial, which I totally missed during my development phase and sought to iron out as soon as possible. One of the key component is to resolve your UTM links, which you may find here as the ideal reference: https://help.unsplash.com/en/articles/2511315-guideline-attribution.
My challenge then was that I had attempted to use the official javascript API, Unsplash-Javascript-API (https://github.com/unsplash/unsplash-js#authorization), in an effort to make the authentication / request processes simpler for my webapp to call.
Though most GET requests do work, given that a specific URL of links via "download_location" (https://help.unsplash.com/en/articles/2511258-guideline-triggering-a-download), has to be used instead, it will then require an authenticated request per new submission request by the webapp.
The final challenge then is that apparently it is not clear how the official Unsplash-Javascript-API actually pulls the "authenticated" request, as I was unable to find it on the website, so that I may retrieve the current-access_token for requests' usage.
The basic codes I am using via the API is the following, however I am confused what is the actual maximum request I may pull per page, I am hoping to get 100 returned images' details, but only gotten a maximum of 30 per time. Anyone can also help to confirm is there a workaround to increase this 30 to 100?
Retrieving a Collection of Photos
unsplash.collections.getCollectionPhotos(urlAPI, 1, 100, "Popular")
.then(toJson)
.then(jsonData => {
console.log("jsonData", jsonData);
});
So, currently my website is unable to launch for nearly 1 week plus, as I am just awaiting the final confirmation or additional help from the customer support end of the official Unsplash Team.
Hopeful that someone may help to assist me in clarifying the codes so that at least I can get one step closer to sorting this "official authenticated" process out, and take away one lesser step to getting my approval access for production ready.
Thank you in advance!

Given multiple tries. I wasn't able to retrieve the Access_Token reply, given that there is a pre-authorization step that I wasn't able to find any working solution to.
The current and clear limitations to the API are:
Maximum of 30 images request per GET request.
The official javascript API, Unsplash-Javascript-API (https://github.com/unsplash/unsplash-js#authorization) works but there is not clear or easy way to retrieve the "Access_Token" for a session usage.
Multiple async AXIOS / FETCH requests may not be "compiled successfully" when using ReactJS ContextProvider function prior to the first render. Therefore, an empty array will be shown instead on the final initial render.
Ultimately, my chosen solution is current to break down the images list to the most priority, with the limitation of only 30 images on retrieval, and still store into the original collection and retrieve it.
The other alternative is to actually download and load the images to your own server to load it, which may also be a faster route.
Sadly enough, the Unsplash API team doesn't response as frequently to assistance and my last contact was roughly 1 month ago, though I have attempted to update to their requirements but there were no feedback thereafter.
Thus, it will tentatively be better for you to just build an alternative solution than to rely on the team for a feedback, unless you are a paying client.
Good luck to the others on this! Cheers!

Google sign in button with automatic approval of authorizations

I've seen many threads about it but cannot find a satisfying answer: when using the Google sign-in button (https://developers.google.com/identity/sign-in/web/sign-in), is it possible to already have the authorizations accepted ? Like if I add the client ID of my app somewhere in the Google console ?
For now I'm calling the auth2.grantOfflineAccess when clicking the button (so I can pass the returned code to my backend and make sure the user is from the expected domain).
If you're able to answer the first question and - bonus point - know if what I'm doing after clicking the button is right, you'd be awesome !

Thanks to Steven's comment, I'm now able to have the authorizations accepted by default. Be aware there will still be a second popup (after the one that requests your email and password) to inform you that your admin has granted the app to access your data. Only at your first connection though.
So what you need to do is to follow the third step of this document. They say you only need the plus.me and userinfo.email scopes if you only request the basic profile of the user but it was not working in my case, I also needed the userinfo.profile scope (because I use grantOfflineAccess() ?).

Can't link Twitter with Satellizer.js

The Facebook, Google and Yahoo login for satellizer.js was pretty straight forward. All I had to do was create apps with their respective API's, configure them with my homepage's URL. Then I added the app-ids to the app.js file:
$authProvider.facebook({
clientId: 'xxxxxxxxxxxxxxxxx'
});
and lastly I added the app id and secret in the config.js file server-side.
I thought this would be the process with Twitter too, but their API keeps giving me an error saying
Something is technically wrong. Thanks for noticing—we're going to fix
it up and have things back to normal soon.
and no additional information is given, which leaves me with finding the needle in the haystack.
Are there any additional measures that I need to take to make the Twitter authentication work?

Be sure to set the right twitter secret and key.
I had the same problem until I rrealised that there was a space as the first character in the string of my twitter key.
This causes the same behavior you described

Android HTTP GET cookie / javascript issue?

I wrote an Android app that should 'connect' to a (private) forum using HTTP GET (and sometimes POST) requests. The basic idea is as such:
Login page where users submit their credentials. Login is performed by doing a HTTP POST (tried GET too, same result) to the Login page of the forum, with their username and password as the parameters. The request should return some cookies that I store in a BasicCookieStore.
Every page of the forum they want to visit is retrieved using HTTP GET. I parse the HTML source that I obtain and show them only the relevant info. In order to authenticate the users, the same BasicCookieStore that I used for login (step 1) is set as the cookiestore for the HttpClient.
This method has been working all the time during my testing, and has worked for my beta testers too. Now that I released the app, it became apparent that many users were having issues, especially on mobile connections (Wifi seems to be no problem).
By logging the HTML source that was returned in all the HTTP GET requests, I have a strong suspicion that the actual login works fine, but somehow the cookies don't get returned or stored or something in that direction. The problem is that the HTML source of the first page they will receive should be the list of forums. In the case of users with problems however, they get served a page that basically reads "You must enable Javascript to view this page".
The strange thing is, I don't receive that page when testing, nor do many of my users. Even worse: some users are now reporting it worked fine for them for days or weeks, and has now stopped working. Others have the exact opposite: not working for days, suddenly working now. One user has reported he was in Greece for 2 weeks, where it worked flawlessly, then he got back to Germany, and it stopped working again.
There seems to be a random component at play here.
I have tried various things, mostly with the way I do the HTTP GET requests. I started out using the normal DefaultHttpClient, with various settings, such as this:
HttpClient httpClient = new DefaultHttpClient();
// Define parameters
HttpParams httpParams = httpClient.getParams();
HttpConnectionParams.setConnectionTimeout(httpParams, TIMEOUT);
HttpConnectionParams.setSoTimeout(httpParams, TIMEOUT);
HttpProtocolParams.setVersion(httpParams, HttpVersion.HTTP_1_1);
// Set cookiestore (getCookieStore returns the same cookiestore)
HttpContext localContext = new BasicHttpContext();
localContext.setAttribute(ClientContext.COOKIE_STORE, getCookieStore());
HttpGet http = new HttpGet(url);
http.addHeader("Accept", ACCEPT_STRING);
http.addHeader("Content-Type", "application/x-www-form-urlencoded; charset=utf-8");
// Execute
HttpResponse response = httpClient.execute(http, localContext);
//... Process result (omitted)
Now I have switched to using AndroidHttpClient instead, with the rest of the code basically unchanged, and seem to get the same result.
I have also tried using the AsyncHttpClient library, which works quite differently, but once again the same result. I tried using its PersistentCookieStore as well, and you guessed it - same result.
I am clueless at this point. Am I looking in the wrong direction? The fact that a website would respond with "you need to enable Javascript" for some users but not for all seems to indicate an issue with cookies. I don't know how a website determines if javascript is enabled, but surely with a HTTP GET request there is no javascript at play. So why do I (and many other users) get to the page without any problems, while others get the 'no javascript' message? The only reason I can think of is cookies, but I have no clue what the problem exactly is.
Any help would be much appreciated!

I doubt the problem is cookies. More likely is a network configuration problem.
For example, your user might have connected to a wifi hotspot with a captive portal page (which uses javascript to make you sign in before you can use the hotspot). In this case they should first open the browser, try to browse to (e.g.) http://google.com, get redirected, sign in, and then launch your app.
Or, your user might be connecting through a proxy. Many mobile carriers around the world will proxy their users' HTTP connections, sometimes doing horrible things to the content. Switching to HTTPS might help with that.

We Keep Coding

JavaScript is the programming language of the Web.