Skip to content

Analysis of Bot Protection systems with available countermeasures ๐Ÿšฟ. How to defeat anti-bot system ๐Ÿ‘ป and get around browser fingerprinting scripts ๐Ÿ•ต๏ธโ€โ™‚๏ธ when scraping the web?

Notifications You must be signed in to change notification settings

xuan2261/browser-fingerprinting

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

51 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Browser Fingerprinting, Bot Detection ๐Ÿ‘จโ€๐Ÿ”ง

Here I study various aspects of existing evasion techniques to get around anti-bot systems. The technical findings that I am sharing below are based on observations of running web scraping scripts for a few months against websites protected by:

and a few other custom built (incl. social media platforms). Having troubles bypassing one of them?


Looking for a stellar web scraping service? Check ScrapingBee service that runs in-cloud scraping bots with no extra charges for traffic from premium and residential proxies, and has battle-tested anti-fingerprinting features.

ScrapingBee


A โญ on this repo will be appreciated!

Technicalities

I constantly add stuff to this section. Over time I will try to make it look&feel more structured.

Random, maybe useful

โœ”๏ธ Win / โŒ Fail / ๐Ÿคท Tie :

  • โœ”๏ธ Client Hints - Shipped recently. In line with Chromium cpp implementation.
  • โœ”๏ธ General navigator and window properties
  • โœ”๏ธ Chrome plugins and native extensions - This includes both Widevine DRM extension, as well as Google Hangouts, safe-browsing etc.
  • ๐Ÿคท p0f - detect host OS from TCP struct - Not possible to fix via Puppeteer APIs. Used in Akamai Bot Manager to match against JS and browser headers (Client Hints and User-Agent). There is a detailed explaination of the issue. The most reliable evasion seems to be not spoofing host OS at all, or using OSfooler-ng.
  • ๐Ÿคท Browser dimensions - Although stealth plugin provides window.outerdimensions evasion, it won't work without correct config on non-default OS in headless mode; almost always fails when viewport size >= screen resolution (low screen resolution display on the host).
  • โŒ core-estimator - This can detect mismatch between navigator.hardwareConcurrency and SW/WW execution profile. Not possible to limit/bump the ServiceWorker/WebWorker thread limit via existng Puppeteer APIs.
  • โŒ WebGL extensions profiling - desc. tbd
  • โŒ RTCPeerConnection when behind a proxy - Applies to both SOCKS and HTTP(S) proxies.
  • โŒ Performance.now - desc. tbd (red pill)
  • โŒ WebGL profiling - desc. tbd
  • โŒ Behavior Detection - desc. tbd (events, params, ML+AI buzz)
  • โŒ Font fingerprinting - desc. tbd (list+version+renderer via HTML&canvas)
  • โŒ Network Latency - desc. tbd (integrity check: proxy det., JS networkinfo, dns resolv profiling&timing)
  • โŒ Battery API - desc. tbd
  • โŒ Gyroscope and other (mostly mobile) device sensors - desc. tbd

Multilogin, Kameleo and others ๐Ÿ’ฐ๐Ÿค 

  • โŒ General navigator and window properties - As per Multilogin documentation custom browser builds typically lag behind the latest additions added by browser vendors. In this case modified Chromium M7X is used (almost 10 versions behind when writing this).
  • ๐Ÿคท Font masking - Font fingerprinting still leaks host OS due to use of different font rendering backends on Win/Lin/Mac. However, the basic "font whitelisting" technique can help to slightly rotate browser fingerprint.
  • โŒ Inconsistencies - Profile misconfiguration leads to early property/behavior inconsitency detection.
  • โŒ Native extensions - Unlike puppeteer-extra-plugin-stealth custom Chromium builds such as ML and Kameleo provide at most an override for native plugins and extensions shipped with Google Chrome.
  • โŒ AudioContext APIs and WebGL property override - Manipulation of original canvas and audio waveform can be detected with custom JS.
  • โœ”๏ธ Audio and GL noise

tbd (if you have an active subscription in any of these services and don't mind sharing an account drop me an email โค๏ธ)

Available stealth browsers with automation features

Important You use this software at your own risk. Some of them contain malwares just fyi. I do not recommend using them.

Stealth Browser Puppeteer Selenium Evasions SDK/Tooling Origin
GoLogin โœ”๏ธ โœ”๏ธ ๐Ÿคฎ ๐Ÿ‘ ๐Ÿ‡บ๐Ÿ‡ธ + ๐Ÿ‡ท๐Ÿ‡บ
Incogniton โœ”๏ธ โœ”๏ธ ๐Ÿคฎ โœ”๏ธ โ“
ClonBrowser โœ”๏ธ โœ”๏ธ ๐Ÿคฎ โœ”๏ธ โ“
MultiLogin โœ”๏ธ โœ”๏ธ ๐Ÿคฎ โœ”๏ธ ๐Ÿ‡ช๐Ÿ‡ช + ๐Ÿ‡ท๐Ÿ‡บ
Indigo Browser โœ”๏ธ โœ”๏ธ ๐Ÿคฎ โœ”๏ธ โ“
GhostBrowser โŒ โŒ โŒ ๐Ÿ‘ โ“
Kameleo โœ”๏ธ โœ”๏ธ ๐Ÿคฎ โœ”๏ธ โ“
AntBrowser โŒ โŒ โŒ โŒ ๐Ÿ‡ท๐Ÿ‡บ
CheBrowser โŒ โŒ ๐Ÿคฎ/โœ”๏ธ ๐Ÿ‘ ๐Ÿ‡ท๐Ÿ‡บ

Legend: ๐Ÿคฎ - Evasion based on noise. โŒ - No. โœ”๏ธ - Acceptable (with support libraries or not). ๐Ÿ‘ - Very nice.

Fingerprint test pages

These websites may be useful to test fingerprinting techniques against a web scraping software

Test page Notes
https://pixelscan.net/ Not 100% realiable as it often displays "inconsistent" to Chrome after a new update, but worth checking as the author adds new interesting detection features every now and then
https://browserleaks.com/ Doesn't need introduction ๐Ÿ˜‰
https://f.vision/ Good quality test page from some ๐Ÿ‡ท๐Ÿ‡บ guys
https://www.ipqualityscore.com/ip-reputation-check Commercial service with free reputation check against popular blacklists
https://antcpt.com/eng/information/demo-form/recaptcha-3-test-score.html ReCaptcha score as well as some interesting notes on how to optimize captcha solving costs
https://ja3er.com/ SSL/TLS fingerprint
https://fingerprintjs.com/demo/ Good for basic tests - from people who believe and claim can create unique fingerprints "99.5%" of the time
https://coveryourtracks.eff.org/ -
https://www.deviceinfo.me/ -
https://amiunique.org/ -
http://uniquemachine.org/ -
http://dnscookie.com/ -
https://whatleaks.com/ -
https://kitchensink.ssl.fun/vendor/shape/fp -

Non-technical notes

I need to make a general remark to people who are evaluating (and/or) planning to introduce anti-bot software on their websites. Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

Blocking bot traffic is based on the premise that you (or your technology provider) can distinguish bots from real users. To make this happen various privacy-invasive techniques are applied. To date none of them has been proved to be successful against specialized web scraping tools. Anti-bot software is all about reducing cheap bot traffic. It makes the process of scraping more expensive and complicated, but does not make it entirely impossible.

Anti-bot software vendors use detection techniques that fall into one of these two categories:

Binary detection

No specialized web scraping software is used. Vendor can detect the bad traffic based on information openly disclosed by the scraper e.g. User-Agent header, connection parameters etc.

As a result only bots that are not targeted to scrape specific website are blocked. This will make most of the managers happy, because the overall number of bad traffic goes down and it may almost look like there is no more bot traffic on the website. Wrong.

Traffic clustering

More advanced web scrapers make use of residential proxies and implement complex evasion techniques to fool anti-bot software to think that the web scraper is a real user. No detection mechanism exists to get around this due to technical limitation of web browsers.

In this case, most of the time the vendor will be only able to cluster the bad traffic by finding patterns in bot traffic and behavior. This is where browser fingerprinting comes into play. The problem with banning the traffic here is that it may turn out to be a risky operation when bots are successfully mimicking real users. There is a chance that by blocking bots the website will become unavailable to real visitors.

Gateways, captchas & co

If you think this is a way to go google "captcha resolve api".

Tester

Check out my tester application: https://niespodd.github.io/browser-fingerprinting/

Support

If you have problems with scraping specific website, write me a short email at [email protected]. Let's have a quick tรชte-ร -tรชte consultation via Skype ๐Ÿ˜Š.

Have I mentioned a โญ would be appreciated? :-)

โžก๏ธ Ethereum address 0x380a4b41fB5e0e1EB8c616eBD56f62f8F934Bab6

About

Analysis of Bot Protection systems with available countermeasures ๐Ÿšฟ. How to defeat anti-bot system ๐Ÿ‘ป and get around browser fingerprinting scripts ๐Ÿ•ต๏ธโ€โ™‚๏ธ when scraping the web?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 99.7%
  • HTML 0.3%