Hands On #1: Testing the Bright Data Web Unblocker proxy

What is Bright Data Web Unblocker

Digging a bit more into Bright Data’s website we can understand better how this works. Directly from the product page:

Limits requests per IP

Manage IP usage rates so you don’t ask for a suspicious amount of data from any one IP

Emulates a real user

Automated user emulation including: starting on the target’s homepage, clicking their links, & making human mouse movements

Imitates the right devices

Web Unlocker emulates the right devices that servers expect to see

Calibrates referrer header

Makes sure the target website sees that you are landing on their page from a popular website

Identifies honeypots

Honeypots are links that sites use to expose your crawlers. Automatically detect them and avoid their trap

Sets intervals between requests

Automated delays are randomly set between requests

All these features can be summed up with the following picture.

Bright data Unblocker
Bright Data Web Unblocker page

It seems a good solution and easy to integrate into our scrapers since it’s basically like adding a proxy to them.

Our testing methodology

To test this kind of product I’ve developed a plain Scrapy spider that retrieves 10 pages from 5 different websites, one per each anti-bot solution tested (Datadome, Cloudflare, Kasada, F5, PerimeterX). It returns the HTTP status code, a string from the page (needed to check if the page was loaded correctly), the website, and the anti-bot names.

The base scraper cannot retrieve correctly any of the records and this will be our benchmark result.

As a result of the test, we’ll assign a score from 0 to 100, depending on how many URLs are retrieved correctly on two runs, one in a local environment and the other one from a server. A score of 100 means that the anti-bot was bypassed for every URL given in input in both tests, while our starting scraper has a score of 0 since it could not avoid an anti-bot for any of the records.

You can find the code of the test scraper in our GitHub repository open to all our readers.

Continue reading on the newsletter