Paso a paso: Cómo evitar Cloudflare y mejorar tus esfuerzos de Web Scraping

Mihnea-Octavian Manolache el 22 de febrero de 2023

Nowadays, web scraping has become a real challenging task. If you ever built a web scraper with a headless browser, most definitely you came across some anti-bot systems. And bypassing cloudflare, datadome or any other anti-bot provider for that matter, is no easy task. You have to first think of evasion strategies and then implement them in production. And even so, there might be scenarios that you never accounted for.

Sin embargo, sin al menos una mínima implementación de estas técnicas de evasión, sus posibilidades de ser atrapado y bloqueado son muy altas. Por eso, el tema del artículo de hoy es cómo eludir la detección de Cloudflare. Y la mayoría de las técnicas que vamos a discutir se aplican también a otros proveedores de detección. Como referencia, nos centraremos en Selenium para ver si podemos hacerlo sigiloso. A lo largo del artículo, entre otros, voy a discutir:

Métodos de detección de robots
Técnicas generales de evasión anti-bot
Evasiones avanzadas para Selenio

¿Cómo detecta Cloudflare los navegadores headless?

Cloudflare is a tech company with a giant network. They focus on services like CDN, DNS and various online security systems. Their Web Application Firewall is usually designed to protect against attacks such as DDoS or cross site scripting. In recent years, Cloudflare added and other providers in the field introduced fingerprinting systems, capable of detecting headless browsers. As you might guess, one of the first affected by these techniques is Selenium. And since the web scraping industry relies heavily on this technology, scrapers are directly affected as well.

Before moving forward to anti-bot techniques, I think it is important to discuss how cloudflare detects Selenium. Well, the system can be very complex. For example, there are properties in a browser that a web driver lacks. The `navigator` interface in a browser even has a property called `webdriver` that indicates if a browser is controlled by automation. And that is an instant give away. If you want to experiment with it:

Abra las herramientas de desarrollo de su navegador
Navegar a la consola
Escriba el siguiente comando: `navigator.webdriver`

En tu caso, debería devolver `false`. Pero si lo intentas con Puppeteer o Selenium, obtendrás `true`. Si te estás preguntando cómo Cloudflare aprovecha esto para detectar bots, pues es bastante sencillo. Todo lo que necesitan hacer es inyectar un script como el siguiente en el sitio web de su socio:

// detection-script.js

const webdriver = navigator.webdriver

// If webdriver returns true, display a reCaptcha

// In this example, I am transferring the user to a Cloudflare challenge page.

// But you get the idea

if ( webdriver ) location.replace('https://cloudflarechallenge.com')

Of course, in real life, there are many more levels of detection these providers use. Even the size of the screen, the keyboard layout or the plugins used by the browsers are used to specifically fingerprint a browser. If you’re interested in how browserless detection works, check out my simple service worker test. And that is just if you stick to the browser. You can also detect bot activity by looking into the IP address from where the request originates. For example, if you’re using datacenter IPs, your chances of getting blocked increase with every request. That is why it’s recommended to use residential or ISP proxies when you’re building a web scraper.

Cómo evitar Cloudflare con Selenium

Afortunadamente, la comunidad de web scraping es muy activa. Y como hay tanta demanda para eludir Cloudflare y otros proveedores anti-bot, existen soluciones de código abierto en ese ámbito. Se pueden conseguir grandes cosas cuando las comunidades de programadores trabajan juntas. Para avanzar, sugiero que sigamos estos pasos:

Run some test to see if default Selenium can bypass Cloudflare
Add some extra evasions to our make our scripts stealthier

Así que empecemos con nuestro primer paso:

#1: ¿Puede Selenium por defecto eludir Cloudflare?

No soy de los que hacen suposiciones. Y eso es especialmente porque no sabemos con certeza cómo funcionan los sistemas de Cloudflare. Utilizan todo tipo de ofuscación con su código, lo que dificulta la ingeniería inversa. Por eso, a lo largo de mi experiencia como desarrollador, aprendí que probar es la mejor manera de entender cómo funciona un sistema. Así que vamos a construir un scraper básico y ver cómo se comporta en un objetivo real, protegido por Cloudflare.

1.1. Configurar el entorno

Con Python, es mejor aislar nuestros proyectos dentro de un único directorio. Así que vamos a crear una nueva carpeta, abrir una ventana de terminal y navegar hasta ella:

# Cree un nuevo entorno virtual y actívelo

~ " python3 -m venv env && source env/bin/activate

# Instale las dependencias

~ " python3 -m pip install selenium

# Cree un nuevo archivo .py y abra el proyecto dentro de su IDE

~ " touch app.py && code .

1.2. Construir un raspador web simple con Selenium

Now that you have successfully set up your project, it’s time to add some code. We’re not going to build anything fancy here. We just need this script for testing purposes. If you want to learn about advanced scraping, check out this [LINK](AICI TREBUIE POSTAT ARTICOLUL) tutorial on Pyppeteer.

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

# Configura Chrome para que se abra en modo headless

options = Options()

options.headless = True

# Crea una nueva instancia de Chrome y navega al destino

driver = webdriver.Chrome(options=options)

driver.get('https://www.snipesusa.com/')

# Dale un poco de tiempo para que cargue

time.sleep(10)

# Haz una captura de pantalla de la página

driver.get_screenshot_as_file('screenshot.png')

# Cierra el navegador

driver.quit()

Ahora echa un vistazo a la captura de pantalla. Esto es lo que tengo:

Creo que podemos concluir que la prueba ha fallado. El sitio web objetivo está protegido por Cloudflare y, como puede ver, se nos bloquea. Así que por defecto, Selenium no es capaz de eludir Cloudflare. No voy a profundizar y comprobar con otros proveedores de detección de bots. Si quieres hacer más pruebas, aquí tienes algunos objetivos y sus proveedores:

#2: ¿Puede el sigiloso selenio eludir Cloudflare?

First of all, let me clarify the terms. By stealth Selenium I mean a version of Selenium that can go undetected and bypass Cloudflare. I am not referring to any specific stealthiness technique. There are a couple of ways you can implement evasion techniques into Selenium. There are packages that handle it, or you can use the `execute_cdp_cmd` to interact directly with the Chrome API. The latter allows you more control, but requires more work. Here is an example of how you could use it to change the user agent’s value:

driver.execute_cdp_cmd('Emulation.setUserAgentOverride', {

               "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win32; x86) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",

               "platform": "Win32",

               "acceptLanguage":"ro-RO"

       })

Pero tendrías que pasar por CDP e identificar las APIs que te permiten hacer todos los cambios necesarios. Así que por el momento, vamos a probar con algunos paquetes.

1.1. Selenio sigiloso

Hay al menos dos paquetes que puede utilizar para hacer que Selenium sea sigiloso. Hasta este punto, sin embargo, ninguno de ellos está garantizado para eludir Cloudflare. Una vez más, tenemos que probar y ver si alguno de ellos funciona. En primer lugar, echemos un vistazo a `selenium-stealth`. Este paquete es un wrapper alrededor de `puppeteer-extra-plugin-stealth`, haciendo posible usar las evasiones de Puppeteer con Selenium de Python. Para usarlo, tienes que instalarlo primero. Abre una ventana de terminal e introduce este comando

# Instalar selenium-stealth

~ " python3 -m pip install selenium-stealth

Ya está todo listo. Podemos usarlo para hacer nuestro raspador anterior más sigiloso:

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium_stealth import stealth

import time

# Configura Chrome para que se abra en modo headless

options = Options()

options.headless = True

# Crear una nueva instancia de Chrome

driver = webdriver.Chrome(options=options)

# Aplicar stealth a su webdriver

stealth(driver,

 languages=["en-US", "en"],

 vendor="Google Inc.",

 platform="Win32",

 webgl_vendor="Intel Inc.",

 renderer="Intel Iris OpenGL Engine",

 fix_hairline=True,

)

# Navega al destino

driver.get('https://www.snipesusa.com/')

# Dale un poco de tiempo para que cargue

time.sleep(10)

# Haz una captura de pantalla de la página

driver.get_screenshot_as_file('stealth.png')

# Cierra el navegador

driver.quit()

Running the script returned some different outcome for me this time, as opposed to the default Selenium settings:

The second option you may use is `undetected_chromedriver`. This one is described as an ‘optimized Selenium chromedriver’. Let’s test it out:

# Instalar undetected_chromedriver

~ " python3 -m pip install undetected_chromedriver

El código es muy similar a nuestro script por defecto. La principal diferencia está en el nombre del paquete. Aquí tenemos un scraper básico con `undetected_chromedriver` y veamos si puede evitar Cloudflare:

import undetected_chromedriver as uc

import time

# Configura Chrome para que se abra en modo headless

options = uc.ChromeOptions()

options.headless = True

# Crea una nueva instancia de Chrome y maximiza la ventana

driver = uc.Chrome(options=options, executable_path='/Applications/Google Chrome.app/Contents/MacOS/Google Chrome')

driver.maximize_window()

# Navega hasta el objetivo

driver.get('https://www.snipesusa.com/')

# Dale algo de tiempo para que cargue

time.sleep(10)

# Haz una captura de pantalla de la página

driver.get_screenshot_as_file('stealth-uc.png')

# Cierra el navegador

driver.quit()

Yet again, running the script turns out well for me. It seems that at least these two packages can successfully bypass Cloudflare protection. At least in the short term. Truth to be told, chances are, if you use these scripts extensively, Cloudflare will catch up on your IP address and block it.

So let me introduce you to a third option: Web Scraping API.

1.2. Selenium con API de Web Scraping

Web Scraping API has this amazing feature called the Proxy Mode. You can read more about it here. But what I want to note here is that our Proxy Mode can successfully be integrated with Selenium. This way, you get access to all the evasion features we’ve implemented. And let me tell you that we have a dedicated team working on custom evasion techniques. In technical terms, we’re handling IP rotations, we’re using various proxies, we’re solving captchas and we’re using Chrome’s API to continuously change our fingerprint. In non-technical terms, this translates to less hussle on your side and greater success rate. You basically get the stealthiest version of Selenium there is. And here’s how it’s done:

# Instalar selenium-wire

~ " python3 -m pip install selenium-wire

Estamos usando `selenium-wire` para usar Selenium con un proxy. Ahora aquí está el script:

from seleniumwire import webdriver

import time

# Method to encode parameters

def get_params(object):

   params = ''

   for key,value in object.items():

       if list(object).index(key) < len(object) - 1:

           params += f"{key}={value}."

       else:

           params += f"{key}={value}"

   return params

# Your WSA API key

API_KEY = '<YOUR_API_KEY>'

# Default proxy mode parameters

PARAMETERS = {

   "proxy_type":"datacenter",

   "device":"desktop",

   "render_js":1

}

# Set Selenium to use a proxy

options = {

   'proxy': {

       "http": f"http://webscrapingapi.{ get_params(PARAMETERS) }:{ API_KEY }@proxy.webscrapingapi.com:80",

   }

}

# Create a new Chrome instance

driver = webdriver.Chrome(seleniumwire_options=options)

# Navigate to target

driver.get('https://www.httpbin.org/get')

# Retrieve the HTML documeent from the page

html = driver.page_source

print(html)

# Close browser

driver.quit()

Si ejecutas este script un par de veces, verás como la dirección IP cambia cada vez. Ese es nuestro sistema de rotación de IP. En segundo plano, también añade técnicas de evasión. Ni siquiera necesitas preocuparte por ellas. Nosotros nos encargamos de la parte de evasión de Cloudflare para que puedas centrarte más en analizar los datos.

Conclusiones

If you want to build a scraper that can bypass Cloudflare, you need to account for a lot of things. A dedicated team can work 24/7 and there’s still no guarantee the evasions will work every time. That’s because with every browser version release, there is a chance new features are added to the API. And some of these features can be used to fingerprint and detect bots.

Incluso diría que el mejor navegador para evitar Cloudflare y otros proveedores es el que construyes tú mismo. Y nosotros construimos uno en Web Scraping API. Ahora lo compartimos contigo. Así que ¡disfruta del scraping!