Scraping
2025-09-23


Web scraping is the process of automatically accessing and collecting information from websites using bots. It's essential for web search, LLM recommendations, live web monitoring and so on, but it can have a downside of generating unwanted traffic and costs. For this reason some websites use various anti-bot measures and in the web content arms race the scrapers unevitably grow more complex.

There are three main ways to recognize and mitigate bot traffic:
 1. displaying captchas for risky browser fingerprints
 2. blocking or rate-limiting abusive IPs
 3. making scraping cost more than it's worth

For the dynamic web it's not feasible anymore to simply fetch the website text and parse it. Often there is some JS involved that populates the website content. Some of the aforementioned anti-bot measures can be involved too, thus the entire browser has to be utilized in the scraping process to be effective, circumvent protections, and render the target content properly.

There are two main browser automation toolkits: (1) [selenium](https://www.seleniun.dev/) and (2) [playwright](https://playwright.dev/). Both provide some facilities to make the browser appear more like a real user. For selenium there is [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver) and for playwright there are extensions such as [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth). Choice between the two pretty much comes down to personal preference and external tooling support.

To avoid IP blocking scrapers have to rotate their addresses regularly, utilizing various proxy services. Some of them are pretty shady, for example residential and mobile IP proxies, like the ones from brightdata (previously luminati). The Evil™ Corp pays mobile devs to include their Evil™ SDK within their app that allows other people to use YOUR device as a proxy. Of course it all has to be obfuscated somewhere in the terms of service, but nobody reads a fucking 200 page TOS for a stupid mobile game. The malicious Evil™ SDK ends up in many places where you don't suspect it: mobile apps, browser extensions, games, the .exe you downloaded last time, and many more. The more you know...

Even though it's not hard to build out a scraping setup like this by yourself, it would be much more convenient to have it all bundled in one service. Unsurprisingly, there's a lot to choose from: scrap-this, scrap-that, com, io, dev, whatever. There are even some designed for plug-and-play use with LLM tools, for example: browserbase, firecrawl.

Last time I used browserbase to scrape a list of about 200 ecom websites and extract contact info from them. Data like this can later be used for marketing campaigns, so called "lead generation", robocalls, email spam, phishing, or other cyberattacks. It's easy with browserbase, because I don't have to grep for contact links and emails, I can use stagehand and have LLMs locate this data for me, navigate the page even, and in the end extract only what I need. Browserbase exposes playwright API for their browser session, so I can also take screenshots and shapshots of these websites to analyze them more throughly later. Theo's recommendation once again comes in clutch. I hear more and more about LLM-based scrapers in the browser and other tools, so I suspect this is only going to get easier with time.

I think the third way of mitigating bot traffic, making it financially unfeasible, is currently the best solution. I am very much rooting for Anubis to win in the battle against the bots and even out the cost of serving the content with the cost of scraping it. There must be a better and less hostile way to obtain a good data set than just wild data harvesting. And there must be better ways to protect your services from it.