Last active
November 4, 2025 02:24
-
-
Save PixelMelt/e1d765be11dd39783de20095bf2ab0fa to your computer and use it in GitHub Desktop.
u/mike_hearn on botguard
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| I'm the guy who wrote/designed the first version of Google's framework for this (a.k.a. BotGuard), way back in 2010. Indeed we were up to "good", like detecting spambots and click fraud. People often think these things are attempts to build supercookies but they aren't, they are only designed to detect the difference between automated and non-automated clients. | |
| There seem to be quite a few VM based JS obfuscation schemes appearing these days, but judging from the blog posts about people attempting to reverse them the designers haven't fully understood how to most fully exploit the technique. Given that the whole point is to make understanding how these programs work hard, that's not a huge surprise. | |
| Building a VM is not an end for obfuscation purposes, it's a means. The actual end goal is to deploy the hash-and-decrypt pattern. I learned this technique from Nate Lawson (via this blog post) and the way his company had used it to great effect in BD+. | |
| A custom VM is powerful not only because it puts the debugger on the wrong level of abstraction, but because you can make one of the registers hold decryption state that's applied to the opcode stream. The register can then be initialized from the output of a hash function applied to measurements of the execution environment. By carefully selecting what's measured you can encrypt each stage of the program under a piece of state that the reverse engineer must tamper with to explore what the program is doing, which will then break the decryption for the next stage. That stage in turn contains a salt combined with another measurement to compute the next key, and so on and so forth. In this way you can build a number of "gates" through which the adversary must pass to reach their end goal - usually a (server side) encrypted token of some sort that must be re-submitted to the server to authorize an action. This sort of thing can make reverse engineering really quite tedious even for experienced developers. | |
| There are a few important things to observe at this point: | |
| It can work astoundingly well. The average spammer is not a good programmer. Spam is not that profitable assuming you've already harvested the lower hanging fruit. Programming tasks that might sound easy to you or I, are not always easy or even possible for your actual real-world adversaries. | |
| You can build many such gates, the first version of BotGuard had on the order of 7 or 8 I think, but that was an MVP designed to demonstrate the concept to a sceptical set of colleagues. I'd assume that the latest versions have more. | |
| If you construct your programs correctly you will kill off non-browser-embedding bots with 100% success. Spammers hate this because they are (or were) very frequently CPU constrained for various reasons, despite that you'd imagine botnets solve this. | |
| There are many tricks to detect browser automation and some of them are very non-obvious. The original signals I came up with to justify the project were never rediscovered outside Google as far as I know, although I doubt they're useful for much these days. Don't under-estimate what can be done here! | |
| Reverse engineering one of the programs once is not sufficient to beat a good system. A high quality VM based obfuscator will be randomizing everything: the programs, the gates and the VM itself. That means it's insufficient to carefully take apart one program. You have to do be able to do it automatically for any program. Also, you will need to be able to automatically de-randomize and de-obfuscate the programs to a good enough semantic level to detect if the program is doing something "new" that might detect your bot, as otherwise you're going to get detected at some point without realizing and then three weeks later all your IPs/accounts/domains will burn or - even better - all your customer's IPs/accounts/domains. They will be upset! | |
| --- | |
| Only the parts related to abuse detection are obfuscated like that. The app JS is of course minified as per usual but that's for size and efficiency reasons, not signal protection. Still, if you build one of these then it's a general platform so you can hide anything inside it. At the time I left Google they were writing programs in the custom hand-crafted assembly, there was no higher level language. It's hard to represent encrypted control flow in normal languages. The programs aren't that large so it wasn't a big deal. That was nearly a decade ago though. Probably they do have higher level languages targeting the platform by now. | |
| Performance was fine even on old browsers. Even a basic JIT eats that stuff for breakfast because it's lots of tight loops and bitwise operations. It can go wrong though. One of the more irritating bugs I had to track down was a miscompile in Opera's JIT (which dates this story - back then Opera was still a thing and used its own engine). Once the hash function got hot enough it would be "optimized" in such a way that the function succeeded but the results became wrong. If the output of a hash function is an encryption key to decrypt your program, that's going to hurt! Luckily there was a workaround. | |
| --- | |
| In such a system, how do you deal with real users 'failing' the gates? | |
| For example, if they are using some obscure braille browser, or an old smart TV? | |
| For things like video view counting, you can just not count those users. But for things like account creation, the business people presumably don't want to lock out 1% of the users. Yet if you present a captcha, then that can be farmed out to people in low wage countries and all your protections are gone. | |
| Is there a fix? | |
| Handled on an app by app basis. There's usually some fallback. For account creation it was phone verification, unless the signal of automation was unambiguous, for example (I know it sounds unlikely but these signals are often not statistical, so you can have signals with no false positives or negatives albeit with poor coverage). I don't know what they do these days | |
| --- | |
| As a follow up, how does correctly constructing the program kill off non-browser embedded bots so effectively? | |
| Please see the linked blog post by Nate for the general principles, or if you're really keen read the Pirate Cat Book. Briefly, the idea is to randomly measure the environment in ways that are infeasibly expensive to simulate, and use those measurements to derive new keys that allow execution to pass through the gates. The effort needed to correctly implement the browser APIs inside your bot eventually approaches the effort needed to write a browser, which is impractical, thus forcing the adversary into using real browsers ... which aren't designed for use by spammers. | |
| --- | |
| The game continues. Back in 2010 when I was writing the first in-browser bot detection signals for Google (so BotGuard could spot embedded Internet Explorers) I wondered how long they might last. Surely at some point embedded browsers would become undetectable? It never happened - browsers are so complex that there will probably always be ways to detect when they're being automated. | |
| There are some less obvious aspects to this that matter a lot in practice: | |
| 1. You have to force the code to actually run inside a real browser in the first place, not simply inside a fast emulator that sends back a clean response. This is by itself a big part of the challenge. | |
| 2. Doing so is useful even if you miss some automated browsers, because adversaries are often CPU and RAM constrained in ways you may not expect. | |
| 3. You have to do something sensible if the User-Agent claims to be something obscure, old or alternatively, too new for you to have seen before. | |
| 4. The signals have to be well protected, otherwise bot authors will just read your JS to see what they have to patch next. Signal collection and obfuscation work best when the two are tightly integrated together. | |
| These days there are quite a few companies doing JS based bot detection but I noticed from write-ups by reverse engineers that they don't seem to be obfuscating what they're doing as well as they could. It's like they heard that a custom VM is a good form of obfuscation but missed some of the reasons why. I wrote a bit about why the pattern is actually useful a month ago when TikTok's bot detector was being blogged about: | |
| https://www.reddit.com/r/programming/comments/10755l2/reverse_engineering_tiktoks_vm_obfuscation_part_2/ | |
| tl;dr you want to use a mesh oriented obfuscation and a custom VM makes that easier. It's a means, not an end. | |
| Ad: Occasionally I do private consulting on this topic, mostly for tech firms. Bot detectors tend to be either something home-grown by tech/social networking firms, or these days sold as a service by companies like DataDome, HUMAN etc. Companies that want to own their anti-abuse stack have to start from scratch every time, and often end up with something subpar because it's very difficult to hire for this set of skills. You often end up hiring people with a generic ML background but then they struggle to obtain good enough signals and the model produces noise. You do want some ML in the mix (or just statistics) to establish a base level of protection and to ensure that when bots are caught their resources are burned, but it's not enough by itself anymore. I offer training courses on how to construct high quality JS anti-bot systems and am thinking of maybe in future offering a reference codebase you can license and then fork. If anyone reading this is interested, drop me an email: [email protected] | |
| --- | |
| What do you mean by a "mesh-oriented obfuscation"? My best guess is: serving a different subset of the VM detection code to each client? | |
| There's lots of techniques that can fall under that heading. The idea is to tie together your logic and obfuscation so that the things you have to do to undo the obfuscation end up breaking access to other parts of the program. Using the output of hash functions as decryption keys is one famous approach but there are others. | |
| --- | |
| Google's obfuscating VM based anti-bot system (BotGuard) was very effective. Source: I wrote it. We used it to completely wipe out numerous botnets that were abusing Google's products e.g. posting spam, clickfraud, phishing campaigns. BotGuard is still deployed on basically every Google product and they later did similar systems for Android and iOS, so I guess it continues to work well. | |
| AFAIK Google was the first to use VM based obfuscation in JavaScript. Nobody was using this technique at the time for anti-spam so I was inspired primarily by the work Nate Lawson did on BluRay. | |
| What most people didn't realize back then is that if you can force your adversary to run a full blown web browser there are numerous tricks to detect that the browser is being automated. When BotGuard was new most of those tricks were specific to Internet Explorer, none were already known (I had to discover them myself) and I never found any evidence that any of them were rediscovered outside of Google. The original bag of tricks is obsolete now of course, nobody is using Internet Explorer anymore. I don't know what it does these days. | |
| The VM isn't merely about protecting the tricks, though. That's useful but not the main reason for it. The main reason is to make it easier to generate random encrypted programs for the VM, and thus harder to write a static analysis. If you can't write a static analysis for the program supplied by your adversary you're forced to actually execute it and therefore can't write a "safe" bot. If the program changes in ways that are designed to detect your bot, done well there's no good way to detect this and bring the botnet to a safe halt because you don't know what the program is actually doing at the semantic level. Therefore the generated programs can detect your bot and then report back to the server what it found, triggering delayed IP/account/phone number bans. It's very expensive for abusers to go through these bans but because they have to blindly execute the generated programs they can't easily reduce the risk. Once the profit margin shrinks below the margin from abusing a different website, they leave and you win. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment