February 17, 2008: What Google's honeynet found
This paper (pdf) is absolutely fascinating, in multiple ways.
The first thing that fascinated me was the infrastructure Google is dedicating to this malware analysis. They do a fairly deep analysis of a million web pages a day, by firing up a virtual copy of Windows (unpatched), telling it to go hit a web page, then seeing if anything bad happens to that (virtual) system within the next two minutes. Given that there are 1440 minutes in a day, this means that at any given time, they are running about 1400 virtual Windows systems in their honeynet.
Of course the other fascinating thing was the actual point of the paper: the malware analysis. There are a fair number of correlations they’ve found - 65% of malware sites in China, twice as many IIS infected sites as Apache ones, society and computer sites are more likely to have malware than porn sites (!!), etc. But two things really jumped out at me.
First, ActiveX is getting a bad rap. The paper reiterates that most of the client side exploits occur via javascript.
The second insight requires an explanation of the malware distribution model Google found. Here’s a quick diagram:

Basically, in (1) the client hits a landing site which contains a hidden link to one or more hop points, each of which redirects the browser to either another hop point or to the final link in the chain: the distribution site, which is the site hosting the malware for download by the client browser. That malware may go on to download more malware, but the distribution sites are the key.
And in 10 months, Google found 9,340 distribution sites. They found over 180,000 landing sites (a/k/a sites with links pointing to the distribution sites), which is troubling (because it means those sites were likely compromised in some way). But less than 10,000 sites hosting the actual malware is startling (in a good way!), because it means that it wouldn’t be very hard for a dedicated task force to shut a lot of them down. Or to simply tell all web browsers: don’t download from these 10,000 web sites, m’kay?
It gets better. All of the distribution sites are among only 500 Autonomous Systems (AS). An AS is essentially a group of IP’s under the control of a single entity such as a corporation or web hoster. And, get this: 95% of the malware distribution sites are under the aegis of only 210 Autonomous Systems.
So, Google could give their list of malware distribution sites to the 500 entities which own those 500 AS’s, and each AS entity would have (on average) somewhere between 18 and 50 sites to clean up. And malware via web exploits would be an almost solved problem.
Until the malware distributors found new web hosts, that is …