Chad Perrin: SOB

10 April 2007

how to analyze software security statistics

Filed under: Security — apotheon @ 10:38

In yesterday’s analysis I presented some coverage of the “Phishing, Spam, and Security Risk Highlights” of Symantec’s Internet Security Threat Report, volume XI. Today, rather than continue my analysis, I’m taking a break to address the important matter of how a secondary analysis may be conducted in the comfort of your own mind. I will continue to provide analysis of further content of the ISTR XI on a(n almost) daily basis, until I decide I’m finished.

Do try this at home! The most important system security utility in your arsenal is, and likely always will be, your mind.

Techniques of Informal Secondary Analysis:

The following points are things to keep in mind when reading a statistical analysis of data related to computer security, or when reading some reporter’s take on such an analysis. These principles of secondary analysis may even prove useful for reading my own analyses. Take no such analysis merely at face value — instead, consider all such sources suspect, and always apply critical thought to them.

  • limited statistical sampling: The sample analyzed is probably artificially limited in some way that creates unobviously erroneous impressions. For instance, the Symantec Internet Security Threat Report XI analysis of operating systems was limited to MS Windows, MacOS X, Sun Solaris, HP-UX, and RHEL/Fedora. A lot of people — including supposed experts who should know better — took this to mean that some kind of analysis of “Linux” in general was done, rather than a single OS that happened to have a Linux kernel. Similarly, sweeping generalizations were made about all available OSes based on this very limited sampling.
  • conflicts of interest: The people presenting the analysis probably operate under conditions of a conflict of interest. The vast majority of TCO comparisons between MS Windows and any Linux distributions have been conducted by someone paid by an interested party who benefits materially from a given result. The Symantec Internet Security Threat Reports are all created by a corporation that stands to lose a lot of money if people migrate away from MS Windows in droves. Sometimes, these biases and conflicts of interest can be very difficult to detect, but failing to research the matter in some depth can result in accepting data as unbiased that should be viewed with suspicion.
  • statistically insignificant sampling: It is dismayingly rare to get an analysis of a statistically significant sample. Most of the time, samples are pathetically small, which can lead to severe shortcomings in accuracy — statistical data becomes more accurate as there is more of it. The reason for this is that small samples are too easily skewed by externalities and transient effects. For instance, the Symantec Internet Security Threat Reports are often held up as indications of some grand, sweeping conclusion, such as the claim made by a couple of information technology reporting agencies declaring that MS Windows is the “most secure” OS based on a mere six months of data, during a period when MS Windows was in transition between releases (new MS Windows releases tend to enjoy a brief period of relatively low exploit activity), among other positive effects on MS Windows statistical security.
  • cherry-picking statistical data: Cherry-picking comparison data to make a point is common, particularly in vendor-sponsored reports and superficial secondary reporting of data. For instance, at least one trade rag reported on the supposedly heightened security of MS Windows based on the total number of vulnerabilities discovered during the six month period of the Symantec Internet Security Threat Report XI, but neglected to mention that in that period only two RHEL/Fedora vulnerabilities and one MacOS X vulnerability were of high criticality while a dozen MS Windows vulnerabilities in the same comparison were of high criticality.
  • improperly excluded comparison data: That’s not the only biased choice of data one may find. Often, data may be excluded from specific comparisons in a manner that creates false, or biased, impressions of the results. For instance, in the Symantec Internet Security Threat Report XI, much is made of the apparently relatively low number of vulnerabilities and the quick patch times for MS Windows, but the MS Windows numbers didn’t include several classes of software vulnerabilities that bear directly on MS Windows security — such as Microsoft’s Achilles heel, Internet Explorer. MS Windows was reported as having only 36 vulnerabilities during that six month period, while IE was reported as having 54 of them in the same period. Clearly, the 36 reported for MS Windows is not the whole story, especially considering that IE is effectively indivisible from the OS.
  • improperly excluded study data: This is a bit more difficult to detect than improperly excluded comparison data. With comparison data, you can find other data that seems relevant to a comparison but was not part of it if you look elsewhere within the report. In the case of improperly excluded study data, you usually need to be able to read between the lines exceedingly well, and double-check your guesses against some hard data that can prove your guesses correct. At this time, I am not certain of any improperly excluded study data in the Symantec Internet Security Threat Report XI, so I will not speculate in examples.
  • improperly included comparison data: The flip-side of improper exclusion is the case of including data in a comparison that should not be included. For instance, in Symantec’s Internet Security Threat Report XI, the RHEL/Fedora statistics give false impressions in comparison with other OSes in a couple of manners: in one case, by including mutually exclusive software package version vulnerabilities from two actually separate OSes as though they were the same OS; in another case, by including thousands of third-party applications and other pieces of software in the analysis, just because they are provided via the YUM package management system for user convenience — equivalent to counting vulnerabilities in Adobe Photoshop, the World of Warcraft game, Symantec Norton Antivirus, Roxio Easy Media Creator Suite, Mozilla Firefox, and thousands of other third-party applications in MS Windows vulnerabilities. In fact, many that were included in RHEL/Fedora’s count could be included in the MS Windows count as well, because they will run on either platform, but they were only counted toward RHEL/Fedora.
  • disputing data: Often, you will find that two reporting organizations provide contradictory data. No statistical analysis should be viewed in a vacuum, if it is at all possible (and practical) to avoid doing so. For instance, the data related to Sun Solaris is materially disputed by Sun Microsystems. If Sun’s charges of inaccurate data are correct, that does not bode well for the accuracy of data elsewhere in Symantec’s report.
  • verifiability of data: The more you can check the data for independent verification, the more you can usually trust it. For instance, Symantec’s data for its Internet Security Threat Reports is famously unverifiable, with the exception of certain classes of statistical vulnerability data. Even this is usually not verifiable in practical terms for anyone that doesn’t have tens of thousands of dollars to spend on the process, because it is usually not packaged effectively for easy parsing and statistical analysis for the general public.
  • scary, but misrepresented, data: Often, data is presented in a confusing or deceptive manner. In fact, this is often done accidentally, as many so-called experts do not even realize the errors of assumption they make. For instance, the number of discovered vulnerabilities is a common statistic used by people of all supposed levels of expertise to bludgeon each other over the relative security of different OSes. This is a generally poor metric in itself for measuring security. In open source software development, in particular, the number of discovered vulnerabilities in a given time period is under most circumstances a misleading metric, as the open source development model lends itself to faster and more productive vulnerability identification by the “good guys”, who can then patch previously unknown vulnerabilities before the “bad guys” figure out the vulnerabilities on their own and develop exploits. The converse is generally true of closed source, proprietary software — MS Windows, for instance, suffers from poor internal vulnerability detection because it must contend with very limited resources in that regard, as compared with the hundreds of thousands of potential bug fixers in the open source development community, and as a result MS Windows labors under the weight of always playing catch-up with zero-day exploits. The first thing you should ask yourself when someone quotes vulnerability discovery numbers at you is “Does this mean it’s more vulnerable, or that the contrasted software has more vulnerabilities that haven’t been discovered by the ‘good guys’?” Do you feel safer when the community or organization that maintains your OS finds the vulnerabilities first, or when malicious security crackers find them first?

Related Information:

  • Security through visibility: The secrets of open source security — an article I wrote about how the open source development model affects software security. It was referenced as an authoritative resource by the Second Life Open Source FAQ.
  • eEye Security zero-day exploit tracker — a place to track some of the most widely distributed software vulnerabilities that create current zero-day exploit exposure. Note the RPC exhaustion vulnerability for MS Windows that has remained upatched (as of this writing) for 510 days and counting — since November 2005.
  • Comparison of web browsers at Wikipedia — an overall comparison of the characteristics of many web browsers, including (sadly incomplete, but usually surprisingly up to date) current vulnerability information. These numbers actually come from Secunia and Security Focus, so you may find more information at their websites, but they are conveniently collected in one place and broken down by severity here if you want a quick glance. As of this writing, the relevant chart on this page shows IE7 and IE6 as contenders for first place on highest number of unpatched vulnerabilities and greatest average severity of vulnerabilities, with which is in the worse condition depending on which security analyst’s numbers you choose.


This has been the first intermission in my security analysis of the Symantec Internet Security Threat Report, volume XI. This is a series of (mostly) daily posts collected under the SOB category Security. You may follow this series (and further security-specific posts) via RSS using the Security Category RSS Feed.

Next, I will conclude my overview of Symantec’s “Executive Summary Highlights”, with specific attention on Symantec’s discussion of zero-day exploits, in brief.

All original content Copyright Chad Perrin: Distributed under the terms of the Open Works License