Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email firstname.lastname@example.org
From: Steven M. Christey (coleymitre.org)
Date: Thu Oct 16 2008 - 18:53:37 CDT
(apologies if this was posted twice)
I supported the NIST effort, primarily as part of my CWE work at
MITRE. Mostly, I was an evaluator of the tool results, but I've also
assisted in some of the design and interpretation of results.
I'll assume that people have read the SATE page and are somewhat
familiar with how this project was run.
First and foremost:
SATE was an "exposition," not a scientific "experiment." We were
trying to figure out the things that could be done to evaluate tools,
NOT to actually evaluate tools. We've learned a lot, but the
resulting human analysis is neither reliable nor repeatable. I'm not
talking about the raw data that was generated by tools and dumped into
a shareable format - the raw data generated from the tools is what it
is. It's only our interpretations of the results, and what
conclusions that should be drawn, that pose the greatest challenge -
especially because you know damn well that someone somewhere is going
to stack stats against each other and compare tools using data that we
keep saying is not appropriate for that purpose).
- "NIST" - the SATE project leads working at NIST, namely Paul
Black, Vadim Okun, and Romain Gaucher.
- "we" - me and other people who evaluated tool results in SATE,
some of whom do not work at NIST.
- "tool" - a code scanning tool or service that participated in SATE.
- "test case" - one of the open source packages that was analyzed.
- "bug report" - a single item as identified by a tool.
- "raw data" - raw data generated from the tools, including bug
reports and supporting documents.
- "evaluation" - the HUMAN determination as to whether the tool's
bug report is correct or not (e.g. true/false positive).
OK. SATE was a big job and we did what we could with the resources we
had. There were several major factors affecting the analysis:
- Number of bug reports. We only manually reviewed about 10% of
47,000 individual bug reports. Yes, 47 thousand. There's a lot of
reasons why the number was that high: running tools in default mode,
running early-generation tools like flawfinder, running tools
against software that wasn't written with security in mind, etc.
- After the exposition was underway, we realized that we needed to
more precisely define what a "true" and "false" positive really
meant. Consider a buffer overflow in a command line argument to an
application that's only intended to be run by the administrator.
Sometimes this was evaluated as a false positive, sometimes as a
true positive. It's a genuine bug - but how much do you care? Then
there's stuff that's true, but it's not necessarily a bug, and you
don't necessarily care - consider the failure to use symbolic names
in security-critical constants (CWE-547) or general code-quality
assessments (not security, but important for many people).
That's some of the easy stuff. Eventually, we (the evaluator team)
settled on a crude definition for true/false positives, but that was
*after* we'd already been evaluating bug reports for a while. Due
to the scale of the effort, we didn't always go back to fix our
results. So, the evaluation data is inconsistent.
Then there's the question of false negatives. If tool X doesn't
even TEST for a specific type of issue, then is it really a false
- NIST designed a database/web interface for importing the results
from various tools, and evaluating each bug report. They used a
simple exchange format that, in retrospect, was not expressive
enough. We did NOT run the tools live. I think NIST did a solid
job in developing the database and web interface, especially in such
a short amount of time. For example, you can look at a line of code
and quickly find which other tools reported issues for surrounding
lines of code.
Since we used the database instead of the live tools, we did not
have access to the analysis environment that the tools had. Some of
these tools have powerful capabilities (e.g. navigation and
visualization) that help to interpret results. So, this led to
increased labor and a lot of ambiguous results (*you* try following
a logic chain that's 20-deep. Well OK, *you* can because you're
special, but I can't. Not all day anyway.)
So, while the database was nice for browsing through multiple issues
from multiple tools, and doing lots of basic data hacking, the lack
of integration of live tools was really limiting.
- We had a fairly large number of issues that we thought were
ambiguous, i.e. we couldn't tell if they were correct or not, even
accounting for the lack of live analysis environment from the tools.
Sometimes, this was because we needed to know more about the test
case's operating context than we had. For example, I remember one
issue where I couldn't be sure if it was a true/false positive. The
issue was a false positive on every platform but Solaris. But for
Solaris, I had to know whether a certain field of a low-level OS
data structure would ever exceed 47 bytes. The Solaris include
files that I looked at didn't exceed 47 bytes, but neither did I
look at every relevant version (then there's SPARC vs. x86).
- We did not directly address potential sources of bias. There was a
general focus on "high-priority" issues (however THAT's defined -
tools varied), but there were lots of differences how we did the
human evaluation. For example, during the course of my evaluation
period (a few weeks), I concentrated mostly on the C programs, not
Java; sometimes I concentrated on a single test case or one file,
sometimes on a single type of vulnerability; sometimes just on a
single tool (maybe I was trolling for new CWE entries, or just
curious); and sometimes, I just casually browsed through the raw
results and grabbed whatever I thought looked interesting.
Generally, I stuck with one particular test case.
Obviously, my individual approach was very informal. Others on the
team were more focused, but there were still different biases at
play. So, the evaluation data is NOT representative.
- We did not schedule enough time for a review period, so tool vendors
didn't have enough time to get back to us on the results of our
evaluations. So the evaluation data - already imperfect for reasons
I've discussed - has not been sufficiently validated.
- As implied by previous results, false positive rates cannot be
determined because we weren't always correct in identifying them,
and our analysis only covered 10% of the data.
- People are very interested in false negatives. A multi-tool
evaluation could help estimate these rates (just see which true
positives weren't mentioned by other tools) - however, you can't do
this with incorrect data! Plus what I mentioned before - is it
really a false negative if a tool doesn't look for that type of
All that said:
- I'm glad to have participated in SATE. People will criticize it,
and/or misinterpret the results, but hopefully everyone will learn
- I believe that one of the best outcomes of this exposition, which
everybody will ignore, is the lessons learned with respect to the
database, web interface, exchange format, and the improved awareness
of how tools might generate different reports for the same thing.
- We did find some examples of false positives. For my contribution
to the December release, I plan to include specific code examples to
help highlight some of these areas; if one tool missed it, then
maybe others will, too.
- We did find differences between tool results. In December, I'll
provide more details on WHY I think some of those differences might
exist. In one example, tool X flagged the failure to check the
result of a malloc() call; tool Y flagged the line of code that did
the NULL pointer dereference. They were reporting two links in the
same chain, but it looks like the results are different. Consider
layering issues. A software security auditor might flag the whole
test case, saying "you don't have a centralized input validation
mechanism." That design-level feature could translate into numerous
XSS or SQL injection errors by a code-auditing tool. The issues are
related, and kind of the same, but mostly not.
There are other counting issues, too. Consider a library function
that, if called incorrectly, contains a buffer overflow. One tool
might flag the library function as one bug; a different tool might
flag 20 distinct code paths, where each called that library function
with potentially dangerous input. Is it one bug or 20? (Ask that
question about strcpy() if you don't get my drift). Obviously,
these differences will seriously affect your results.
- If it was sometimes difficult for me to interpret the results (even
allowing for the lack of access to live tools), then it will
probably be a challenge for a lot of developers who don't do
security all the time. Maybe the code samples we analyzed were
particularly complex, but I don't think so. There will probably be
significant disagreement with me on this sobering conclusion.
- The SAMATE people at NIST hope to perform a modified "exposition" in
2009, and this will probably involve important design enhancements.
In closing - we didn't do SATE to compare tools; we wanted to explore
methods for understanding their capabilities. The evaluation data
that was generated is not suitable for comparing tools because it's
inaccurate and incomplete. But there was a lot of useful raw data,
the database design and exchange format were a good start, and we
learned a lot that will be covered more extensively in December.
Dailydave mailing list