Get the spreadsheet here -> Latent Defect Estimation Spreadsheet
Not all software is perfect the moment it is written by a sleep deprived twenty-year-old developer coming off a Games of Thrones marathon weekend. Software has defects. Maybe minor, maybe not, but its more likely than not that software has un-discovered defects. One problem is in knowing when it’s safe to ship the version you have, should testing continue, or will customers be better off having this version that solves new problems for them? Its not about zero (known) defects. Its about getting value to the customer faster for their feedback to help drive future product direction. There is risk in too much testing and beta trial time.
Yes, you heard right. We want an estimate of something we haven’t found yet. In actual fact, we want an estimate of “if it is there, how likely would we have been to see it.” A technique used by biologists for counting fish in a pond becomes a handy tool for answering this fishy question as well. How many undiscovered defects are in my code? Can (or should) we ship yet?
The Capture-Recapture Method
The method described here is a way to estimate how well the current investigation for defects is working. The basic principle is to have multiple individuals or groups analyze the same feature or code and record their findings. The ratio of overlap (found by both groups) and unique discovery (found by just one of the groups) gives an indication of how much more there might be to find.
I first encountered this approach by reading work by Walt Humphries who is notable for the Team Software Process (TSP) working out of Carnegie Mellon University’s Software Engineering Institute (SEI). He first included capture-recapture as a method for estimating latent defect count as part of the TSP. Joe Schofield also published more recent papers on implementing this technique for defect estimation, and it’s his example I borrow here (see references at the end of this post).
I feel compelled to say that not coding a defect in the first place is superior to estimating how many you have to fix, so this analysis doesn’t give permission to avoid defects using any and all extraordinary methods (pair programming, test driven development, code reviews, earlier feedback). It is far cheaper to avoid defects than fixing them later. This estimation process should be an “also,” and that’s where statistical sampling techniques work best. Sampling is a cost effective way to build confidence that if something big is there, chances are we should have seen it.
The capture-recapture method assigns one group to find as many defects as they can for a feature or area of code or documentation. A second (and third or fourth) group tests and records all defects they find. Some defects found will be duplicates, and some defects uniquely discovered by just one of the groups.
This is a common techniques used to answer biological population problems. Estimating how many fish are in a pond is achieved by tagging a proportion of the fish, returning them to the pond and then recapturing a sample. The ratio of tagged versus untagged fish allows the total fish in the pond to be estimated. Rather than fish, we use the defects found by one group as tagged fish, and compare the defects found by a second group. The ratio of commonality between the defects found gives an estimate of how thorough defect discovery has been.
If two independent groups find exactly the same defects, it is likely that the latent defect count is extremely low. If each independent group found all unique defects, then it’s likely that test coverage isn’t high and a large number of defect remain to be found and testing should continue. Figure 1 shows this relationship.
The capture recapture method uses the overlap from multiple groups to scale how many undiscovered defects still exist. Assumes both groups feel they have thoroughly tested the feature or product.
Equation 2 shows the two-part calculation required to estimate the number of un-discovered defects. First the total number of defects is estimated by multiplying the defect count found by group A by the defect count of defects found by group B. This is then divided by the count of the number of defects found by both (the overlap). The second step of the calculation subtracts the currently found defect count (doesn’t matter who found it) from the total estimated. This is the number of defects still un-discovered.
Figure 2 shows a worked example of capturing what defects each group discovered and using Equation 2 to compute the total estimated defect count, and the estimated latent un-discovered defect count. 3 defects are estimated still lurking to be found. This estimate doesn’t say how big they are, or whether it’s worth proceeding with more testing, but it does say that its likely two-thirds of the defects have been found, and the most egregious defects likely to have been found by one of the two groups. Confidence building.
Example capture recapture table and calculation to determine how many defects remain un-discovered.
To understand why Equation 2 works and how we got there, we take the generic fish in the pond capture-recapture equation and rearrange it to solve for Total fish in pond, which in our context is the Total number of defects for our feature or code. Equation 3 shows this transition step by step (thanks to my lovely wife for the algebra help!).
The geeky math. You don’t need to remember this. It shows how to get from the fish in the pond equation to the total defects equation.
Like all sampling methods, its only as valid as the samples. The hardest part I consistently struggle with is getting multiple groups reporting everything they see. The duplicates matter, and people are so used to NOT reporting something already known, it’s hard to get them to do it. I suggest going paper to a simple paper system. Give each group a different color post-it note pads and collect them only at the conclusion of their testing. Collate them on a whiteboard, sticking them together if they are the same defect as shown in Figure 3. Its relatively easy to count the total from each group (yellow stickies, and blue stickies) and the total found by both (the ones attached to each other). Removing the electronic tool avoids people seeing prematurely what the other groups has found.
Tracking defects reported using post-it notes. Stick post-its together when found by both groups.
Having an intentional process for setting up capture recapture experiment is key. This type of analysis takes effort, but the information it yields is a valuable yardstick on how releasable a feature currently stands. It’s not a total measure of quality, the market may still not like the solution as developed which is why their is risk in not deploying it, but they certainly won’t like it more if it is defect ridden. Customers need a stable product to give reliable feedback about improving the solution you imagined versus just this looks wrong. The two main capture, recapture experiment vehicles are using bug-bash days, and customer beta test programs.
Some companies have bug-bash days. This is where all developers are given dedicated time to look for defects in certain features. These are ideal days to set multiple people the task of testing the same code area, and performing this latent defect analysis. It helps to have a variety of skillsets and skill levels perform this testing. It’s the different approaches and expectations to using a product that kicks up the most defect dust. The only change from traditionally running a bug-bash day is that each group keeps individual records on the defects they find.
To setup the capture-recapture experiment, dedicate time for multiple groups of people test independently as individuals or small groups. Two or three groups work best. Working independently is key. They should record their defects without seeing what else the other groups have found, avoid having the groups use a common tool, because even though you instruct them not to look at other groups logged defects, they might (use post-it notes as shown earlier in Figure 3). They should be told to log every defect they find even if its minor. They should be told to only stop once they feel they have given the feature a good thorough look and would be surprised if they missed something big.
Performing this analysis for every feature might be too expensive, so consider doing a sample of features. Choose a variety of features that might be key indicators of customer satisfaction.
Customer Beta Programs
Another way of getting this data is by delivering the product you have to real customers as part of a beta test program. Allocate members at random to two groups, they don’t even have to know what group they are in, you just need to know during analysis. Capture every report from every person, even if it’s a duplicate of a known issue previously reported. Analyze the data from the two groups for overlap and uniqueness using this method to get an estimate for latent defects.
Disciplined data capture requires that you know what group each beta tester is in. A quick way is to use the first letter of the customer’s last name. A-K is group A, L-Z is group B. It won’t be exactly equal membership counts, but it is an easy way to get roughly two groups. Find an easy way in your defect tracking system to record which groups reported which defects. You need a total count found by group A, a total count found by group B, a count of defects found by both, and a total number of unique defects reported. If you can, add columns or tags to record “Found by A” and “Found by B” in your electronics tools and find a way of counting based on these fields. If this is difficult, set a standard for the defects title by appending a “(A)”, “(B)” or “(AB)” string to the end of the defect title. Then you can then count the defects found only by A, by B and by both by hand (or if clever, search).
There will be a point of diminishing return on continuing the beta, this capture – recapture process could be used as a “go” indicator the feature is ready to go-live. In this case, you can keep the analysis ongoing until a latent defect count hits a lower trigger value which is an indication of deployment quality. Using this analysis could shorten a beta period and get a loved product into the customers’ hands earlier with the revenue benefits that will bring.
Summary – don’t do this by hand
We of course have a spreadsheet for this purpose. We are still getting it to shareable quality, but the equations and the mathematics matches this article and has been used successfully in commercial settings. Please give it a try and let us know how it works for you.
Get the spreadsheet here -> Latent Defect Estimation Spreadsheet
Introduction to the Team Software Process; Humphrey; 2000; pgs. 345 – 350