Top Ten Data and Forecasting Tips

Posted by in Featured, Forecasting | 2 comments

Here is a list of the top 10 tips i find myself giving out. Its not in any particular order of importance, just the order they come to my head. Its a long weekend, so writing things down helps me relax. Would love to hear yours, so please add them to the comments.

1. If two measures correlate, stop measuing the one that takes more effort. E.g. If story counts correlates to story point forecasts, stop estimating story points and just count.

2. Always balance measures. At least one measure in the following four domains: Quality (how well), Productivity (how much, pace), Responsiveness (how fast from comitting), Predictability (how repeatable) (thats Larry Maccherone)

3. Measure the work, not the worker. Flow of value over how busy people appear. Its also less advantageous to game, giving a more reliable result in the longrun. Measuring (and embarassing) people causes poor data.

4. Look for exceptions, don’t just explain the normal. Find ways to detect exceptions in measures earlier. Trends are more insightful than individual measures for seeing exceptions.

5. Capture at a minimum, 1- the date work was started, 2 – the date it was delivered and 3 – the type of work (so we can see if its normal within the same type of work).

6. Scope Risk play a big role in forecasts. Scope Risks are things that might have to be done, but we aren’t sure yet. Track items that might fail and need reworking, for example server performance criteria or memory usage. Look for ways to detect these earlier and remove. Removing isn’t the goal – knowing if they will definately occur adds more certainty to the forecast.

7. Don’t exclude “outliers” without good reason. Have a rule, for example 10 times the most common value. Often these are multiple other things that haven’t been broken down yet so can’t be ignored.

8. Work often gets split into smaller pieces before delivery. Don’t use the completion rate as the forecast rate for the “un-split” backlog items. Adjust the backlog by this split rate. 1 to 3 times is the most common split rate for software backlogs (but measure your own and fix).

9. If work sits idle for long periods waiting, then don’t expect effort estimates for an items to match calendar delivery time. In these cases, forecast system throughput rather than item sizes (story points).

10. Probabilistic forecasting is easier than most people expect. If average are used to forecast (like traditional burndown charts) then the chance of hitting the date that gives is 50% – a coin toss. Capture historical data, or estimate in ranges, and use that.

Read More

Do Story Size Estimates Matter? Do your own experiment

Posted by in Featured, Forecasting | 2 comments

This is one of the most common questions I receive when introducing forecasting. Don’t we need to know the size of the individual items to forecast accurately?

My answer: Probably not.

It depends on your development and delivery process, but often system factors account for more of the elapsed delivery time than different story sizes.

Why might story point estimation NOT be a good forecaster?

Consider commuting to work by car each day. If the road is clear of traffic, then the distance travelled is probably the major cause of travel time. At peak commute time, it’s more likely weather and traffic congestion influence travel time more than distance alone. For software development, if one person (or a team) could get the work and be un-disturbed from start to delivery of a story, then story point effort estimates will correlate and match elapsed delivery time. If there are hand-offs to people with other specialist skills, dependencies on other teams, expedited production issues to solve or other delays, then the story size estimate will diverge from elapsed delivery time.

The ratio between hands-on versus total elapsed time called “process efficiency.” Often for software development this is between 5-15%. Meaning even if we nailed the effort estimates in points, we would be accurately predicting 5-15% of elapsed delivery time! We need to find ways to accurately forecast (or remove) the non-work time influenced by the entire system.

This is why using a forecasting technique that reflects the system delivery performance of actual delivered work is necessary to forecasting elapse time. To some degree, traditional story point “velocity” does represent a pace including process efficiency, but it has very little predictive power than story counts alone. So, if you are looking at an easy way to improve process efficiency, dropping the time staff spend on estimation might be a good first step.

Running your own experiment

You should run your own experiment. Prove in your environment if story point estimates and velocity perform better than story count and throughput for forecasting. The experiment is pretty simple, go back three months and see which method predicts the actual known outcome today. You can use our forecasting spreadsheets to do this.

  1. Download the forecasting spreadsheet Throughput Forecaster.xlsx
  2. Make two copies of it, call one “Velocity Forecast.xlsx” and the other “Throughput Forecast.xlsx”
  3. Pick a prior period of time. Say, 3 Months. Gather the following historical data –
    1. Number of completed stories per sprint or week. A set of 6 to 12 throughput samples.
    2. Sum of story points completed per sprint or week. A set of 6 to 12 velocity samples
  4. For each spreadsheet enter the known starting date, the historical data for throughput or velocity, and the sum of all samples (a total of ALL completed work over this period) as the starting story count and velocity (in the respective spreadsheets).
  5. Confirm which method accurately predicted closest to the known completion date.

This experiment is called backtesting. We are using a historical known outcome to confirm out forecasting tool and technique hits something we know to have occurred.

If performed correctly, both spreadsheets will be accurate. Given that, is the effort of story point estimation still worth it?

Troy

Read More

Forecasting techniques – effort versus reward

Posted by in Featured, Forecasting |

Why should I use probabilistic forecasting? This is a common question I have to answer with new clients. I always use and recommend the simplest technique that answers the specific question being asked and progress in complexity only when absolutely necessary. I see forecasting capabilities in five stages of incremental improvement at an effort cost.  Here is my simple 5 level progression of forecasting techniques –

forecasting levels of capability

Level 1 – Average regression

Traditional Agile forecasting (1) relies on using a running average and projecting that average out over time for the remaining work being forecast. This is level 1 on our capability measure. Does it work? Mostly. But it does rely on the future pace being similar to the past, and it suffers from the Flaw of Averages (read about it in the book, Flaw of Averages by Sam Savage). The flaw of averages is the terminology that covers errors in judgement because a single value is used to describe a result when the result is actually many possible outcomes each with a higher or lower possibility. When we project the historical average pace (story point velocity or throughput), the answer we calculate has around 50% chance. A coin toss away from being late. We often want better odds than that when committing real money and people to a project.

flaw of averages 1

flaw of averages 2

Level 2, 3, and 4 – Probabilistic Forecasting

Probabilistic forecasting returns a fuller range of possibilities that allow the likelihood of a result to be calculated. In the forecasting software world, this is normally “On or before date x.” In a probabilistic forecast we look at what percentage of the possible results versus all results we calculated were actually “on or before date x.” This allows us to say things like, “We are 85% certain to deliver by 7th August.”

Probabilistic forecasting relies upon the input parameters being non-exact. A simple range estimate like 1 to 5 days (or points, or whatever unit pace is measured) for each of the remaining 100 items is enough to perform a probabilistic forecast. Its the simplest probabilistic model and gets us to level 2 in our capability. The goal is that the eventual actual result is actually between 1 to 5 days for an item. Our spreadsheet tools use this technique when estimates are set to “Range estimate”  (download it here).

Levels 3 and 4 are more refined range estimate forecasts. Level 3 specifies a probability distribution that helps you specify if part of the range estimate is more likely than another. Low-Most Likely-High estimates are this type of distribution. It helps firm up the probabilistic forecast by showing preference to some range estimate values based on our knowledge of the work. Over the years different processes have demonstrated different distribution curves, for example, manufacturing often shows a bell curve (Normal distribution) and software work shows a left-skewed distribution where the lower values are more likely than the higher tails. This allows us to take a good “guess” given what we know what values are more likely and encode this guess in our tools. It is more complex, and to be honest, we only use it after exhausting a straight range estimate and proving an input factor makes a material difference in the forecast. Out of ten inputs there might be two that fall into this category.

Level 4 forecasts use historical data. Historical data is a mix of a range estimate (it has a natural lowest and highest value), and probability for each value. Some values occur more often than others, and when we use it for forecasting, those values will be given more weight. This naturally means our forecast match the historical nature of the system giving reliable results.  Our spreadsheet tools use this technique when estimates are set to “Historical data”  (download it here).

Level 5 – Simulation + Probabilistic Forecasting

Level 5 forecasts model the interactions of a process through simulation. This is the domain of our KanbanSim and ScrumSim tool (see Downloads to download this tool). It allows you to make a simple or as complex a model as you need that exhibits the same response as your organizational process. This not only helps understand the system and forecast in detail, it allows you to perform “what-if” experiments to detect what factors and process setup/assumptions give a desirable result. This what-if analysis is often called Sensitivity Analysis, and we use it to answer complex process questions with reliable results. But, it takes some work, and if your process is changing, or inconsistent, or unstable – then this may not be the best investment in time. We can help advise if we think you need this level of forecasting.

Which one should you use?

Avoid any regression based forecasting. With our free spreadsheets and tools there is little upside in doing it the “traditional” way and risking the Flaw of Averages causing you to make a judgment error.

A probabilistic technique at level 2 if you have no historic data, or level 4 if you do is our advice. All of our spreadsheet tools allow you to use either range estimates or data for the forecast inputs. Given its free, we can’t break down the barrier to entry any more than we have – download it here.

Use our simulator if you have complex questions, and we are here to help you make that step when you need it.

Troy.

 

Read More

KanbanSim and ScrumSim v2.0 Released + Simplified Licensing

Posted by in Featured, Forecasting, Tools |

We are growing up. We made it to V2.0 of our flagship product, KanbanSim and ScrumSim. We have added over 100 new features since we have launched.

We have also invested heavily improving the interactive modeling features that customers are using to quickly experiment with model input impact analysis to find optimal solutions (e.g. drag the number of developers slider and see cost and date impact). We have also invested heavily in the model editor, making code completion and inline documentation, and model snippets making creating new models faster.

Our licensing has also been updated to how we really did it anyway, and its in your benefit –

  1. KanbanSim and ScrumSim is FREE (no catch) for individuals and companies up to 10 employees.
  2. If your company has more than 10 employees (its the honor system), licenses are $995 per person
  3. If your company wants annual software maintenance and support, its $4,995 for each 10 license per division, and then 20% a year to renew.

We simplified our licensing because we wanted no barriers to getting started, and have found that even our generous 6-12 month trial period made some customers uncomfortable to start. We also found that larger companies felt uncomfortable having to pay so little! So, we want to help them feel “at ease” knowing they get every version the moment its released and email and phone support if necessary.

See our Downloads page to get the latest version. And please, tell your friends.

Read More

Latent Defect Estimation – How many bugs remain?

Posted by in Featured, Forecasting |

Get the spreadsheet here -> Latent Defect Estimation Spreadsheet

Not all software is perfect the moment it is written by a sleep deprived twenty-year-old developer coming off a Games of Thrones marathon weekend. Software has defects. Maybe minor, maybe not, but its more likely than not that software has un-discovered defects. One problem is in knowing when it’s safe to ship the version you have, should testing continue, or will customers be better off having this version that solves new problems for them? Its not about zero (known) defects. Its about getting value to the customer faster for their feedback to help drive future product direction. There is risk in too much testing and beta trial time.

Yes, you heard right. We want an estimate of something we haven’t found yet. In actual fact, we want an estimate of “if it is there, how likely would we have been to see it.” A technique used by biologists for counting fish in a pond becomes a handy tool for answering this fishy question as well. How many undiscovered defects are in my code? Can (or should) we ship yet?

The Capture-Recapture Method

The method described here is a way to estimate how well the current investigation for defects is working. The basic principle is to have multiple individuals or groups analyze the same feature or code and record their findings. The ratio of overlap (found by both groups) and unique discovery (found by just one of the groups) gives an indication of how much more there might be to find.

I first encountered this approach by reading work by Walt Humphries who is notable for the Team Software Process (TSP) working out of Carnegie Mellon University’s Software Engineering Institute (SEI). He first included capture-recapture as a method for estimating latent defect count as part of the TSP. Joe Schofield also published more recent papers on implementing this technique for defect estimation, and it’s his example I borrow here (see references at the end of this post).

I feel compelled to say that not coding a defect in the first place is superior to estimating how many you have to fix, so this analysis doesn’t give permission to avoid defects using any and all extraordinary methods (pair programming, test driven development, code reviews, earlier feedback). It is far cheaper to avoid defects than fixing them later. This estimation process should be an “also,” and that’s where statistical sampling techniques work best. Sampling is a cost effective way to build confidence that if something big is there, chances are we should have seen it.

The capture-recapture method assigns one group to find as many defects as they can for a feature or area of code or documentation. A second (and third or fourth) group tests and records all defects they find.  Some defects found will be duplicates, and some defects uniquely discovered by just one of the groups.

This is a common techniques used to answer biological population problems. Estimating how many fish are in a pond is achieved by tagging a proportion of the fish, returning them to the pond and then recapturing a sample. The ratio of tagged versus untagged fish allows the total fish in the pond to be estimated. Rather than fish, we use the defects found by one group as tagged fish, and compare the defects found by a second group. The ratio of commonality between the defects found gives an estimate of how thorough defect discovery has been.

If two independent groups find exactly the same defects, it is likely that the latent defect count is extremely low. If each independent group found all unique defects, then it’s likely that test coverage isn’t high and a large number of defect remain to be found and testing should continue. Figure 1 shows this relationship.

Figure 1

The capture recapture method uses the overlap from multiple groups to scale how many undiscovered defects still exist. Assumes both groups feel they have thoroughly tested the feature or product.

Capture-recapture overlap venn diagrams

Capture-recapture overlap venn diagrams

Equation 2 shows the two-part calculation required to estimate the number of un-discovered defects. First the total number of defects is estimated by multiplying the defect count found by group A by the defect count of defects found by group B. This is then divided by the count of the number of defects found by both (the overlap). The second step of the calculation subtracts the currently found defect count (doesn’t matter who found it) from the total estimated. This is the number of defects still un-discovered.

Equation 2

Capture-recapture equations

Capture-recapture equations

Figure 2 shows a worked example of capturing what defects each group discovered and using Equation 2 to compute the total estimated defect count, and the estimated latent un-discovered defect count. 3 defects are estimated still lurking to be found. This estimate doesn’t say how big they are, or whether it’s worth proceeding with more testing, but it does say that its likely two-thirds of the defects have been found, and the most egregious defects likely to have been found by one of the two groups. Confidence building.

Figure 2

Example capture recapture table and calculation to determine how many defects remain un-discovered.

Capture-recapture defect table analysis

Capture-recapture defect table analysis

To understand why Equation 2 works and how we got there, we take the generic fish in the pond capture-recapture equation and rearrange it to solve for Total fish in pond, which in our context is the Total number of defects for our feature or code. Equation 3 shows this transition step by step (thanks to my lovely wife for the algebra help!).

Equation 3

The geeky math. You don’t need to remember this. It shows how to get from the fish in the pond equation to the total defects equation.

 

LAtent defect estimation using capture-recapture algebra

Like all sampling methods, its only as valid as the samples. The hardest part I consistently struggle with is getting multiple groups reporting everything they see. The duplicates matter, and people are so used to NOT reporting something already known, it’s hard to get them to do it. I suggest going paper to a simple paper system. Give each group a different color post-it note pads and collect them only at the conclusion of their testing. Collate them on a whiteboard, sticking them together if they are the same defect as shown in Figure 3. Its relatively easy to count the total from each group (yellow stickies, and blue stickies) and the total found by both (the ones attached to each other). Removing the electronic tool avoids people seeing prematurely what the other groups has found.

Figure 3

Tracking defects reported using post-it notes. Stick post-its together when found by both groups.

Example of capture-recapture of defects using post-it notes.

Example of capture-recapture of defects using post-it notes.

Having an intentional process for setting up capture recapture experiment is key. This type of analysis takes effort, but the information it yields is a valuable yardstick on how releasable a feature currently stands. It’s not a total measure of quality, the market may still not like the solution as developed which is why their is risk in not deploying it, but they certainly won’t like it more if it is defect ridden. Customers need a stable product to give reliable feedback about improving the solution you imagined versus just this looks wrong. The two main capture, recapture experiment vehicles are using bug-bash days, and customer beta test programs.

Bug-Bash Days

Some companies have bug-bash days. This is where all developers are given dedicated time to look for defects in certain features. These are ideal days to set multiple people the task of testing the same code area, and performing this latent defect analysis. It helps to have a variety of skillsets and skill levels perform this testing. It’s the different approaches and expectations to using a product that kicks up the most defect dust. The only change from traditionally running a bug-bash day is that each group keeps individual records on the defects they find.

To setup the capture-recapture experiment, dedicate time for multiple groups of people test independently as individuals or small groups. Two or three groups work best. Working independently is key. They should record their defects without seeing what else the other groups have found, avoid having the groups use a common tool, because even though you instruct them not to look at other groups logged defects, they might (use post-it notes as shown earlier in Figure 3). They should be told to log every defect they find even if its minor. They should be told to only stop once they feel they have given the feature a good thorough look and would be surprised if they missed something big.

Performing this analysis for every feature might be too expensive, so consider doing a sample of features. Choose a variety of features that might be key indicators of customer satisfaction.

Customer Beta Programs

Another way of getting this data is by delivering the product you have to real customers as part of a beta test program. Allocate members at random to two groups, they don’t even have to know what group they are in, you just need to know during analysis. Capture every report from every person, even if it’s a duplicate of a known issue previously reported. Analyze the data from the two groups for overlap and uniqueness using this method to get an estimate for latent defects.

Disciplined data capture requires that you know what group each beta tester is in. A quick way is to use the first letter of the customer’s last name. A-K is group A, L-Z is group B. It won’t be exactly equal membership counts, but it is an easy way to get roughly two groups. Find an easy way in your defect tracking system to record which groups reported which defects. You need a total count found by group A, a total count found by group B, a count of defects found by both, and a total number of unique defects reported. If you can, add columns or tags to record “Found by A” and “Found by B” in your electronics tools and find a way of counting based on these fields. If this is difficult, set a standard for the defects title by appending a “(A)”, “(B)” or “(AB)” string to the end of the defect title. Then you can then count the defects found only by A, by B and by both by hand (or if clever, search).

There will be a point of diminishing return on continuing the beta, this capture – recapture process could be used as a “go” indicator the feature is ready to go-live. In this case, you can keep the analysis ongoing until a latent defect count hits a lower trigger value which is an indication of deployment quality. Using this analysis could shorten a beta period and get a loved product into the customers’ hands earlier with the revenue benefits that will bring.

Summary – don’t do this by hand

We of course have a spreadsheet for this purpose. We are still getting it to shareable quality, but the equations and the mathematics matches this article and has been used successfully in commercial settings. Please give it a try and let us know how it works for you.

Get the spreadsheet here -> Latent Defect Estimation Spreadsheet

Capture-recapture spreadsheet.

Capture-recapture spreadsheet.

References

http://www.ifpug.org/Conference%20Proceedings/ISMA3-2008/ISMA2008-22-Schofield-estimating-latent-defects-using-capture-recapture-lessons-from-biology.pdf

http://joejr.com/CRMQAI.pdf

Introduction to the Team Software Process; Humphrey; 2000; pgs. 345 – 350

 

Read More

Metrics don’t have to be evil – 5 Traps and tips for using metrics wisely

Posted by in Featured, Forecasting |

Problem in a nutshell: Metrics can be misused. Metrics can (and will) be gamed. This doesn’t mean we should avoid using any quantitative measures for team and project decision making – we just need to know why and what we are measuring, and interpret the results accordingly

“Just like dynamite, it would appear that metrics can be used for good as well as evil. It all depends on how you use them.”

1. Don’t embarrass people

Embarrassing people is easy to do when showing metrics they feel responsible for. This causes data to be hidden, obscured, and mis-reported. This leaves you with an incomplete and inaccurate picture even with data. Once you embarrass someone, thats the last time they will trust any metric, and the last time you have an accurate metric.

Do

  • Focus on trends rather than single point values.
  • Leave axis values off charts where possible; focus people on trends.
  • Exclude any name information Its OK for that team to identify themselves, but NOT for others to point out another team.

 

Figure 1 – Its the trend that matters. No team names or axis values help compare “trend”

2. Focus on Trends Not Individual Values

Trends are charts of the same measure over time. Trends help make sense of noisy data by helping see relative direction of change. Figure 1 shows a trend-line applied to cycle time data. The orange line is the team looking at its data, the grey line is the trend of the same measure of the rest of the company. This chart shows that the team is driving down its cycle time average over time, whereas the company trend is level over time.

Do

  • Capture data that helps show trend values over time
  • Add linear trend-line to data to help see the big picture of change
  • Help teams see how their trend tracks against “others” in similar situation
  • “other” means teams in SIMILAR situations, don’t compare apples versus oranges. Eg. sustainment teams versus production support teams.

3. Use Balanced Metrics

Tracking just one metric promotes overdriving that metric at the loss of everything else. Multiple opposing metrics should be equally shown with the emphasis that trade something you are above the trend with for something that is trending worse than others. Changing one metric is easy; changing that metric without decimating another is much harder.

Larry Maccherone in his “Software Development Performance Index” uses a metric from multiple quadrants –

Responsiveness – Time in Process average (often called cycle time).

Productivity – Throughput / team size (team size is to help normalize team size, making bigger teams and smaller team trends comparable)

Predictability – variability of throughput / size values. Helps teams identify they have peaks and troughs rather than smooth flow

Quality – How ready to release is the codebase? Could be number of open blocking P1 or P2 defects, or a score based on passing tests, number of un-merged feature branches, performance regressions. This is always the most difficult to find for each company. Avoid defect counts alone. Find ways to make quality mean improved customer experience.

Do –

  • Look for opposable measures. NO team should be able to be BEST at all, just one or two
  • Being BEST in a measure is an alarm! It means that they may be overdriving one measure at the sacrifice of others
  • Always show the measures together so people can see the tradeoffs they are making
Always show balanced metrics together. Avoids focus on just one.

Always show balanced metrics together. Avoids focus on just one.

4. Use Sampling – Track some metrics just sometimes

Some metrics are expensive to capture. You don’t need every metric all of the time. Sampling allows data to be captured for a short period of time to get a snapshot of how high or how low the metric is compared to estimate. For example, how much interrupt driven work is the team fielding requests for? Get the team to stick a post-it note on a whiteboard every time they do a “small job.” over the week you will get a good indication of percentage and make appropriate process changes. You can repeat one week next month and don’t track for the other three. This has made the cost of getting this metric 1/4 of the original cost and given the same result! Sampling is a powerful and underused technique.

Do

  • For measure that rely on people to do extra work to capture; use sampling. For example, track one week a month.
  • It takes less data than you think. 11 samples give a representative picture of a measure, by 30 samples you are almost certain the result is similar to every sample.

5. What, So What, Now What – Help people see the point

There has to be a reason for tracking and showing a metric. Make it clear how a metric trend aligns to a better decisions and improvement. If people don’t know why a metric is being tracked, they will assume its to track them personally! Help them see its about the work and the system, not the worker and their livelihood!

Do

  • Promote system metrics rather than personal metrics
  • Promote team metrics rather than personal metrics
  • Share how a trend of a metric has led to a better decision or improvement
  • Be vigilant about dropping metrics that are just available to capture – have a reason

In summary

Metrics aren’t evil. Although they are often mis-used, they don’t need to be. Make people responsible for determining actions on their own metrics. Send ideas and stories on what you have seen work and fail.

 

Read More

Risks – Things that could make a big difference

Posted by in Featured, Forecasting |

Problem in a nutshell: Sometimes extra work needs to be done before delivery because something went wrong, or when a feature was built something was learnt that means additional innovation is required. How can these factors be managed in a forecast early and dealt with earlier. We find asking the simple question “What could go wrong?” helps us be more right when forecasting.

Features or project work starts with a guessed amount of work. As the feature is built, other technical learning can cause delays. For example, when a feature for giving suggestions about what other products you might buy turns out to be too slow to be useful during real-time shopping, additional work may be needed to build an index server specifically to make these results return faster. From a probabilistic perspective, there is a known amount of work (the original feature) and an additional “possible” amount of work if it performs poorly. This is a risk. It has a probability of being needed (less than 100%) and an impact if (and ONLY if it comes true).

If we performed a simple Monte Carlo simulation for this scenario, and said that there was a 50% chance performance would fail, the result would be an equal chance of an early date, and a later date. There would also be a normal distribution of uncertainty around each of these dates. The result would be “Multi-Modal” – jargon for meaning more than one peak of highest probability. The average delivery date is early July, but it has almost NO CHANCE! It will be around mid June, or early September. Based mainly on if this risk comes true.

Monte Carlo of a 50% risk produces with our Single Feature Forecaster spreadsheet.

Figure 1- Monte Carlo of a 50% risk produces with our Single Feature Forecaster spreadsheet.

What does this mean? A few things –

  1. Estimating and quibbling over whether a story is a 5 point or 8 point story is pointless. That changes the result in this case by a few weeks. Stop estimating stories and start brainstorming risks.
  2. If we know that risks can cause these bi-modal probability forecasts, we need to stop using AVERAGE which would give us the nonsense July delivery that won’t happen.
  3. Probabilistic forecasting is necessary to make sense of this type of forecasting. But how?

How do you forecast these risks?

It seems harder than it is. Here is how I generated the above forecast (figure 1) using the Single Feature Forecast spreadsheet that uses no macro’s or programatic add-ins – its PURE formula, so its not that complex to follow. Monte Carlo forecasting plays out feature completion 1000’s of times. In the chart image shown in figure 1 above, you can see the first 50 hypothetical project outcomes in the lower chart (it looks like lightning strikes). You can see that there are two predominant ways the forecast plays out with some variability based on our range estimates for number of stories and throughput estimates (it could be actual throughput data, i just started with a range of 1 to 5 stories per week, but use data when you can). Its either shorter or longer, but not not a lot of chance in between.

Here are the basic forecast guesses for this feature –

The main forecast data to deliver a feature.

Figure 2 – The main forecast data to deliver a feature.

Once we have this data, lets enter the risks. In this case, just one –

Risks definition

Figure 3 – Risks definition

The inputs in figure 3 represent a risk that has a 50% chance of occurring, and if it does, 30 to 40 more stories are needed to implement an index server. This risk is added (30-40 stories picked at random) are added to the forecast 50% of the time. The results shown in figure 1 clearly shows that to be predictable in forecasting the delivery date, determining which peak is more likely is critically important. If the longer date is unacceptable, reducing the probability of that risk early beneficial. As a team or a coach, i would set the team a goal of halving the risk probability of needed an index server (from 50% to 25%), or determining early if its certain an index server is needed and the later date real.

For example, by doing a technical spike it is determined it is less likely that an index server is needed. The team agrees there is a 25% chance, they ruled out 3 out of 4 reasons an index server might be needed. The only chance in the spreadsheet is the risk likelihood being reduced to 25% (from 50% as shown in Figure 3) The forecast now looks like this –

25% chance of performance risk.

Figure 4 – 25% chance of performance risk.

Its clear to see that there is now a 75% chance of hitting June versus September. This is well worth knowing, and until we can show how things going wrong cause us to stress when asked to estimate a delivery date, the conversation is seen as the team being evasive rather than carefully considering what they know.

This example is for a single major delivery blocker risk. Its common that there are 3 to 5 risks like this in significant features or projects. The same modeling and forecasting techniques work, but rather than just two peaks, there will be more peaks and troughs. Strategy stays the same, reduce likelihoods, and prove early if a risk is certain. Then make good decisions with a forecast that constantly shows the uncertainty in the forecast.

Conclusion

If you aren’t brainstorming risks and forecasting them using Monte Carlo forecasting you are likely to miss dates. Averages cannot be useful when forecasting multi-modal forecast outcomes common to IT projects. Estimating work items is the least of your worries in projects and features where technical risks abound. We find three risks commonly cause most of the chaos and rarely find none.

Main point – its easier than you think to model risk factors, and we suggest that you take a look at our spreadsheets that support this type of analysis.

Troy

 

 

Read More