Metrics don’t have to be evil – 5 Traps and tips for using metrics wisely

Posted by in Featured, Forecasting |

Problem in a nutshell: Metrics can be misused. Metrics can (and will) be gamed. This doesn’t mean we should avoid using any quantitative measures for team and project decision making – we just need to know why and what we are measuring, and interpret the results accordingly

“Just like dynamite, it would appear that metrics can be used for good as well as evil. It all depends on how you use them.”

1. Don’t embarrass people

Embarrassing people is easy to do when showing metrics they feel responsible for. This causes data to be hidden, obscured, and mis-reported. This leaves you with an incomplete and inaccurate picture even with data. Once you embarrass someone, thats the last time they will trust any metric, and the last time you have an accurate metric.


  • Focus on trends rather than single point values.
  • Leave axis values off charts where possible; focus people on trends.
  • Exclude any name information Its OK for that team to identify themselves, but NOT for others to point out another team.


Figure 1 – Its the trend that matters. No team names or axis values help compare “trend”

2. Focus on Trends Not Individual Values

Trends are charts of the same measure over time. Trends help make sense of noisy data by helping see relative direction of change. Figure 1 shows a trend-line applied to cycle time data. The orange line is the team looking at its data, the grey line is the trend of the same measure of the rest of the company. This chart shows that the team is driving down its cycle time average over time, whereas the company trend is level over time.


  • Capture data that helps show trend values over time
  • Add linear trend-line to data to help see the big picture of change
  • Help teams see how their trend tracks against “others” in similar situation
  • “other” means teams in SIMILAR situations, don’t compare apples versus oranges. Eg. sustainment teams versus production support teams.

3. Use Balanced Metrics

Tracking just one metric promotes overdriving that metric at the loss of everything else. Multiple opposing metrics should be equally shown with the emphasis that trade something you are above the trend with for something that is trending worse than others. Changing one metric is easy; changing that metric without decimating another is much harder.

Larry Maccherone in his “Software Development Performance Index” uses a metric from multiple quadrants –

Responsiveness – Time in Process average (often called cycle time).

Productivity – Throughput / team size (team size is to help normalize team size, making bigger teams and smaller team trends comparable)

Predictability – variability of throughput / size values. Helps teams identify they have peaks and troughs rather than smooth flow

Quality – How ready to release is the codebase? Could be number of open blocking P1 or P2 defects, or a score based on passing tests, number of un-merged feature branches, performance regressions. This is always the most difficult to find for each company. Avoid defect counts alone. Find ways to make quality mean improved customer experience.

Do –

  • Look for opposable measures. NO team should be able to be BEST at all, just one or two
  • Being BEST in a measure is an alarm! It means that they may be overdriving one measure at the sacrifice of others
  • Always show the measures together so people can see the tradeoffs they are making
Always show balanced metrics together. Avoids focus on just one.

Always show balanced metrics together. Avoids focus on just one.

4. Use Sampling – Track some metrics just sometimes

Some metrics are expensive to capture. You don’t need every metric all of the time. Sampling allows data to be captured for a short period of time to get a snapshot of how high or how low the metric is compared to estimate. For example, how much interrupt driven work is the team fielding requests for? Get the team to stick a post-it note on a whiteboard every time they do a “small job.” over the week you will get a good indication of percentage and make appropriate process changes. You can repeat one week next month and don’t track for the other three. This has made the cost of getting this metric 1/4 of the original cost and given the same result! Sampling is a powerful and underused technique.


  • For measure that rely on people to do extra work to capture; use sampling. For example, track one week a month.
  • It takes less data than you think. 11 samples give a representative picture of a measure, by 30 samples you are almost certain the result is similar to every sample.

5. What, So What, Now What – Help people see the point

There has to be a reason for tracking and showing a metric. Make it clear how a metric trend aligns to a better decisions and improvement. If people don’t know why a metric is being tracked, they will assume its to track them personally! Help them see its about the work and the system, not the worker and their livelihood!


  • Promote system metrics rather than personal metrics
  • Promote team metrics rather than personal metrics
  • Share how a trend of a metric has led to a better decision or improvement
  • Be vigilant about dropping metrics that are just available to capture – have a reason

In summary

Metrics aren’t evil. Although they are often mis-used, they don’t need to be. Make people responsible for determining actions on their own metrics. Send ideas and stories on what you have seen work and fail.


Read More

Risks – Things that could make a big difference

Posted by in Featured, Forecasting |

Problem in a nutshell: Sometimes extra work needs to be done before delivery because something went wrong, or when a feature was built something was learnt that means additional innovation is required. How can these factors be managed in a forecast early and dealt with earlier. We find asking the simple question “What could go wrong?” helps us be more right when forecasting.

Features or project work starts with a guessed amount of work. As the feature is built, other technical learning can cause delays. For example, when a feature for giving suggestions about what other products you might buy turns out to be too slow to be useful during real-time shopping, additional work may be needed to build an index server specifically to make these results return faster. From a probabilistic perspective, there is a known amount of work (the original feature) and an additional “possible” amount of work if it performs poorly. This is a risk. It has a probability of being needed (less than 100%) and an impact if (and ONLY if it comes true).

If we performed a simple Monte Carlo simulation for this scenario, and said that there was a 50% chance performance would fail, the result would be an equal chance of an early date, and a later date. There would also be a normal distribution of uncertainty around each of these dates. The result would be “Multi-Modal” – jargon for meaning more than one peak of highest probability. The average delivery date is early July, but it has almost NO CHANCE! It will be around mid June, or early September. Based mainly on if this risk comes true.

Monte Carlo of a 50% risk produces with our Single Feature Forecaster spreadsheet.

Figure 1- Monte Carlo of a 50% risk produces with our Single Feature Forecaster spreadsheet.

What does this mean? A few things –

  1. Estimating and quibbling over whether a story is a 5 point or 8 point story is pointless. That changes the result in this case by a few weeks. Stop estimating stories and start brainstorming risks.
  2. If we know that risks can cause these bi-modal probability forecasts, we need to stop using AVERAGE which would give us the nonsense July delivery that won’t happen.
  3. Probabilistic forecasting is necessary to make sense of this type of forecasting. But how?

How do you forecast these risks?

It seems harder than it is. Here is how I generated the above forecast (figure 1) using the Single Feature Forecast spreadsheet that uses no macro’s or programatic add-ins – its PURE formula, so its not that complex to follow. Monte Carlo forecasting plays out feature completion 1000’s of times. In the chart image shown in figure 1 above, you can see the first 50 hypothetical project outcomes in the lower chart (it looks like lightning strikes). You can see that there are two predominant ways the forecast plays out with some variability based on our range estimates for number of stories and throughput estimates (it could be actual throughput data, i just started with a range of 1 to 5 stories per week, but use data when you can). Its either shorter or longer, but not not a lot of chance in between.

Here are the basic forecast guesses for this feature –

The main forecast data to deliver a feature.

Figure 2 – The main forecast data to deliver a feature.

Once we have this data, lets enter the risks. In this case, just one –

Risks definition

Figure 3 – Risks definition

The inputs in figure 3 represent a risk that has a 50% chance of occurring, and if it does, 30 to 40 more stories are needed to implement an index server. This risk is added (30-40 stories picked at random) are added to the forecast 50% of the time. The results shown in figure 1 clearly shows that to be predictable in forecasting the delivery date, determining which peak is more likely is critically important. If the longer date is unacceptable, reducing the probability of that risk early beneficial. As a team or a coach, i would set the team a goal of halving the risk probability of needed an index server (from 50% to 25%), or determining early if its certain an index server is needed and the later date real.

For example, by doing a technical spike it is determined it is less likely that an index server is needed. The team agrees there is a 25% chance, they ruled out 3 out of 4 reasons an index server might be needed. The only chance in the spreadsheet is the risk likelihood being reduced to 25% (from 50% as shown in Figure 3) The forecast now looks like this –

25% chance of performance risk.

Figure 4 – 25% chance of performance risk.

Its clear to see that there is now a 75% chance of hitting June versus September. This is well worth knowing, and until we can show how things going wrong cause us to stress when asked to estimate a delivery date, the conversation is seen as the team being evasive rather than carefully considering what they know.

This example is for a single major delivery blocker risk. Its common that there are 3 to 5 risks like this in significant features or projects. The same modeling and forecasting techniques work, but rather than just two peaks, there will be more peaks and troughs. Strategy stays the same, reduce likelihoods, and prove early if a risk is certain. Then make good decisions with a forecast that constantly shows the uncertainty in the forecast.


If you aren’t brainstorming risks and forecasting them using Monte Carlo forecasting you are likely to miss dates. Averages cannot be useful when forecasting multi-modal forecast outcomes common to IT projects. Estimating work items is the least of your worries in projects and features where technical risks abound. We find three risks commonly cause most of the chaos and rarely find none.

Main point – its easier than you think to model risk factors, and we suggest that you take a look at our spreadsheets that support this type of analysis.




Read More

Calendar Days vs Work Days (Storing and using cycle time data)

Posted by in Featured, Forecasting |

Problem in a nutshell: Should work time in process (cycle time) and lead time be workdays or calendar days? Does it matter which we use for forecasting?

Just want the spreadsheet: Get it here: Cycle%20Time%20Adjustments.xlsx

We get this question a lot. Should weekend and holidays be captured in cycle time data. Our answer is along the lines, “whatever you have.” It doesn’t matter from a forecasting perspective, as long as you are consistent. Here are this issues that may sway one way or the other –

  1. If your work item estimate are time based, and they are expressed in work days, it may be easier to use work days as time in process (cycle time) numbers.
  2. If you are capturing item data as date started and date ended, then it will naturally fall out to include weekends, and we can remove those days to get work days.

We often have to convert from one to the other. Its easy if we have dates to work from, because Excel has some helpful functions for computing workdays between dates (removing non-work days and a list of public holidays) Lookup the NETWORKDAYS documentation. This is an lossless conversion.

Trickier is when we just have a number of days. We spend some time checking into how these were calculated. If it is calendar days and we cannot get the raw date data, we use a statistical approach for removing an approximate number of days from each sample. Here is how our algorithm works (it’s a little complicated, but it is the best we have found).

For every multiple of 7 days in the original cycle time we can remove 2 days. If we have a cycle time of 7 days, and we know that a company works 5 days (Monday to Friday), then we can remove two days. When a cycle time is less than 7 days, we have to guess what day of the week the work started. For example, if a cycle time is 3 days, valid starting days where all 3 days fit into a working week are, Monday, Tuesday, Wednesday. If the work started Thursday, 1 day would be weekend, Friday, 2 days weekend. If every day of the working week has equal chance, there is a 3/5 chance the right value is a 3, 1/5 it’s a 2, and 1/5 it’s a 1. We use these probabilities and adjust based on a uniform random chance.

Don’t worry. We of course have encoded all of this logic into a spreadsheet. Our Cycle Time Adjustments.xlsx spreadsheet can convert in both directions for dates and numerical cycle time inputs. It can never be exact for numerical cycle times, but it is pretty close from our round trip testing (dates -> numerical -> dates).

Get it here: Cycle%20Time%20Adjustments.xlsx

You can see our logic for the time based probability logic in the time based setup worksheet.

Setup for the probabilities of cycle time adjustment.

Setup for the probabilities of cycle time adjustment.


For recommendations about data capture of cycle time and lead times, we suggest –

  1. Capture date started (committed to start delivery), date completed and date captured as an option to be considered (often, date created).
  2. Store cycle time data as dates, don’t convert to days until the last moment you need to.
  3. Be consistent with date format. We like yyyy-mmm-dd (eg. 2016-Apr-20) as a format that is unlikely to be confusing whatever the native date format is in your country or region.


Read More

Decision Making

Posted by in Featured, Forecasting | 3 comments

I recently have the opportunity to do training with Michael Tardiff, a gifted facilitator and trainer for Solutions IQ. One of Michael’s specialty subject is group decision making. We take different approaches to teaching this topic, i’m more about getting to any answer, Michael is more about knowing the method used to get to the answer so that it has the greatest chance of surviving use over time. Michael is right of course, the goal of decision making is to get to the right answer (for now) and to avoid future “I never agreed to that” problems. Whilst consensus isn’t necessarily the key, finding agreement that persists over stress and time is the purpose and goal.

Michael says there are four basic types of decision making process, and others are a combination of these –

  1. King Rules (gets to live why we like their decisions, then beheaded)
    speed: fastest, risk: high if technical, long-lasting: until change of king
  2. Majority Rules (works while the minority of the last decision believes they will be the majority one day)
  3. Consent (staying silent means you agree)
  4. Consensus (hardest to achieve, but once agreed it was so hard, it tends to stick)
    speed: slowest, risk: low if the right people agree consensus, long-lasting: good

Its important to call out (when its unsaid) how a decision is being made or has been made. Consensus is the longest and hardest to achieve, but tends to stick because people are invested in the decision. Consent offers middle ground if there is time and capability to handle objections. If your system demands King Rules, just acknowledge it. Majority rules is a muddy area. You haven’t managed to sway the minority opinion who might believe their day will come. But, if a decision is needed by a certain time, or total agreement may never be achieved it is a (often) fair way to resolve decisions. But it may not stick for long.

Hofstede’s Cultural Dimension Theory (see here)

Decision making styles can be culturally impacted. Even within one country, there are very different styles in lively discussion one coast to the other (in the USA, West coast are more consensus introverts, and East coast more Extrovert). Pay attention when working with experts from cross geographies that the ability for challenging authority varies, and you may just think you have consensus. The classic measure of this is Hofstede’s Cultural Dimension Theory which ranks countries based on set of interesting dimension relevant to decision making attributes. I’ve found an awareness of Power distance index (PDI): The Power Distance Index is defined as “the extent to which the less powerful members of organizations and institutions (like the family) accept and expect that power is distributed unequally.” important. And Long-term orientation vs. short-term orientation (LTO): This dimension associates the connection of the past with the current and future actions/challenges. A lower degree of this index (short-term) indicates that traditions are honored and kept, while steadfastness is valued, key to understanding some group dynamics. More ideas can be found in these articles and books: Wikipedia: Cross Cultural Decision Making, and the book Advances Cross Cultural Decision Factors Ergonomics.

Its key that even the introverts who know why a decision is a poor or impossible choice gets heard by the group, independent of salary or positional power. If the decision is more technical than opinion, weight the technical voices in the room higher than the opinion voices.

Reducing Thrashing

To reduce elongated analysis time, I often nudge teams in the following directions –

  1. “Good for now” Agree for how long you are going to test the decision and revisit it for further analysis. Often by helping people remember a decision isn’t in stone, but for now, they overcome hesitancy to commit based on uncertainty.
  2. “Close the gap” Narrow in on a few actionable things. Even if you cant decide on the whole solution, can you agree on first steps. Often, the team realizes that most of the value in the decision is achieved.
  3. “Guard Rails” Identify what factors occurring invalidate any key assumptions and need the decision revisited. Helps people agree for now and feel that dooms-day scenarios are protected against.
  4. “Agree on Research” If agreement on the decision can’t be reached, identify what research inputs are needed to proceed and get a decision. Document what is in the way of reaching a decision and what data would clarify and get clarity or reduce uncertainty.
  5. And Sebastian Eichner (@stdout) mentioned another important tool. “Roll a dice and pick at random.” Often people find reasons why the one picked at random isn’t a viable choice, or if the decision is really that similar in risk and reward, its as good a choice as any! Use it to draw out opinions.

Its good to have teams make smaller, less risky decisions to practice putting contrary views in a productive way. Decision making is a skill to be built in a team, and a great indicator of team maturity.

The one final point often mentioned. “Who is responsible for a decision if one can’t be reached?” There is an eventual moment where King Rules needs to and should apply. If the cost of no decision outweighs the risk of moving forward, someone has to make the best decision they can. If thats you, and you are in a position of power you have a couple of acceptable choices –

  1. Delegate to the most informed expert, and say “which one, we need a choice and i think you have the most information” and then cover them if it goes badly.
  2. Break the deadlock. If two options are equally liked by different people, make it clear that no decision is worse and that you are going decision A for two or three months (as long as you need to see if it was likely right). By making it clear you are only stepping in because of the cost of no decision as a tie-breaker, you still give the team a good chance of making their own choices. If this is re-occuring, you need to make staff changes!


Read More

Does setting arbitrary goals (times or dates) work?

Posted by in Featured, Forecasting |

Problem in a nutshell: Work should be released when it reaches the quality needed with the features required. Of course small releases give the fastest feedback, and this post isn’t saying you should do larger releases. This post looks at whether setting a date or time goal does impact delivery.

Runners in the New York marathon finish in higher concentration just prior to hour and half hour and fifteen-minute elapsed time boundaries. Why? It is speculated that their is a mental race going on in each runner head and they try and achieve the next personal goal-post. Don’t they just run as fast as they can? Sure, but they also need something to pace themselves against in order to judge their ongoing pace and balance it against exhaustion (it should be noted i’ve never run a marathon!).

Clustering of finish times.

Clustering of finish times.

Its 1.4 times more likely to finish 3:59 than 4:01.

Its 1.4 times more likely to finish 3:59 than 4:01.

Having a goal in mind means that constant adjustments throughout the marathon help achieve finishing on goal. Whilst the runners can’t pickup half hour faster than personal best, they get early feedback they are off pace and adjust early to maybe reach a few minute(s) before a boundary.

I think the same needs to happen when we set goals for software delivery teams. They need constant feedback that they are on-pace for adjustment early – NOT cramming at the end. Having a date in mind is the only way to compare delivery pace of work versus a pace required to achieve that delivery without heroics. Heroics are failure. It puts teams in burnout mode and they fail to continue consistent pace after crunch making it impossible to reliably forecast. If I see a team moving into a feature or project have crunched in a prior delivery, I halve my throughput estimates for one to two times the crunch period. Its just NOT cost effective to have teams crunch.

My advice –

  1. If teams have crunched, reduce throughput estimates by 1/2 for 2 times the crunch period they endured
  2. DON’T use throughput samples during crunch mode. They are artificially high and cause crunch mode in the next plan!
  3. Set a delivery date and work out what team size and scope will fit into that period (using our spreadsheets of course :))
  4. Track delivery pace against this plan. The moment delivery falls behind, revisit the scope expected and communicate it is at risk. Get small actions taken earlier
  5. Track when teams are crunching versus sustainable. I put a C and a S in the notes of any throughput weeks I capture in our spreadsheets. Any team spending more than 10% year crunching is costing the company delivery pace and money. Compute this by estimating the salary of the team and computing what running half pace for 2x the crunch period costs.




Read More