This post was originally published on my newsletter Stratechgist.
The sales team promised: âif we had reporting, this big user would sign a 5 year contractâ. I prioritized the reporting functionality, we shipped it, user signed, a success all around. Months later, the user discovered a bug â reports were wrongly aggregated if they straddled more than one calendar year. I didnât want to prioritize the bugfix because it was a rare use case, and this was the only report of it. âJust tell them to combine multiple annual reports in Excel,â I told their account manager.
She agreed. We both moved on, at the chagrin of the user who signed this 5 year contract.
Software development constantly generates imperfections. Not everything can or should be fixed. No one has unlimited resources, if youâre a team leader you know this firsthand. Itâs difficult to prioritize a bug that affects a handful of users over new features that will drive more adoption.
After a few years of prioritizing features, you get a laundry list of bugs that at some point you will have to take care of. As a leader, youâll have to decide when. If you don't, the alternative is wasted effort on low-impact fixes, critical issues festering, and team morale drained from neverending backlogs.
Every team needs a conscious, strategic approach to managing bugs. You need a framework. Let me tell you about mine: the three Fs. It stands for âFix, Flag or Forgetâ. Yes, forget, as in leave it broken.
The âthree Fs frameworksâ categorizes any issue/bug in one of these three buckets:
Letâs dig further into the factors that I look into when making the decision in accordance with the framework.
First things first: impact! Impact is communicated as âseverityâ, a proxy for âHow bad is it when it happens?â. The severity of an issue is a wide spectrum. Iâve seen issues ranging from minor user annoyances to system crashes and data loss.
Each bug has an associated business cost. It comes in two flavors: quantifiable and qualitative. In my experience the quantifiable is easier to dig out as it comes as one (or a combination) of the following:
The qualitative is the user's pain. My own proxy for it is âhow much of the userâs day have we ruinedâ by them running into the issue. The qualitative cost manifests as frustration, confusion, yelling at support agents, and difficulty with working around the bug itself.
I consider these factors essential when assessing the impact of any bug or regression, as they will allow me to better prioritize the fix.
Every fix has an associated effort, which affects the cost. The cost to fix exceptional issues with high severity â e.g. showstoppers â is irrelevant. In all other cases the cost is essential to prioritising a fix.
I find engineering cost estimates most useful, when they are expressed in time (i.e. hours, days) or story points (if the teamâs process is Scrum-y). As a leader of a team I take understanding the effort to fix a bug very seriously. We all have limited time on our hands and we have to be careful how we deploy it, so having a good understanding of the effort is essential to prioritize well.
The cost of the fix has correlation with the complexity of the issue at hand. A simple tweak (e.g. copy change) can take minutes to fix, while a structural problem that requires a major refactor can take months. The complexity of a fix is also related to the knowledge required to implement the fix â some fixes are small only to the engineers with specialized knowledge. Picking the person with the right expertise to execute (or oversee) the fix can influence the cost of the bugfix.
Another factor that complicates a fix, and increases its cost, are dependencies. Issues can span multiple systems, services or products. For example, a bug that caused by a timestamp field without timezone support requires a fix in the service backend, a change of the column type in the respective database table, a proto schema change, and bumping the protobuf schema versions on downstream services to adopt the new field, and rolling out all of those services one by one.
To complicate things further, this long chain of changes and dependencies can get blocked. It doesnât have to be a huge blocker - even a service deployment lock due to an unrelated incident can postpone the bugfix hitting production by hours, days or even weeks.
Finally, I lke to consider the blast radius. Major fixes like refactors are risky. Without excellent test coverage they can introduce new, worse problems, destabilizing the system. I prefer fixes with surgical precision â small, targeted, low-risk changes, easy to test and revert â instead of open-heart surgeries. Precise fixes are low risk.
When it comes to fixing bugs, as engineers we donât always think about how strategically aligned these fixes are to the current or future goals. More interestingly: does not fixing them block our goals? If a bug is of low impact and itâs not strategically aligned with the productâs direction, then itâs difficult to justify prioritizing it.
Also, the issue might distract the team from higher-priority strategic initiatives. If the team doesnât have the bandwidth then maybe the bug should not be prioritized now (or ever?).
Sometimes even though thereâs no strategic alignment, there might be technical alignment. Some bugs are related to technical debt. In such cases, the bug should be used as additional fuel on the fire to prioritize fixing it, as itâll be a double whammy â youâre not only fixing a bug, but youâre also paying down debt. I have witnessed bugs that popped up in legacy parts of the system and itâs always neat to use said bugs as an excuse to migrate (part of) the system to newer foundations. As an engineering leader, itâs easier to justify a migration when itâs related to a regression.
Bugs can regress the system over time or over a different dimension (e.g. traffic, load). In the past Iâve launched a new API with a memory leak. In the first days the leakâs memory impact was negligible. But after a couple of days it was completely depleting the serverâs memory. We observed that as the traffic grew, also the leakage got bigger, so we not only had to reboot the server every day, but also to sty vigilant for days with elevated traffic. Eventually, we found the leak and fixed it.
The example shows how an issue can creep up and get worse over time or with the increase of use. If we didnât have good monitoring in place the memory would leak out and the operating systemâs OOM killer would reap the server process, leading to downtime.
Issues such as the memory leak also have a delayed cost. The more complex a codebase becomes the more difficult it will be to root cause the leak. Thatâs why itâs better to work on such issues sooner rather than later.
Worse case scenario would be if such issues become a blocker for new feature development. In other words, neglected issues halting innovation. A nightmare.
As software engineers we all build the intuition when an issue has to be fixed now v.s. later (or never). As an aspiring engineering leader Iâve been thinking about codifying this: a set of practical criteria when to characterize an issue as a must fix now.
Hereâs my loose categorization:
Any of the above should end up in the âFix itâ category. But just prioritizing them is not enough. Teams must build a muscle to react to this class of bugs; for example:
The bottom line is that any engineering team must have the mechanisms to tackle important bugs rapidly, with as little interruption as possible. The leader(s) of the team must protect the team during that ordeal and communicate outwards (i.e. stakeholders and leadership) about any side-effects to ongoing projects.
Flagging is great. It allows you to acknowledge a bug yet defer its fixing. But it can be a trap. Flagging can turn the backlog into a graveyard where tickets go to die. To avoid the graveyard situation flagging must be an active bug management process, not a knee jerk reaction.
Hereâs my categorization for flagging:
As I mentioned before, flagging is tricky. To be effective you need a tight process and to be a bit dogmatic in following it. If you end up skimping on hygiene your backlog will soon overflow with bugs that you donât even remember why theyâre there in the first place.
Hereâs what effective flagging requires:
The bottom line is that any engineering team must have the mechanisms to flag issues with consistency. Allocate specific time for reviewing flagged items. Resist pressure to fix everything immediately if it doesn't meet the "Fix It" criteria.
Sometimes, the best decision is to do nothing. As a lead, what I always make sure to highlight is that deciding not to fix something is not neglect but a strategic non-action. This requires courage and clear communication, to your users, your stakeholders and your leadership.
When to leave things broken:
As a leader, to decide to leave something in a broken state is not just closing the Jira ticket as âWonât doâ. The decision requires auxiliary artifacts so people (including your future self) can refer back to them whenever needed.
Hereâs what I suggest:
As with any framework, there are some pitfalls that itâs easy to get into if you do not apply critical thinking to your use of the framework. Some examples:
If you are looking to grow as an engineer on engineering leader, letâs chat more and schedule your 7 day trial mentorship for free.
Find out if MentorCruise is a good fit for you â fast, free, and no pressure.
Tell us about your goals
See how mentorship compares to other options
Preview your first month