Any developer or coder, professional or otherwise, needs to have the ability to investigate bugs as they arise. The process may seem straightforward at first glance, but when investigating bugs which are difficult to reproduce or find causes for, applying proper method will be crucial to your success. I have outlined the method which I've created and refined through my own experiences in the sections below.
Symptoms and Causes
We define a symptom as an observable issue, bug, defect, or malfunction, and a cause as the reason a symptom appears. It is important to approach the process of investigating a bug / defect with the understanding that symptoms and causes should be handled as two seperate things, and that the understanding of the root cause or causes may be hidden even when all of the symptoms are discovered. I should acknowledge that though we are handling symptoms and causes seperately, it is possible that a symptom of one cause is itself the cause of another symptom. In other words, we can have a chain, where one root cause causes a symptom, which then causes another, etc., similar to a stack of dominoes.
Acknowledge and Address any Biases and Assumptions
Some of the most common reasons why a developer or team will fail in their investigation of a bug or defect relate to pre-conceived notions and biases, such as:
- Assumptions about the quality of a module or piece of code based on which team wrote it, good or bad, and thus the likelihood it will be causing the bug / defect in question.
- Biases to avoid looking into certain possibilities because the repercussions of the discovery would be undesired.
- Biases to favor looking into or avoid certain possibilities due to some personal motivation.
- Avoidance of certain observations or avenues of investigation due to the political environment of the organization.
If left unchecked, assumptions and biases may manifest as listed above and hinder, or even halt, an investigative process.
Study the Initial Report / Observation
Generally a bug / defect is brought to your attention in one of these ways:
- You discover a bug yourself while developing or reviewing the codebase.
- Someone else informs you of the bug / defect in an unofficial capacity.
- You are officially assigned to a defect and given a report with a way to reproduce the defect and a description of it.
When beginning an analysis based on an initial report or discovery, you must keep in mind the possibility that the findings, descriptions, and suppositions are partially or wholly incorrect. Symptoms might possibly be causable by different types of issues, and some symptoms may be "disguised" as other symptoms. Even when the initial analysis is fully correct, we must not anchor ourselves into relying too heavily on that initial analysis. There is a common cognitive bias called the anchoring effect which is just as described: relying too heavily on the first piece of information given.
With this in mind, the initial report / description will still provide a good place to start the analysis. The content can be used to reproduce some of the symptoms of the defect (though there may be other ways not specified as well), give some information into the overall context and use case(s), and provide a starting point for light conjecture into the causes, so long as it is considered as conjecture and thus subject to change. Most importantly, with these considerations in mind, you should be able to derive some next steps for the analysis / investigation.
Make Observations of the Symptoms and Surrounding Context
One useful avenue of investigation is to check if the bug / defect can be reproduced in scenarios other than the one in the initial observation. The questions below can aid in this type of analysis:
- Can we reproduce the symptom in other conditions? Will a different combination of inputs, system states, error states, etc. also cause the same symptom or related symptoms?
- Will any slight, or major, changes in the inputs prevent the symptom from occuring? It may be possible to narrow down the possibilities by removing extraneous conditions.
- Will any slight, or major, changes in the inputs worsen the situation? This may also help narrow down the root cause.
- Does a change in context remove or worsen the bug / defect? For example, will following the same steps with a different user account, different time of day, different visual theme, etc. make any changes?
- Is it difficult to reproduce the scenario, or does it only appear intermittently? If the same set of inputs / circumstances do not always produce the bug / defect, then either something is missing from the initial analysis or something is happening outside of your control (such as a dependency itself intermittently failing).
Some of the points above may be irrelevant depending upon the situation, though the overall approach is still valuable. The facts gained from this step may help narrow down the search for the root cause(s) and might possibly even be used to enhance the original report.
Follow the Path(s) to the Cause(s)
The steps above will have yielded enough information for you to see at least one path to one or more possible root causes. If more than one path presents itself, they can be formally or informally ranked based on likelihood or ease of investigation, though we must remember to address biases and assumptions at this step as well. The remaining steps can be summarized as follows (for each path):
- Follow the path through the code / data flow: in a sense, we will be traversing the code / data flow backwards, reverse from how the system / flow actually behaves.
- If something unexpected is found, follow the process laid out here: when we reach a point where something is wrong, broken, or unexpected, we can take this same approach of debugging / investigating using this observation as the subject of investigation. In effect, we are reducing the original defect to the one found here (though in a real life scenario it may not be quite that simple). At this point, we may or may not have found a root cause.
- Analyze the root cause and consider methods of correcting it: the actual correction of a bug / defect falls outside of the scope of this article, but the investigative process laid out here should give hints and possibly a course of action as to how to resolve the root cause.
Based on the steps above, we can consider the method to be a continual process of reducing the observations until we get to the root cause(s) (recursive anyone?). This idea of reduction is not a crucial part of this method but it may be useful to keep this in mind throughout the process. Overall, this method is simple and perhaps even somewhat obvious, but often the simplest things are the most elusive.