•  Back
  • Customer Support for Software Engineers — Part I

    How to manage production defects after (and before) releases

    Customer support is critical to the success of a software company. Softwares will have defects, usability issues, face unforeseen circumstances like downtime, etc. But what can we do while we develop software to support support? Puns aside, how can we manage software supportability?

    Make sure to check the Part II of this series, where I talk about measurement, estimation, and people allocation towards Service Level Agreements (SLA)/Service Level Objectives (SLO).

    Let’s review some things that many organizations fail to do that ultimately will save them time, money, increase customer satisfaction, improve their estimates and support Service Level Objectives.

    FAQs and TSGs

    The following might seem obvious, but you would be surprised with the number of companies that release software in production and forget that somebody will have to support the users.

    Frequently Asked Questions (FAQ) can be provided ahead of time if engineers can already anticipate some of the questions that can arise. It can be done by analyzing:

    • Features that are not explicit in the UI or maybe hidden in menus;
    • Business processes or activities that are supported by a combination of different actions;
    • Functionality that didn’t leave the backlog (even if it’s to say that this is not supported yet — better yet if we can say that it’s not supported yet but our team is considering it).

    A simple count of how many times the same question gets asked is also a good indicator of an improvement that can be done to the system to save time from the CS team, improving the overall SLO. FAQs are usually available to the users to follow by themselves.

    Troubleshooting Guides (TSG) should be provided when incidents require a procedure to be executed by the user, the customer support agent, or both. While FAQs are designed to answer questions and may provide instructions or steps to accomplish some goal, a TSG will help identify (and hopefully fix) an actual problem. Remember that while the TSG is being executed, the results of the steps should be collected as they will be valuable information to the engineering team in case the TSG is not enough to address the incident reported by the user.

    FAQs and TSGs should be provided to CS upfront.

    Teams that build features to be released in production should seriously consider building and updating this documentation, making it part of their Definition of Done.

    These documents are also to be kept updated by CS agents as they find out how to answer new questions and troubleshoot new incidents, composing a knowledge base that will reduce the time needed to support clients.

    While some teams may consider this as overhead, they should think of the time it will save from everyone (including themselves) if the CSA can promptly support the user without having to interrupt anybody’s planned work. Often TSG’s may arise simply from workarounds that are defined as part of managing an incident.

    Incident x Defect x Improvements

    Managing the incident does not require the defect to be fixed. The priority should always be to support the user by fixing that occurrence or providing a workaround while the team works on a permanent solution.

    When a user reports something, they report an incident. An incident is something they need help with, but it is not necessarily a defect. A user asking how to do something that the system does not do is not a defect but can be interpreted as a possible improvement (if many users ask about it, it might be worth building it).

    However, if the user is complaining about not being able to complete some action that the system is supposed to support according to the requirements, it is a defect. It doesn’t matter if the user is doing it right or wrong. If there is a way to get it done and the user can’t figure it out, you have a usability defect.

    Note that managing the incident does not require the defect to be fixed. The priority should always be to support the user by fixing that occurrence or providing a workaround while the team works on a permanent solution. When a workaround is defined, it should be integrated into the TSG or FAQ until a permanent fix is worked on. If you have a defect (the system’s behavior is different from what the requirements specify), you should probably add it to the TSG. If you have a usability defect, you will likely want to add it to the FAQ.

    Triage

    During the triage process, the customer support agents can gather the information they can to support engineers in designing a workaround or providing a fix. Tt is the responsibility of the engineers to clearly document which information is needed.

    Triage is the process the Customer Support team executes to prioritize tickets according to their impact on the customer or the organization regarding financial implications, image/brand, etc.

    It’s important to understand that the time it takes to fix a defect or if a workaround can be established has no relation to the impact it has on the business or the customer.

    What is important is that, during the triage process, the customer support agents can gather the information they can to support engineers in designing a workaround or providing a fix.

    Therefore, it is the responsibility of the engineers to clearly document which information needs to be collected and what the steps are to collect it in a way that customer support agents can be trained and can understand how to do it.

    Root Cause Investigation

    While the root cause investigation might have detected the fault in the code that is causing the defective behavior, it is NOT the root cause.

    Root cause investigation is the process of finding the cause of an incident or a group of similar incidents. There are different approaches for the root cause investigation.

    For high-traffic software, often one-time incidents are just treated at the incident level, and no root cause investigation is performed until the incident happens above a specific frequency. Sometimes, users make mistakes or have unique situations, and it might not be worth making permanent changes to the software just for one case or another.

    When conducting a root cause investigation, teams often stop at the cause at the code level and fix it there. That fault may be the one causing the software failure, but it might not be the root cause.

    In a software development process, defects propagate. If a defect is present in a requirements documentation, it will likely propagate to technical designs built based on the documentation. The code should also replicate the defect because it should adhere to the requirements. Likewise, the tests are not executed to find defects in the requirements but to assess the system’s adherence to them. That is if the software was built with internal consistency. I had a professor at the university who told me, “If there is a lie in the requirements and you couldn’t detect it, provided that you did the rest of the job right, you told the same lie until the end.”

    That means that while the root cause investigation might have detected the fault in the code that is causing the defective behavior, it is NOT the root cause. Finding the actual root cause (the origin of the defect) will enable you to start focusing on how to deliver better in future releases by avoiding defects from even happening.

    Defect Avoidance

    By gathering the information on which activity introduced the defect, we can detect which steps of our process are more prone to insert faults (or which are the cheapest defects we can avoid).

    I think that the most important piece of this post might be this one. This is how we actually improve our process to reduce the number of defects. All the things we discussed so far are tied to improving the management of the incident, not having fewer incidents to manage.

    So, what can we do?

    I’ve mentioned the defect propagation effect. When we make a mistake and include a fault in any software artifact, any other software artifact built based on the faulty one will likely carry the fault forward (if we develop with consistency).

    Artifacts built at the beginning of the process (requirements, mockups, etc.) will have faults. These faults will carry over to intermediate artifacts (like technical specifications, models, etc.). When the last* artifacts (code, tests, manuals, installation guides, etc.) are developed, these faults will be built into them. This is expected. We even have a name for the quality assurance process that makes sure that we have done all these transformations consistently, which is verification.

    *by “last” I mean artifacts delivered without necessarily being used as input to build other artifacts in the same process.

    A defective behavior on the software is called a failure. This behavior may be caused by one or more faults in the code. While this fault may have been introduced directly into the code, it may have as well been transformed from another artifact.

    By gathering the information on the actual root cause, the origin of the defect, we can detect which steps of our process are more prone to insert faults (or which are the cheapest defects we can avoid). We can then decide to take different actions to minimize the rate at which we introduce defects. Some approaches can be:

    • Improve the templates to capture missed information and minimize ambiguity for who is writing the artifact and who is reading it.
    • Add or improve QA activities. For code, we can do code reviews and tests. For tests, we can do mutations. For models, we can adopt checklists. For textual documents, we can also adopt checklists, reviews, inspections, perspective-based reading techniques, etc. Many practices can be employed to improve the quality of these artifacts. The best one will depend on each scenario.
    • Invest in training. Often, people don’t understand how what they are building will be used in future activities, or sometimes they don’t know how to best use the tools and templates adopted by the organization. Or they may simply need the training to perform their roles better.

    By identifying these improvement opportunities, you can reduce the number of defects being added. There will be fewer defects to be captured and fewer defects escaping to production. The cumulative result will lead to fewer reported incidents.

    Fewer reported incidents mean less effort fixing defects, less high-severity stressful incidents, fewer interruptions to the planned work, higher productivity, satisfaction (on all sides), and a better image for the business and the product.

    The same approach should also be followed once a defect is detected during any QA activity, whether a test, a review, or an inspection. It allows for fixing the entire software consistently (not just the code).

    While doing that only for defects found in production will keep you focused on the type of faults that escape QA, doing it for all of them will ensure you save time and effort throughout the process and increase productivity.

    Support Tier 2

    Organizations either allocate these incidents to the same teams that are working on improving the software or keep that as a last resort and allocate the incident to an engineering team specialized in working on the incidents.

    When support can’t resolve the incidents by applying their knowledge about the system, their business knowledge, and the knowledge base (FAQs and TSGs), the incident needs to be escalated to a technical team.

    Organizations either allocate these incidents to the same teams that are working on improving the software or keep that as a last resort and allocate the incident to an engineering team specialized in working on the incidents.

    Both strategies have been proven to be successful, so I will just mention a few caveats with each approach.

    Support Tier 2 teams are usually not very experienced with specific parts of the system. Their focus is on providing a workaround and performing root cause analysis. Often the root cause fix will be performed by the team responsible for the scope where the cause was identified. It is important that the Support Tier 2 team involves the team that owns that scope in any changes that need to be made to the system. They should:

    • Fix the defect in the system and ask for the owning team to perform code reviews and design tests.
    • Identify and report any artifacts that will need to be fixed (requirements, technical specifications, manuals, etc).
    • Update FAQs and TSGs accordingly.

    Therefore, the owning team will still have responsibilities related to the incident, it’s just that these responsibilities won’t affect the SLO anymore. Also, sometimes high-severity incidents go straight to the owning team.

    I’ve seen some organizations complaining that the owning teams would care less about the quality of the product because they don’t handle incidents. That is usually not true, but if that is a concern, a winning strategy is to have the owning teams handle all incidents related to the most recent release.

    If there is no Support Tier 2 team, the incidents go straight to the owning team.

    This team will have to reserve effort from the sprint (usually part of the velocity) to handle incidents that cannot wait for the next sprint plan to be prioritized due to the SLO. In Part II of this series, I’ll talk about strategies to do just that.

    Customer Support for Software Engineers — Part II

    Make sure to check Part II of this series, where I talk about measurement, estimation, and people allocation towards Service Level Agreements (SLA)/Service Level Objectives (SLO).

    If you like this post, please share it (you can use the buttons in the end of this post). It will help me a lot and keep me motivated to write more. Also, subscribe to get notified of new posts when they come out.

  •  Back
  • You might also enjoy