What Is the Customer Impact?
Failures happen and there is no avoiding them. However, we should strive to ensure they cause the minimum possible customer impact. I’m going to use a trivial story of something that happened to me recently to show why being able to measure customer impact during failures is critical.
I recently planned my week around an important delivery. Late in the afternoon — when my packages typically arrive — I received a delivery status update that read: “DELAYED: Due to operating conditions we no longer know when your package will be delivered.” This was a confusing message and the delivery time changed from “today” to “unknown.” I reached out to UPS customer support and finally learned the package didn’t make it on the truck. Their only suggestion was using MyChoice and electing to pick up the package in person, which would add at least another day and involve a drive to another city.
I fully expect failures, especially shipping delays around the holidays. The customer experience after the initial failure could not only resolve the issue with minimal impact, but also create a more loyal customer. Instead UPS made things worse with uncertainty and by requiring more of my time. They now have a frustrated customer rather than a loyal one.
We can break down what happened in my case to see how they created a frustrated customer. First, I went from trying to check delivery status on the site to completely confused about the status. This is the first failure. Even worse than no messaging is confusing messaging. They placed the burden of understanding on me, the customer. I had to reach out to support instead of them reaching out to me. My time costs are now increasing. The rep was perfectly pleasant and sympathetic, but also of no help. Their only offered course of action was something that further increased the impact on me by taking more of my time and putting all of the burden on me. I would need to change the delivery options via the website, then I would need to go pick it up.
This entire process made a simple, common failure worse for a customer. This is the opposite of how support should work. If I were managing this area at UPS, I would be doing retros and failure analysis on the error resolution process. While I would want to minimize delivery issues, my focus would be around minimizing customer impact of our failures. I would want all possible data and profiles on customers. Are they a new or recurring customer? How has their experience been previously (what’s their overall satisfaction been)? Have they reached out to support previously? Is this their first time having this specific issue? What are the metrics on our Twitter feed? I want to get a picture of our customer satisfaction and then do whatever I can to improve the customer experience. I might prioritize solving the resolution issues over solving the actual shipping failures. If we handle our customer experience correctly, this will provide us cover for the inevitable failures we do hit. UPS may be and hopefully are doing all of these things. Which would just go to show how difficult these things are. It must be a part of your culture and changing culture is difficult.
Why Apple and Amazon Are Eating Your Lunch
I am not mad that UPS screwed up the delivery. Things happen, I am a realist. I expect failure. I judge you not by your failures, but by how you handle your failures. I’m now an Apple fanboy, not because I believe they have the best tech or make the best products, but because every issue I have encountered has been taken care of with the best customer service I’ve ever seen. I’ve had no bad customer experience issues with Apple. Their service is good enough that even an average experience is shocking. Apple values their customers and it shows. In appreciation for them valuing me, I continue to pay for their products and services even when I know the competition can provide better products. This plus convenience is also why if I’m not using Apple I’m using Amazon. UPS’s handling of this issue wasn’t terrible. I just have higher expectations now. I’m used to the world of Amazon and Apple.
Not only have Amazon and Apple changed my perspective, but I was lucky enough to work at a company that valued customers in the same way and has the results to show it. Workiva is a truly customer focused organization. During my time there, Workiva maintained a Net Promoter score at levels often higher than Apple’s. As much as I want to credit the wDesk software I helped build with Workiva’s success, I believe the dominant factor was their focus on the customer. This not only lead to better, customer centric software, but to an entire organization and processes built around the customer. As engineers we spent tons of time fully engaged with our customer support staff. Especially in the early days of the company, our customer success (CS) reps often played the role of the product manager, and as a great proxy for the customer when we didn’t have direct access to the customer. We spent significant time at their desks ensuring issues were resolved quickly. Before the data analysis organization was formed we had to rely on customer interactions to get an intuition for customer experience. This direct engagement allowed us to internalize the customer impact of our engineering failures. When we set out to solve issues, it wasn’t just about fixing a bug, it was about building a system that could handle failure. We ended up dedicating months of time to building in failure handling systems and processes. We knew that there was no way to ensure a connection from a client to a server would be successful. So we built a framework on the client to handle those issues. We hardened the client software to the point that the backend system could go down entirely for minutes without any significant customer impact. This mindset lead to my partner Robert Kluin creating an Incident Response process designed to ensure customer impact was minimized and thoroughly understood. We’ll follow this post up with the details of that process and the impact it had.
These companies all share a culture that puts the customer first. It is ingrained deeply into their cultures. It can not be easily injected in later on, at least not that I’ve seen be successful. Why is Amazon investigating building their own delivery service? One critical reason is so they can control that process and ensure that it provides the best possible customer experience. They need a process built with their culture and values. Their culture and values are fanatically customer centric. They certainly aren’t going to inject their specific culture of caring about customers into a partner like UPS. Even if they were to buy UPS, it would be nearly impossible to shift the culture of an org of that size and age. With that in mind, I expect Amazon to really hurt UPS once it decides to go all in on delivery.
How do we get this wrong? A common scenario I see following a failure is a manager saying: “Explain to me how this will never happen again.” While I appreciate attempting to fix the problem, this is relatively low on my priority list after a failure has occurred. We have other processes in place between engineering practices, tests, QA, etc, that this specific issue won’t occur again anytime soon. I don’t care about the specific failure because failures are going to happen. What I care about is how we handle failures and that were are learning to minimize the customer impact from failures. I expect failure is going to happen again, I want to be good at minimizing the impact and recovering quickly.
Customer Impact
When assessing a failure, the single most important question I want answered is “what was the customer impact?” I don’t want meaningless numbers quantifying the impact. I want you to do the best you can to actually measure the impact on the customer’s experience. Was the issue just a blip in their day and they barely noticed? Or, did it cost them hours of productivity? Do we now have customers that went from loyal to disillusioned? I want a summary that includes the number of impacted customers, the impact per customer, and details on any critical customers impacted.
Yes, you have high impact customers. The obvious high-impact customers are the ones that pay you the most money. However, you should also consider influential customers such as those highly visible on social media and marquee customers. These are often customers you’ll put on watch lists after they have a particularly negative experience. I want to know if they were impacted by this outage and how severely.
This is the beauty of Software-as-a-Service (SaaS) businesses. The business and customers incentives are much better aligned. Customers pay you on an ongoing basis to provide value and customer service that meets their expectations. As long as there is competition, customers have alternatives, and if there are no big up-front costs there’s a much lower transition cost. This means we must always measure customer satisfaction to ensure we can keep them as customers. Specifically for a SaaS company to be successful, you must have high levels of customer retention to retain the recurring revenues that offset the R&D costs you paid up-front plus your ongoing expenditures. Not to mention, the best way to drive new business is often through existing customers. You typically won’t have high retention rates, and certainly won’t get help driving sales, unless you have strong customer loyalty. The a16z podcast on “The Rise of the CCO”, Chief Customer Officer focused on this topic. I recommend giving it a listen, especially if you are a SaaS company.
Response
The next thing I want to know after an incident is “how was our response?”. Were we aware there was an issue before we started getting reports from customers? Who was aware: operations, development, quality assurance, customer support, sales? I want to know who was made aware and how. For those directly responsible for resolution (often engineering and operations), did we have the tools in place to detect this issue? What was the level of customer impact by the time we were aware? Did we have tools and process in place to effectively communicate with stakeholders? In this case stakeholders means everybody from customers to our executive management and support staff. We should have a predetermined list of stakeholders that are notified depending on the severity (both potential and realized) of the issue. Speaking of impact, did we correctly measure and report the customer impact? Different stakeholders have different priorities, making it difficult to have a single quantifiable measurement of impact.
How was our communication process overall? Did stakeholders get the correct notifications via our predetermined communication channels? We don’t want the CEO to first hear about an issue from a customer’s executive management. Ideally we also wouldn’t have stakeholders hearing about issues from social media accounts not controlled by us. It is critical that we control the message, but also that we’re open and honest about the issue to ensure we maintain (or gain) trust with our stakeholders. If we lie and they get the real story later then all trust is gone. If we hold information back and they still attain that information their trust in us will be diminished.
The Failure
After we’ve covered customer impact and our response, I want to know about the failure itself. However, as I mentioned, I don’t really care about the specifics of the failure. I am interested in making sure we learned from the failure so that we can avoid similar failures in the future. For an issue to actually impact customers, it almost certainly wasn’t a single failure. We too often focus on the technical specifics of the issue that occurred. And, while I want the technical issue fixed, I also want to address all of the other failures in the system as well. Failure is inevitable, my goal is keep customer impact costs as close to zero as possible. This means building a fault tolerant environment. Fault tolerance is different from failure avoidance: total avoidance isn’t actually possible.
This isn’t to say that we shouldn’t spend time attempting to avoid failure. We should put boundaries and processes in place that help us avoid as much failure as possible. However, we are a business that needs to ship to survive and take risks to thrive. Thus we should target a “good enough” threshold for delivery — which is typically something less than perfection. I highly recommend watching Brian Troutwine’s excellent presentation on Fault Tolerance to get a better understanding why we can’t build perfect solutions. In his presentation he walks through the costs of failure avoidance, using NASA as an example.
Being tolerant means having processes in place to handle failures. Those processes should be as automated and instantaneous as possible. If an issue resulted in measurable customer impact, our failure-handling processes did not handle the issue correctly. Latency was encountered somewhere in our process. It’s our goal to find the areas that were not responsive enough and improve them as much as possible.
The Post-mortem
The post-mortem is where we spend time investigating and learning from not just the failures in the system, but our failures in responding to the issue. Too often I see post-mortems only focused on the solution to the specific issue. Instead we must use the post-mortem to improve our failure handling process. As a general rule, if a failure caused any customer impact a post-mortem should be done. The goal is for any failure in our system have nominal customer impact. This does not just mean seeing no issues in our monitoring dashboards, but reducing actual customer impact. An large increase in latency of a core service causing at most a slight increase in perceived latency by the customer is how I want failures to be handled. A slight increase in a service’s latency causing cascading failures and thus noticeable customer impact is not tolerable.
This is the danger of only relying on dashboards and system metrics. Those numbers are valuable to folks in operations but often mean nothing to our customers. During our post-mortem, we should be discussing not only the issues impacting our customers but how can we better learn to quantify and display customer experience. This will likely require finding ways to correlate system metrics with customer experience. This is quite difficult to do, but an improvement worth doing.
If you don’t have a process in place to learn from failures and better handle future failures, you should start today. Personally, I would invest significantly more into the tools and processes for handling issues that happen in production, than I would in pre-production development. Yet, over and over I see massive amounts of money spent on upfront development tools and processes while the tools and systems to help diagnose and triage production systems are often neglected until an issue occurs. And, unfortunately before we know it, we’re just going from fire to fire trying to keep up. One of the best ways to get out in front is to force your staff to use production tools for development — for example by learning to use logs and metrics rather than debuggers and SSH. Sadly, this is rarely done. We sacrifice customer experience in the name of shipping. When all we’re really doing is sacrificing ship time now for ship time in the future.
I understand that if you are a startup shipping now is almost certainly the highest priority and you’ll have no choice but to sacrifice time in the future for time now. This is ok. Since you’re a startup, you likely have few customers. And as Geoffrey Moore points out in his book Crossing the Chasm, your early customers are what he categorizes as “Innovators” and “Early Adopters”. They are going to be more willing to tolerate failure and even poorer customer experience than the bulk of your future customers. It’s when you attempt to go from Early Adopters to the Early Majority (the chasm) that many companies fail. It’s here where your customer experience can save you. Use that time with your Innovators and Early Adopters to build out your failure handling protocols so as you bring on more conservative customers, hopefully in bulk, you don’t quickly lose them and fall into the chasm.
Also understanding these things now and the potential ways to build them out will help you build the foundation now. You know you need to have engineers working with customers to build the best product. Don’t let the conversation end there. Ensure your engineers are there with your customer support to interact with customers when issues arise. Create a customer success department and hire a passionate leader. They can come from the product side but it’s not necessary that they do. Just embed them early. Make them part of the development process. This is what Workiva did from day one and it was critical to their success.
If you can’t get your team to embrace using production tools, you should start with the manual processes for handling failures and the post-mortem process. You can’t optimize a process if the process doesn’t exist. Whether a startup or a mature company, you should not be skipping retros. Ever. Once you have the process, follow it and stick to it. It’s like a diet. It’s always easier to cheat but in the long term it’s worth sticking with it. Once you have process that you can optimize you’re only steps away from a process that you can automate as you scale as a company. If you can automate failure recovery you’ll be ahead of the curve and almost certainly have a competitive advantage.
We have seen cultures at companies best described as “everything is a fire.” It’s like an extension of the startup phase where we pretend every issue is new and critical: instead of having a standard process that ensures a disciplined, calm response, we run around panicking and screaming about the issues. This is where the heroes like to show up and save the day. This culture will not scale. Your customer experience will degrade. Engineers and teams will burn out. There aren’t enough of the hero folks. And, your system has likely grown too complex to support a hero saving the day anyway. You must fix this culture. The only solution is to build a standard process with full company buy-in around handling failure. Once you are handling failure without significant customer impact, issues go from being emergencies and a mentality of survival mode to an optimization process.