Cloud strategies for COVID-19

Tyler Treat
Real Kinetic Blog
Published in
6 min readApr 6, 2020

--

Written by Nick Joyce

COVID-19 presents many challenges for all of us, personally and financially. Some service businesses that normally require close contact with fellow humans have been asked by governments to close for the foreseeable future so that we as a global community can help lower the curve of the outbreak, thereby saving many more lives and allowing health care services to bend under the pressure, rather than break.

Demand for online services has surged since people have been implementing social distancing guidelines. Video and streaming, team collaboration, and delivery services have shown a sharp increase in traffic. Indeed, Microsoft recently announced they had seen a 775% increase in demand for cloud resources in affected areas as businesses struggle to scale up to meet the tidal wave.

I’d like to outline some business and technical strategies that can work together to help maintain application reliability and scalability during these unprecedented times. I’m going to use the example of an online e-commerce store throughout the article, but the key points should apply to any online service.

Expectation setting

This is probably the item with the biggest impact to the business both short and long term. If you align expectations with reality, you will rarely disappoint. The idea here is to provide the user with confidence in the service as well as building trust and loyalty that will, ideally, last long after the pandemic has subsided.

  • Be crystal clear about the capabilities of the business to be able to provide the service to the user at this point in time. If there is no availability to be able to deliver goods in a timely fashion, inform them early, ideally before they attempt to check out. Something that might work well here is using the client’s IP address as a geographical locator. If they are coming from an impacted area, ask them to provide more detail as to their specific location to be able to give immediate feedback on when the most likely time will be when they can get a delivery slot.
  • Build a buffer into all timelines. This is early days of the pandemic, with suggestions that social distancing is likely going to have to continue for up to 3–6 months. The point is, this is a very fluid situation and things may get worse before they get better.
  • Communicate timelines early and effectively. Users like to feel confident that “you got this.” Do everything you can to ensure that any communicated timelines are upheld. If they feel like they need to be concerned, they are going to visit your service again and again, causing, in aggregate, a potentially significant amount of unnecessary traffic.
  • If timelines are set, be specific, and make sure any potential delays are communicated as early as possible to better manage expectations. Include any advice on what they can do to ensure reliable delivery.
  • In the worst case, if there is a failure to provide the service in its entirety, provide some alternative methods that users can get what they were looking for, even if it is a competitor. This builds trust that you’re looking out for them and have their best interests in mind.
  • Be wary of negative user experiences. Forcing a new user to sign up to your service just to tell them you can’t fulfill what they were looking for in the first place is just going to leave the person frustrated. People have more time on their hands, due to social distancing, to tell others about their poor interaction so this could have significant longer term impacts to the brand and reputation of the business.

Focus on functionality

Unprecedented demand means unprecedented scaling of the underlying infrastructure, including web servers, databases, and messaging systems. If the reliability of the application is starting to drop due to overloaded components of the system, it’s a good idea to focus on what core aspects of the service you can reliably provide.

For example, an e-commerce application is composed of many systems, including customer, order, payment, and reviews. This is obviously an arbitrary example, but go with it. It could be that the overall system is struggling to serve pages describing products to new and existing customers. The issue is isolated to the review service being overwhelmed with the sheer amount of traffic and has significantly increased latencies when responding to queries.

In this situation, the first step would be to disable the review service, thereby improving the user experience by significantly dropping the latency for page result renders. Customer reviews are an important part of the normal buying process, but it is not critical, so not showing reviews in this case is okay. Load shedding and graceful degradation of the application are essential when dealing with massive spikes in traffic.

The next step is adding a circuit breaker to the review service such that if the latencies increase beyond a threshold, there is an automated switch that is flipped to prevent any requests from hitting the review service. Care should be taken to ensure that the breaker reintroduces the review service slowly and carefully so as to not cause further infrastructure instability.

Another valuable feature available at runtime is feature flags. There are switches that can be flipped at runtime to show or hide large swathes of functionality. While this technique is typically used for testing new functionality on a small subset of users (used in conjunction with A/B testing), it can be very useful when the load of the system is starting to overwhelm the underlying infrastructure. Feature flags allow us to split deployments (of artifacts) from releases (of features). This allows us to gradually roll out changes while managing risk.

Infrastructure strategies

Managing cloud compute resources effectively can go a long way to providing a good user experience. Some strategies include:

  • Most major cloud vendors provide multiple availability zones (AZ) per region. As discussed in previous editions, this can be used as part of a set of best practices to be able to provide the best availability and resiliency for your service.
  • Make sure that all zones within a region are configured for deployment (even if they are not used). A number of GCP and AWS regions have more than 3 AZs available to deploy infrastructure. We have seen more than one client that has only configured a maximum of 3 AZs to deploy infrastructure. The issue arises when a request is made to the cloud vendor to spin up a new VM, but there are insufficient physical hardware resources available to complete the request so it is denied. This then can potentially become a thundering-herd problem, so make sure that all the zones are available for use.
  • Use caching aggressively. For an e-commerce solution, it is typical that anonymous content is statically generated and served to the customers via a CDN. This helps offload any expensive requests such as hitting the database to those who are logged in. Introducing caches such as Redis or Memcached or their cloud equivalents can greatly reduce the load on database and application servers.
  • Circuit breaking — as discussed earlier. Netflix’s Hystrix provides a good reference implementation of this.
  • Set limits for everything, both at an infrastructure and application level. Specifying the maximum amount of RAM/CPU/storage requirements for each instance of the application allows autoscalers and provisioners to react more intelligently to load as it ebbs and flows. Set dispatch and concurrency limits on queues to avoid DDoSing yourself, particularly if you have integrations with more statically scaled infrastructure such as on-prem services.
  • In the interest of meeting expectations, application limits can (and should where applicable) be placed on the number of goods that are purchased by a customer in a single transaction or over a window of time. This can be also used to form a queue whereby only a limited number of customers can use the site at any one time. Not an ideal user experience, but it does mean that the site stays up or the warehouse backing the site is able to cope with the volume of orders.
  • Look at purchasing reserved instances. This is where you can make an upfront payment to purchase an amount of compute capacity from the cloud vendor. This allows the vendor to perform better capacity planning but also allows you to guarantee that you have the compute resources available when you need it.

Conclusion

COVID-19 is an unprecedented event, the kind no one has seen for a long time. Ensuring that people are staying safe, healthy, and productive is of paramount importance. Hopefully, some of the items discussed here will help you to ensure that your business is able to ride the wave and come out stronger on the other side.

Do you have any other ideas or suggestions for what might be useful at this stage? We’d love to hear from you.

Good luck out there.

--

--

Managing Partner at Real Kinetic. Interested in distributed systems, messaging infrastructure, and resilience engineering.