Introducing Istio to a production cluster
Service meshes are a powerful way to manage network traffic at runtime. They work best when the mesh encompasses every endpoint. If you already have a Kubernetes cluster running in production, introducing a service mesh such as Istio can be hard.
Real Kinetic has helped clients deploy Istio to production with great effect, and I wanted to talk through some of the tips and strategies we’ve employed to achieve that.
Getting started
To start with, install Istio into lower, non-prod environments. There are a plethora of Helm chart configuration options that can be used to fine-tune the deployment of Istio to the cluster. When starting out, we suggest just using the defaults.
Autoscaling various components (e.g., Mixer, Pilot) may become necessary as the load through the mesh increases, but by default, Istio handles quite a bit of load. The metrics for implementing autoscaling of Istio components are usually around the rate of change within the mesh and the sheer number of proxies that need to be managed.
For an initial introduction, Istio should be configured with mTLS as “PERMISSIVE” which allows non-TLS requests to enter the mesh.
Your first service
When migrating the first service into the mesh, a low-impact service should be selected. We have found that admin/support portals are an excellent place to start. If there is an issue during deployment, Istio can be easily removed from the deployment config and the service run as usual with minimal impact to consuming users/services.
Start by using the istioctl kube-inject which decorates the Kubernetes YAML when running kubectl apply (or via Helm) and a toggle flag to enable/disable as required. Istio features such as traffic management (VirtualService, DestinationRule), reliability (retries, rate-limiting), or network policies should NOT be used initially. This allows confidence in Istio to grow — by acting like a dumb proxy.
One thing that trips people up when starting to use Istio is that all egress from the mesh is subject to network policies. The default policy for all egress traffic is to DENY. If a service accesses an S3 or GCS bucket, by default, Istio prevents traffic from exiting the mesh to the external service. A ServiceEntry must be used to configure Istio to allow traffic to flow to the external service.
Ingress and load balancing
Once you’ve got a few services deployed using Istio, the next step is to start looking at services that handle ingress traffic external to the cluster. Define an Ingress Gateway (or use the default that is created as part of the initial install). Configure your load balancers (ALB, GLCB, Nginx, Traefik, etc.) to include the gateway as part of the pool of endpoints able to accept traffic. Traffic will now either go directly to the pods or through the service mesh. Define a DestinationRule to tell Istio where to push the traffic once the gateway has received it or you’ll receive a blank 404 page. Note that we’re still not configuring any advanced traffic-management features yet, just directing the traffic where it is meant to go.
Auto-injection
Istio has a neat feature where you can label a namespace with “istio-injection=enabled” to automatically inject the necessary Kubernetes config to deploy the sidecar per container. Every pod deployed within the namespace will automatically be added to the mesh without the need for the istioctl kube-inject step listed above. Focus on migrating services that are within the same namespace. When complete, the istioctl kube-inject toggle can be removed from the deploy scripts.
Be sure to label the “istio-system” namespace with “istio-injection=disabled” — we learned this one the hard way and had to start the Istio components manually. This didn’t cause any downtime because the mesh was just static for a while, but it meant that updates could not be deployed to the mesh. It was a fun one to debug. :)
Advanced features
Once a decent number of services are deployed into the mesh, more advanced traffic management and policies can be implemented.
RBAC
By default, Istio allows all services to talk to all other services in the mesh. Depending on your setup, this may not be ideal. Indeed, following the principle of least privilege, the reach of each service within the mesh should be restricted. This can be achieved using Istio’s RBAC features, ServiceRole, and ServiceRoleBinding objects. What shape the request looks like (path, method, etc.) can also be defined. In this case, if a remote attacker were able to gain access to a shell running in a pod, they would only be able to talk to the predefined set of services.
Reliability
One of the value-adds of a service mesh is the ability to monitor and control network traffic at runtime. Defining the maximum number of concurrent requests that can hit a service, rate limiting, circuit breaking, and timeouts can provide huge improvements in resiliency for your services, and therefore the users consuming those services.
Istio supports retries, but we tend to promote having the mesh abstract this away from the underlying service. For example, it may be impactful for a service to know when it is struggling to get a connection to a database and to fail fast. Also, the requests must be idempotent in order to arbitrarily support retries. This is an important implication in case you flip retries on without thinking through your RPC semantics.
Automated deployment and rollback
Istio’s fine-grained traffic-management features allow for the deployment of services within the Kubernetes cluster. Tools like Spinnaker and Flagger can make canary deployment a breeze, including the automation of rollbacks in case there is an issue. I recently gave a talk on canary deployments with Flagger and will be turning this into a blog post shortly.
Observability
Observability is a huge feature of service meshes. The ability to see how a request progressed through the mesh, along with valuable metadata, such as timing, is invaluable for debugging, especially in a distributed, microservices environment. Istio supports tracing out of the box, via the B3 headers. To increase the value of the traces that are generated by Istio, the code that generates requests to other services in the mesh should be configured to forward these headers. That allows Istio to track the context of the request as it spans many and multiple services.
Upgrading
Upgrading infrastructure you don’t own is fraught with risk. Istio is a distributed system and has a lot of moving parts. The documentation is useful but sparse and doesn’t have a troubleshooting guide. How to go about upgrading Istio is outside the scope of this post, but if possible, we typically recommend having experts manage this for you. For example, if you’re using GKE, you can enable managed Istio for your cluster. Same goes for Azure and IBM Cloud. I am not aware of such an offering for AWS at this time.
Conclusion
Introducing Istio into a production cluster can be a daunting task. Hopefully, the tips and strategies laid out here help you to roll out a service mesh successfully. We’ve done this successfully at Real Kinetic with a number of our clients — come talk to us.