Try it now Joe: avoiding costly mistakes with admin portals
When you are building a new application or service, how do you change runtime settings and configuration? How do you enable new features? How do you debug data issues? Do you manually change settings in the database? Do you handle all configuration via command flags? Do you connect your database’s CLI and run queries manually? Maybe you are advanced, so you run phpMyAdmin or pgAdmin rather than a CLI? Perhaps you SSH into a server to run diagnostic commands and utilities? Or, if you’re at a larger company, does an ops person do those things for you?
These are common problems and, unfortunately, common solutions. We frequently see the main solution being a database CLI and command-line utilities that get run on production servers. When I started writing client-server applications, it is what I did. It was extremely common practice — albeit a poor, dangerous practice.
The problems with this approach are more obvious in the world of distributed systems, cloud, and especially DevOps. Now we desire safe, repeatable, auditable processes to access and modify runtime configuration and access diagnostic data for our services and applications. Our goal is to avoid humans SSHing into servers or attaching CLI tools to a running database. It is dangerous for our systems, our users, and ourselves.
I painfully learned the dangers of SSH and manually run tools fifteen years ago. I managed a system backed by a geographically distributed, multi-master database. The database, and application, was designed to support this model and it worked very well — most of the time. Occasionally there would be a replication failure that required my intervention to correct. Nearly all of the recovery procedure was automated, I only needed to SSH into the servers and run the repair tools I’d written. My repair tool first backed up each individual production database and then restored them to a special cluster for analysis. The analysis validated that the tooling could correctly, and safely, repair the issue. There was just one small issue — me. I was distracted and mistakenly ran the restore and analysis tool on the production cluster. Whoops. This destroyed a massive amount of production data, over $100 million worth of financial transactions scheduled for payment the following day. Ouch. After a moment of sheer terror, I remembered that the automated tools dumped an extra backup during one portion of the analysis, I just had to recover the data from that dump. An implementation detail of my automation saved me. This mistake could have been avoided entirely and the system back online many hours faster if I were not a crucial step in the process. A simple mixup of my terminal windows caused a potentially huge issue.
A better solution is to invest the time and energy into developing an admin portal to access diagnostic tools and configuration settings for your application and services. The investment often pays off many times over. I “discovered” this pattern when I started building applications on Google App Engine. I could no longer connect a CLI or GUI to my database, nor could I SSH into a server to run diagnostic tools because it is a serverless PaaS. That forced me to build something specific to my app. This turned out to be an amazing learning opportunity for me. Going forward, I’ll refer to this concept generically as an “admin portal.”
There are general tools for exploring a database, but I think admin portals are best when made application-specific. The goal is to eliminate the need to manually run commands, scripts, and queries after all. This also allows us to pull together information from multiple sources into one view. Automation helps protect our users, our systems, and can relieve stress in times of duress such as when dealing with support issues. If these tools understand how the data should be structured and should look, they can help us quickly identify cases where the data is broken or not as expected.
An admin portal provides a way to enforce and validate your business logic and rules. You can also mask sensitive data while performing diagnostics. That allows support people to ensure there are no data anomalies and to ensure the data’s structure matches your application’s expectations without exposing sensitive data. These pages turn out to be very valuable in many cases, but the ability to build in app-specific logic is very important. Rather than a human needing to run a series of queries and scripts then interpret the output, the admin pages can do it. That means you can build things like an “account health” tool that checks to ensure basic expectations are met. These tools can include repair logic for issues that are difficult, impossible, or impractical to fully correct with database wide migrations or repairs.
The admin portal is effectively another application and that is why the team developing the services and application should be the ones who start building the basic tools. They will have a good idea of what types of checks should be built in because they will already be running those checks manually as they are building the application. They will also have a good feel for the types of common issues and bugs they hit based on their experience developing the application. Because the admin portal is an application, support roles (such as SREs or other technical support roles ) can contribute to the code base to implement additional checks and fixes as issues are hit during testing and in production settings. It is a living, evolving system.
Another massive advantage? You can implement permissions. That means you can safely grant access to developers and other support people access to analytics and diagnostics tools while restricting access to tools that expose sensitive data. Additionally, all actions can be logged to your normal application audit logs so that you know which data was viewed, which settings were modified, and what fixes were applied. If your application exposes audit logs to customers this means your customers know who within your organization has accessed their data, giving them increased confidence as well. That’s good for your staff and your users.
These types of tools become even more valuable as you add customer support staff. It gives less technical support staff a way to safely run diagnostics, perform initial troubleshooting steps, and even repair common problems. Even if you have no dedicated front-line support personnel, they will save your developers time.
We worked with a group who developed a very advanced diagnostic and repair tool. It allowed them to quickly develop “scripts” that went through their full SDLC (mandatory code reviews and automated testing) that could be rapidly deployed to production, enabling them to follow good processes even in the midst of production incidents. Once their scripts had been and code-reviewed, the automated tests would run to validate the behavior, and then they would automatically became available in their admin portal. They provided application-specific hooks that allowed them to safely deploy “repair” tools that could be scoped to specific user accounts or data. That allowed them to rapidly respond to customer issues while ensuring customer data was safe and protected and reducing mistakes. The additional benefit was that they built up a catalog of bug fixes, meaning subsequent customers impacted by the same issue benefited from a much quicker time to resolution.
Admin portals take an investment of developer time to build out. You can reduce the impact per service by providing a simple framework development teams can base their portal off of that provides authentication and authorization, auditing, and basic UI components. Your developers already run the queries and fixes as they are building out their applications, admin portals can help make those repeatable. The investment in building admin portals will pay off many times over through reduced time-to-resolution, increased safety, better transparency about who is accessing data, and the ability to empower lower level support staff to resolve customer issues.
Want to ship more confidently by implementing admin portals and other best practices? We work with clients to accelerate their delivery and improve confidence — contact us to learn more.