Averting Disaster: Data Engineering Inspired by Platform Engineering
For the greater part of the past decade I have spent the entirety of my career building or thinking about data systems. I like to think of myself as a software engineer with data engineering sympathies. I have developed and maintained modern data stacks in AWS and GCP, contributed to platform engineering teams creating Internal Developer Platforms (IDPs) to simplify engineers’ workflows, and managed large Kubernetes clusters hosting applications responsible for generating hundreds of millions in revenue. Throughout my engineering experiences, I’ve consistently observed a troubling trend: companies with mature R&D and product organizations aim to leverage their operational and analytical data for competitive advantage, yet often treat their data teams as second-class citizens. These teams are frequently placed in illogical organizational structures or isolated as cost centers, unlike their software engineering counterparts who enjoy full access to essential systems. It’s no surprise that Monte Carlo’s annual survey on data quality shows declining trends. The world is in the grip of an AI fever dream, pouring billions into technologies whose lifeblood — data — is managed by these undervalued teams. According to a survey by Ernst & Young, only 36% of senior leaders in these companies are investing in data infrastructure. One does not need to be an oracle to see disaster on the horizon. We have already seen data quality issues costing billions in losses for companies. This is a call to action: companies must adopt best practices from software engineering and apply them to data engineering. We will explore how data engineers can learn from their platform engineering counterparts in building and deploying robust data systems.
Data Engineering
Data engineers love their new shiny toys; just take a look at the data landscape of last year — it’s effectively a meme. While tools are great, there is not going to be a single silver bullet that solves your data woes. The industry needs to take a holistic look at delivering high-quality data to drive value. This is an approach that is championed by Joe Reis and Matt Housley in Fundamentals of Data Engineering. In particular, by defining the data engineering lifecycle, they have codified a way for engineers and leaders to think about the main pillars of data engineering. What may surprise some is the amount of overlap with software engineering, especially the undercurrents that permeate the entirety of the lifecycle.
Data Engineering Lifecycle from Fundamentals of Data Engineering, Joe Reis and Matt Housley
Taking a closer look at the data engineering lifecycle, two critical components often overlooked are the management of stack infrastructure through Infrastructure as Code (IaC) and the deployment of that infrastructure to the cloud. A data engineer working with a modern stack would gladly grab a beer with you and talk about their war stories of lost weekends spent crawling through undifferentiated work, trying to get a stack to deploy in the correct VPC with the appropriate level of IAM. And let’s be honest, if you’re on AWS, the IAM star policy was probably used. As mentioned above, for most companies, data teams are typically thought of as outside the R&D / product development lifecycle and thus are not integrated into the DevOps practices of their software engineer counterparts. Consequently, they often have to navigate these waters on their own.
I know most companies are not ready to migrate to a data mesh architecture. It involves a fundamental reshaping of how to become a successful data-driven organization and treat data as a first-class concern. However, I believe there are some excellent ideas within this paradigm shift that any team can start to incorporate into their data practices today. In particular, the idea of the data infrastructure as a platform.
“We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.” — Zhamak Dehghani, creator of Data Mesh.
Platform Engineering
The folks at platformengineering.org define platform engineering as follows: “Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era.” The Microsoft team states, “Platform engineering is a practice built upon DevOps principles that seeks to improve each development team’s security, compliance, costs, and time-to-business value through improved developer experiences and self-service within a secure, governed framework. It’s both a product-based mindset shift and a set of tools and systems to support it.” I find this to be a very apt definition of the practice. Toolchains are instantiated via an IDP. Through this IDP, developers gain access to common CI/CD workflows, test suites, and infrastructure management in the form of IaC modules. As I have written before, fundamentally, data engineering work is software engineering work. In today’s development context, this means deploying workloads to the cloud. Therefore, creating CI/CD workflows that can deploy and maintain pipelines and infrastructure is paramount in delivering value to the business.
The Intersection
As a data engineer, you can assist your platform team in encapsulating workflows for creating infrastructure resources that you are using for data stack such as AWS EMR, Glue, and Google Cloud Dataproc, as well as storage and data warehousing solutions like AWS S3, Redshift, and BigQuery. If you don’t have a platform engineering team, you can initiate this practice independently by leveraging common tools like Terraform for standing up your data infrastructure. Start by building CI/CD pipelines, beginning with a single pipeline and developing a pattern to implement across all your pipelines. Champion your data organization by proactively setting up meetings with your infrastructure or platform teams to discuss your stack, explain how resources will interact with your data pipelines, and identify who consumes your datasets. Learn how they monitor existing infrastructure and applications, ask questions, and explore how you can integrate your system into existing systems for observability and monitoring. These are opportunities for you to start integrating data engineering into the larger organization.
In recent years, industry leaders have consistently stated that the primary challenge for data teams is not a lack of tools, but the ongoing struggle to demonstrate their value to the business. The data field often feels like Sisyphus pushing a boulder up a hill, as new data-centric technologies gain attention only to roll back down again. Data teams are expected to be DevOps engineers, infrastructure engineers, DBAs, and software engineers, all while possessing a deep understanding of the business to deliver insights from operational and analytical data. It’s high time for us to finally push the boulder over the top of the hill by integrating ourselves into the fabric of the greater engineering community.
If this post has resonated with you, but you’re struggling to actualize the vision for your data teams, please reach out to us at Real Kinetic for a consultation. Our team of veteran data and platform engineers has a proven track record of helping clients reach their potential. We’ve even begun building our own enterprise-ready developer platform, ensuring that we support data stacks as first-class workloads, making them a central focus of our solutions.