The Tale of Two Teams: Requirements-Driven vs. Resume-Driven Development
Over the years, I’ve observed a recurring pattern in how different organizations approach modernizing their data systems. While I’m cautious about making blanket statements, this theme has been so persistent that it prompted me to write this post. I’ll be investigating two common archetypes of approaching this problem:
- Team A: This team has a strong understanding of the business, with the primitives of that business codified in existing (legacy) systems. However, they may lack familiarity with new technologies and have very little technical opinion on the tech stack needed to complete the project.
- Team B: This team has a more nebulous definition of the business facts but has strong opinions on the tech stack and possibly the engineering skills to implement said stack for their new solution. They, too, are working with legacy systems, which complicates their modernization efforts.
Which team do you think will have more success in creating value for the business through their new solution? It may not be obvious because of the technical acumen of Team B; however, seasoned engineers reading this might already see where I’m heading. Yet, there are undoubtedly those who don’t recognize the potential issues. After all, by picking the best-in-breed product for each layer of your stack (storage, ingestion, processing, orchestration), you should be good to go and can start writing up the procurement documentation for your director, right?
In this article, I hope to shed light on the pitfalls of such choices. Both teams I’ve described above ultimately need to prove how they’re delivering value to the business.
Examining Team B: The Resume-Driven Approach
Let’s examine Team B first. They are a group of engineers who keep up with the latest and greatest tools and are ready to build out a modern data stack. They want a scalable data lakehouse architecture using Apache Iceberg in conjunction with BigQuery and the BigLake metastore. They also want to use Apache Beam, which can be leveraged for both batch and stream processing to make the whole pipeline much faster. In terms of orchestration, they know that Airflow is super popular, and their legacy system is not cutting it.
Alright, they have their stack and are ready to start migrating to this new solution. They have a proof of concept set up, but they start to bump into some issues. As a data architect brought in to assist, you come in to gather information and see how you can get them back on track. To figure out their situation, you ask some basic questions about the system, and your conversation goes something like this:
Conversational Insights with Team B
Working from first principles, you ask,
“What are the basic facts of the business?”
“Well, we are the team responsible for all the data. We work with roughly 300 tables, and it’s a legacy system that takes forever, so we want to throw as much horsepower at it as possible.”
“What is the source system you are ingesting from?”
“A traditional RDBMS.”
“How much data are you ingesting with this pipeline?”
“A couple of gigabytes each run.”
“Who will be the end-user of this data?”
“This data will end up in a key-value datastore and will be used by downstream engineering teams.”
“How frequently do you need this data updated?”
“We update this data about once every few weeks; it doesn’t change much, but when it does, it takes a long time to run.”
“How many pipelines will you plan on having?”
“A single DAG.
“Can this work be broken up?”
“We have started work on decoupling the pipeline, but our data model is complicated so we are going to migrate first then worry about that later.”
“Will other teams in the organization be writing pipelines?”
“No.”
“Will this data be used for analytics?”
“No, it will drive the APIs that are core to the business.”
“What is the business value you plan to bring with this new solution?”
“We want it to be faster; we would like it to be easier to debug and spot problems early on.”
Identifying Issues with Team B’s Approach
This team was doomed to fail from the onset, for a few reasons. First and foremost, they could not concretely answer the questions that plague so many teams: what are the facts of the business? Does your data model represent those facts accurately? Furthermore, taking a dive into the chosen stack, Apache Iceberg is a framework for enabling ACID transactions against an object store, typically used in an analytics architecture. Apache Beam is a highly scalable, highly parallelizable data processing framework meant for handling large-scale data pipelines, not a few gigabytes of data. Finally, do they want the overhead of managing Airflow for a single DAG that runs every few weeks? You may have noticed we did not dig deeper into the current data model, or data operating model. We did not need to; it was obvious that the team was going down the wrong path.
Exploring Team A: The Requirements-Driven Approach
Now let’s look at Team A. They are a team of veteran engineers who are highly specialized in proprietary systems but do not have the same opinions on the tech stack needed. Before they even begin to develop anything, they bring you in for assistance. You ask the same line of questions.
Conversational Insights with Team A
Working from first principles, you ask,
“What are the basic facts of the business?”
“We are a large construction equipment supply company. We are in charge of handling the pre-cast models used in box-culvert construction. We are responsible for customer invoices and our inventory tables. We are using a dimensional data modeling schema with fact and dimensional tables that we have built out over the last 20 years.”
“What is the source system you are ingesting from?”
“A legacy on-premises database system that has been running for decades.”
“Who will be the end-user of this data?”
“Another team will be ingesting this data; we need to call their API because they have some validation logic built in.”
“How frequently do you need this data updated?”
“The data we are working with does not have a scheduled time for arrival, so it will need to be event-driven.”
“How many pipelines will you plan on having?”
“It needs to be a single pipeline that does some simple transformations before forwarding that data to the API. Additionally, it will need to encrypt sensitive data related to the invoices.”
“Will other teams in the organization be writing pipelines?”
“No.”
“Will this data be used for analytics?”
“Yes, however, we will not be responsible for storing that data for analytics. Once data is forwarded to the API endpoint and verified as successfully written, we would like to have the data deleted.”
“What is the business value you plan to bring with this new solution?”
“We would like the system to run as autonomously as possible and be able to notify the proper channels of any issues, updates, etc.”
Solution Development for Team A
Now that you have the problem clearly defined, you start to put together an event-driven system around S3 as a landing zone, incorporating some Lambda functions to handle the simple transformations and a couple of SQS queues to decouple the transformation logic from the logic needed to send out notifications. You help the team deliver, and everyone gets a win!
The Breakdown
These two teams faced relatively similar problems: moving data from a source system, performing transformations, and forwarding the data on. Neither team needed to set up any analytics pipelines. However, in both organizations, these were the “data teams” composed of data engineers. The significant difference was that Team A had a strong knowledge of their business facts and a solid understanding of their data model, while Team B did not. Team B pursued the modern data stack because they are a data team and chose tools they found during their research. In contrast, Team A successfully delivered value to their business while learning about “good” architecture along the way.
Sadly, this scenario is not uncommon. Team A’s approach demonstrates the power of deep business knowledge, while Team B’s experience serves as a cautionary tale about the pitfalls of prioritizing trendy tools over foundational understanding.
Another common issue in system migrations is teams anchoring themselves to the current solution. For Team B, this manifests as reimplementing the existing solution with new tools rather than leveraging these tools as intended. Team A, while potentially more open to different approaches, faces a different challenge. Instead of recreating the legacy solution, they may need to invest more time in understanding and mastering the new technologies. This approach, while potentially leading to a longer ramp-up period, ultimately allows Team A to fully utilize the capabilities of the new tools and avoid simply replicating old systems in a new environment.
As I have written about before, data engineers are fundamentally software engineers. They need to understand the problem before engineering a solution. They must be able to articulate the value they bring to the business; failure to do so often leads to project failure. Ultimately, data leaders must ensure they are solving real problems rather than hunting for issues that fit specific tools.
If you’re looking to unlock the true potential of your data and ensure your platform engineering efforts drive real value for your organization, I encourage you to reach out to Real Kinetic. Our dedicated teams are here to help you navigate the complexities of data systems and empower your business with tailored solutions that meet your unique challenges.