Data Engineering: A Software Engineer’s Approach

Matt Perreault
Real Kinetic Blog
Published in
4 min readJan 25, 2024

--

The tech industry is divided on the role of the data engineer. A quick internet search returns plenty of discussions comparing software engineers to their data counterparts. We see a large disparity from company to company in the tasks a data engineer is expected to perform. This holds particular significance for hiring managers seeking to build high-performing data engineering teams.

It is becoming commonplace to see companies task DBAs, business analysts, and data scientists to stand up and manage complex data infrastructure, including container orchestration tools, networking, and develop complex ETL code using frameworks like Apache Beam, Kafka, or Kinesis. However, it becomes evident that these professionals, not inherently trained in managing complex code bases and infrastructure, are often underutilized and not deployed to their full potential. Not only is the employer not getting results, but the employee is not working on tasks they were hoping to work on. As a result, a company’s data initiatives falter or fail altogether.

Companies Want to be Data-Driven

In this day and age, companies are required to be data-driven, utilizing insights for customer analytics, identifying sales opportunities, predicting customer churn, and optimizing service usage. In this pursuit, the industry gravitates towards leveraging modern, cloud-native data stack tools such as Dataflow, Kinesis, Redshift, BigQuery, and Snowflake, to name a few. It’s essential to recognize that implementing and managing these tools falls squarely within the domain of software engineering. The orchestration, scalability, and optimization of a stack of this caliber demand a skill set typically associated with software engineers. This juxtaposes the domain of analytics and modeling, which belongs to the realm of business analysts and data scientists.

Solution: Data Engineers Need Software Engineering Skills and Practices

Data engineering thrives when steeped in software engineering principles and practices. Here’s a guide to empower data engineers with the necessary skills and practices:

1. Treat Systems as a Product, Not a Project

Leaders need to shift their mindset towards treating data systems as ongoing products rather than finite projects. This approach emphasizes continuous improvement, adaptability, and responsiveness to evolving business needs. This means building for the long run, not just ad-hoc pipelines or one-off dashboards.

2. Define the Problem Space, End-Users, and Stakeholders

It’s crucial to know who stands to gain from the data and what key business metrics the data will illuminate. Addressing questions such as these ensures alignment with organizational objectives. In cases where multiple teams are involved, initiate communication early in the process. Failure to emphasize this step may lead to the creation of underutilized pipelines, unnecessary dashboards, and a lack of clear ownership.

3. Create Architecture Diagrams

Produce precise and easily understandable architecture diagrams to convey the configuration of your data systems, just as you would when architecting software applications and systems. Define and implement consistent patterns throughout the software development life cycle with architecture design records. This approach ensures early buy-in from leadership and the team, securing the necessary support and resources for successful implementation. A well-defined understanding of the system’s behavior and its constituent components enhances maintainability and adaptability in the future, particularly when new tools or patterns must be introduced to meet evolving system demands.

4. Adopt Version Control Systems Early On

Incorporate a version control system (git) from the outset of your data engineering projects. Many legacy data tools do not treat data pipelines as code and are instead often UI-oriented. This means the concept of a software development lifecycle (SDLC) can sometimes be foreign to professionals now finding themselves in the role of a data engineer. Version control, and more generally an SDLC, facilitates collaboration, tracks changes, and maintains code integrity and quality throughout the development lifecycle. No more making changes directly to your DAGs without a commit trail!

5. Establish Development, Staging, and Production Environments

Implement an environment strategy with dedicated spaces for development, staging, and production. This ensures controlled testing and deployment, preventing unforeseen issues in live environments. You can still sample prod data from your dev database to test your pipelines; however, you should still segregate these environments. A clear environment strategy plays a critical role in having a well-defined SDLC.

6. Implement Continuous Integration/Continuous Deployment (CI/CD)

Streamline development processes by adopting CI/CD pipelines. Automate testing and deployment to enhance efficiency and maintain a reliable and agile development workflow.

7. Testing

Prioritize testing as a fundamental practice in data engineering. Write unit tests against your transformation code to validate its alignment with business requirements. Thoroughly examine data integrity through rigorous testing to guarantee the dependability of the complete data processing pipeline. Recognize that the majority of data challenges stem from issues related to data integrity. This can be done with schema and type checking of your data before loading it into a sink.

Recognize the Rich Tapestry of Data Professionals

In wrapping up, it’s time to recognize the diverse roles of data professionals. As the spotlight on data engineering intensifies, fueled by breakthroughs in the Large Language Model (LLM) space and the resounding success stories of data-driven companies like Netflix, Airbnb, and others, it’s high time for leadership to sit up and take notice of their data teams.

This isn’t about downplaying the vital roles of business analysts or data scientists. It’s acknowledging that you wouldn’t expect a data engineer to suddenly become a statistician analyzing complex models. Data Scientists and business analysts should not be expected to stand up infrastructure and write fault tolerant code as it is outside the scope of their expertise.

If you or your organization needs assistance in navigating and optimizing your data initiatives through a software engineering approach, let’s get in touch!

--

--

Based in Colorado. In my day job I build and architect data intensive systems in the cloud