Understanding Google Cloud Dataflow: Common Mistakes with Flex-Templates

Matt Perreault
Real Kinetic Blog
Published in
7 min readAug 8, 2024

What I got Wrong About Dataflow Flex Templates

Photo by Ben White on Unsplash

Over the past few weeks, I have been in the trenches of the Google Cloud Dataflow system. Why, you may ask? Here at Real Kinetic we are on a mission to support all the cloud resources that a data engineer would need to set up a modern data platform on GCP through our product, Konfig . Our goals are to: eliminate the undifferentiated work surrounding cloud infrastructure management, establish fine-grained access controls via IAM, setting up VPC networks and managing resource updates using best practices so you can focus on delivering value to your business. As I have been building out the support for Dataflow Flex Template creation and management, I have come across some gotchas that I would like to share in the hopes that, if you are not using Konfig and find yourself bumping into any of these issues, it will save you time and preserve your sanity.

The Dataflow Pipeline Lifecycle

Understanding the Dataflow Pipeline Lifecycle is important to gain context into why flex-templates are built the way they are. Having this knowledge will reveal some potential pitfalls. This process involves two key components: the “launcher” container and the “custom pipeline” container. Below I will describe what each does and then explain the gotchas.

Launcher Container

When a Dataflow job starts the first thing that happens is a launcher container is created. It is created from the image referenced in the flex-template JSON file stored in GCS, typically built with `gcloud dataflow flex-template build`. The launcher container runs on a GCE VM “launcher” and creates the execution graph. It can be initiated using the `gcloud` CLI, the Dataflow console, REST API, or other means such as an Airflow operator. When you launch a Dataflow flex-template job, Apache Beam runs the code on the launcher machine to construct the pipeline graph. It processes all the apply steps (represented by ‘|’ in Python) defined in your pipeline code. For each step in your pipeline, a node is added to the graph. This is known as the graph construction time and is typically the longest-running process during your job launch phase. During this time, Dataflow will also verify that any resources the pipeline interacts with are available, ensure the Dataflow worker service account has the appropriate access to these resources and, attempt to optimize your pipeline execution graph through Fusion optimization.

Gotcha:

You must have a Google Dataflow launcher base image in your Dataflow flex-template image that you plan to use to launch your job. Unfortunately, the cloud console gives little feedback during this part of the job and I spent too much time trying to use a smaller base image used for non-flex templates to speed up CI build times before realizing this was a requirement.

How Konfig can help:

Making sure your Dataflow job can communicate with the input and output resources is crucial, and is something we looked to support natively in Konfig. When you create a Dataflow Flex Template workload in Konfig, you simply define the resources your Dataflow pipeline will interact with and the IAM is set up automatically with least-privileged access needed for your Dataflow pipeline to function properly and securely, so you don’t have to.

Custom Pipeline Container:

After the execution graph is created, it can be executed as a job on worker machines. This is where the custom pipeline container comes into play. The code typically uses the official Apache Beam image, which can be overridden if needed. This is the image that the Dataflow documentation refers to as a “custom image”. Custom containers can be hardcoded into the pipeline if the image is tightly coupled to pipeline code, or passed in at runtime with the PipelineOption ` — sdk_container_image`. You can pre-load this image with dependencies to make spinning up worker VMs quicker.

Gotcha:

If you find yourself needing to use a custom container the best practice is to combine the launcher image and the SDK image in the same Dockerfile as a multi-stage build. This way, a single Docker image can be used as both the launcher container and the SDK container on the worker VMs reducing management overhead. Here’s an example:

FROM gcr.io/dataflow-templates-base/IMAGE_NAME:TAG as template_launcher 
FROM apache/beam_python3.10_sdk:2.57.0

# RUN <...Make image customizations here...>
# See https://cloud.google.com/dataflow/docs/guides/build-container-image

# Configure the Flex Template here
COPY --from=template_launcher /opt/google/dataflow/python_template_launcher /opt/google/dataflow/python_template_launcher
COPY my_pipeline.py /template/
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/my_pipeline.py"

Staging and Temporary Location Buckets

The staging and temporary locations are used by Dataflow during the execution of your jobs. If you look into your staging bucket after a job run, you’ll see a `staging/` object path. This path stores your job binaries, dependencies, and other artifacts such as job run logs. Your temporary location is for files generated during pipeline execution, which may include shuffle data that spills over from the VM, side inputs, and other temporary data needed for processing. Depending on which Beam SDK you use in your Dataflow jobs, Google suggests varying best practices for managing these locations. Google provides two places to specify the staging and temporary storage locations. First, at build time, using the gcloud dataflow flex-template build command with the ` — temp-location` and ` — staging-location` flags. Second, at run time, which may differ depending on how you’re launching your jobs: through an Airflow Operator as part of a DAG, on a schedule with a Dataflow data pipeline, via REST API call, or through the `gcloud dataflow flex-template run` command.

Gotcha:

However you decide to launch your job, even if you have already specified your staging and temp locations at build time, you must also specify them at run time. Not realizing this was the cause of much confusion for me and honestly it feels like a bug. If you specify the staging and temp locations at build time, you’ll see them defined in your flex-template JSON file under the `defaultEnvironment` object. If you don’t re-specify them at run time with the appropriate object paths (`gs://your-dataflow-staging-bucket-path/staging` and `gs://your-dataflow-staging-bucket-path/tmp` for staging and temp files respectively), Dataflow will override them and generate a default bucket for you. However, guidance from Google states that the default bucket should not be used and that using your own bucket is best practice.

How Konfig can help:

We took this into consideration with Konfig which overrides the default bucket for you with proper object path locations for both temp and stage locations while at the same time managing the buckets and setting the appropriate IAM roles so you don’t have to think about it, and you know you’re following best practices.

IAM and Service Accounts

IAM is pretty straightforward; however, I encountered some issues while launching my Dataflow jobs where the error messages did not clearly identify the problem. Thus, I figured this information would help someone out in the world. With Dataflow, there are two service accounts:

Service-Agent:

This is managed by Google, and not much thought needs to be put into it as it is created automatically when the Dataflow API is enabled.

Worker Service Account:

Whenever you run your Dataflow job, this is the service account you provide. This service account needs the following access as the bare minimum. You will also need to provide the appropriate roles for your input and output resources.

Required IAM Roles for the Worker Service Account

Dataflow Worker:

  • roles/dataflow.worker: Required to run the jobs.

Storage Buckets:

  • Flex-template storage bucket (for flex-template JSON files):
  • roles/storage.objectUser
  • Staging/Temp GCS storage bucket:
  • roles/storage.objectUser

Artifact Registry Repository:

  • For storing the Dataflow image referenced in your flex-template JSON:
  • roles/artifactregistry.reader

Gotcha:

This last one is tricky! If you do not assign the Artifact Registry reader role for the repository where your images are hosted to your worker service account, your job will fail. It is needed by the launcher VM to pull the image from the Artifact Registry repository. If you hit this issue, the launcher error logs are not helpful at all. The error message will state that it was unable to access the GCS staging location, and the launcher fails during the execution graph creation phase without writing anything to the staging bucket.

How Konfig can help:

Konfig creates a custom service account for your Dataflow workload and attaches the least-privileged roles to it, including access to the resources you identify as necessary for Dataflow. This way, you won’t have to spend time deciphering obscure error messages related to IAM.

As I mentioned above, these are some painful things that I have learned working with flex-templates. Hopefully some of this information will help someone out there struggling to get started with Dataflow. Much of this information could not be found in the Google documentation and required searching other sources. Some of these are quite old which prompted me to create this post. So shout outs are due! Neil Kolban’s medium article on flex-templates and Travis Webb on custom containers with flex-templates. If either of you see this, thank you for helping me understand this system a bit more!

If you find yourself struggling with infrastructure, trying to cobble together solutions in hopes that they will work but still getting stuck, reach out to us at Real Kinetic. We are a team of experienced engineers who have navigated these challenges so you don’t have to! If you’re looking for a reliable solution, check out Konfig, our platform designed to help you set up and manage your cloud infrastructure in a straightforward and efficient manner.

This post was originally posted on our substack. Subscribe to get this content automatically!

Published in Real Kinetic Blog

Our thoughts, opinions, and insights into technology and leadership. We blog about scalability, devops, and organizational issues.

Written by Matt Perreault

Based in Colorado. In my day job I build and architect data intensive systems in the cloud

No responses yet

What are your thoughts?