How to Automate Dataflow Flex-Template Deployments with GitLab CI/CD

Matt Perreault
Real Kinetic Blog
Published in
8 min readFeb 23, 2024

--

If you find value in this article please be sure to applaud it and share it. If you find some discrepancies or just wanted to chat please leave a comment and I will get back to you. This article was not generated by a GPT but by a real life human who wants to interact with you!

In my last article, I urged that companies should approach data engineering as they do their software products. In this article I apply software engineering SDLC principles by walking through how to automate your Google Cloud Dataflow development life cycle with Gitlab CI/CD pipelines. I will briefly discuss what a Dataflow flex-template is and what components it is made of. Then we will walk through steps on how to build your GitLab CI/CD to deploy a flex-template and how you can verify that your pipeline is working. I am making the assumption that you already know what Dataflow is and what problem(s) it solves.

First, let’s take a look at what a Dataflow flex-template is. A flex-template is made up of two main components, a Dockerfile and a template specification JSON file. The Dockerfile is used to bundle up your Apache Beam pipeline code. The template spec file is used by Dataflow to know which options your pipeline will need at runtime and has a pointer to the image created by the Dockerfile. You can optionally define a metadata JSON file to validate the pipeline parameters. This setup has the advantage of separating the implementation from the deployment of your Dataflow pipeline. A data engineer can write and deploy their Beam code and another team, say a business analyst or data science team, can then run that pipeline. For more information on flex-templates reference the Google documentation on the topic.

For our deployment technology I have decided to use Gitlab’s CI/CD. It is becoming an increasingly popular tool among large and small enterprises. I find its UI to be intuitive and with CI Linting it catches those annoying YAML errors early and often in the development process making the engineer’s life easier. It has nice security and environment integration tools out of the box, with environment-specific and protected variables. For more information on CI/CD templates checkout Gitlab’s documentation.

Before we get started there are some housekeeping pieces to get out of the way. You will need to have Python 3.11 , Docker, and the gcloud cli installed. I am assuming that you already have a GCP project that you plan to work within. This is where we will upload our Docker image (Artifact Registry) as well as our template specification file (GCS). Let’s get started!

Dockerfile

The Dockerfile is pretty straightforward. We will utilize the current Flex-Template base image provided by Google. Here is the reference of their base images list. We make sure to set the required environment variables. FLEX_TEMPLATE_PYTHON_PY_FILE points to the python file that defines your entry point for your beam pipeline. FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE points to the requirements.txt. When running a Python Dataflow job we need to prepackage the dependencies in order for them to be loaded properly onto the VM instances running the job. So we need to download the precompiled dependencies into the Dataflow staging directory /tmp/dataflow-requirements-cache.

FROM gcr.io/dataflow-templates-base/python311-template-launcher-base

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/count_produce.py"
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="/template/requirements.txt"

COPY . /template

RUN apt-get update \
&& rm -rf /var/lib/apt/lists/* \
&& pip install - no-cache-dir - upgrade pip \
# install dependencies
&& pip install - no-cache-dir -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE \
# download dependencies into Dataflow staging directory
&& pip download - no-cache-dir - dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE

# Prevents pip from redownloading and recompiling all of the dependencies
ENV PIP_NO_DEPS=true
ENTRYPOINT [ "/opt/google/dataflow/python_template_launcher" ]

Test that the image builds:

$ docker build . -t dataflow-pipeline

GitLab CI/CD

Now that we have our Dockerfile set up to build our Beam pipeline we can build out the GitLab CI/CD pipeline. This process will be broken down into three main steps: build the Docker container, push that container to Artifact Registry and build the template specification file. We will make sure to set up some workflow rules that define when the pipeline is kicked off. There are a couple of concepts that need to be understood about GitLab’s approach to CI/CD, which I will describe as we walk through the code.

A pipeline consists of jobs; these jobs describe a group of tasks, such as building containers or deploying artifacts to a registry. These jobs are run in stages which define when those jobs are run. Below is how to define the stages the jobs will run in for this pipeline.

.gitlab-ci.yml

stages:
- build-image
- deploy-image
- deploy-template

Now, we need to define our jobs that run in these stages. We need to first build the image from the Dockerfile referenced above and push it to the project’s GitLab repository. This initial job keeps the pipeline clean and simple and enables us to break it up into logical steps. It also has the benefit of using the GitLab registry as a build location for managing your image promotions rather than having to copy images between GCP projects.

In order to build Docker images in a pipeline there are two methods. Using Docker or kaniko. I have found working with kaniko very nice and straightforward. It uses the Docker-in-Docker build method. Below is the code that builds our Dockerfile and pushes it to the GitLab registry.

.gitlab-ci.yml

variables:
DOCKER_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA
docker-build:
stage: build-image
image:
name: gcr.io/kaniko-project/executor:v1.14.0-debug
entrypoint: [""]
script:
- /kaniko/executor
- context "${CI_PROJECT_DIR}"
- dockerfile "${CI_PROJECT_DIR}/Dockerfile"
- destination "${DOCKER_IMAGE}"

If you have not created an Artifact Registry repository in a GCP project yet you can go ahead and do that at this time. Here are the docs for getting that set up.

Next we will create a job that pushes the newly built image to a GCP Artifact Registry repository.

In the next code snippet you will notice that I make heavy use of GitLab’s CI/CD variables. These are set in the Settings -> CI/CD page from your repository. The most notable for this next segment are the docker pieces that need to be put in place in order for this to run. First the services: docker..dind code snippet this is needed to start the Docker daemon. Second, all of the DOCKER_ variables are needed to make sure that the docker TLS certs are available for the run docker commands.

.gitlab-ci.yml

docker-push:
stage: deploy-image
image: google/cloud-sdk:alpine
# Start the docker daemon
services:
- docker:20.10.17-dind
variables:
GCP_DOCKER_IMAGE: "${GCP_ARTIFACT_REPO}/dataflow/${CI_COMMIT_SHORT_SHA}"
# Docker host address
DOCKER_HOST: tcp://docker:2376
# Docker certificate location
DOCKER_TLS_CERTDIR: "/certs"
# Enforce TLS verify
DOCKER_TLS_VERIFY: 1
# Location of docker tls client certificate
DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
before_script:
# Login to GitLab Docker registry
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
# Authenticate with GCP
- echo $GCP_SA_KEY_FILE | base64 -d | gcloud auth activate-service-account - key-file=-
- gcloud config set project $GCP_PROJECT_ID
# Authenticate with GCP Artifact Registry
- gcloud auth configure-docker - quiet us-central1-docker.pkg.dev
script:
# Pull image from GitLab container registry
- docker image pull $DOCKER_IMAGE
# Re-tag image with new GCP Docker image name
- docker image tag $DOCKER_IMAGE $GCP_DOCKER_IMAGE
# Push image to GCP Artifact Registry
- docker image push $GCP_DOCKER_IMAGE

Now that we have our Dockerfile pushed and tagged to the GCP Artifact Registry repository we are ready to build the template specification file. For this job we are going to leverage the gcloud dataflow flex-template command to build out the template spec file. There is one requirement and that is a GCS bucket needs to exist where you are going to place your template file. Make sure that you have a bucket created for the template spec file to live in.

.gitlab-ci.yaml

build-spec-template:
stage: deploy-template
image: google/cloud-sdk:alpine
before_script:
- echo $GCP_SA_KEY_FILE | base64 -d | gcloud auth activate-service-account - key-file=-
- gcloud config set project $GCP_PROJECT_ID
script:
# Build flex-template
- echo Creating template spec file
- |
gcloud dataflow flex-template build gs://$GCS_BUCKET/dataflow-template.json \
- image $GCP_DOCKER_IMAGE \
- sdk-language "PYTHON" \
- metadata-file "metadata.json"

You can find the complete source code at our Real Kinetic repo:

https://gitlab.com/real-kinetic-oss/code-lab/dataflow-cicd

Troubleshooting

If you find that your pipeline has failed you can debug it by navigating to Build -> Pipelines page and click into your failed pipeline. From there you can zero in on the issue. For example if you have yaml syntax errors you will see an indicator letting you know and a modal detailing where the issue is at. If a particular job is giving you an issue you can go into the Jobs section where you will find the output logs of the specific runner for that job.

GitLab Build Settings
Example YAML syntax

Verify

After the pipeline has run there are a couple of things to verify. First, navigate to the Artifact Registry repository to make sure that the docker image was pushed there with the correct commit SHA.

Google Cloud Artifact Registry Console

Next, verify that your template file was created by going to the google cloud storage console. Open up your template file and make sure that the image block of the .json object is pointing to the above image.

Finally, you can run your job. There are a few ways you can start it. The simplest is via the Dataflow GCP console. Navigate to the Dataflow page and then click “Create Job from Template” select the “Custom Template” option then type in the GCS bucket path where you placed your template file and any parameters your job needs. Alternatively, you can run it via the gcloud dataflow flex-template run command or through some other means such as Apache Airflow.

Execution stages of a Dataflow job, represented as a workflow graph

Conclusion

In this tutorial we walked through preparing a Dataflow flex-template deployment pipeline. We started by defining the Dockerfile where the Apache Beam code is built. Then we crafted the GitLab CI/CD pipeline. We tackled some common pitfalls such as making sure to set the correct Docker environment variables in the CI/CD pipeline. We also briefly discussed some best practices when managing Docker images in a production environment, such as having an image promotion story in place. With these pieces in place your data organization is now implementing SDLC best practices for your data batch/streaming pipelines with Dataflow! If you or your organization is looking for more guidance on modernizing your data stack with GCP please reach out to us at Real Kinetic and we would love to assist.

--

--

Based in Colorado. In my day job I build and architect data intensive systems in the cloud