Reverse ETL for Granular Billing Data with BigQuery and Dataflow
Reporting on Cloud Spend with GCP
Reporting on cloud spend has become a top priority for many business leaders. Today, I will walk you through how to work with GCP to get granular data on cloud service costs. This involves utilizing the Billing export job in BigQuery, leveraging CTEs and nested records in analytics queries, and writing a Dataflow job with custom transformation logic to make the data digestible for other parts of your organization.
At Real Kinetic, we’ve applied this process to display cost data for our customer environments, integrating it directly into our Konfigurate platform. We built Konfigurate with these challenges in mind, ensuring that cost tracking is streamlined and accessible at every level of the platform hierarchy. One key takeaway from our approach is the importance of organization and project modeling. It’s a simple yet often overlooked aspect of platform engineering, but failing to get it right can lead to sunk time and escalating costs over the years.
My goal is to show you how to build a full-stack reverse ETL pipeline, where each step of the process adds value for your service stakeholders. Let’s start by breaking down the stack and the pipeline.
BigQuery Billing Export
We’ll use the BigQuery Billing export job powered by BigQuery Data Transfer Service to upload daily cloud spend data into BigQuery. Depending on the level of detail you need, you can choose between different types of exports. For this pipeline, we’ll use the detailed usage report, which provides service-level details for a wide range of GCP products.
When setting up your billing export, you’ll need a billing project. This is the project where your BigQuery dataset and table for the exported data will reside. Ensure the Billing API is enabled for this project along with the BigQuery Data Transfer API. Once that’s done, create a BigQuery dataset to house your detailed usage report table. Give the dataset a clear, descriptive name, as it should be dedicated solely to your billing data.
With all the export pieces in place, we can create the export job. Navigate to the Billing section of the GCP console. From the left-hand menu, click Billing Export, and under Detailed Usage Cost, click Edit Settings. Here, select the billing project and the BigQuery dataset you created earlier. Save the settings, and data will begin loading into your dataset.
The table will be created automatically with a name in the following format:<dataset>.<gcp_billing_export_resource_v1_<billing_account_id>>
Note that it may take a few days for the initial dataset to be populated, so don’t worry if you don’t see data immediately.
BigQuery Data Analysis
Next, we’ll analyze the exported data to ensure it provides the granularity needed for each of our services. This is where using conformed dimensions across your services pays dividends. We’ll rely on labels to break down granular cost data.
This approach can be extended further within a Kubernetes cluster. By labeling Kubernetes objects, such as deployments, you can derive deployment-level cost statistics, offering even more precise insights into resource usage. To get GKE cost you will need to enable it during the export. Keep in mind that it will incur more storage costs for BigQuery.
Implementing consistent labels across all organizations within a business is essential for achieving accurate and reliable cost analysis. Without standardized labels, attributing costs to specific teams, departments, or services becomes significantly more challenging.
Below is an example query that uses a simple CTE to retrieve department-level cost data. For organizations using domain-driven design, this could map directly to a domain:
WITH department_cost AS (
SELECT
project_labels.value AS department,
SUM(CAST(cost AS NUMERIC)) AS department_monthly_cost
FROM `<dataset>.<data-table>`
JOIN UNNEST(project.labels) AS project_labels
ON project_labels.key = 'department-name'
GROUP BY project_labels.value
)
SELECT
project_labels.value AS department,
project.name AS service,
SUM(CAST(cost AS NUMERIC)) AS service_monthly_cost,
dc.department_monthly_cost
FROM `<dataset>.<data-table>`
JOIN UNNEST(project.labels) AS project_labels
ON project_labels.key = 'department'
JOIN department_cost AS dc
ON dc.department = project_labels.value
GROUP BY project.name, project_labels.value, dc.department_monthly_cost
ORDER BY department ASC, dc.department_monthly_cost DESC;
In this example output, we retrieve the department, the service in that department, the monthly cost incurred by the department, and the monthly cost of each service within that department. Notice the use of JOIN UNNEST(project.labels)
. This technique is a powerful way to work with RECORD
types in BigQuery, allowing you to flatten and filter them as needed. In this example, we rely on the fact that each service is labeled with its corresponding department.
You may notice that this query breaks a cardinal rule of Kimball dimensional modeling by mixing grains in a single output table (department and service). This was a conscious decision for this use case, as our end users require department and service data to be presented in a single row. If your use case requires strict adherence to dimensional modeling principles, you could separate these queries into views and merge their outputs in a Dataflow pipeline. The design decision depends on your specific requirements.
For my needs, where no business intelligence engineers are running queries against these outputs, this approach works well.
A few important notes about the table created by the billing export job:
- Partitioning: The table is partitioned by
DAY
, which can improve query performance, especially for larger datasets where performance is a concern, by leveraging the_PARTITIONTIME
column appropriately in your queries. - Write Disposition: The job is set to a
WRITE_TRUNCATE
disposition, meaning the table is overwritten multiple times a day. You can confirm this by examining theexport_time
column in the table. - Pipeline Considerations: Keep this write behavior in mind when designing your data pipelines to pull from this table, as the frequent updates may impact your pipeline’s logic.
Note: Keep an eye out for a follow-up post on how I reduced query slot time and shuffle by over 50% !
Reverse ETL with a Dataflow Job and Pipeline
Speaking of your data pipeline, we’ll now write a Dataflow job to transform the analyzed data and load it into a Firestore database. This step ensures that the data is accessible to downstream consumers and completes the reverse ETL process — extracting data from an analytics datastore and moving it into a non-analytical datastore. I won’t show the entire pipeline here, as each of you will have your own transformations tailored to meet business needs. However, I’ll highlight a few key points.
First, I am using the Python Beam SDK, as it’s what I am most comfortable with. Second, I utilize the ReadFromBigQuery PTransform from the apache_beam.io.gcp.bigquery module, passing in the parameterized query you saw earlier. In this case, we specify our BigQuery dataset and table. You will want to parameterize this query to whatever degree makes sense for your use case. Make sure to set the method to “DIRECT_READ”. This ensures the data is read directly from storage using the BigQuery Read API.
with beam.Pipeline(options=run_options) as p:
query_results = p | "BigQuery query request" >> ReadFromBigQuery(
query=monthly_cost_query,
use_standard_sql=True,
method="DIRECT_READ", # Secret sauce
)
query_results | "Write to Firestore" >> beam.ParDo(
BigQueryToFirestore(collection="gcpBilling")
)
Now, write your custom transform.
class BigQueryToFirestore(beam.DoFn):
def __init__(self, collection: str):
self.collection = collection
def setup(self):
# Dataflow distributes your DoFn's across workers
# To prevent import errors, import your client in setup
from google.cloud import firestore
try:
self.client = firestore.Client(database="billing")
except Exception as e:
logging.error(f"Failed to initialize Firestore client: {e}")
def teardown(self):
self.client.close()
def process(self, row):
# BigQuery uses DECIMAL type, so cast appropriately
from decimal import Decimal
transformed_row = {}
for key, value in row.items():
if isinstance(value, Decimal):
transformed_row[key] = float(value)
else:
transformed_row[key] = value
self.write_to_firestore(transformed_row)
@retry.Retry(predicate=retry.if_exception_type(Exception))
def write_to_firestore(self, row):
collection_ref = self.client.collection(self.collection)
query = (
collection_ref.where("department", "==", row["department"])
.where("service", "==", row["service"])
.limit(1)
)
doc = query.get()
updated_cost_data = {
"department_monthly_cost": row["department_monthly_cost"],
"service_monthly_cost": row["service_monthly_cost"],
}
# Only update the cost data if the department and service exist
if doc:
doc[0].reference.update(updated_cost_data)
else:
new_cost_data = {
"service": row["service"],
"department": row["department"],
"department_monthly_cost": row["department_monthly_cost"],
"service_monthly_cost": row["service_monthly_cost"],
}
collection_ref.add(new_cost_data)
An important requirement of data pipelines is idempotency. This means that no matter if you are running a backfill or a production pipeline, the results should be the same. In this case, we are dealing with a rapidly changing dimension in the cost data, as it updates daily. My requirement was to ensure that no duplicate departments or services are created. As such, my pipeline meets the idempotency requirement for this project.
If you have different requirements, you can leverage the event_time
or _PARTITIONTIME
columns to ensure the data is loaded correctly.
During development I would suggest running this pipeline locally with the direct runner to make sure that everything is working as expected.
At this point, you can package your Dataflow pipeline into a Flex Template and take advantage of Dataflow’s newer feature for setting up recurring pipelines. This is particularly useful because you can now leverage Cloud Scheduler to run your pipeline on a cron schedule. Alternatively, if your pipeline is part of a larger DAG, you can include it in your Airflow/GCP Composer DAG using the Dataflow Data Pipeline operator.
If you run into any issues setting up your Flex Template, I’ve written about some common gotchas that you can read about here. Additionally, if you’re looking to productionize your Flex Template with a CI/CD pipeline, I’ve got you covered. You can find more details on that here.
Conclusion
There you have it! You’re now equipped to start extracting detailed billing and usage data from your cloud stack using GCP’s native tools. While we’ve covered a lot in this short article, the key takeaway is that enterprise-wide conformed dimensions — specifically labels in this example — are essential for ensuring consistent and reliable data.
It sounds simple, and it is in theory, but in practice, it can be quite challenging. You’ll need to champion your cause, coordinating with stakeholders and service owners to ensure your requirements are met if you want to realize the business value. Unfortunately, this phase often fails in organizations today. What you’re asking for is the integration of best practices in data engineering and platform engineering across departments and teams.
At Real Kinetic, we’ve solved this problem dozens of times. It’s one of the driving factors behind the development of the Konfigurate platform. With Konfigurate, you get built-in service catalog details for all of your deployed services, along with granular cost data at every organizational level. If you have any questions or run into issues I would be happy to chat in the comments. If you are a leader in need of tackling the organization and management of your cloud resources to get insights like these please don’t hesitate to reach out!