Data Build Tool (dbt): The Beat story

Rahul Jain
Beat Engineering Blog
7 min readMar 29, 2022

--

Leveraging dbt as a modern data stack tool at Beat.

Overview

By now, most people working in Data and Analytics have heard of dbt. If you haven’t, what rock have you been living under 🙃 ? All jokes apart, dbt (written in lowercase) is a framework for writing data transformations in SQL. In doing so, dbt makes use of well-known software and data engineering practices.

This post covers our experience with dbt at Beat in the hope that others on a similar journey can learn from it.

A quick primer on dbt

dbt is an open source library written in Python. A commercial product (dbt-cloud) is available as an add-on if one needs a fully managed solution. However, the bulk of the functionality is contained in the open source core modules and plugins.

dbt focuses on the T (Transform) step of the ELT workflow. Once raw data is ingested into an analytics data store (such as a data warehouse), Data Analysts can use dbt to write transformation workflows on top. dbt makes heavy use of templated SQL (using Jinja) for code reusability (just like functions or modules in a programming language).

dbt also queries the underlying storage to document the transformation workflows and artifacts. Additional features like the ability to add tests and auto-generate the SQL boilerplate, make dbt an excellent tool for analytical engineering. Before dbt, such capabilities were more or less missing in data transformation systems.

The need for dbt mainly arose to improve the productivity of Data Analysts. Further, recent developments in data engineering architecture have made it possible for Analysts to focus on core logic without worrying about boilerplate. Specifically,

  • The birth of the highly performant cloud data warehouses (starting with Amazon Redshift in 2012) and the SQL renaissance. This enabled Data Analysts to write transformations in SQL instead of having to learn another programming language like Python or Scala.
  • Availability of cheap storage in public clouds giving rise to ELT (Extract Load Transform) architecture, a divergence from its predecessor, ETL (Extract Transform Load). In ELT, raw, untransformed data is first extracted from data sources and loaded into the data warehouse. Subsequently, Data Analysts can transform this raw data without worrying about the complexity of extracting and loading it.
  • The growing recognition of the benefits of software engineering practices like version control (see our post on this), CI/CD, modular architecture and reusable components in data analytics workflows.
  • The urgent need to reign-in the complexity of data pipelines, especially in rapidly growing organizations, by creating well defined data models, common grammar, documentation, data quality and governance.

dbt at Beat

As our business at Beat is growing, so is our reliance on data for decision making. We run several complex data pipelines each day. However, these pipelines are a mishmash of SQLs, Apache Spark and other languages and frameworks. Many of these do not follow common standards, are error-prone and often difficult to read or debug. As part of our core culture, at Beat, we do more with less by identifying opportunities to optimize the use of our resources. Thus, we are now taking a closer look at our data tools and processes to explore ways to reduce this complexity.

dbt couldn’t have come at a better time for us. It fits our needs well and we are very excited to take advantage of its capabilities in our ecosystem. The strongly positive experience of other organizations with dbt has further strengthened our confidence in this tool.

The Setup

We store our analytics data in a data-lake (AWS S3 with a Hive meta store) and mainly use Apache Spark and Trino to transform it.

Our dbt project uses the dbt-trino plugin and all models are stored in a single Git repository. The structure of the models follows a data-mart approach.

We use Argo for workflow orchestration. When Argo triggers a workflow, it copies the models onto a docker image with dbt and runs the container in Kubernetes. We use the dbt build command to run both models and corresponding tests in the same workflow.

This is a simple setup but sufficient as a starting point.

dbt Deployment Architecture

For documentation, we have a separate Argo job that runs dbt docs generate and then hosts the generated artifacts in a web-server running in Kubernetes. The documentation workflow is triggered every hour to keep it up-to-date and in sync with the latest materialized dbt models.

Lineage graph for one of the dbt workflows

Lessons

As we migrate our data models to dbt, we are collecting many interesting lessons, observations and gotchas. Some of these are mentioned below.

Model Design and Development

  • Migrating the models from raw SQL to dbt is pretty easy and straightforward. We lift-and-shift our existing models instead of redesigning them. This helps us stay focussed on the migration and easily compare the results between the old and new systems.
  • Each model has several “intermediate/temp” models that are written for modularity and performance but shouldn’t be visible to the end users. dbt makes it easy to write them to a temporary schema. However, unless using CTEs, every intermediate model must be specified in its own file. This is painful and ugly. There is an open Github issue to support “model blocks” which will hopefully solve this.
  • dbt does not yet support good name-spacing. So everything is in a global namespace. This is a big limitation in writing modular code. Currently, the best way to properly namespace models is to use fully qualified model names but these can get very long and unreadable very fast. Github issue here.
  • dbt dags can be easily run/tested from one’s laptop. This is very helpful for development and debugging.
  • We have an interesting use-case for deleting some data from a model. This is not encouraged in dbt design . However, we found that the pre-hook feature can be used for this purpose till we get the chance to redesign our models.

Code Reuse

  • dbt macros are very powerful and one can get really creative with them. At Beat, we often run the same transformation workflow for different regional markets that we do business in. Each market’s data is stored in a different data warehouse, isolated from other markets. These transformations are alike except in a few aspects such as the timezones or schema names. Before dbt, we were forced to maintain multiple copies of our transformation code, one for each market. With dbt’s support for macros, we can templatize the SQL query to inject such parameters from Argo.

Code extensibility

  • Custom materializations are a powerful feature. dbt-trino did not support incremental materialization till recently. But it was fairly easy for us to write our own custom materialization for this purpose.
  • We are also envisioning a use case for writing a no-op custom materialization for existing models that we have not yet ported to dbt. This will help us bring external models to dbt by adding them to the schema.yaml files. Once added, these “external” models can then be enriched with tests and documentation within dbt.

Documentation

  • dbt supports hiding intermediate models in the documentation. But this is clunky as one has to explicitly mark them as hidden in the schema.yaml file by using “docs.show=false”.
  • The generated documentation is good but can be better. For example, the lineage graph can get too messy too soon. There is a download-to-png option but the png is rather pixelated for large lineage graphs. An option to download the graph as svg/pdf would be very useful.

Conclusion and Next Steps

Overall, dbt fits very well in our data engineering ecosystem. dbt creates a standardized central framework for data transformations. It promotes code reuse and reduces boilerplate so data analysts can focus on the core transformation logic. This improves productivity and creates a shared set of processes and practices for data transformations. It encourages best practices like documentation and testing. Last but not least, it compels data analysts to think of transformations as a modular, model-based system.

dbt is continuously evolving and shaping the best practices for analytical data transformations. At the same time, there is a need for practitioners to tailor the tool to their specific requirements. Thankfully, dbt’s open-core model makes it easy to extend it.

The next step on our roadmap is to migrate our existing data workflows to dbt and promote it as a tool of choice across all our data teams.

If you found this article interesting and looking for your new challenges in data and engineering, check out our open roles. Join us on the ride!

About the Authors

We are the Data and Analytics Engineering Team at Beat. We are responsible for modeling and transforming raw data into meaningful analytical models to drive Business Analytics at Beat. We also help promote better data governance practices at Beat.

We are Marios Alexiou, Alexandros Mavrommatis, Athina Kalampogia, Anna Maria Zografou, Anastasiia Gulevskaia, Ricardo Angel Granados Lopez, Anton Zayniev, Ioannis Agathangelos and Rahul Jain

--

--