Git branching: Best practices for BI and Data Teams

Evaluating various Git branching practices for Beat’s BI and Analytics Engineering Team

Rahul Jain
Beat Engineering Blog

--

A representative image of a main line and two branches in green, purple and yellow with a futuristic matrix in dual tone green in the background.

Introduction

Data informs and motivates all aspects of decision making at Beat — from how our product evolves to serve the growing needs of our customers, to how we operate our services in some of the most complex markets in the world. The central Data Science and Analytics tribe (nicknamed DNA) at Beat is responsible for enabling all corners of the company to exploit data models & customer insights. Within DNA, The Analytics and Big Data Engineering team owns and maintains several automated data pipelines to model, enrich and transform the data in a highly scalable manner.

Like most other teams at Beat, we use Git (Github) for our source code versioning and management. As a version control system, Git has been around for more than 15 years and has become the de-facto source control tool for software teams. Unlike its predecessors, Git (and other distributed version control tools) allow easy and inexpensive branching of code to enable multiple engineers/analysts to simultaneously work on a shared codebase.

Although version control has been in practice by software engineering teams for a long time, it is only recently¹ ² that data teams developing analytical assets such as data pipelines, automated reports and dashboards have started adopting version control systems for tracking and managing their artefacts.

At Beat, we follow a data-pipeline-as-code approach to author our data products and systems and hence our development process depends heavily on Git and its capabilities. As we gear-up towards a more headless approach to BI with most of the modelling and transformation embedded in source code, it becomes important to adopt a development process that caters to our specific needs.

In this post we will walk you through the various branching strategies that we currently use in Beat. Over the past few years, we have tested these strategies in production and identified their strengths and weaknesses that we will describe in the following sections.

Git Branching Patterns

A typical development workflow³ involves splitting the mainline⁴ into a branch, making changes to it and then merging it back to the mainline at a later point in time. Several such lines could be split simultaneously, often one per feature and later merged back when the feature development is complete.

Over the history of version control tools, several patterns have emerged to enable this workflow. The choice of these patterns often depends not just on the type of artefacts being developed but also on the topology of the team itself.

Martin Fowler has written an extensive wiki⁵ on the various patterns of source code branching, mainly aimed at software development teams.

Gitflow

One of these patterns is the famous Gitflow process. Originally introduced eleven years ago by Vincent Driessen in his now seminal post — A successful Git branching model, Gitflow has become extremely popular as a branching strategy among software teams.

Gitflow is very flexible. It shines when different teams are sharing the same code repository and, more importantly, there may be a need to maintain multiple simultaneous versions of the code in production (for ex. with packaged software distributed to multiple customers). Each “release” point is clearly tagged in the mainline, making it very easy to rollback to an older release version (or apply a hot-fix) when things go wrong.

In the Analytics Engineering team, we have traditionally followed the Gitflow approach for branching and merging Apache Spark applications. Gitflow was mainly chosen because the code repository for Spark jobs was shared with other data teams at Beat, the repository contained a few reusable code components (for ex. utility libraries) and we wanted to have stricter control on the release process.

Considering Alternatives

In the spirit of our core values of “Challenge Often, Commit always” and “Do more with less”, everyone at Beat is encouraged to frequently challenge the existing status-quo and find out smarter, more efficient ways of working. This is how we continuously improve. It was with exactly this in mind that the Analytics Engineering recently sat down to have a fresh look at our branching strategy and realised the following

  • We are moving to our own independent code repositories from the shared monorepo.
  • Achieving higher velocity in development and deployment is extremely important to us.
  • Each data pipeline is an independent artefact that is often developed and deployed independently from others.
  • We need a faster release cadence with the ability to deploy several times a day, if needed.
  • Unlike packaged software, our code artefacts are locally consumed. In other words, the code — Spark and SQL pipelines are run by the team and only the output (transformed tables, reports or dashboards) are shared with data consumers.
  • Code rollbacks happen but only rarely.

In this setup, Gitflow, while safer and more organised, felt too complicated for our needs. The Gitflow release process is slow and time consuming and prevents us from making frequent deployments to our production code.

This compelled us to take a look at alternatives that support a faster development cycle and fewer ceremonies.

If Gitflow exists at one end of the spectrum of various branching techniques, the other end of this spectrum is occupied by the trunk-based-development process popularised by Paul Hammant. Trunk based development eschews long running feature branches in favour of small, frequent, incremental commits straight to the mainline (or via merges from very short lived feature branches), preferably accompanied with automated builds and testing. Trunk-based-development process also enables continuous integration and deployment.

We also looked at other alternatives such as Github flow and found that they were quite similar to the trunk-based process except in minor details.

Evaluating all the options, we decided that the trunk-based-development process would fit our needs best as it would enable us to achieve better process efficiency for our data transformation pipelines.

This approach allows our Data Analysts and Analytics Engineers, often working independently or in small teams, to continually commit code many times a day. The commit sizes are small and branches have a short life span allowing the team to achieve high velocity.

For our Spark pipelines, we still use Gitflow and will gradually migrate to the trunk-based process for the ETL code. As a first step though, we pulled out the more frequently changing components such as configuration files, helm charts and orchestration workflows etc. in a separate repository where we follow the trunk-based process.

Conclusion

We believe that the trunk based development process is a more suitable branching strategy for most data activities as it allows data teams to achieve higher development velocity and enables continuous integration and deployment of their artefacts in production.

We will end this post with a few useful tips for effectively using version control and branching in your data teams

  • Use a distributed version control system, preferably Git.
  • Follow the principles of data-pipeline-as-code. More on this in another post but in short, treat all content as source code. This includes SQL, Dashboards, pipeline schedules, data transformations, data quality checks and even documentation. Doing this allows using the best of software engineering practices that help create a more maintainable data ecosystem, more predictable KPIs and reduce cycle time for data analytics
  • Use trunk-based-development (tailor it for your needs if required) process for your development and deployment workflows.
  • When buying an external BI tool, check if it supports version control. Most (though not all) modern BI tools support this.

If you found this article interesting and looking for your new challenges in data and engineering, check out our open roles. Join us on the ride!

Footnotes

[1]: https://towardsdatascience.com/introduction-to-github-for-data-scientists-2cf8b9b25fba

[2]: A quick Google search reveals that most of the discourse around version control in data communities is fairly recent (circa 2018 onwards)

[3]: Though not the only one

[4]: Mainline refers to the current “published” state of code at any time — https://martinfowler.com/articles/branching-patterns.html#mainline

[5]: He prefers to call it Bliki, a portmanteau of Blog + Wiki (https://martinfowler.com/bliki/WhatIsaBliki.html)

About the Authors

The Analytics and Big Data Engineering Team at Beat is responsible for modelling and transforming raw data into meaningful analytical models to drive our Data Science and Analytics function at Beat. The team owns and maintains several automated data pipelines to model, enrich and transform data for further analysis. In addition, we also help promote better data governance practices within Beat.

We are Marios Alexiou, Anastasia Miliopoulou, Rahul Jain, Alexandros Mavrommatis, Athina Kalampogia, Petros Vamvakaris, Anna Maria Zografou and Ioannis Aganthangelos

--

--