CI/CD in Data Engineering

Posted by Andrew Ridgway on Thu 15 June 2023

Data Engineering has traditionally been considered the bastard step child of work that would have once been considered Administrative in the Tech world. Predominately we write SQL and then deploy that SQL onto one or more Databases. In fact a lot of the traditional methodologies around data almost assume this is the core of how an organistation is managing the majority of it's data. In the last couple of years though there has been a very steady move towards having the Data Engineering workload of SQL move towards Software Engineering techniques. With the popularity of tools like DBT and the latest newcommer on the block, SQL-MESH The oppportunity has started to arise where we can align our Data Engineering workloads with different environments and move much more efficiently towards a Continous Integration and Deployment methodology in our workflows.

For the Data Engineering space the move to the cloud has been a breath of fresh air (Not so in some other IT disciplines). I am relatively young, so I don't 100% remember but my experience has taught me that there were 3 options here not so long ago

Expensive:

  • SAS
  • SSIS/SSRS
  • COGNOS/TM1

Rickety:

  • Just write stored procedures!
  • Startup script on my laptop XD
  • "Don't touch that machine over there, No one knows what it does but if it's turned off our financial reports don't work" (This is a third hand story I heard, seriously!)
  • "I need to an upgrade to my laptop, Excel needs more than 8GB of RAM"

Hard:

  • Hadoop
  • Spark (hadoop but whatever)
  • Python
  • R

(The reason I've listed them as hard is because self hosting Hadoop/Spark and managing a truckload of python or R scripts, whilst it could have been "cheap" required a team of devs who really really knew what they were doing... so not really cheap and also really hard)

Then there was getting git behind all the sql scripts and modelling, let alone CI/CD IF it existed, it was custom, and bespoke and likely had a single point of failure in the person who knew how git merge worked. At least... thats I was told I'm not that old ;p.

These days we are pretty blessed, with democrotisation of clusters and data Infrastructure in the cloud we no longer need a team of sysadmins who know how to tune a cluster to the Nth degree to get the best our of our data workloads (well... we do, but we pay the cloud guys for that!). However, we still need to know about the idiosyncracities of this infrastructure, when it is appropiate to use and how we want to control and maintain the workloads.

In general when I am designing a system I normally like to break it into 3.

  • Storage
  • Compute
  • Code

In General Storage and Compute will be infrastructer related, "Code" is sort of a catch all for my modelling, normally sql, python/r or spark scripts that are used to provide system or business logic, anything really thats going to get data to the end user/analyst/data scientists/annoying person

Traditionally the compute layer only really had 2 considerations

  • SQL or Logic engine (normally a flavour of spark(glue) and then something like reshift/athena/trino/bigquery)
  • Orchestration Layer (Airflow, Dagster)

But with the advent of sql engine agnostic Modelling we potentially now need to also consider

  • Model Compilation

Now on the surface it seems counterintuitive to seperate the models from the logic layer but lets consider the following scenario

Redshift is costing to much and is getting slow, we want to try bigquery How much investment will it be to change over

Now, If the entirety of your modelling is stored and deployed to big query direct this would not only involve the investment of spinning up the big query account and either connecting or migrating your data over. You would also need to consider how in the bloody hell you convert all your existing models and workflows over.

With something like DBT or sqlmesh you change your compilation target and it's done for you. It also means the Data Engineer now doesn't need to necessarily understand the esoteric nature of the target, at least for simple models (which, lets be real, most are).

BUT, now we have a lot of software and infrastructure a simplified common datastack will look something like the below (Assuming ELT, ETL is a bit different but more or less needs the same components)