Compare commits
3 Commits
master
...
unfinished
Author | SHA1 | Date | |
---|---|---|---|
![]() |
fd6b2a45a3 | ||
![]() |
9f1dc86819 | ||
![]() |
4182185595 |
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
||||
.venv*
|
Binary file not shown.
69
src/content/CI_CD_in_data.md
Normal file
69
src/content/CI_CD_in_data.md
Normal file
@ -0,0 +1,69 @@
|
||||
Title: CI/CD in Data Engineering
|
||||
Date: 2023-06-15 20:00
|
||||
Modified: 2023-06-15 20:00
|
||||
Category: Data Engineering
|
||||
Tags: data engineering, DBT, Terraform, IAC
|
||||
Slug: CI/CD in Data and Data Infrastructure
|
||||
Authors: Andrew Ridgway
|
||||
Summary: When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture
|
||||
|
||||
Data Engineering has traditionally been considered the bastard step child of work that would have once been considered Administrative in the Tech world. Predominately we write SQL and then deploy that SQL onto one or more Databases. In fact a lot of the traditional methodologies around data almost assume this is the core of how an organistation is managing the majority of it's data. In the last couple of years though there has been a very steady move towards having the Data Engineering workload of SQL move towards Software Engineering techniques. With the popularity of tools like [DBT](https://www.dbtlabs.com) and the latest newcommer on the block, [SQL-MESH](https://www.sqlmesh.com) The oppportunity has started to arise where we can align our Data Engineering workloads with different environments and move much more efficiently towards a Continous Integration and Deployment methodology in our workflows.
|
||||
|
||||
For the Data Engineering space the move to the cloud has been a breath of fresh air (Not so in some other IT disciplines). I am relatively young, so I don't 100% remember but my experience has taught me that there were 3 options here not so long ago
|
||||
|
||||
_Expensive:_
|
||||
|
||||
+ SAS
|
||||
+ SSIS/SSRS
|
||||
+ COGNOS/TM1
|
||||
|
||||
_Rickety:_
|
||||
|
||||
+ Just write stored procedures!
|
||||
+ Startup script on my laptop XD
|
||||
+ "Don't touch that machine over there, No one knows what it does but if it's turned off our financial reports don't work" (This is a third hand story I heard, seriously!)
|
||||
+ "I need to an upgrade to my laptop, Excel needs more than 8GB of RAM"
|
||||
|
||||
_Hard:_
|
||||
|
||||
+ Hadoop
|
||||
+ Spark (hadoop but whatever)
|
||||
+ Python
|
||||
+ R
|
||||
|
||||
_(The reason I've listed them as hard is because self hosting Hadoop/Spark and managing a truckload of python or R scripts, whilst it could have been "cheap" required a team of devs who really really knew what they were doing... so not really cheap and also **really hard**)_
|
||||
|
||||
Then there was getting git behind all the sql scripts and modelling, let alone CI/CD **IF** it existed, it was custom, and bespoke and likely had a single point of failure in the person who knew how `git merge` worked. At least... thats I was told I'm not *that* old ;p.
|
||||
|
||||
These days we are pretty blessed, with democrotisation of clusters and data Infrastructure in the cloud we no longer need a team of sysadmins who know how to tune a cluster to the Nth degree to get the best our of our data workloads (well... we do, but we pay the cloud guys for that!). However, we still need to know about the idiosyncracities of this infrastructure, when it is appropiate to use and how we want to control and maintain the workloads.
|
||||
|
||||
In general when I am designing a system I normally like to break it into 3.
|
||||
|
||||
+ Storage
|
||||
+ Compute
|
||||
+ Code
|
||||
|
||||
*In General* Storage and Compute will be infrastructer related, "Code" is sort of a catch all for my modelling, normally sql, python/r or spark scripts that are used to provide system or business logic, anything really thats going to get data to the end user/analyst/data scientists/annoying person
|
||||
|
||||
Traditionally the compute layer only really had 2 considerations
|
||||
|
||||
+ SQL or Logic engine (normally a flavour of spark(glue) and then something like reshift/athena/trino/bigquery)
|
||||
+ Orchestration Layer (Airflow, Dagster)
|
||||
|
||||
But with the advent of sql engine agnostic Modelling we potentially now need to also consider
|
||||
|
||||
+ Model Compilation
|
||||
|
||||
Now on the surface it seems counterintuitive to seperate the models from the logic layer but lets consider the following scenario
|
||||
|
||||
> Redshift is costing to much and is getting slow, we want to try bigquery
|
||||
> How much investment will it be to change over
|
||||
|
||||
Now, If the entirety of your modelling is stored and deployed to big query direct this would not only involve the investment of spinning up the big query account and either connecting or migrating your data over. You would also need to consider how in the bloody hell you convert all your existing models and workflows over.
|
||||
|
||||
With something like DBT or sqlmesh you change your compilation target and it's done for you. It also means the Data Engineer now doesn't need to necessarily understand the esoteric nature of the target, at least for simple models (which, lets be real, most are).
|
||||
|
||||
BUT, now we have *a lot* of software and infrastructure a simplified common datastack will look something like the below (Assuming ELT, ETL is a bit different but more or less needs the same components)
|
||||
|
||||
<img src="{static}/images/DataStackSimplified.png" width="600" height="295" />
|
||||
|
BIN
src/content/images/DataStackSimplified.png
Normal file
BIN
src/content/images/DataStackSimplified.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 323 KiB |
222
src/output/CI/CD in Data and Data Infrastructure.html
Normal file
222
src/output/CI/CD in Data and Data Infrastructure.html
Normal file
@ -0,0 +1,222 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<meta name="description" content="">
|
||||
<meta name="author" content="">
|
||||
|
||||
<title>Andrew Ridgway's Blog</title>
|
||||
|
||||
<link href="http://localhost:8000/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Full Atom Feed" />
|
||||
<link href="http://localhost:8000/feeds/data-engineering.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Categories Atom Feed" />
|
||||
|
||||
<!-- Bootstrap Core CSS -->
|
||||
<link href="http://localhost:8000/theme/css/bootstrap.min.css" rel="stylesheet">
|
||||
|
||||
<!-- Custom CSS -->
|
||||
<link href="http://localhost:8000/theme/css/clean-blog.min.css" rel="stylesheet">
|
||||
|
||||
<!-- Code highlight color scheme -->
|
||||
<link href="http://localhost:8000/theme/css/code_blocks/tomorrow.css" rel="stylesheet">
|
||||
|
||||
<!-- Custom Fonts -->
|
||||
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
|
||||
<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
|
||||
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
|
||||
|
||||
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
|
||||
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
|
||||
<!--[if lt IE 9]>
|
||||
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
|
||||
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
|
||||
<![endif]-->
|
||||
|
||||
|
||||
|
||||
|
||||
<meta name="tags" contents="data engineering" />
|
||||
<meta name="tags" contents="DBT" />
|
||||
<meta name="tags" contents="Terraform" />
|
||||
<meta name="tags" contents="IAC" />
|
||||
|
||||
|
||||
<meta property="og:locale" content="en">
|
||||
<meta property="og:site_name" content="Andrew Ridgway's Blog">
|
||||
|
||||
<meta property="og:type" content="article">
|
||||
<meta property="article:author" content="">
|
||||
<meta property="og:url" content="http://localhost:8000/CI/CD in Data and Data Infrastructure.html">
|
||||
<meta property="og:title" content="CI/CD in Data Engineering">
|
||||
<meta property="og:description" content="">
|
||||
<meta property="og:image" content="http://localhost:8000/">
|
||||
<meta property="article:published_time" content="2023-06-15 20:00:00+10:00">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<!-- Navigation -->
|
||||
<nav class="navbar navbar-default navbar-custom navbar-fixed-top">
|
||||
<div class="container-fluid">
|
||||
<!-- Brand and toggle get grouped for better mobile display -->
|
||||
<div class="navbar-header page-scroll">
|
||||
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
|
||||
<span class="sr-only">Toggle navigation</span>
|
||||
<span class="icon-bar"></span>
|
||||
<span class="icon-bar"></span>
|
||||
<span class="icon-bar"></span>
|
||||
</button>
|
||||
<a class="navbar-brand" href="http://localhost:8000/">Andrew Ridgway's Blog</a>
|
||||
</div>
|
||||
|
||||
<!-- Collect the nav links, forms, and other content for toggling -->
|
||||
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
|
||||
<ul class="nav navbar-nav navbar-right">
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
<!-- /.navbar-collapse -->
|
||||
</div>
|
||||
<!-- /.container -->
|
||||
</nav>
|
||||
|
||||
<!-- Page Header -->
|
||||
<header class="intro-header" style="background-image: url('http://localhost:8000/theme/images/post-bg.jpg')">
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<div class="post-heading">
|
||||
<h1>CI/CD in Data Engineering</h1>
|
||||
<span class="meta">Posted by
|
||||
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
|
||||
on Thu 15 June 2023
|
||||
</span>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<!-- Main Content -->
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<!-- Post Content -->
|
||||
<article>
|
||||
<p>Data Engineering has traditionally been considered the bastard step child of work that would have once been considered Administrative in the Tech world. Predominately we write SQL and then deploy that SQL onto one or more Databases. In fact a lot of the traditional methodologies around data almost assume this is the core of how an organistation is managing the majority of it's data. In the last couple of years though there has been a very steady move towards having the Data Engineering workload of SQL move towards Software Engineering techniques. With the popularity of tools like <a href="https://www.dbtlabs.com">DBT</a> and the latest newcommer on the block, <a href="https://www.sqlmesh.com">SQL-MESH</a> The oppportunity has started to arise where we can align our Data Engineering workloads with different environments and move much more efficiently towards a Continous Integration and Deployment methodology in our workflows. </p>
|
||||
<p>For the Data Engineering space the move to the cloud has been a breath of fresh air (Not so in some other IT disciplines). I am relatively young, so I don't 100% remember but my experience has taught me that there were 3 options here not so long ago</p>
|
||||
<p><em>Expensive:</em></p>
|
||||
<ul>
|
||||
<li>SAS</li>
|
||||
<li>SSIS/SSRS</li>
|
||||
<li>COGNOS/TM1</li>
|
||||
</ul>
|
||||
<p><em>Rickety:</em></p>
|
||||
<ul>
|
||||
<li>Just write stored procedures!</li>
|
||||
<li>Startup script on my laptop XD</li>
|
||||
<li>"Don't touch that machine over there, No one knows what it does but if it's turned off our financial reports don't work" (This is a third hand story I heard, seriously!)</li>
|
||||
<li>"I need to an upgrade to my laptop, Excel needs more than 8GB of RAM"</li>
|
||||
</ul>
|
||||
<p><em>Hard:</em></p>
|
||||
<ul>
|
||||
<li>Hadoop</li>
|
||||
<li>Spark (hadoop but whatever)</li>
|
||||
<li>Python</li>
|
||||
<li>R</li>
|
||||
</ul>
|
||||
<p><em>(The reason I've listed them as hard is because self hosting Hadoop/Spark and managing a truckload of python or R scripts, whilst it could have been "cheap" required a team of devs who really really knew what they were doing... so not really cheap and also <strong>really hard</strong>)</em></p>
|
||||
<p>Then there was getting git behind all the sql scripts and modelling, let alone CI/CD <strong>IF</strong> it existed, it was custom, and bespoke and likely had a single point of failure in the person who knew how <code>git merge</code> worked. At least... thats I was told I'm not <em>that</em> old ;p.</p>
|
||||
<p>These days we are pretty blessed, with democrotisation of clusters and data Infrastructure in the cloud we no longer need a team of sysadmins who know how to tune a cluster to the Nth degree to get the best our of our data workloads (well... we do, but we pay the cloud guys for that!). However, we still need to know about the idiosyncracities of this infrastructure, when it is appropiate to use and how we want to control and maintain the workloads. </p>
|
||||
<p>In general when I am designing a system I normally like to break it into 3.</p>
|
||||
<ul>
|
||||
<li>Storage</li>
|
||||
<li>Compute</li>
|
||||
<li>Code</li>
|
||||
</ul>
|
||||
<p><em>In General</em> Storage and Compute will be infrastructer related, "Code" is sort of a catch all for my modelling, normally sql, python/r or spark scripts that are used to provide system or business logic, anything really thats going to get data to the end user/analyst/data scientists/annoying person </p>
|
||||
<p>Traditionally the compute layer only really had 2 considerations</p>
|
||||
<ul>
|
||||
<li>SQL or Logic engine (normally a flavour of spark(glue) and then something like reshift/athena/trino/bigquery)</li>
|
||||
<li>Orchestration Layer (Airflow, Dagster)</li>
|
||||
</ul>
|
||||
<p>But with the advent of sql engine agnostic Modelling we potentially now need to also consider</p>
|
||||
<ul>
|
||||
<li>Model Compilation</li>
|
||||
</ul>
|
||||
<p>Now on the surface it seems counterintuitive to seperate the models from the logic layer but lets consider the following scenario</p>
|
||||
<blockquote>
|
||||
<p>Redshift is costing to much and is getting slow, we want to try bigquery
|
||||
How much investment will it be to change over</p>
|
||||
</blockquote>
|
||||
<p>Now, If the entirety of your modelling is stored and deployed to big query direct this would not only involve the investment of spinning up the big query account and either connecting or migrating your data over. You would also need to consider how in the bloody hell you convert all your existing models and workflows over.</p>
|
||||
<p>With something like DBT or sqlmesh you change your compilation target and it's done for you. It also means the Data Engineer now doesn't need to necessarily understand the esoteric nature of the target, at least for simple models (which, lets be real, most are).</p>
|
||||
<p>BUT, now we have <em>a lot</em> of software and infrastructure a simplified common datastack will look something like the below (Assuming ELT, ETL is a bit different but more or less needs the same components)</p>
|
||||
<p><img src="http://localhost:8000/images/DataStackSimplified.png" width="600" height="295" /></p>
|
||||
</article>
|
||||
|
||||
<hr>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<hr>
|
||||
|
||||
<!-- Footer -->
|
||||
<footer>
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<p>
|
||||
<script type="text/javascript" src="https://sessionize.com/api/speaker/sessions/83c5d14a-bd19-46b4-8335-0ac8358ac46d/0x0x91929ax">
|
||||
</script>
|
||||
</p>
|
||||
<ul class="list-inline text-center">
|
||||
<li>
|
||||
<a href="https://twitter.com/ar17787">
|
||||
<span class="fa-stack fa-lg">
|
||||
<i class="fa fa-circle fa-stack-2x"></i>
|
||||
<i class="fa fa-twitter fa-stack-1x fa-inverse"></i>
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="https://facebook.com/ar17787">
|
||||
<span class="fa-stack fa-lg">
|
||||
<i class="fa fa-circle fa-stack-2x"></i>
|
||||
<i class="fa fa-facebook fa-stack-1x fa-inverse"></i>
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="https://github.com/armistace">
|
||||
<span class="fa-stack fa-lg">
|
||||
<i class="fa fa-circle fa-stack-2x"></i>
|
||||
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
<p class="copyright text-muted">Blog powered by <a href="http://getpelican.com">Pelican</a>,
|
||||
which takes great advantage of <a href="http://python.org">Python</a>.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
<!-- jQuery -->
|
||||
<script src="http://localhost:8000/theme/js/jquery.js"></script>
|
||||
|
||||
<!-- Bootstrap Core JavaScript -->
|
||||
<script src="http://localhost:8000/theme/js/bootstrap.min.js"></script>
|
||||
|
||||
<!-- Custom Theme JavaScript -->
|
||||
<script src="http://localhost:8000/theme/js/clean-blog.min.js"></script>
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
@ -82,6 +82,8 @@
|
||||
<div class="row">
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<dl>
|
||||
<dt>Thu 15 June 2023</dt>
|
||||
<dd><a href="http://localhost:8000/CI/CD in Data and Data Infrastructure.html">CI/CD in Data Engineering</a></dd>
|
||||
<dt>Tue 23 May 2023</dt>
|
||||
<dd><a href="http://localhost:8000/appflow-production.html">Implmenting Appflow in a Production Datalake</a></dd>
|
||||
<dt>Wed 10 May 2023</dt>
|
||||
|
@ -81,6 +81,19 @@
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<div class="post-preview">
|
||||
<a href="http://localhost:8000/CI/CD in Data and Data Infrastructure.html" rel="bookmark" title="Permalink to CI/CD in Data Engineering">
|
||||
<h2 class="post-title">
|
||||
CI/CD in Data Engineering
|
||||
</h2>
|
||||
</a>
|
||||
<p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p>
|
||||
<p class="post-meta">Posted by
|
||||
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
|
||||
on Thu 15 June 2023
|
||||
</p>
|
||||
</div>
|
||||
<hr>
|
||||
<div class="post-preview">
|
||||
<a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake">
|
||||
<h2 class="post-title">
|
||||
|
@ -84,7 +84,7 @@
|
||||
<div class="post-preview">
|
||||
<a href="http://localhost:8000/author/andrew-ridgway.html" rel="bookmark">
|
||||
<h2 class="post-title">
|
||||
Andrew Ridgway (2)
|
||||
Andrew Ridgway (3)
|
||||
</h2>
|
||||
</a>
|
||||
</div>
|
||||
|
@ -82,6 +82,19 @@
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<div class="post-preview">
|
||||
<a href="http://localhost:8000/CI/CD in Data and Data Infrastructure.html" rel="bookmark" title="Permalink to CI/CD in Data Engineering">
|
||||
<h2 class="post-title">
|
||||
CI/CD in Data Engineering
|
||||
</h2>
|
||||
</a>
|
||||
<p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p>
|
||||
<p class="post-meta">Posted by
|
||||
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
|
||||
on Thu 15 June 2023
|
||||
</p>
|
||||
</div>
|
||||
<hr>
|
||||
<div class="post-preview">
|
||||
<a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake">
|
||||
<h2 class="post-title">
|
||||
|
@ -1,5 +1,54 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all-en.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all-en.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-06-15T20:00:00+10:00</updated><entry><title>CI/CD in Data Engineering</title><link href="http://localhost:8000/CI/CD%20in%20Data%20and%20Data%20Infrastructure.html" rel="alternate"></link><published>2023-06-15T20:00:00+10:00</published><updated>2023-06-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-06-15:/CI/CD in Data and Data Infrastructure.html</id><summary type="html"><p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p></summary><content type="html"><p>Data Engineering has traditionally been considered the bastard step child of work that would have once been considered Administrative in the Tech world. Predominately we write SQL and then deploy that SQL onto one or more Databases. In fact a lot of the traditional methodologies around data almost assume this is the core of how an organistation is managing the majority of it's data. In the last couple of years though there has been a very steady move towards having the Data Engineering workload of SQL move towards Software Engineering techniques. With the popularity of tools like <a href="https://www.dbtlabs.com">DBT</a> and the latest newcommer on the block, <a href="https://www.sqlmesh.com">SQL-MESH</a> The oppportunity has started to arise where we can align our Data Engineering workloads with different environments and move much more efficiently towards a Continous Integration and Deployment methodology in our workflows. </p>
|
||||
<p>For the Data Engineering space the move to the cloud has been a breath of fresh air (Not so in some other IT disciplines). I am relatively young, so I don't 100% remember but my experience has taught me that there were 3 options here not so long ago</p>
|
||||
<p><em>Expensive:</em></p>
|
||||
<ul>
|
||||
<li>SAS</li>
|
||||
<li>SSIS/SSRS</li>
|
||||
<li>COGNOS/TM1</li>
|
||||
</ul>
|
||||
<p><em>Rickety:</em></p>
|
||||
<ul>
|
||||
<li>Just write stored procedures!</li>
|
||||
<li>Startup script on my laptop XD</li>
|
||||
<li>"Don't touch that machine over there, No one knows what it does but if it's turned off our financial reports don't work" (This is a third hand story I heard, seriously!)</li>
|
||||
<li>"I need to an upgrade to my laptop, Excel needs more than 8GB of RAM"</li>
|
||||
</ul>
|
||||
<p><em>Hard:</em></p>
|
||||
<ul>
|
||||
<li>Hadoop</li>
|
||||
<li>Spark (hadoop but whatever)</li>
|
||||
<li>Python</li>
|
||||
<li>R</li>
|
||||
</ul>
|
||||
<p><em>(The reason I've listed them as hard is because self hosting Hadoop/Spark and managing a truckload of python or R scripts, whilst it could have been "cheap" required a team of devs who really really knew what they were doing... so not really cheap and also <strong>really hard</strong>)</em></p>
|
||||
<p>Then there was getting git behind all the sql scripts and modelling, let alone CI/CD <strong>IF</strong> it existed, it was custom, and bespoke and likely had a single point of failure in the person who knew how <code>git merge</code> worked. At least... thats I was told I'm not <em>that</em> old ;p.</p>
|
||||
<p>These days we are pretty blessed, with democrotisation of clusters and data Infrastructure in the cloud we no longer need a team of sysadmins who know how to tune a cluster to the Nth degree to get the best our of our data workloads (well... we do, but we pay the cloud guys for that!). However, we still need to know about the idiosyncracities of this infrastructure, when it is appropiate to use and how we want to control and maintain the workloads. </p>
|
||||
<p>In general when I am designing a system I normally like to break it into 3.</p>
|
||||
<ul>
|
||||
<li>Storage</li>
|
||||
<li>Compute</li>
|
||||
<li>Code</li>
|
||||
</ul>
|
||||
<p><em>In General</em> Storage and Compute will be infrastructer related, "Code" is sort of a catch all for my modelling, normally sql, python/r or spark scripts that are used to provide system or business logic, anything really thats going to get data to the end user/analyst/data scientists/annoying person </p>
|
||||
<p>Traditionally the compute layer only really had 2 considerations</p>
|
||||
<ul>
|
||||
<li>SQL or Logic engine (normally a flavour of spark(glue) and then something like reshift/athena/trino/bigquery)</li>
|
||||
<li>Orchestration Layer (Airflow, Dagster)</li>
|
||||
</ul>
|
||||
<p>But with the advent of sql engine agnostic Modelling we potentially now need to also consider</p>
|
||||
<ul>
|
||||
<li>Model Compilation</li>
|
||||
</ul>
|
||||
<p>Now on the surface it seems counterintuitive to seperate the models from the logic layer but lets consider the following scenario</p>
|
||||
<blockquote>
|
||||
<p>Redshift is costing to much and is getting slow, we want to try bigquery
|
||||
How much investment will it be to change over</p>
|
||||
</blockquote>
|
||||
<p>Now, If the entirety of your modelling is stored and deployed to big query direct this would not only involve the investment of spinning up the big query account and either connecting or migrating your data over. You would also need to consider how in the bloody hell you convert all your existing models and workflows over.</p>
|
||||
<p>With something like DBT or sqlmesh you change your compilation target and it's done for you. It also means the Data Engineer now doesn't need to necessarily understand the esoteric nature of the target, at least for simple models (which, lets be real, most are).</p>
|
||||
<p>BUT, now we have <em>a lot</em> of software and infrastructure a simplified common datastack will look something like the below (Assuming ELT, ETL is a bit different but more or less needs the same components)</p>
|
||||
<p><img src="http://localhost:8000/images/DataStackSimplified.png" width="600" height="295" /></p></content><category term="Data Engineering"></category><category term="data engineering"></category><category term="DBT"></category><category term="Terraform"></category><category term="IAC"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<p>This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. </p>
|
||||
<p>Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.</p>
|
||||
<h3>Datalake Extraction Layer</h3>
|
||||
|
@ -1,5 +1,54 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-06-15T20:00:00+10:00</updated><entry><title>CI/CD in Data Engineering</title><link href="http://localhost:8000/CI/CD%20in%20Data%20and%20Data%20Infrastructure.html" rel="alternate"></link><published>2023-06-15T20:00:00+10:00</published><updated>2023-06-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-06-15:/CI/CD in Data and Data Infrastructure.html</id><summary type="html"><p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p></summary><content type="html"><p>Data Engineering has traditionally been considered the bastard step child of work that would have once been considered Administrative in the Tech world. Predominately we write SQL and then deploy that SQL onto one or more Databases. In fact a lot of the traditional methodologies around data almost assume this is the core of how an organistation is managing the majority of it's data. In the last couple of years though there has been a very steady move towards having the Data Engineering workload of SQL move towards Software Engineering techniques. With the popularity of tools like <a href="https://www.dbtlabs.com">DBT</a> and the latest newcommer on the block, <a href="https://www.sqlmesh.com">SQL-MESH</a> The oppportunity has started to arise where we can align our Data Engineering workloads with different environments and move much more efficiently towards a Continous Integration and Deployment methodology in our workflows. </p>
|
||||
<p>For the Data Engineering space the move to the cloud has been a breath of fresh air (Not so in some other IT disciplines). I am relatively young, so I don't 100% remember but my experience has taught me that there were 3 options here not so long ago</p>
|
||||
<p><em>Expensive:</em></p>
|
||||
<ul>
|
||||
<li>SAS</li>
|
||||
<li>SSIS/SSRS</li>
|
||||
<li>COGNOS/TM1</li>
|
||||
</ul>
|
||||
<p><em>Rickety:</em></p>
|
||||
<ul>
|
||||
<li>Just write stored procedures!</li>
|
||||
<li>Startup script on my laptop XD</li>
|
||||
<li>"Don't touch that machine over there, No one knows what it does but if it's turned off our financial reports don't work" (This is a third hand story I heard, seriously!)</li>
|
||||
<li>"I need to an upgrade to my laptop, Excel needs more than 8GB of RAM"</li>
|
||||
</ul>
|
||||
<p><em>Hard:</em></p>
|
||||
<ul>
|
||||
<li>Hadoop</li>
|
||||
<li>Spark (hadoop but whatever)</li>
|
||||
<li>Python</li>
|
||||
<li>R</li>
|
||||
</ul>
|
||||
<p><em>(The reason I've listed them as hard is because self hosting Hadoop/Spark and managing a truckload of python or R scripts, whilst it could have been "cheap" required a team of devs who really really knew what they were doing... so not really cheap and also <strong>really hard</strong>)</em></p>
|
||||
<p>Then there was getting git behind all the sql scripts and modelling, let alone CI/CD <strong>IF</strong> it existed, it was custom, and bespoke and likely had a single point of failure in the person who knew how <code>git merge</code> worked. At least... thats I was told I'm not <em>that</em> old ;p.</p>
|
||||
<p>These days we are pretty blessed, with democrotisation of clusters and data Infrastructure in the cloud we no longer need a team of sysadmins who know how to tune a cluster to the Nth degree to get the best our of our data workloads (well... we do, but we pay the cloud guys for that!). However, we still need to know about the idiosyncracities of this infrastructure, when it is appropiate to use and how we want to control and maintain the workloads. </p>
|
||||
<p>In general when I am designing a system I normally like to break it into 3.</p>
|
||||
<ul>
|
||||
<li>Storage</li>
|
||||
<li>Compute</li>
|
||||
<li>Code</li>
|
||||
</ul>
|
||||
<p><em>In General</em> Storage and Compute will be infrastructer related, "Code" is sort of a catch all for my modelling, normally sql, python/r or spark scripts that are used to provide system or business logic, anything really thats going to get data to the end user/analyst/data scientists/annoying person </p>
|
||||
<p>Traditionally the compute layer only really had 2 considerations</p>
|
||||
<ul>
|
||||
<li>SQL or Logic engine (normally a flavour of spark(glue) and then something like reshift/athena/trino/bigquery)</li>
|
||||
<li>Orchestration Layer (Airflow, Dagster)</li>
|
||||
</ul>
|
||||
<p>But with the advent of sql engine agnostic Modelling we potentially now need to also consider</p>
|
||||
<ul>
|
||||
<li>Model Compilation</li>
|
||||
</ul>
|
||||
<p>Now on the surface it seems counterintuitive to seperate the models from the logic layer but lets consider the following scenario</p>
|
||||
<blockquote>
|
||||
<p>Redshift is costing to much and is getting slow, we want to try bigquery
|
||||
How much investment will it be to change over</p>
|
||||
</blockquote>
|
||||
<p>Now, If the entirety of your modelling is stored and deployed to big query direct this would not only involve the investment of spinning up the big query account and either connecting or migrating your data over. You would also need to consider how in the bloody hell you convert all your existing models and workflows over.</p>
|
||||
<p>With something like DBT or sqlmesh you change your compilation target and it's done for you. It also means the Data Engineer now doesn't need to necessarily understand the esoteric nature of the target, at least for simple models (which, lets be real, most are).</p>
|
||||
<p>BUT, now we have <em>a lot</em> of software and infrastructure a simplified common datastack will look something like the below (Assuming ELT, ETL is a bit different but more or less needs the same components)</p>
|
||||
<p><img src="http://localhost:8000/images/DataStackSimplified.png" width="600" height="295" /></p></content><category term="Data Engineering"></category><category term="data engineering"></category><category term="DBT"></category><category term="Terraform"></category><category term="IAC"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<p>This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. </p>
|
||||
<p>Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.</p>
|
||||
<h3>Datalake Extraction Layer</h3>
|
||||
|
@ -1,5 +1,54 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/andrew-ridgway.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/andrew-ridgway.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-06-15T20:00:00+10:00</updated><entry><title>CI/CD in Data Engineering</title><link href="http://localhost:8000/CI/CD%20in%20Data%20and%20Data%20Infrastructure.html" rel="alternate"></link><published>2023-06-15T20:00:00+10:00</published><updated>2023-06-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-06-15:/CI/CD in Data and Data Infrastructure.html</id><summary type="html"><p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p></summary><content type="html"><p>Data Engineering has traditionally been considered the bastard step child of work that would have once been considered Administrative in the Tech world. Predominately we write SQL and then deploy that SQL onto one or more Databases. In fact a lot of the traditional methodologies around data almost assume this is the core of how an organistation is managing the majority of it's data. In the last couple of years though there has been a very steady move towards having the Data Engineering workload of SQL move towards Software Engineering techniques. With the popularity of tools like <a href="https://www.dbtlabs.com">DBT</a> and the latest newcommer on the block, <a href="https://www.sqlmesh.com">SQL-MESH</a> The oppportunity has started to arise where we can align our Data Engineering workloads with different environments and move much more efficiently towards a Continous Integration and Deployment methodology in our workflows. </p>
|
||||
<p>For the Data Engineering space the move to the cloud has been a breath of fresh air (Not so in some other IT disciplines). I am relatively young, so I don't 100% remember but my experience has taught me that there were 3 options here not so long ago</p>
|
||||
<p><em>Expensive:</em></p>
|
||||
<ul>
|
||||
<li>SAS</li>
|
||||
<li>SSIS/SSRS</li>
|
||||
<li>COGNOS/TM1</li>
|
||||
</ul>
|
||||
<p><em>Rickety:</em></p>
|
||||
<ul>
|
||||
<li>Just write stored procedures!</li>
|
||||
<li>Startup script on my laptop XD</li>
|
||||
<li>"Don't touch that machine over there, No one knows what it does but if it's turned off our financial reports don't work" (This is a third hand story I heard, seriously!)</li>
|
||||
<li>"I need to an upgrade to my laptop, Excel needs more than 8GB of RAM"</li>
|
||||
</ul>
|
||||
<p><em>Hard:</em></p>
|
||||
<ul>
|
||||
<li>Hadoop</li>
|
||||
<li>Spark (hadoop but whatever)</li>
|
||||
<li>Python</li>
|
||||
<li>R</li>
|
||||
</ul>
|
||||
<p><em>(The reason I've listed them as hard is because self hosting Hadoop/Spark and managing a truckload of python or R scripts, whilst it could have been "cheap" required a team of devs who really really knew what they were doing... so not really cheap and also <strong>really hard</strong>)</em></p>
|
||||
<p>Then there was getting git behind all the sql scripts and modelling, let alone CI/CD <strong>IF</strong> it existed, it was custom, and bespoke and likely had a single point of failure in the person who knew how <code>git merge</code> worked. At least... thats I was told I'm not <em>that</em> old ;p.</p>
|
||||
<p>These days we are pretty blessed, with democrotisation of clusters and data Infrastructure in the cloud we no longer need a team of sysadmins who know how to tune a cluster to the Nth degree to get the best our of our data workloads (well... we do, but we pay the cloud guys for that!). However, we still need to know about the idiosyncracities of this infrastructure, when it is appropiate to use and how we want to control and maintain the workloads. </p>
|
||||
<p>In general when I am designing a system I normally like to break it into 3.</p>
|
||||
<ul>
|
||||
<li>Storage</li>
|
||||
<li>Compute</li>
|
||||
<li>Code</li>
|
||||
</ul>
|
||||
<p><em>In General</em> Storage and Compute will be infrastructer related, "Code" is sort of a catch all for my modelling, normally sql, python/r or spark scripts that are used to provide system or business logic, anything really thats going to get data to the end user/analyst/data scientists/annoying person </p>
|
||||
<p>Traditionally the compute layer only really had 2 considerations</p>
|
||||
<ul>
|
||||
<li>SQL or Logic engine (normally a flavour of spark(glue) and then something like reshift/athena/trino/bigquery)</li>
|
||||
<li>Orchestration Layer (Airflow, Dagster)</li>
|
||||
</ul>
|
||||
<p>But with the advent of sql engine agnostic Modelling we potentially now need to also consider</p>
|
||||
<ul>
|
||||
<li>Model Compilation</li>
|
||||
</ul>
|
||||
<p>Now on the surface it seems counterintuitive to seperate the models from the logic layer but lets consider the following scenario</p>
|
||||
<blockquote>
|
||||
<p>Redshift is costing to much and is getting slow, we want to try bigquery
|
||||
How much investment will it be to change over</p>
|
||||
</blockquote>
|
||||
<p>Now, If the entirety of your modelling is stored and deployed to big query direct this would not only involve the investment of spinning up the big query account and either connecting or migrating your data over. You would also need to consider how in the bloody hell you convert all your existing models and workflows over.</p>
|
||||
<p>With something like DBT or sqlmesh you change your compilation target and it's done for you. It also means the Data Engineer now doesn't need to necessarily understand the esoteric nature of the target, at least for simple models (which, lets be real, most are).</p>
|
||||
<p>BUT, now we have <em>a lot</em> of software and infrastructure a simplified common datastack will look something like the below (Assuming ELT, ETL is a bit different but more or less needs the same components)</p>
|
||||
<p><img src="http://localhost:8000/images/DataStackSimplified.png" width="600" height="295" /></p></content><category term="Data Engineering"></category><category term="data engineering"></category><category term="DBT"></category><category term="Terraform"></category><category term="IAC"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<p>This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. </p>
|
||||
<p>Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.</p>
|
||||
<h3>Datalake Extraction Layer</h3>
|
||||
|
@ -1,2 +1,2 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<rss version="2.0"><channel><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link>http://localhost:8000/</link><description></description><lastBuildDate>Tue, 23 May 2023 20:00:00 +1000</lastBuildDate><item><title>Implmenting Appflow in a Production Datalake</title><link>http://localhost:8000/appflow-production.html</link><description><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Tue, 23 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-23:/appflow-production.html</guid><category>Data Engineering</category><category>data engineering</category><category>Amazon</category><category>Managed Services</category></item><item><title>Dawn of another blog attempt</title><link>http://localhost:8000/how-i-built-the-damn-thing.html</link><description><p>Containers and How I take my learnings from home and apply them to work</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 10 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-10:/how-i-built-the-damn-thing.html</guid><category>Data Engineering</category><category>data engineering</category><category>containers</category></item></channel></rss>
|
||||
<rss version="2.0"><channel><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link>http://localhost:8000/</link><description></description><lastBuildDate>Thu, 15 Jun 2023 20:00:00 +1000</lastBuildDate><item><title>CI/CD in Data Engineering</title><link>http://localhost:8000/CI/CD%20in%20Data%20and%20Data%20Infrastructure.html</link><description><p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Thu, 15 Jun 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-06-15:/CI/CD in Data and Data Infrastructure.html</guid><category>Data Engineering</category><category>data engineering</category><category>DBT</category><category>Terraform</category><category>IAC</category></item><item><title>Implmenting Appflow in a Production Datalake</title><link>http://localhost:8000/appflow-production.html</link><description><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Tue, 23 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-23:/appflow-production.html</guid><category>Data Engineering</category><category>data engineering</category><category>Amazon</category><category>Managed Services</category></item><item><title>Dawn of another blog attempt</title><link>http://localhost:8000/how-i-built-the-damn-thing.html</link><description><p>Containers and How I take my learnings from home and apply them to work</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 10 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-10:/how-i-built-the-damn-thing.html</guid><category>Data Engineering</category><category>data engineering</category><category>containers</category></item></channel></rss>
|
@ -1,5 +1,54 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Data Engineering</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/data-engineering.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Data Engineering</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/data-engineering.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-06-15T20:00:00+10:00</updated><entry><title>CI/CD in Data Engineering</title><link href="http://localhost:8000/CI/CD%20in%20Data%20and%20Data%20Infrastructure.html" rel="alternate"></link><published>2023-06-15T20:00:00+10:00</published><updated>2023-06-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-06-15:/CI/CD in Data and Data Infrastructure.html</id><summary type="html"><p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p></summary><content type="html"><p>Data Engineering has traditionally been considered the bastard step child of work that would have once been considered Administrative in the Tech world. Predominately we write SQL and then deploy that SQL onto one or more Databases. In fact a lot of the traditional methodologies around data almost assume this is the core of how an organistation is managing the majority of it's data. In the last couple of years though there has been a very steady move towards having the Data Engineering workload of SQL move towards Software Engineering techniques. With the popularity of tools like <a href="https://www.dbtlabs.com">DBT</a> and the latest newcommer on the block, <a href="https://www.sqlmesh.com">SQL-MESH</a> The oppportunity has started to arise where we can align our Data Engineering workloads with different environments and move much more efficiently towards a Continous Integration and Deployment methodology in our workflows. </p>
|
||||
<p>For the Data Engineering space the move to the cloud has been a breath of fresh air (Not so in some other IT disciplines). I am relatively young, so I don't 100% remember but my experience has taught me that there were 3 options here not so long ago</p>
|
||||
<p><em>Expensive:</em></p>
|
||||
<ul>
|
||||
<li>SAS</li>
|
||||
<li>SSIS/SSRS</li>
|
||||
<li>COGNOS/TM1</li>
|
||||
</ul>
|
||||
<p><em>Rickety:</em></p>
|
||||
<ul>
|
||||
<li>Just write stored procedures!</li>
|
||||
<li>Startup script on my laptop XD</li>
|
||||
<li>"Don't touch that machine over there, No one knows what it does but if it's turned off our financial reports don't work" (This is a third hand story I heard, seriously!)</li>
|
||||
<li>"I need to an upgrade to my laptop, Excel needs more than 8GB of RAM"</li>
|
||||
</ul>
|
||||
<p><em>Hard:</em></p>
|
||||
<ul>
|
||||
<li>Hadoop</li>
|
||||
<li>Spark (hadoop but whatever)</li>
|
||||
<li>Python</li>
|
||||
<li>R</li>
|
||||
</ul>
|
||||
<p><em>(The reason I've listed them as hard is because self hosting Hadoop/Spark and managing a truckload of python or R scripts, whilst it could have been "cheap" required a team of devs who really really knew what they were doing... so not really cheap and also <strong>really hard</strong>)</em></p>
|
||||
<p>Then there was getting git behind all the sql scripts and modelling, let alone CI/CD <strong>IF</strong> it existed, it was custom, and bespoke and likely had a single point of failure in the person who knew how <code>git merge</code> worked. At least... thats I was told I'm not <em>that</em> old ;p.</p>
|
||||
<p>These days we are pretty blessed, with democrotisation of clusters and data Infrastructure in the cloud we no longer need a team of sysadmins who know how to tune a cluster to the Nth degree to get the best our of our data workloads (well... we do, but we pay the cloud guys for that!). However, we still need to know about the idiosyncracities of this infrastructure, when it is appropiate to use and how we want to control and maintain the workloads. </p>
|
||||
<p>In general when I am designing a system I normally like to break it into 3.</p>
|
||||
<ul>
|
||||
<li>Storage</li>
|
||||
<li>Compute</li>
|
||||
<li>Code</li>
|
||||
</ul>
|
||||
<p><em>In General</em> Storage and Compute will be infrastructer related, "Code" is sort of a catch all for my modelling, normally sql, python/r or spark scripts that are used to provide system or business logic, anything really thats going to get data to the end user/analyst/data scientists/annoying person </p>
|
||||
<p>Traditionally the compute layer only really had 2 considerations</p>
|
||||
<ul>
|
||||
<li>SQL or Logic engine (normally a flavour of spark(glue) and then something like reshift/athena/trino/bigquery)</li>
|
||||
<li>Orchestration Layer (Airflow, Dagster)</li>
|
||||
</ul>
|
||||
<p>But with the advent of sql engine agnostic Modelling we potentially now need to also consider</p>
|
||||
<ul>
|
||||
<li>Model Compilation</li>
|
||||
</ul>
|
||||
<p>Now on the surface it seems counterintuitive to seperate the models from the logic layer but lets consider the following scenario</p>
|
||||
<blockquote>
|
||||
<p>Redshift is costing to much and is getting slow, we want to try bigquery
|
||||
How much investment will it be to change over</p>
|
||||
</blockquote>
|
||||
<p>Now, If the entirety of your modelling is stored and deployed to big query direct this would not only involve the investment of spinning up the big query account and either connecting or migrating your data over. You would also need to consider how in the bloody hell you convert all your existing models and workflows over.</p>
|
||||
<p>With something like DBT or sqlmesh you change your compilation target and it's done for you. It also means the Data Engineer now doesn't need to necessarily understand the esoteric nature of the target, at least for simple models (which, lets be real, most are).</p>
|
||||
<p>BUT, now we have <em>a lot</em> of software and infrastructure a simplified common datastack will look something like the below (Assuming ELT, ETL is a bit different but more or less needs the same components)</p>
|
||||
<p><img src="http://localhost:8000/images/DataStackSimplified.png" width="600" height="295" /></p></content><category term="Data Engineering"></category><category term="data engineering"></category><category term="DBT"></category><category term="Terraform"></category><category term="IAC"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html"><p>How Appflow simplified a major extract layer and when I choose Managed Services</p></summary><content type="html"><p>I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called <a href="https://aws.amazon.com/appflow/">Amazon Appflow</a>. This product <em>claimed</em> to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.</p>
|
||||
<p>This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. </p>
|
||||
<p>Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.</p>
|
||||
<h3>Datalake Extraction Layer</h3>
|
||||
|
BIN
src/output/images/DataStackSimplified.png
Normal file
BIN
src/output/images/DataStackSimplified.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 323 KiB |
@ -84,6 +84,19 @@
|
||||
<div class="container">
|
||||
<div class="row">
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<div class="post-preview">
|
||||
<a href="http://localhost:8000/CI/CD in Data and Data Infrastructure.html" rel="bookmark" title="Permalink to CI/CD in Data Engineering">
|
||||
<h2 class="post-title">
|
||||
CI/CD in Data Engineering
|
||||
</h2>
|
||||
</a>
|
||||
<p>When to use IaC CI/CD techniques or Software CI/CD techniques in Data Architecture</p>
|
||||
<p class="post-meta">Posted by
|
||||
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
|
||||
on Thu 15 June 2023
|
||||
</p>
|
||||
</div>
|
||||
<hr>
|
||||
<div class="post-preview">
|
||||
<a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake">
|
||||
<h2 class="post-title">
|
||||
|
0
src/output/tag/dbt.html
Normal file
0
src/output/tag/dbt.html
Normal file
0
src/output/tag/iac.html
Normal file
0
src/output/tag/iac.html
Normal file
0
src/output/tag/terraform.html
Normal file
0
src/output/tag/terraform.html
Normal file
@ -83,8 +83,11 @@
|
||||
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
|
||||
<h1>Tags for Andrew Ridgway's Blog</h1> <li><a href="http://localhost:8000/tag/amazon.html">Amazon</a> (1)</li>
|
||||
<li><a href="http://localhost:8000/tag/containers.html">containers</a> (1)</li>
|
||||
<li><a href="http://localhost:8000/tag/data-engineering.html">data engineering</a> (2)</li>
|
||||
<li><a href="http://localhost:8000/tag/data-engineering.html">data engineering</a> (3)</li>
|
||||
<li><a href="http://localhost:8000/tag/dbt.html">DBT</a> (1)</li>
|
||||
<li><a href="http://localhost:8000/tag/iac.html">IAC</a> (1)</li>
|
||||
<li><a href="http://localhost:8000/tag/managed-services.html">Managed Services</a> (1)</li>
|
||||
<li><a href="http://localhost:8000/tag/terraform.html">Terraform</a> (1)</li>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
Loading…
x
Reference in New Issue
Block a user