Compare commits

...

1 Commits

Author SHA1 Message Date
b6cec40d8b switching branches 2024-07-25 13:46:43 +10:00
19 changed files with 239 additions and 14 deletions

View File

@ -0,0 +1,10 @@
Title: Dynamically Generating a DBT sources.yml With Datahub
Date: 2023-12-15 20:00
Modified: 2023-12-15 20:00
Category: Data Engineering
Tags: data engineering, dbt, datahub
Slug: datahub-dbt-sources
Authors: Andrew Ridgway
Summary: Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml
I find that in our space the terms data catalog, data governance and data definitions can be dirty terms. I challenge any data professional to not say that these are at best after thoughts in a stack. Normally technologies that govern these areas of businesses data architecture are the unsexy ones, and there are good reasons for this. It is not fun to try and get multiple people in the room and get them to agree on any given metric. As my current boss is and has been fond of saying, "You get 3 people in the room to define how we measure a sale I will give you 3 completely different and yet valid answers". This is the core of the problem with Data governance, Catalogs and Definitions... no one can agree, and the result of this is engineers either ignore it... or put it on the back burner because its going to be an awful experience

View File

@ -82,7 +82,9 @@
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<dl>
<dt>Wed 18 October 2023</dt>
<dt>Fri 15 December 2023</dt>
<dd><a href="http://localhost:8000/datahub-dbt-sources.html">Dynamically Generating a DBT sources.yml With Datahub</a></dd>
<dt>Wed 15 November 2023</dt>
<dd><a href="http://localhost:8000/metabase-duckdb.html">Metabase and DuckDB</a></dd>
<dt>Tue 23 May 2023</dt>
<dd><a href="http://localhost:8000/appflow-production.html">Implmenting Appflow in a Production Datalake</a></dd>

View File

@ -81,6 +81,19 @@
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-preview">
<a href="http://localhost:8000/datahub-dbt-sources.html" rel="bookmark" title="Permalink to Dynamically Generating a DBT sources.yml With Datahub">
<h2 class="post-title">
Dynamically Generating a DBT sources.yml With Datahub
</h2>
</a>
<p>Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Fri 15 December 2023
</p>
</div>
<hr>
<div class="post-preview">
<a href="http://localhost:8000/metabase-duckdb.html" rel="bookmark" title="Permalink to Metabase and DuckDB">
<h2 class="post-title">
@ -90,7 +103,7 @@
<p>Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
on Wed 15 November 2023
</p>
</div>
<hr>

View File

@ -84,7 +84,7 @@
<div class="post-preview">
<a href="http://localhost:8000/author/andrew-ridgway.html" rel="bookmark">
<h2 class="post-title">
Andrew Ridgway (3)
Andrew Ridgway (4)
</h2>
</a>
</div>

View File

@ -91,7 +91,7 @@
<p>Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
on Wed 15 November 2023
</p>
</div>
<hr>

View File

@ -82,6 +82,19 @@
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-preview">
<a href="http://localhost:8000/datahub-dbt-sources.html" rel="bookmark" title="Permalink to Dynamically Generating a DBT sources.yml With Datahub">
<h2 class="post-title">
Dynamically Generating a DBT sources.yml With Datahub
</h2>
</a>
<p>Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Fri 15 December 2023
</p>
</div>
<hr>
<div class="post-preview">
<a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake">
<h2 class="post-title">

View File

@ -0,0 +1,172 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Andrew Ridgway's Blog</title>
<link href="http://localhost:8000/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Full Atom Feed" />
<link href="http://localhost:8000/feeds/data-engineering.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Categories Atom Feed" />
<!-- Bootstrap Core CSS -->
<link href="http://localhost:8000/theme/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="http://localhost:8000/theme/css/clean-blog.min.css" rel="stylesheet">
<!-- Code highlight color scheme -->
<link href="http://localhost:8000/theme/css/code_blocks/tomorrow.css" rel="stylesheet">
<!-- Custom Fonts -->
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<meta name="tags" contents="data engineering" />
<meta name="tags" contents="dbt" />
<meta name="tags" contents="datahub" />
<meta property="og:locale" content="en">
<meta property="og:site_name" content="Andrew Ridgway's Blog">
<meta property="og:type" content="article">
<meta property="article:author" content="">
<meta property="og:url" content="http://localhost:8000/datahub-dbt-sources.html">
<meta property="og:title" content="Dynamically Generating a DBT sources.yml With Datahub">
<meta property="og:description" content="">
<meta property="og:image" content="http://localhost:8000/">
<meta property="article:published_time" content="2023-12-15 20:00:00+10:00">
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-default navbar-custom navbar-fixed-top">
<div class="container-fluid">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header page-scroll">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="http://localhost:8000/">Andrew Ridgway's Blog</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Header -->
<header class="intro-header" style="background-image: url('http://localhost:8000/theme/images/post-bg.jpg')">
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-heading">
<h1>Dynamically Generating a DBT sources.yml With Datahub</h1>
<span class="meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Fri 15 December 2023
</span>
</div>
</div>
</div>
</div>
</header>
<!-- Main Content -->
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<!-- Post Content -->
<article>
<p>I find that in our space the terms data catalog, data governance and data definitions can be dirty terms. I challenge any data professional to not say that these are at best after thoughts in a stack. Normally technologies that govern these areas of businesses data architecture are the unsexy ones, and there are good reasons for this. It is not fun to try and get multiple people in the room and get them to agree on any given metric. As my current boss is and has been fond of saying, "You get 3 people in the room to define how we measure a sale I will give you 3 completely different and yet valid answers". This is the core of the problem with Data governance, Catalogs and Definitions... no one can agree, and the result of this is engineers either ignore it... or put it on the back burner because its going to be an awful experience</p>
</article>
<hr>
</div>
</div>
</div>
<hr>
<!-- Footer -->
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<p>
<script type="text/javascript" src="https://sessionize.com/api/speaker/sessions/83c5d14a-bd19-46b4-8335-0ac8358ac46d/0x0x91929ax">
</script>
</p>
<ul class="list-inline text-center">
<li>
<a href="https://twitter.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-twitter fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://facebook.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-facebook fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://github.com/armistace">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
</ul>
<p class="copyright text-muted">Blog powered by <a href="http://getpelican.com">Pelican</a>,
which takes great advantage of <a href="http://python.org">Python</a>.</p>
</div>
</div>
</div>
</footer>
<!-- jQuery -->
<script src="http://localhost:8000/theme/js/jquery.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="http://localhost:8000/theme/js/bootstrap.min.js"></script>
<!-- Custom Theme JavaScript -->
<script src="http://localhost:8000/theme/js/clean-blog.min.js"></script>
</body>
</html>

View File

@ -1,5 +1,5 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all-en.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all-en.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-12-15T20:00:00+10:00</updated><entry><title>Dynamically Generating a DBT sources.yml With Datahub</title><link href="http://localhost:8000/datahub-dbt-sources.html" rel="alternate"></link><published>2023-12-15T20:00:00+10:00</published><updated>2023-12-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-12-15:/datahub-dbt-sources.html</id><summary type="html">&lt;p&gt;Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml&lt;/p&gt;</summary><content type="html">&lt;p&gt;I find that in our space the terms data catalog, data governance and data definitions can be dirty terms. I challenge any data professional to not say that these are at best after thoughts in a stack. Normally technologies that govern these areas of businesses data architecture are the unsexy ones, and there are good reasons for this. It is not fun to try and get multiple people in the room and get them to agree on any given metric. As my current boss is and has been fond of saying, "You get 3 people in the room to define how we measure a sale I will give you 3 completely different and yet valid answers". This is the core of the problem with Data governance, Catalogs and Definitions... no one can agree, and the result of this is engineers either ignore it... or put it on the back burner because its going to be an awful experience&lt;/p&gt;</content><category term="Data Engineering"></category><category term="data engineering"></category><category term="dbt"></category><category term="datahub"></category></entry><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-11-15T20:00:00+10:00</published><updated>2023-11-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-11-15:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;

View File

@ -1,5 +1,5 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-12-15T20:00:00+10:00</updated><entry><title>Dynamically Generating a DBT sources.yml With Datahub</title><link href="http://localhost:8000/datahub-dbt-sources.html" rel="alternate"></link><published>2023-12-15T20:00:00+10:00</published><updated>2023-12-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-12-15:/datahub-dbt-sources.html</id><summary type="html">&lt;p&gt;Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml&lt;/p&gt;</summary><content type="html">&lt;p&gt;I find that in our space the terms data catalog, data governance and data definitions can be dirty terms. I challenge any data professional to not say that these are at best after thoughts in a stack. Normally technologies that govern these areas of businesses data architecture are the unsexy ones, and there are good reasons for this. It is not fun to try and get multiple people in the room and get them to agree on any given metric. As my current boss is and has been fond of saying, "You get 3 people in the room to define how we measure a sale I will give you 3 completely different and yet valid answers". This is the core of the problem with Data governance, Catalogs and Definitions... no one can agree, and the result of this is engineers either ignore it... or put it on the back burner because its going to be an awful experience&lt;/p&gt;</content><category term="Data Engineering"></category><category term="data engineering"></category><category term="dbt"></category><category term="datahub"></category></entry><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-11-15T20:00:00+10:00</published><updated>2023-11-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-11-15:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;

View File

@ -1,5 +1,5 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/andrew-ridgway.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/andrew-ridgway.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-12-15T20:00:00+10:00</updated><entry><title>Dynamically Generating a DBT sources.yml With Datahub</title><link href="http://localhost:8000/datahub-dbt-sources.html" rel="alternate"></link><published>2023-12-15T20:00:00+10:00</published><updated>2023-12-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-12-15:/datahub-dbt-sources.html</id><summary type="html">&lt;p&gt;Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml&lt;/p&gt;</summary><content type="html">&lt;p&gt;I find that in our space the terms data catalog, data governance and data definitions can be dirty terms. I challenge any data professional to not say that these are at best after thoughts in a stack. Normally technologies that govern these areas of businesses data architecture are the unsexy ones, and there are good reasons for this. It is not fun to try and get multiple people in the room and get them to agree on any given metric. As my current boss is and has been fond of saying, "You get 3 people in the room to define how we measure a sale I will give you 3 completely different and yet valid answers". This is the core of the problem with Data governance, Catalogs and Definitions... no one can agree, and the result of this is engineers either ignore it... or put it on the back burner because its going to be an awful experience&lt;/p&gt;</content><category term="Data Engineering"></category><category term="data engineering"></category><category term="dbt"></category><category term="datahub"></category></entry><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-11-15T20:00:00+10:00</published><updated>2023-11-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-11-15:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;

View File

@ -1,2 +1,2 @@
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link>http://localhost:8000/</link><description></description><lastBuildDate>Wed, 18 Oct 2023 20:00:00 +1000</lastBuildDate><item><title>Metabase and DuckDB</title><link>http://localhost:8000/metabase-duckdb.html</link><description>&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 18 Oct 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-10-18:/metabase-duckdb.html</guid><category>Business Intelligence</category><category>data engineering</category><category>Metabase</category><category>DuckDB</category><category>embedded</category></item><item><title>Implmenting Appflow in a Production Datalake</title><link>http://localhost:8000/appflow-production.html</link><description>&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Tue, 23 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-23:/appflow-production.html</guid><category>Data Engineering</category><category>data engineering</category><category>Amazon</category><category>Managed Services</category></item><item><title>Dawn of another blog attempt</title><link>http://localhost:8000/how-i-built-the-damn-thing.html</link><description>&lt;p&gt;Containers and How I take my learnings from home and apply them to work&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 10 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-10:/how-i-built-the-damn-thing.html</guid><category>Data Engineering</category><category>data engineering</category><category>containers</category></item></channel></rss>
<rss version="2.0"><channel><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link>http://localhost:8000/</link><description></description><lastBuildDate>Fri, 15 Dec 2023 20:00:00 +1000</lastBuildDate><item><title>Dynamically Generating a DBT sources.yml With Datahub</title><link>http://localhost:8000/datahub-dbt-sources.html</link><description>&lt;p&gt;Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Fri, 15 Dec 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-12-15:/datahub-dbt-sources.html</guid><category>Data Engineering</category><category>data engineering</category><category>dbt</category><category>datahub</category></item><item><title>Metabase and DuckDB</title><link>http://localhost:8000/metabase-duckdb.html</link><description>&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 15 Nov 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-11-15:/metabase-duckdb.html</guid><category>Business Intelligence</category><category>data engineering</category><category>Metabase</category><category>DuckDB</category><category>embedded</category></item><item><title>Implmenting Appflow in a Production Datalake</title><link>http://localhost:8000/appflow-production.html</link><description>&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Tue, 23 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-23:/appflow-production.html</guid><category>Data Engineering</category><category>data engineering</category><category>Amazon</category><category>Managed Services</category></item><item><title>Dawn of another blog attempt</title><link>http://localhost:8000/how-i-built-the-damn-thing.html</link><description>&lt;p&gt;Containers and How I take my learnings from home and apply them to work&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 10 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-10:/how-i-built-the-damn-thing.html</guid><category>Data Engineering</category><category>data engineering</category><category>containers</category></item></channel></rss>

View File

@ -1,5 +1,5 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Business Intelligence</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/business-intelligence.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Business Intelligence</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/business-intelligence.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-11-15T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-11-15T20:00:00+10:00</published><updated>2023-11-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-11-15:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;

View File

@ -1,5 +1,5 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Data Engineering</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/data-engineering.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt;
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Data Engineering</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/data-engineering.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-12-15T20:00:00+10:00</updated><entry><title>Dynamically Generating a DBT sources.yml With Datahub</title><link href="http://localhost:8000/datahub-dbt-sources.html" rel="alternate"></link><published>2023-12-15T20:00:00+10:00</published><updated>2023-12-15T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-12-15:/datahub-dbt-sources.html</id><summary type="html">&lt;p&gt;Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml&lt;/p&gt;</summary><content type="html">&lt;p&gt;I find that in our space the terms data catalog, data governance and data definitions can be dirty terms. I challenge any data professional to not say that these are at best after thoughts in a stack. Normally technologies that govern these areas of businesses data architecture are the unsexy ones, and there are good reasons for this. It is not fun to try and get multiple people in the room and get them to agree on any given metric. As my current boss is and has been fond of saying, "You get 3 people in the room to define how we measure a sale I will give you 3 completely different and yet valid answers". This is the core of the problem with Data governance, Catalogs and Definitions... no one can agree, and the result of this is engineers either ignore it... or put it on the back burner because its going to be an awful experience&lt;/p&gt;</content><category term="Data Engineering"></category><category term="data engineering"></category><category term="dbt"></category><category term="datahub"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt;
&lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt;
&lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt;
&lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt;

View File

@ -84,6 +84,19 @@
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-preview">
<a href="http://localhost:8000/datahub-dbt-sources.html" rel="bookmark" title="Permalink to Dynamically Generating a DBT sources.yml With Datahub">
<h2 class="post-title">
Dynamically Generating a DBT sources.yml With Datahub
</h2>
</a>
<p>Leveraging the power of Datahub schemas to dynamically generate dbt sources.yml</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Fri 15 December 2023
</p>
</div>
<hr>
<div class="post-preview">
<a href="http://localhost:8000/metabase-duckdb.html" rel="bookmark" title="Permalink to Metabase and DuckDB">
<h2 class="post-title">
@ -93,7 +106,7 @@
<p>Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
on Wed 15 November 2023
</p>
</div>
<hr>

View File

@ -52,7 +52,7 @@
<meta property="og:title" content="Metabase and DuckDB">
<meta property="og:description" content="">
<meta property="og:image" content="http://localhost:8000/">
<meta property="article:published_time" content="2023-10-18 20:00:00+10:00">
<meta property="article:published_time" content="2023-11-15 20:00:00+10:00">
</head>
<body>
@ -91,7 +91,7 @@
<h1>Metabase and DuckDB</h1>
<span class="meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
on Wed 15 November 2023
</span>
</div>

View File

0
src/output/tag/dbt.html Normal file
View File

View File

@ -83,7 +83,9 @@
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<h1>Tags for Andrew Ridgway's Blog</h1> <li><a href="http://localhost:8000/tag/amazon.html">Amazon</a> (1)</li>
<li><a href="http://localhost:8000/tag/containers.html">containers</a> (1)</li>
<li><a href="http://localhost:8000/tag/data-engineering.html">data engineering</a> (3)</li>
<li><a href="http://localhost:8000/tag/data-engineering.html">data engineering</a> (4)</li>
<li><a href="http://localhost:8000/tag/datahub.html">datahub</a> (1)</li>
<li><a href="http://localhost:8000/tag/dbt.html">dbt</a> (1)</li>
<li><a href="http://localhost:8000/tag/duckdb.html">DuckDB</a> (1)</li>
<li><a href="http://localhost:8000/tag/embedded.html">embedded</a> (1)</li>
<li><a href="http://localhost:8000/tag/managed-services.html">Managed Services</a> (1)</li>