Merge pull request #1 from armistace/metabase_duckdb

Metabase duckdb
This commit is contained in:
Andrew Ridgway 2023-11-16 15:07:01 +10:00 committed by GitHub
commit 2f96fc0b90
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
26 changed files with 1187 additions and 7 deletions

View File

@ -1 +1,2 @@
pelican[markdown] pelican[markdown]
markdown-markup-emoji

Binary file not shown.

After

Width:  |  Height:  |  Size: 146 KiB

View File

@ -0,0 +1,105 @@
Title: Metabase and DuckDB
Date: 2023-10-18 20:00
Modified: 2023-10-18 20:00
Category: Business Intelligence
Tags: data engineering, Metabase, DuckDB, embedded
Slug: metabase-duckdb
Authors: Andrew Ridgway
Summary: Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible
Ahhhh [DuckDB](https://duckdb.org/) if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's _"Datawarehouse on your laptop"_ mantra. However, the OTHER application that sometimes gets missed is _"SQLite for OLAP workloads"_ and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded
However, for this to work we need some form of conatinerised reporting application.... lucky for us there is [Metabase](https://www.metabase.com/) which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?
### The Beginnings of an Idea
Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here
<img alt="Duckdb Architecture" height="auto" width="100%" src="{attach}/images/metabase_duckdb.png">
But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?.
To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase.
Thankfully Metabase point you to a [community driver](https://github.com/AlexR2D2/metabase_duckdb_driver) for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later )
Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the [PR](https://github.com/AlexR2D2/metabase_duckdb_driver/pull/19) for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase
### But How do we get Data?
Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built
```
FROM openjdk:19-buster
ENV MB_PLUGINS_DIR=/home/plugins/
ADD https://downloads.metabase.com/v0.46.2/metabase.jar /home
ADD https://github.com/AlexR2D2/metabase_duckdb_driver/releases/download/0.1.6/duckdb.metabase-driver.jar /home/plugins/
RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar
CMD ["java", "-jar", "/home/metabase.jar"]
```
Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database?
Now that we have the default credentials chain it is trivial to call parquet from s3
```sql
SELECT * FROM read_parquet('s3://<bucket>/<file>');
```
However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (*and frankly were not desined to be so this is ok*). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier
The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.
With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this
```python
def duckdb_builder(table):
conn = duckdb.connect("curated_duckdb.duckdb")
conn.sql(f"CALL load_aws_credentials('{aws_profile}')")
#This removes a lot of weirdass ANSI in logs you DO NOT WANT
conn.execute("PRAGMA enable_progress_bar=false")
log.info(f"Create {table} in duckdb")
sql = f"CREATE OR REPLACE TABLE {table} AS SELECT * FROM read_parquet('s3://{curated_bucket}/{table}/*')"
conn.sql(sql)
log.info(f"{table} Created")
```
And then an upload to an s3 bucket
This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)
The final Dockerfile looks like this
```
FROM openjdk:19-buster
ENV MB_PLUGINS_DIR=/home/plugins/
ADD https://downloads.metabase.com/v0.47.6/metabase.jar /home
ADD duckdb.metabase-driver.jar /home/plugins/
RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar
RUN mkdir -p /duckdb_data
COPY entrypoint.sh /home
COPY helper_scripts/download_duckdb.py /home
RUN apt-get update -y && apt-get upgrade -y
RUN apt-get install python3 python3-pip cron -y
RUN pip3 install boto3
RUN crontab -l | { cat; echo "0 */6 * * * python3 /home/helper_scripts/download_duckdb.py"; } | crontab -
CMD ["bash", "/home/entrypoint.sh"]
```
And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.
Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the [metabase documentation](https://www.metabase.com/learn/administration/git-based-workflow), the unfortunate thing about it is Metabase *have* hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.
Until then....

View File

@ -16,6 +16,5 @@ TWITTER_URL = 'https://twitter.com/ar17787'
FACEBOOK_URL = 'https://facebook.com/ar17787' FACEBOOK_URL = 'https://facebook.com/ar17787'
DEFAULT_PAGINATION = 10 DEFAULT_PAGINATION = 10
# Uncomment following line if you want document-relative URLs when developing # Uncomment following line if you want document-relative URLs when developing
#RELATIVE_URLS = True #RELATIVE_URLS = True

View File

@ -82,6 +82,8 @@
<div class="row"> <div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1"> <div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<dl> <dl>
<dt>Wed 18 October 2023</dt>
<dd><a href="http://localhost:8000/metabase-duckdb.html">Metabase and DuckDB</a></dd>
<dt>Tue 23 May 2023</dt> <dt>Tue 23 May 2023</dt>
<dd><a href="http://localhost:8000/appflow-production.html">Implmenting Appflow in a Production Datalake</a></dd> <dd><a href="http://localhost:8000/appflow-production.html">Implmenting Appflow in a Production Datalake</a></dd>
<dt>Wed 10 May 2023</dt> <dt>Wed 10 May 2023</dt>

View File

@ -81,6 +81,19 @@
<div class="container"> <div class="container">
<div class="row"> <div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1"> <div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-preview">
<a href="http://localhost:8000/metabase-duckdb.html" rel="bookmark" title="Permalink to Metabase and DuckDB">
<h2 class="post-title">
Metabase and DuckDB
</h2>
</a>
<p>Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
</p>
</div>
<hr>
<div class="post-preview"> <div class="post-preview">
<a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake"> <a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake">
<h2 class="post-title"> <h2 class="post-title">

View File

@ -84,7 +84,7 @@
<div class="post-preview"> <div class="post-preview">
<a href="http://localhost:8000/author/andrew-ridgway.html" rel="bookmark"> <a href="http://localhost:8000/author/andrew-ridgway.html" rel="bookmark">
<h2 class="post-title"> <h2 class="post-title">
Andrew Ridgway (2) Andrew Ridgway (3)
</h2> </h2>
</a> </a>
</div> </div>

View File

@ -82,6 +82,7 @@
<div class="row"> <div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1"> <div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<ul> <ul>
<li><a href="http://localhost:8000/category/business-intelligence.html">Business Intelligence</a></li>
<li><a href="http://localhost:8000/category/data-engineering.html">Data Engineering</a></li> <li><a href="http://localhost:8000/category/data-engineering.html">Data Engineering</a></li>
</ul> </ul>
</div> </div>

View File

@ -0,0 +1,165 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Andrew Ridgway's Blog - Articles in the Business Intelligence category</title>
<link href="http://localhost:8000/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Full Atom Feed" />
<link href="http://localhost:8000/feeds/business-intelligence.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Categories Atom Feed" />
<!-- Bootstrap Core CSS -->
<link href="http://localhost:8000/theme/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="http://localhost:8000/theme/css/clean-blog.min.css" rel="stylesheet">
<!-- Code highlight color scheme -->
<link href="http://localhost:8000/theme/css/code_blocks/tomorrow.css" rel="stylesheet">
<!-- Custom Fonts -->
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<meta property="og:locale" content="en">
<meta property="og:site_name" content="Andrew Ridgway's Blog">
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-default navbar-custom navbar-fixed-top">
<div class="container-fluid">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header page-scroll">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="http://localhost:8000/">Andrew Ridgway's Blog</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Header -->
<header class="intro-header" style="background-image: url('https://wallpaperaccess.com/full/3239444.jpg')">
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-heading">
<h1>Articles in the Business Intelligence category</h1>
</div>
</div>
</div>
</div>
</header>
<!-- Main Content -->
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-preview">
<a href="http://localhost:8000/metabase-duckdb.html" rel="bookmark" title="Permalink to Metabase and DuckDB">
<h2 class="post-title">
Metabase and DuckDB
</h2>
</a>
<p>Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
</p>
</div>
<hr>
<!-- Pager -->
<ul class="pager">
<li class="next">
</li>
</ul>
Page 1 / 1
<hr>
</div>
</div>
</div>
<hr>
<!-- Footer -->
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<p>
<script type="text/javascript" src="https://sessionize.com/api/speaker/sessions/83c5d14a-bd19-46b4-8335-0ac8358ac46d/0x0x91929ax">
</script>
</p>
<ul class="list-inline text-center">
<li>
<a href="https://twitter.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-twitter fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://facebook.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-facebook fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://github.com/armistace">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
</ul>
<p class="copyright text-muted">Blog powered by <a href="http://getpelican.com">Pelican</a>,
which takes great advantage of <a href="http://python.org">Python</a>.</p>
</div>
</div>
</div>
</footer>
<!-- jQuery -->
<script src="http://localhost:8000/theme/js/jquery.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="http://localhost:8000/theme/js/bootstrap.min.js"></script>
<!-- Custom Theme JavaScript -->
<script src="http://localhost:8000/theme/js/clean-blog.min.js"></script>
</body>
</html>

View File

@ -0,0 +1,165 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Andrew Ridgway's Blog - Articles in the Data Analytics category</title>
<link href="http://localhost:8000/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Full Atom Feed" />
<link href="http://localhost:8000/feeds/data-analytics.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Categories Atom Feed" />
<!-- Bootstrap Core CSS -->
<link href="http://localhost:8000/theme/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="http://localhost:8000/theme/css/clean-blog.min.css" rel="stylesheet">
<!-- Code highlight color scheme -->
<link href="http://localhost:8000/theme/css/code_blocks/tomorrow.css" rel="stylesheet">
<!-- Custom Fonts -->
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<meta property="og:locale" content="en">
<meta property="og:site_name" content="Andrew Ridgway's Blog">
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-default navbar-custom navbar-fixed-top">
<div class="container-fluid">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header page-scroll">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="http://localhost:8000/">Andrew Ridgway's Blog</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Header -->
<header class="intro-header" style="background-image: url('https://wallpaperaccess.com/full/3239444.jpg')">
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-heading">
<h1>Articles in the Data Analytics category</h1>
</div>
</div>
</div>
</div>
</header>
<!-- Main Content -->
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-preview">
<a href="http://localhost:8000/notebook-or-bi.html" rel="bookmark" title="Permalink to Notebook or BI, What is the most appropiate communication medium">
<h2 class="post-title">
Notebook or BI, What is the most appropiate communication medium
</h2>
</a>
<p>When is a notebook enough or when do we need a dashboard</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Thu 13 July 2023
</p>
</div>
<hr>
<!-- Pager -->
<ul class="pager">
<li class="next">
</li>
</ul>
Page 1 / 1
<hr>
</div>
</div>
</div>
<hr>
<!-- Footer -->
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<p>
<script type="text/javascript" src="https://sessionize.com/api/speaker/sessions/83c5d14a-bd19-46b4-8335-0ac8358ac46d/0x0x91929ax">
</script>
</p>
<ul class="list-inline text-center">
<li>
<a href="https://twitter.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-twitter fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://facebook.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-facebook fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://github.com/armistace">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
</ul>
<p class="copyright text-muted">Blog powered by <a href="http://getpelican.com">Pelican</a>,
which takes great advantage of <a href="http://python.org">Python</a>.</p>
</div>
</div>
</div>
</footer>
<!-- jQuery -->
<script src="http://localhost:8000/theme/js/jquery.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="http://localhost:8000/theme/js/bootstrap.min.js"></script>
<!-- Custom Theme JavaScript -->
<script src="http://localhost:8000/theme/js/clean-blog.min.js"></script>
</body>
</html>

View File

@ -1,5 +1,78 @@
<?xml version="1.0" encoding="utf-8"?> <?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all-en.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt; <feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all-en.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;
&lt;p&gt;&lt;img alt="Duckdb Architecture" height="auto" width="100%" src="http://localhost:8000/images/metabase_duckdb.png"&gt;&lt;/p&gt;
&lt;p&gt;But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?. &lt;/p&gt;
&lt;p&gt;To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase. &lt;/p&gt;
&lt;p&gt;Thankfully Metabase point you to a &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver"&gt;community driver&lt;/a&gt; for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later ) &lt;/p&gt;
&lt;p&gt;Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver/pull/19"&gt;PR&lt;/a&gt; for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase&lt;/p&gt;
&lt;h3&gt;But How do we get Data?&lt;/h3&gt;
&lt;p&gt;Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;46.2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;AlexR2D2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase_duckdb_driver&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;releases&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;java&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/metabase.jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database? &lt;/p&gt;
&lt;p&gt;Now that we have the default credentials chain it is trivial to call parquet from s3&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://&amp;lt;bucket&amp;gt;/&amp;lt;file&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (&lt;em&gt;and frankly were not desined to be so this is ok&lt;/em&gt;). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier&lt;/p&gt;
&lt;p&gt;The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.&lt;/p&gt;
&lt;p&gt;With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;duckdb_builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;curated_duckdb.duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CALL load_aws_credentials(&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;aws_profile&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#This removes a lot of weirdass ANSI in logs you DO NOT WANT&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PRAGMA enable_progress_bar=false&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Create &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; in duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CREATE OR REPLACE TABLE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; AS SELECT * FROM read_parquet(&amp;#39;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;curated_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/*&amp;#39;)&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; Created&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then an upload to an s3 bucket&lt;/p&gt;
&lt;p&gt;This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)&lt;/p&gt;
&lt;p&gt;The final Dockerfile looks like this&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;47.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb_data&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;entrypoint&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;helper_scripts&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download_duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pip3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0 */6 * * * python3 /home/helper_scripts/download_duckdb.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bash&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/entrypoint.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.&lt;/p&gt;
&lt;p&gt;Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the &lt;a href="https://www.metabase.com/learn/administration/git-based-workflow"&gt;metabase documentation&lt;/a&gt;, the unfortunate thing about it is Metabase &lt;em&gt;have&lt;/em&gt; hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.&lt;/p&gt;
&lt;p&gt;Until then....&lt;/p&gt;</content><category term="Business Intelligence"></category><category term="data engineering"></category><category term="Metabase"></category><category term="DuckDB"></category><category term="embedded"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt;
&lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt; &lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt;
&lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt; &lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt;
&lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt; &lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt;

View File

@ -1,5 +1,78 @@
<?xml version="1.0" encoding="utf-8"?> <?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt; <feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;
&lt;p&gt;&lt;img alt="Duckdb Architecture" height="auto" width="100%" src="http://localhost:8000/images/metabase_duckdb.png"&gt;&lt;/p&gt;
&lt;p&gt;But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?. &lt;/p&gt;
&lt;p&gt;To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase. &lt;/p&gt;
&lt;p&gt;Thankfully Metabase point you to a &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver"&gt;community driver&lt;/a&gt; for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later ) &lt;/p&gt;
&lt;p&gt;Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver/pull/19"&gt;PR&lt;/a&gt; for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase&lt;/p&gt;
&lt;h3&gt;But How do we get Data?&lt;/h3&gt;
&lt;p&gt;Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;46.2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;AlexR2D2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase_duckdb_driver&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;releases&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;java&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/metabase.jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database? &lt;/p&gt;
&lt;p&gt;Now that we have the default credentials chain it is trivial to call parquet from s3&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://&amp;lt;bucket&amp;gt;/&amp;lt;file&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (&lt;em&gt;and frankly were not desined to be so this is ok&lt;/em&gt;). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier&lt;/p&gt;
&lt;p&gt;The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.&lt;/p&gt;
&lt;p&gt;With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;duckdb_builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;curated_duckdb.duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CALL load_aws_credentials(&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;aws_profile&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#This removes a lot of weirdass ANSI in logs you DO NOT WANT&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PRAGMA enable_progress_bar=false&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Create &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; in duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CREATE OR REPLACE TABLE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; AS SELECT * FROM read_parquet(&amp;#39;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;curated_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/*&amp;#39;)&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; Created&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then an upload to an s3 bucket&lt;/p&gt;
&lt;p&gt;This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)&lt;/p&gt;
&lt;p&gt;The final Dockerfile looks like this&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;47.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb_data&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;entrypoint&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;helper_scripts&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download_duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pip3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0 */6 * * * python3 /home/helper_scripts/download_duckdb.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bash&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/entrypoint.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.&lt;/p&gt;
&lt;p&gt;Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the &lt;a href="https://www.metabase.com/learn/administration/git-based-workflow"&gt;metabase documentation&lt;/a&gt;, the unfortunate thing about it is Metabase &lt;em&gt;have&lt;/em&gt; hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.&lt;/p&gt;
&lt;p&gt;Until then....&lt;/p&gt;</content><category term="Business Intelligence"></category><category term="data engineering"></category><category term="Metabase"></category><category term="DuckDB"></category><category term="embedded"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt;
&lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt; &lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt;
&lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt; &lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt;
&lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt; &lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt;

View File

@ -1,5 +1,78 @@
<?xml version="1.0" encoding="utf-8"?> <?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/andrew-ridgway.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-05-23T20:00:00+10:00</updated><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt; <feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/andrew-ridgway.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;
&lt;p&gt;&lt;img alt="Duckdb Architecture" height="auto" width="100%" src="http://localhost:8000/images/metabase_duckdb.png"&gt;&lt;/p&gt;
&lt;p&gt;But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?. &lt;/p&gt;
&lt;p&gt;To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase. &lt;/p&gt;
&lt;p&gt;Thankfully Metabase point you to a &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver"&gt;community driver&lt;/a&gt; for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later ) &lt;/p&gt;
&lt;p&gt;Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver/pull/19"&gt;PR&lt;/a&gt; for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase&lt;/p&gt;
&lt;h3&gt;But How do we get Data?&lt;/h3&gt;
&lt;p&gt;Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;46.2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;AlexR2D2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase_duckdb_driver&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;releases&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;java&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/metabase.jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database? &lt;/p&gt;
&lt;p&gt;Now that we have the default credentials chain it is trivial to call parquet from s3&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://&amp;lt;bucket&amp;gt;/&amp;lt;file&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (&lt;em&gt;and frankly were not desined to be so this is ok&lt;/em&gt;). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier&lt;/p&gt;
&lt;p&gt;The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.&lt;/p&gt;
&lt;p&gt;With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;duckdb_builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;curated_duckdb.duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CALL load_aws_credentials(&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;aws_profile&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#This removes a lot of weirdass ANSI in logs you DO NOT WANT&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PRAGMA enable_progress_bar=false&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Create &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; in duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CREATE OR REPLACE TABLE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; AS SELECT * FROM read_parquet(&amp;#39;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;curated_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/*&amp;#39;)&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; Created&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then an upload to an s3 bucket&lt;/p&gt;
&lt;p&gt;This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)&lt;/p&gt;
&lt;p&gt;The final Dockerfile looks like this&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;47.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb_data&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;entrypoint&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;helper_scripts&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download_duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pip3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0 */6 * * * python3 /home/helper_scripts/download_duckdb.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bash&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/entrypoint.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.&lt;/p&gt;
&lt;p&gt;Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the &lt;a href="https://www.metabase.com/learn/administration/git-based-workflow"&gt;metabase documentation&lt;/a&gt;, the unfortunate thing about it is Metabase &lt;em&gt;have&lt;/em&gt; hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.&lt;/p&gt;
&lt;p&gt;Until then....&lt;/p&gt;</content><category term="Business Intelligence"></category><category term="data engineering"></category><category term="Metabase"></category><category term="DuckDB"></category><category term="embedded"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt;
&lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt; &lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt;
&lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt; &lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt;
&lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt; &lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt;

View File

@ -1,2 +1,2 @@
<?xml version="1.0" encoding="utf-8"?> <?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link>http://localhost:8000/</link><description></description><lastBuildDate>Tue, 23 May 2023 20:00:00 +1000</lastBuildDate><item><title>Implmenting Appflow in a Production Datalake</title><link>http://localhost:8000/appflow-production.html</link><description>&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Tue, 23 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-23:/appflow-production.html</guid><category>Data Engineering</category><category>data engineering</category><category>Amazon</category><category>Managed Services</category></item><item><title>Dawn of another blog attempt</title><link>http://localhost:8000/how-i-built-the-damn-thing.html</link><description>&lt;p&gt;Containers and How I take my learnings from home and apply them to work&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 10 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-10:/how-i-built-the-damn-thing.html</guid><category>Data Engineering</category><category>data engineering</category><category>containers</category></item></channel></rss> <rss version="2.0"><channel><title>Andrew Ridgway's Blog - Andrew Ridgway</title><link>http://localhost:8000/</link><description></description><lastBuildDate>Wed, 18 Oct 2023 20:00:00 +1000</lastBuildDate><item><title>Metabase and DuckDB</title><link>http://localhost:8000/metabase-duckdb.html</link><description>&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 18 Oct 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-10-18:/metabase-duckdb.html</guid><category>Business Intelligence</category><category>data engineering</category><category>Metabase</category><category>DuckDB</category><category>embedded</category></item><item><title>Implmenting Appflow in a Production Datalake</title><link>http://localhost:8000/appflow-production.html</link><description>&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Tue, 23 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-23:/appflow-production.html</guid><category>Data Engineering</category><category>data engineering</category><category>Amazon</category><category>Managed Services</category></item><item><title>Dawn of another blog attempt</title><link>http://localhost:8000/how-i-built-the-damn-thing.html</link><description>&lt;p&gt;Containers and How I take my learnings from home and apply them to work&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Ridgway</dc:creator><pubDate>Wed, 10 May 2023 20:00:00 +1000</pubDate><guid isPermaLink="false">tag:localhost,2023-05-10:/how-i-built-the-damn-thing.html</guid><category>Data Engineering</category><category>data engineering</category><category>containers</category></item></channel></rss>

View File

@ -0,0 +1,75 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Business Intelligence</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/business-intelligence.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;
&lt;p&gt;&lt;img alt="Duckdb Architecture" height="auto" width="100%" src="http://localhost:8000/images/metabase_duckdb.png"&gt;&lt;/p&gt;
&lt;p&gt;But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?. &lt;/p&gt;
&lt;p&gt;To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase. &lt;/p&gt;
&lt;p&gt;Thankfully Metabase point you to a &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver"&gt;community driver&lt;/a&gt; for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later ) &lt;/p&gt;
&lt;p&gt;Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver/pull/19"&gt;PR&lt;/a&gt; for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase&lt;/p&gt;
&lt;h3&gt;But How do we get Data?&lt;/h3&gt;
&lt;p&gt;Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;46.2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;AlexR2D2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase_duckdb_driver&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;releases&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;java&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/metabase.jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database? &lt;/p&gt;
&lt;p&gt;Now that we have the default credentials chain it is trivial to call parquet from s3&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://&amp;lt;bucket&amp;gt;/&amp;lt;file&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (&lt;em&gt;and frankly were not desined to be so this is ok&lt;/em&gt;). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier&lt;/p&gt;
&lt;p&gt;The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.&lt;/p&gt;
&lt;p&gt;With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;duckdb_builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;curated_duckdb.duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CALL load_aws_credentials(&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;aws_profile&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#This removes a lot of weirdass ANSI in logs you DO NOT WANT&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PRAGMA enable_progress_bar=false&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Create &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; in duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CREATE OR REPLACE TABLE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; AS SELECT * FROM read_parquet(&amp;#39;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;curated_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/*&amp;#39;)&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; Created&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then an upload to an s3 bucket&lt;/p&gt;
&lt;p&gt;This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)&lt;/p&gt;
&lt;p&gt;The final Dockerfile looks like this&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;47.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb_data&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;entrypoint&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;helper_scripts&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download_duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pip3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0 */6 * * * python3 /home/helper_scripts/download_duckdb.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bash&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/entrypoint.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.&lt;/p&gt;
&lt;p&gt;Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the &lt;a href="https://www.metabase.com/learn/administration/git-based-workflow"&gt;metabase documentation&lt;/a&gt;, the unfortunate thing about it is Metabase &lt;em&gt;have&lt;/em&gt; hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.&lt;/p&gt;
&lt;p&gt;Until then....&lt;/p&gt;</content><category term="Business Intelligence"></category><category term="data engineering"></category><category term="Metabase"></category><category term="DuckDB"></category><category term="embedded"></category></entry></feed>

View File

@ -0,0 +1,2 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog - Data Analytics</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/data-analytics.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-07-13T20:00:00+10:00</updated><entry><title>Notebook or BI, What is the most appropiate communication medium</title><link href="http://localhost:8000/notebook-or-bi.html" rel="alternate"></link><published>2023-07-13T20:00:00+10:00</published><updated>2023-07-13T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-07-13:/notebook-or-bi.html</id><summary type="html">&lt;p&gt;When is a notebook enough or when do we need a dashboard&lt;/p&gt;</summary><content type="html">&lt;p&gt;I want to preface this post by saying I think "Dashboards" or "BI" as terms are wayyyyyyyyyyyyyyyyy over saturated in the market. There seems to be a belief that any question answerable in data deserves the work associated with a dashboard when in fact a simple one off report, or notebook, would be more than enough.&lt;/p&gt;</content><category term="Data Analytics"></category><category term="data engineering"></category><category term="Data Analytics"></category></entry></feed>

Binary file not shown.

After

Width:  |  Height:  |  Size: 146 KiB

View File

@ -84,6 +84,19 @@
<div class="container"> <div class="container">
<div class="row"> <div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1"> <div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-preview">
<a href="http://localhost:8000/metabase-duckdb.html" rel="bookmark" title="Permalink to Metabase and DuckDB">
<h2 class="post-title">
Metabase and DuckDB
</h2>
</a>
<p>Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible</p>
<p class="post-meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
</p>
</div>
<hr>
<div class="post-preview"> <div class="post-preview">
<a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake"> <a href="http://localhost:8000/appflow-production.html" rel="bookmark" title="Permalink to Implmenting Appflow in a Production Datalake">
<h2 class="post-title"> <h2 class="post-title">

View File

@ -0,0 +1,246 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Andrew Ridgway's Blog</title>
<link href="http://localhost:8000/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Full Atom Feed" />
<link href="http://localhost:8000/feeds/business-intelligence.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Categories Atom Feed" />
<!-- Bootstrap Core CSS -->
<link href="http://localhost:8000/theme/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="http://localhost:8000/theme/css/clean-blog.min.css" rel="stylesheet">
<!-- Code highlight color scheme -->
<link href="http://localhost:8000/theme/css/code_blocks/tomorrow.css" rel="stylesheet">
<!-- Custom Fonts -->
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<meta name="tags" contents="data engineering" />
<meta name="tags" contents="Metabase" />
<meta name="tags" contents="DuckDB" />
<meta name="tags" contents="embedded" />
<meta property="og:locale" content="en">
<meta property="og:site_name" content="Andrew Ridgway's Blog">
<meta property="og:type" content="article">
<meta property="article:author" content="">
<meta property="og:url" content="http://localhost:8000/metabase-duckdb.html">
<meta property="og:title" content="Metabase and DuckDB">
<meta property="og:description" content="">
<meta property="og:image" content="http://localhost:8000/">
<meta property="article:published_time" content="2023-10-18 20:00:00+10:00">
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-default navbar-custom navbar-fixed-top">
<div class="container-fluid">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header page-scroll">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="http://localhost:8000/">Andrew Ridgway's Blog</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Header -->
<header class="intro-header" style="background-image: url('http://localhost:8000/theme/images/post-bg.jpg')">
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-heading">
<h1>Metabase and DuckDB</h1>
<span class="meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Wed 18 October 2023
</span>
</div>
</div>
</div>
</div>
</header>
<!-- Main Content -->
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<!-- Post Content -->
<article>
<p>Ahhhh <a href="https://duckdb.org/">DuckDB</a> if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's <em>"Datawarehouse on your laptop"</em> mantra. However, the OTHER application that sometimes gets missed is <em>"SQLite for OLAP workloads"</em> and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded </p>
<p>However, for this to work we need some form of conatinerised reporting application.... lucky for us there is <a href="https://www.metabase.com/">Metabase</a> which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?</p>
<h3>The Beginnings of an Idea</h3>
<p>Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here </p>
<p><img alt="Duckdb Architecture" height="auto" width="100%" src="http://localhost:8000/images/metabase_duckdb.png"></p>
<p>But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?. </p>
<p>To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase. </p>
<p>Thankfully Metabase point you to a <a href="https://github.com/AlexR2D2/metabase_duckdb_driver">community driver</a> for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later ) </p>
<p>Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the <a href="https://github.com/AlexR2D2/metabase_duckdb_driver/pull/19">PR</a> for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase</p>
<h3>But How do we get Data?</h3>
<p>Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built</p>
<div class="highlight"><pre><span></span><code><span class="n">FROM</span><span class="w"> </span><span class="n">openjdk</span><span class="p">:</span><span class="mi">19</span><span class="o">-</span><span class="n">buster</span>
<span class="n">ENV</span><span class="w"> </span><span class="n">MB_PLUGINS_DIR</span><span class="o">=/</span><span class="n">home</span><span class="o">/</span><span class="n">plugins</span><span class="o">/</span>
<span class="n">ADD</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">downloads</span><span class="o">.</span><span class="n">metabase</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">v0</span><span class="o">.</span><span class="mf">46.2</span><span class="o">/</span><span class="n">metabase</span><span class="o">.</span><span class="n">jar</span><span class="w"> </span><span class="o">/</span><span class="n">home</span>
<span class="n">ADD</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">github</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">AlexR2D2</span><span class="o">/</span><span class="n">metabase_duckdb_driver</span><span class="o">/</span><span class="n">releases</span><span class="o">/</span><span class="n">download</span><span class="o">/</span><span class="mf">0.1</span><span class="o">.</span><span class="mi">6</span><span class="o">/</span><span class="n">duckdb</span><span class="o">.</span><span class="n">metabase</span><span class="o">-</span><span class="n">driver</span><span class="o">.</span><span class="n">jar</span><span class="w"> </span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">plugins</span><span class="o">/</span>
<span class="n">RUN</span><span class="w"> </span><span class="n">chmod</span><span class="w"> </span><span class="mi">744</span><span class="w"> </span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">plugins</span><span class="o">/</span><span class="n">duckdb</span><span class="o">.</span><span class="n">metabase</span><span class="o">-</span><span class="n">driver</span><span class="o">.</span><span class="n">jar</span>
<span class="n">CMD</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;java&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;-jar&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;/home/metabase.jar&quot;</span><span class="p">]</span>
</code></pre></div>
<p>Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database? </p>
<p>Now that we have the default credentials chain it is trivial to call parquet from s3</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">read_parquet</span><span class="p">(</span><span class="s1">&#39;s3://&lt;bucket&gt;/&lt;file&gt;&#39;</span><span class="p">);</span>
</code></pre></div>
<p>However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (<em>and frankly were not desined to be so this is ok</em>). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier</p>
<p>The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.</p>
<p>With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this </p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">duckdb_builder</span><span class="p">(</span><span class="n">table</span><span class="p">):</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">duckdb</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s2">&quot;curated_duckdb.duckdb&quot;</span><span class="p">)</span>
<span class="n">conn</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;CALL load_aws_credentials(&#39;</span><span class="si">{</span><span class="n">aws_profile</span><span class="si">}</span><span class="s2">&#39;)&quot;</span><span class="p">)</span>
<span class="c1">#This removes a lot of weirdass ANSI in logs you DO NOT WANT</span>
<span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;PRAGMA enable_progress_bar=false&quot;</span><span class="p">)</span>
<span class="n">log</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Create </span><span class="si">{</span><span class="n">table</span><span class="si">}</span><span class="s2"> in duckdb&quot;</span><span class="p">)</span>
<span class="n">sql</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">&quot;CREATE OR REPLACE TABLE </span><span class="si">{</span><span class="n">table</span><span class="si">}</span><span class="s2"> AS SELECT * FROM read_parquet(&#39;s3://</span><span class="si">{</span><span class="n">curated_bucket</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="n">table</span><span class="si">}</span><span class="s2">/*&#39;)&quot;</span>
<span class="n">conn</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span>
<span class="n">log</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">table</span><span class="si">}</span><span class="s2"> Created&quot;</span><span class="p">)</span>
</code></pre></div>
<p>And then an upload to an s3 bucket</p>
<p>This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)</p>
<p>The final Dockerfile looks like this</p>
<div class="highlight"><pre><span></span><code><span class="n">FROM</span><span class="w"> </span><span class="n">openjdk</span><span class="p">:</span><span class="mi">19</span><span class="o">-</span><span class="n">buster</span>
<span class="n">ENV</span><span class="w"> </span><span class="n">MB_PLUGINS_DIR</span><span class="o">=/</span><span class="n">home</span><span class="o">/</span><span class="n">plugins</span><span class="o">/</span>
<span class="n">ADD</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">downloads</span><span class="o">.</span><span class="n">metabase</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">v0</span><span class="o">.</span><span class="mf">47.6</span><span class="o">/</span><span class="n">metabase</span><span class="o">.</span><span class="n">jar</span><span class="w"> </span><span class="o">/</span><span class="n">home</span>
<span class="n">ADD</span><span class="w"> </span><span class="n">duckdb</span><span class="o">.</span><span class="n">metabase</span><span class="o">-</span><span class="n">driver</span><span class="o">.</span><span class="n">jar</span><span class="w"> </span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">plugins</span><span class="o">/</span>
<span class="n">RUN</span><span class="w"> </span><span class="n">chmod</span><span class="w"> </span><span class="mi">744</span><span class="w"> </span><span class="o">/</span><span class="n">home</span><span class="o">/</span><span class="n">plugins</span><span class="o">/</span><span class="n">duckdb</span><span class="o">.</span><span class="n">metabase</span><span class="o">-</span><span class="n">driver</span><span class="o">.</span><span class="n">jar</span>
<span class="n">RUN</span><span class="w"> </span><span class="n">mkdir</span><span class="w"> </span><span class="o">-</span><span class="n">p</span><span class="w"> </span><span class="o">/</span><span class="n">duckdb_data</span>
<span class="n">COPY</span><span class="w"> </span><span class="n">entrypoint</span><span class="o">.</span><span class="n">sh</span><span class="w"> </span><span class="o">/</span><span class="n">home</span>
<span class="n">COPY</span><span class="w"> </span><span class="n">helper_scripts</span><span class="o">/</span><span class="n">download_duckdb</span><span class="o">.</span><span class="n">py</span><span class="w"> </span><span class="o">/</span><span class="n">home</span>
<span class="n">RUN</span><span class="w"> </span><span class="n">apt</span><span class="o">-</span><span class="n">get</span><span class="w"> </span><span class="n">update</span><span class="w"> </span><span class="o">-</span><span class="n">y</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">apt</span><span class="o">-</span><span class="n">get</span><span class="w"> </span><span class="n">upgrade</span><span class="w"> </span><span class="o">-</span><span class="n">y</span>
<span class="n">RUN</span><span class="w"> </span><span class="n">apt</span><span class="o">-</span><span class="n">get</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">python3</span><span class="w"> </span><span class="n">python3</span><span class="o">-</span><span class="n">pip</span><span class="w"> </span><span class="n">cron</span><span class="w"> </span><span class="o">-</span><span class="n">y</span>
<span class="n">RUN</span><span class="w"> </span><span class="n">pip3</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">boto3</span>
<span class="n">RUN</span><span class="w"> </span><span class="n">crontab</span><span class="w"> </span><span class="o">-</span><span class="n">l</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">cat</span><span class="p">;</span><span class="w"> </span><span class="n">echo</span><span class="w"> </span><span class="s2">&quot;0 */6 * * * python3 /home/helper_scripts/download_duckdb.py&quot;</span><span class="p">;</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">crontab</span><span class="w"> </span><span class="o">-</span>
<span class="n">CMD</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;bash&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;/home/entrypoint.sh&quot;</span><span class="p">]</span>
</code></pre></div>
<p>And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.</p>
<p>Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the <a href="https://www.metabase.com/learn/administration/git-based-workflow">metabase documentation</a>, the unfortunate thing about it is Metabase <em>have</em> hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.</p>
<p>Until then....</p>
</article>
<hr>
</div>
</div>
</div>
<hr>
<!-- Footer -->
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<p>
<script type="text/javascript" src="https://sessionize.com/api/speaker/sessions/83c5d14a-bd19-46b4-8335-0ac8358ac46d/0x0x91929ax">
</script>
</p>
<ul class="list-inline text-center">
<li>
<a href="https://twitter.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-twitter fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://facebook.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-facebook fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://github.com/armistace">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
</ul>
<p class="copyright text-muted">Blog powered by <a href="http://getpelican.com">Pelican</a>,
which takes great advantage of <a href="http://python.org">Python</a>.</p>
</div>
</div>
</div>
</footer>
<!-- jQuery -->
<script src="http://localhost:8000/theme/js/jquery.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="http://localhost:8000/theme/js/bootstrap.min.js"></script>
<!-- Custom Theme JavaScript -->
<script src="http://localhost:8000/theme/js/clean-blog.min.js"></script>
</body>
</html>

View File

@ -0,0 +1,171 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Andrew Ridgway's Blog</title>
<link href="http://localhost:8000/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Full Atom Feed" />
<link href="http://localhost:8000/feeds/data-analytics.atom.xml" type="application/atom+xml" rel="alternate" title="Andrew Ridgway's Blog Categories Atom Feed" />
<!-- Bootstrap Core CSS -->
<link href="http://localhost:8000/theme/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="http://localhost:8000/theme/css/clean-blog.min.css" rel="stylesheet">
<!-- Code highlight color scheme -->
<link href="http://localhost:8000/theme/css/code_blocks/tomorrow.css" rel="stylesheet">
<!-- Custom Fonts -->
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<meta name="tags" contents="data engineering" />
<meta name="tags" contents="Data Analytics" />
<meta property="og:locale" content="en">
<meta property="og:site_name" content="Andrew Ridgway's Blog">
<meta property="og:type" content="article">
<meta property="article:author" content="">
<meta property="og:url" content="http://localhost:8000/notebook-or-bi.html">
<meta property="og:title" content="Notebook or BI, What is the most appropiate communication medium">
<meta property="og:description" content="">
<meta property="og:image" content="http://localhost:8000/">
<meta property="article:published_time" content="2023-07-13 20:00:00+10:00">
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-default navbar-custom navbar-fixed-top">
<div class="container-fluid">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header page-scroll">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="http://localhost:8000/">Andrew Ridgway's Blog</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Header -->
<header class="intro-header" style="background-image: url('http://localhost:8000/theme/images/post-bg.jpg')">
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-heading">
<h1>Notebook or BI, What is the most appropiate communication medium</h1>
<span class="meta">Posted by
<a href="http://localhost:8000/author/andrew-ridgway.html">Andrew Ridgway</a>
on Thu 13 July 2023
</span>
</div>
</div>
</div>
</div>
</header>
<!-- Main Content -->
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<!-- Post Content -->
<article>
<p>I want to preface this post by saying I think "Dashboards" or "BI" as terms are wayyyyyyyyyyyyyyyyy over saturated in the market. There seems to be a belief that any question answerable in data deserves the work associated with a dashboard when in fact a simple one off report, or notebook, would be more than enough.</p>
</article>
<hr>
</div>
</div>
</div>
<hr>
<!-- Footer -->
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<p>
<script type="text/javascript" src="https://sessionize.com/api/speaker/sessions/83c5d14a-bd19-46b4-8335-0ac8358ac46d/0x0x91929ax">
</script>
</p>
<ul class="list-inline text-center">
<li>
<a href="https://twitter.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-twitter fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://facebook.com/ar17787">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-facebook fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://github.com/armistace">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
</ul>
<p class="copyright text-muted">Blog powered by <a href="http://getpelican.com">Pelican</a>,
which takes great advantage of <a href="http://python.org">Python</a>.</p>
</div>
</div>
</div>
</footer>
<!-- jQuery -->
<script src="http://localhost:8000/theme/js/jquery.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="http://localhost:8000/theme/js/bootstrap.min.js"></script>
<!-- Custom Theme JavaScript -->
<script src="http://localhost:8000/theme/js/clean-blog.min.js"></script>
</body>
</html>

View File

View File

View File

View File

View File

@ -83,8 +83,11 @@
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1"> <div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<h1>Tags for Andrew Ridgway's Blog</h1> <li><a href="http://localhost:8000/tag/amazon.html">Amazon</a> (1)</li> <h1>Tags for Andrew Ridgway's Blog</h1> <li><a href="http://localhost:8000/tag/amazon.html">Amazon</a> (1)</li>
<li><a href="http://localhost:8000/tag/containers.html">containers</a> (1)</li> <li><a href="http://localhost:8000/tag/containers.html">containers</a> (1)</li>
<li><a href="http://localhost:8000/tag/data-engineering.html">data engineering</a> (2)</li> <li><a href="http://localhost:8000/tag/data-engineering.html">data engineering</a> (3)</li>
<li><a href="http://localhost:8000/tag/duckdb.html">DuckDB</a> (1)</li>
<li><a href="http://localhost:8000/tag/embedded.html">embedded</a> (1)</li>
<li><a href="http://localhost:8000/tag/managed-services.html">Managed Services</a> (1)</li> <li><a href="http://localhost:8000/tag/managed-services.html">Managed Services</a> (1)</li>
<li><a href="http://localhost:8000/tag/metabase.html">Metabase</a> (1)</li>
</div> </div>
</div> </div>
</div> </div>