blog/src/output/feeds/all-en.atom.xml

107 lines
30 KiB
XML

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Andrew Ridgway's Blog</title><link href="http://localhost:8000/" rel="alternate"></link><link href="http://localhost:8000/feeds/all-en.atom.xml" rel="self"></link><id>http://localhost:8000/</id><updated>2023-10-18T20:00:00+10:00</updated><entry><title>Metabase and DuckDB</title><link href="http://localhost:8000/metabase-duckdb.html" rel="alternate"></link><published>2023-10-18T20:00:00+10:00</published><updated>2023-10-18T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-10-18:/metabase-duckdb.html</id><summary type="html">&lt;p&gt;Using Metabase and DuckDB to create an embedded Reporting Container bringing the data as close to the report as possible&lt;/p&gt;</summary><content type="html">&lt;p&gt;Ahhhh &lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's &lt;em&gt;"Datawarehouse on your laptop"&lt;/em&gt; mantra. However, the OTHER application that sometimes gets missed is &lt;em&gt;"SQLite for OLAP workloads"&lt;/em&gt; and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded &lt;/p&gt;
&lt;p&gt;However, for this to work we need some form of conatinerised reporting application.... lucky for us there is &lt;a href="https://www.metabase.com/"&gt;Metabase&lt;/a&gt; which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?&lt;/p&gt;
&lt;h3&gt;The Beginnings of an Idea&lt;/h3&gt;
&lt;p&gt;Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here &lt;/p&gt;
&lt;p&gt;&lt;img alt="Duckdb Architecture" height="auto" width="100%" src="http://localhost:8000/images/metabase_duckdb.png"&gt;&lt;/p&gt;
&lt;p&gt;But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?. &lt;/p&gt;
&lt;p&gt;To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase. &lt;/p&gt;
&lt;p&gt;Thankfully Metabase point you to a &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver"&gt;community driver&lt;/a&gt; for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later ) &lt;/p&gt;
&lt;p&gt;Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the &lt;a href="https://github.com/AlexR2D2/metabase_duckdb_driver/pull/19"&gt;PR&lt;/a&gt; for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase&lt;/p&gt;
&lt;h3&gt;But How do we get Data?&lt;/h3&gt;
&lt;p&gt;Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;46.2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;AlexR2D2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase_duckdb_driver&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;releases&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;java&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;-jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/metabase.jar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database? &lt;/p&gt;
&lt;p&gt;Now that we have the default credentials chain it is trivial to call parquet from s3&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s3://&amp;lt;bucket&amp;gt;/&amp;lt;file&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (&lt;em&gt;and frankly were not desined to be so this is ok&lt;/em&gt;). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier&lt;/p&gt;
&lt;p&gt;The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.&lt;/p&gt;
&lt;p&gt;With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;duckdb_builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;curated_duckdb.duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CALL load_aws_credentials(&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;aws_profile&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#39;)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#This removes a lot of weirdass ANSI in logs you DO NOT WANT&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PRAGMA enable_progress_bar=false&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Create &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; in duckdb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CREATE OR REPLACE TABLE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; AS SELECT * FROM read_parquet(&amp;#39;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;curated_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/*&amp;#39;)&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; Created&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And then an upload to an s3 bucket&lt;/p&gt;
&lt;p&gt;This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)&lt;/p&gt;
&lt;p&gt;The final Dockerfile looks like this&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;openjdk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;
&lt;span class="n"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB_PLUGINS_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;47.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;744&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;plugins&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metabase&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;duckdb_data&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;entrypoint&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;helper_scripts&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;download_duckdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pip3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0 */6 * * * python3 /home/helper_scripts/download_duckdb.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crontab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bash&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/home/entrypoint.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.&lt;/p&gt;
&lt;p&gt;Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the &lt;a href="https://www.metabase.com/learn/administration/git-based-workflow"&gt;metabase documentation&lt;/a&gt;, the unfortunate thing about it is Metabase &lt;em&gt;have&lt;/em&gt; hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.&lt;/p&gt;
&lt;p&gt;Until then....&lt;/p&gt;</content><category term="Business Intelligence"></category><category term="data engineering"></category><category term="Metabase"></category><category term="DuckDB"></category><category term="embedded"></category></entry><entry><title>Implmenting Appflow in a Production Datalake</title><link href="http://localhost:8000/appflow-production.html" rel="alternate"></link><published>2023-05-23T20:00:00+10:00</published><updated>2023-05-17T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-23:/appflow-production.html</id><summary type="html">&lt;p&gt;How Appflow simplified a major extract layer and when I choose Managed Services&lt;/p&gt;</summary><content type="html">&lt;p&gt;I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called &lt;a href="https://aws.amazon.com/appflow/"&gt;Amazon Appflow&lt;/a&gt;. This product &lt;em&gt;claimed&lt;/em&gt; to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.&lt;/p&gt;
&lt;p&gt;This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box. &lt;/p&gt;
&lt;p&gt;Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.&lt;/p&gt;
&lt;h3&gt;Datalake Extraction Layer&lt;/h3&gt;
&lt;p&gt;I often find that the flakiest part of any data solution, or at least a data solution that consumes data other applications create, is the extraction layer. If you are going to get a bug its going to be here, not always, but in my experience first port of call is... did it load :/ &lt;/p&gt;
&lt;p&gt;It is why I believe one of the most saturated parts of the enterprise data market is in fact the extraction layer. It seems every man and his dog (not to mention start up ) seems to be trying to "solve" this problem. The result is often that, as a data architect, you are spoilt for choice. BUT it seems that every different type of connection requires a different extractor, all for varying costs and with varying success. &lt;/p&gt;
&lt;p&gt;The RDBMS extraction space is largely solved, and there are products like &lt;a href="https://www.qlik.com/us/products/qlik-replicate"&gt;Qlick replicate&lt;/a&gt;, or &lt;a href="https://aws.amazon.com/dms/"&gt;AWS DMS&lt;/a&gt; as well as countless others that can do this at the CDC level and the work relatively well, albeit at a considerable cost. &lt;/p&gt;
&lt;p&gt;The API landscape for extraction is particularly saturated. I believe I saw on linkedin a graphic showing no less than 50 companies offering extraction from API endpoints, I'm not offey with all of them but they largely seem to &lt;em&gt;claim&lt;/em&gt; to achieve the same goal, with varying levels of depth.&lt;/p&gt;
&lt;p&gt;This proliferation of API extractors obviously coinccides with the proliferation of SAAS products taking over from bespoke software that enterprises would have once ran with, hooked up to their existing enterprise DB's and used. This new landscape seems also shows that rather than an enterprise owning there data, they often need the skills, and increasingly $$$'s to access it.&lt;/p&gt;
&lt;p&gt;This complexity for access is normally coupled with poor documentation, where its a crapshoot as to whether there is an swaggerui, let alone useful API documentation (this is getting better though)&lt;/p&gt;
&lt;h3&gt;So why Managed for Extraction?&lt;/h3&gt;
&lt;p&gt;As you see above when you're extracting data it is so often a crapshoot and writing something bespoke is so incrediblly risky that the idea of it gives me hives. I could write a containerised python function for each of my API extractions, or a small batch loader for RDBMS myself and have a small cluster of these things extracting from tables and API endpoints but the thought of managing all of that, especially in a 1 man DataOps team is far to overwhelming.&lt;/p&gt;
&lt;p&gt;And Right there is my criteria for choosing a managed server.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Do I want to manage this myself?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is there any benefit to me managing this?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is it more cost effective to have someone else manage it?&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Invariably, the extraction layer, at least when answering the questions above, gives me the irks and I just decide to run with a simple managed service where I can point at the source and target click go and watch it go brrrrrrrrrrrrr&lt;/p&gt;
&lt;p&gt;When you couple ease of use with the relative reliability the value proposition of designing bespoke applications for the extraction task rapidly decreases, at least for me&lt;/p&gt;
&lt;p&gt;And this is why Extraction, at least in systems I design, is more often than not handled by a managed service, and why AppFlow, with the concept of a managed service for API calls to s3, was a cool tech I had to swing a chance to play with.&lt;/p&gt;
&lt;h3&gt;AppFlow, The Good, The Bad, The Ugly&lt;/h3&gt;
&lt;p&gt;Using AppFlow turned out to be a largely simple affair, even in Terraform, Once you have the correct Authentication tokens its more or less select the service you want and then create a "flow" for each endpoint. The complex part is the "Map_All" function for the endpoint. When triggered it automtically create a 1 - 1 mapping for all fields in the endpoint into the target file (in my case parquet) BUT this actually fundamentaly changes the flow you have created and thus causes terraform to shit the bed. This can be dealt with via a lifecycle rule, but means schema changes in the endpoint could cause issues in the future. &lt;/p&gt;
&lt;p&gt;All in All having a Managed Service to manage API endpoint extraction has been great and enabled the expansion of a datalake with no bespoke application code to manage the extraction of information from API endpoints which has proved to be a massive time and money saver overall&lt;/p&gt;
&lt;p&gt;I am yet to play with establishing a custom endpoint and it will be interesting to see just how much work this is compared with writing the code for a bespoke application... sounds like a good blog post if I get to do it one day.&lt;/p&gt;</content><category term="Data Engineering"></category><category term="data engineering"></category><category term="Amazon"></category><category term="Managed Services"></category></entry><entry><title>Dawn of another blog attempt</title><link href="http://localhost:8000/how-i-built-the-damn-thing.html" rel="alternate"></link><published>2023-05-10T20:00:00+10:00</published><updated>2023-05-10T20:00:00+10:00</updated><author><name>Andrew Ridgway</name></author><id>tag:localhost,2023-05-10:/how-i-built-the-damn-thing.html</id><summary type="html">&lt;p&gt;Containers and How I take my learnings from home and apply them to work&lt;/p&gt;</summary><content type="html">&lt;p&gt;So, once again I'm trying this blog thing out. For the first time though I'm not going to make it niche, or cultral, but just whatever I feel like writing about. For a number of years now my day job has been in and around the world of data. Starting out as a "Workforce Analyst" (read downloading csv's of payroll data and making excel report) and over time moving to my current role where I build and design systems for ingesting data from various systems systems to allow analysts and Data Scientists. My hobby however has been... well.. tech. These two things have over time merged into the weirdness that is my professional life and I'd like to take elements of this life and share my learnings.&lt;/p&gt;
&lt;p&gt;The core reason for this is that I keep reading that its great to write. The other is I've decided that getting my thoughts into some form of order might be beneficial both to me and perhaps a wider audience. There are so many things I've attempted, succeeded and failed at, that, at the ver least, it will be worth getting them into a central repository of knowledge so that I, and maybe others, can share and use as time progresses. I also keep seeing on &lt;a href="https://news.ycombinator.com"&gt;Hacker News&lt;/a&gt; a lot of refernences to the guys who've been writing blogs since the early days of the internet and I want to contribute my little pie to what I want the internet to be&lt;/p&gt;
&lt;p&gt;So strap yourselves in as I take you on my data/self hosting journey, sprinkled with a little dev ops and data engineering to wet your appetite over the next little while. Sometimes I might even throw in some cultral or policitcal commentry just to keep things spicy!&lt;/p&gt;</content><category term="Data Engineering"></category><category term="data engineering"></category><category term="containers"></category></entry></feed>