Ahhhh DuckDB if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's "Datawarehouse on your laptop" mantra. However, the OTHER application that sometimes gets missed is "SQLite for OLAP workloads" and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded
+However, for this to work we need some form of conatinerised reporting application.... lucky for us there is Metabase which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?
+The Beginnings of an Idea
+Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here
+But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?.
+To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase.
+Thankfully Metabase point you to a community driver for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later )
+Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the PR for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase
+But How do we get Data?
+Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built
+FROM openjdk:19-buster
+
+ENV MB_PLUGINS_DIR=/home/plugins/
+
+ADD https://downloads.metabase.com/v0.46.2/metabase.jar /home
+ADD https://github.com/AlexR2D2/metabase_duckdb_driver/releases/download/0.1.6/duckdb.metabase-driver.jar /home/plugins/
+
+RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar
+
+CMD ["java", "-jar", "/home/metabase.jar"]
+
Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database?
+Now that we have the default credentials chain it is trivial to call parquet from s3
+SELECT * FROM read_parquet('s3://<bucket>/<file>');
+
However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (and frankly were not desined to be so this is ok). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier
+The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.
+With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this
+def duckdb_builder(table):
+ conn = duckdb.connect("curated_duckdb.duckdb")
+ conn.sql(f"CALL load_aws_credentials('{aws_profile}')")
+ #This removes a lot of weirdass ANSI in logs you DO NOT WANT
+ conn.execute("PRAGMA enable_progress_bar=false")
+ log.info(f"Create {table} in duckdb")
+ sql = f"CREATE OR REPLACE TABLE {table} AS SELECT * FROM read_parquet('s3://{curated_bucket}/{table}/*')"
+ conn.sql(sql)
+ log.info(f"{table} Created")
+
And then an upload to an s3 bucket
+This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)
+The final Dockerfile looks like this
+FROM openjdk:19-buster
+
+ENV MB_PLUGINS_DIR=/home/plugins/
+
+ADD https://downloads.metabase.com/v0.47.6/metabase.jar /home
+ADD duckdb.metabase-driver.jar /home/plugins/
+
+RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar
+
+RUN mkdir -p /duckdb_data
+
+COPY entrypoint.sh /home
+
+COPY helper_scripts/download_duckdb.py /home
+
+RUN apt-get update -y && apt-get upgrade -y
+
+RUN apt-get install python3 python3-pip cron -y
+
+RUN pip3 install boto3
+
+RUN crontab -l | { cat; echo "0 */6 * * * python3 /home/helper_scripts/download_duckdb.py"; } | crontab -
+
+CMD ["bash", "/home/entrypoint.sh"]
+
And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.
+Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the metabase documentation, the unfortunate thing about it is Metabase have hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.
+Until then....
+