Ahhhh DuckDB if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's "Datawarehouse on your laptop" mantra. However, the OTHER application that sometimes gets missed is "SQLite for OLAP workloads" and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded
However, for this to work we need some form of conatinerised reporting application.... lucky for us there is Metabase which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?
The Beginnings of an Idea
Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here
But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?.
To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase.
Thankfully Metabase point you to a community driver for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later )
Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the PR for thi will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase
But How do we get Data?
Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built
FROM openjdk:19-buster
ENV MB_PLUGINS_DIR=/home/plugins/
ADD https://downloads.metabase.com/v0.46.2/metabase.jar /home
ADD https://github.com/AlexR2D2/metabase_duckdb_driver/releases/download/0.1.6/duckdb.metabase-driver.jar /home/plugins/
RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar
CMD ["java", "-jar", "/home/metabase.jar"]
Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveragin the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database?