Andrew Ridgway's Blog

Building a 5 node Proxmox cluster!

2024-07-24T20:00:00+10:00

A quick summary of this post by AI

I'm going to use AI to summarise this post here because it ended up quite long I've edited it ;)

Summary:

Quick look at some of the things I've used Proxmox fr

I've set up LXC containers for various services like Plex, databases (PostgreSQL, MySQL, MongoDB), Nginx, and file serving, taking advantage of Proxmox's ease of use and integration with standard Linux tools.
I'm using QEMU-based virtual machines (VMs) sparingly due to resource concerns, but have set up a simple Kubernetes cluster across three nodes (Intel NUCs) using VMs. Additionally, you have a development VM for remote coding environments.
My current plans include writing about your Kubernetes setup, Gitea CI/CD pipelines, and other tools like n8n, Grafana, and Matrix.

As part of the summary it came up with this interesting idea of "follow up" I'm leaving it here as I thought it was an interesting take on what I can write about in the future

Follow-up Questions:

Kubernetes Cluster:
What challenges did you face while setting up your Kubernetes cluster with k3s and Longhorn? How did you troubleshoot and eventually stabilize the system?
How have you configured resource allocation for your Kubernetes nodes to balance performance and efficiency?
CI/CD with Gitea:
Can you provide more details on how you're integrating LXC containers with your Gitea CI/CD pipelines? What steps are involved in setting up this process?
What triggers deployments or builds in your CI/CD setup, and how do you handle failures or errors?
Monitoring and Logging:
How have you configured monitoring and logging for your Proxmox setup? Are you using tools like Prometheus, Grafana, or others to keep track of your systems' health?
How do you ensure the security and privacy of your data while utilizing these tools?
Future Plans:
You mentioned exploring the idea of having Mistral AI write blog posts based on your notes. Can you elaborate more on this concept? What challenges might arise, and how do you plan to address them?
Are there any other new technologies or projects you're considering for your homelab in the near future?

A Picture is worth a thousand words

Yes I know the setup is a bit hacky but it works. Below is an image of the original architecture its changed a bit but you sort of get what's going on

The idea

For some time now I have been toying with the idea of a hypervisor. Initially my thoughts were to get some old blade servers and use those. That was until someone pointed out there power requirements. Looking at specs for some of these machines the power supplies would be 600 to 800 watts, which is fine until you realise that these have redundant powersupplies and are now potentially pulling up 1.5kW of energy... I'm not made of money!

I eventually decided I'd use some hardware I had already lying around, including the old server, as well as 3 Old Intel Nuc I could pick up for under $100 (4th gen core i5's upgraded to 16GB RAM DDR3). I'd also use an old Dell Workstation I had lying around to provide space for some storage, it currently has 4TB RAID 1 on BTRFS sharing via NFS.

All together the 5 machines draw less that 600W of power, cool, hardware sorted (at least for a little hobby cluster)

The platform for the Idea!

After doing some amazing reddit research and looking at various homelab ideas for doing what I wanted it became very very clear the proxmx was going to the solution. Its a debian based, open source hypervisor that, for the cost of an annoying little nag when you log in and some manual deb repo congif, gives you an enterprise grade hypervisor ready to spin up VM's and "LXC's" or Linux Jails...These have turned out to be really really useful but more on that later.

First lets define what on earth Proxmox is

Proxmox

Proxmox VE (Virtual Environment) is an open-source server virtualization platform that has gained significant popularity among home lab enthusiasts due to its robustness, ease of use, and impressive feature set. Here's why Proxmox stands out as a fantastic choice for homelab clusters:

Simultaneous Management of LXC Containers and VMs: Proxmox VE allows you to manage both Linux Container (LXC) guests and Virtual Machines (VMs) under a single, intuitive web interface or via the command line. This makes it incredibly convenient to run diverse workloads on your homelab cluster.

For instance, you might use LXC containers for lightweight tasks like web servers, mail servers, or development environments due to their low overhead and fast start-up times. Meanwhile, VMs are perfect for heavier workloads that require more resources or require full system isolation, such as database servers or Windows-based applications.

Efficient Resource Allocation: Proxmox VE provides fine-grained control over resource allocation, allowing you to specify resource limits (CPU, memory, disk I/O) for both LXC containers and VMs on a per-guest basis. This ensures that your resources are used efficiently, even when running mixed workloads.
Live Migration: One of the standout features of Proxmox VE is its support for live migration of both LXC containers and VMs between nodes in your cluster. This enables you to balance workloads dynamically, perform maintenance tasks without downtime, and make the most out of your hardware resources.
High Availability: The built-in high availability feature allows you to set up automatic failover for your critical services running as LXC containers or VMs. In case of a node failure, Proxmox VE will automatically migrate the guests to another node in the cluster, ensuring minimal downtime.
Open-Source and Free: Being open-source and free (with optional paid support), Proxmox VE is an attractive choice for budget-conscious home lab enthusiasts who want to explore server virtualization without breaking the bank. It also offers a large community of users and developers, ensuring continuous improvement and innovation.

Proxmox VE is an incredibly useful platform for homelab clusters due to its ability to manage both LXC containers and VMs efficiently, along with its advanced features like live migration and high availability. Whether you're looking to run diverse workloads or experiment with virtualization technologies, Proxmox VE is definitely worth considering.

Relevant Links:

Official Proxmox VE website: https://www.proxmox.com/
Proxmox VE documentation: https://pve-proxmox-community.org/
Proxmox VE forums: https://forum.proxmox.com/

I'd like to thank the mistral-nemo LLM for writing that ;)

LXC's

To start to understand proxmox we do need to focus in on one important piece, LXC's these are containers but not docker container, below I've had mistral summarise some of the differences.

Isolation Level:

LXC uses Linux's built-in features like cgroups and namespaces for containerization. This provides a high degree of isolation between containers.
Docker also uses these features but it adds an additional layer called the "Docker Engine" which manages many aspects of the containers, including networking, storage, etc.

System Call Filtering:

LXC does not have system call filtering by default. This means that processes inside LXC containers can make any syscall available on the host.
Docker provides system call filtering with its "rootless" mode or using a tool like AppArmor, which restricts the capabilities of processes running in containers.

Resource Management

LXC has built-in support for cgroup hierarchy management and does not enforce strict limits by default.
Docker enforces strict resource limits on every container by default.

Networking:

In LXC, each container gets its own network namespace but IP addresses are shared by default. Networking is managed using traditional Linux tools like ip or bridge-utils.
Docker provides a custom networking model with features like user-defined networks, service discovery, and automatic swarm mode integration.

What LXC is Focused On:

Given these differences, here's what LXC primarily focuses on:

Simplicity and Lightweightness: LXC aims to provide a lightweight containerization solution by utilizing only Linux's built-in features with minimal overhead. This makes it appealing for systems where resource usage needs to be kept at a minimum.
Control and Flexibility: By not adding an extra layer like Docker Engine, LXC gives users more direct control over their containers. This can make it easier to manage complex setups or integrate with other tools.
Integration with Traditional Linux Tools: Since LXC uses standard Linux tools for networking (like ip and bridge-utils) and does not add its own layer, it integrates well with traditional Linux systems administration practices.
Use Cases Where Fine-grained Control is Required: Because of its flexible nature, LXC can be useful in scenarios where fine-grained control over containerization is required. For example, in scientific computing clusters or high-performance computing environments where every bit of performance matters.

So, while Docker provides a more polished and feature-rich container ecosystem, LXC offers a simple, lightweight, and flexible alternative for those who prefer to have more direct control over their containers and prefer using standard Linux tools.

Ever since I discovered Proxmox LXC containers, my server management has been a breeze. For my Plex setup, it's perfect - isolating each instance and keeping resources in check but by using device loading I can get a graphics card there for some sweet sweet hardware decoding. Same goes for my databases; PostgreSQL, MySQL, and MongoDB all run smoothly as individual LXCs. Nginx, too, has found its home here, handling reverse proxy duties without breaking a sweat. And for fileservering, what could be better than having a dedicated LXC for that? It's like having my own little server farm right at my fingertips!

The LXC's have also been super easy to set up with the help of ttecks helper scripts Proxmox Helper Scripts It was very sad to hear he had gotten sick and I realy hope he gets well soon!

VM's

Proxmox uses the open-source QEMU hypervisor for hardware virtualization, enabling it to create and manage multiple isolated virtual machines on a single physical host. QEMU, which stands for Quick Emulator, is full system emulator that can run different operating systems directly on a host machine's hardware. When used in conjunction with Proxmox's built-in web-based interface and clustering capabilities, QEMU provides numerous advantages for VM management. These include live migration of running VMs between nodes without downtime, efficient resource allocation due to QEMU's lightweight nature, support for both KVM (Kernel-based Virtual Machine) full virtualization and hardware-assisted virtualization technologies like Intel VT-x or AMD-V, and the ability to manage and monitor VMs through Proxmox's intuitive web interface. Additionally, QEMU's open-source nature allows Proxmox users to leverage a large community of developers for ongoing improvements and troubleshooting!

Again I'd like to thank mistral-nemo for that very informative piece of prose ;)

The big question here is what do I use the VM capablity of Proxmox for?

I actually try to avoid their use as I don't want the massive use of resources, however, part of the hardware design I came up with was to use the 3 Old Intel Nuc's as predominately a kubernetes cluster.. and so I have 3 Vm's spread across those nodes that act as my very simple Kubernetes cluster I also have a VM I turn on and off as required that can act as a development machine and gives me remote VS Code or Zed environments. (I look forward to writing a blog post on Zed and How that's gone for me)

I do look forward to writing a seperate post about how the kubernetes cluster has gone. I have used k3s and longhorn and it hasn't been a rosy picture, but after a couple months I finally seem to have landed on a stable system

Anyways, Hopefully this gives a pretty quick overview of my new cluster and some of the technologies it uses. I hope to write a post in the future about the gitea CI/CD I have set up that leverages kubernetes and LXC's to get deployment pipelines as well as some of the things I'm using n8n, grafana and matrix for but I think for right now myself and mistral need to sign off and get posting.

Thanks for reading this suprisingly long post (if you got here) and I look forward to upating you on some of the other cool things I'm experimenting with with this new homelab. (Including an idea I'm starting to form of having my mistral instance actually start to write some blogs on this site using notes I write so that my posting can increase.. but I need to experiment with that a bit more)

A Cover Letter

2024-02-23T20:00:00+10:00

To whom it may concern

My name is Andrew Ridgway and I am a Data and Technology professional looking to embark on the next step in my career.

I have over 10 years’ experience in System and Data Architecture, Data Modelling and Orchestration, Business and Technical Analysis and System and Development Process Design. Most of this has been in developing Cloud architectures and workloads on AWS and GCP Including ML workloads using Sagemaker.

In my current role I have Proposed, Designed and built the data platform currently used by business. This includes internal and external data products as well as the infrastructure and modelling to support these. This role has seen me liaise with stakeholders of all levels of the business from Analysts in the Customer Experience team right up to C suite executives and preparing material for board members. I understand the complexity of communicating complex system design to different level stakeholders and the complexities of involved in communicating to both technical and less technical employees particularly in relation to data and ML technologies.

I have also worked as a technical consultant to many businesses and have assisted with the design and implementation of systems for a wide range of industries including financial services, mining and retail. I understand the complexities created by regulation in these environments and understand that this can sometimes necessitate the use of technologies and designs, including legacy systems and designs, I wouldn’t normally use. I also have a passion of designing systems that enable these organisations to realise the benefits of CI/CD on workloads they would not traditionally use this capability. In particular I took a very traditional legacy Data Warehousing team and implemented a solution that meant version control was no longer controlled by a daily copy and paste of folders with dates on major updates. My solution involved establishing guidelines of use of git version control so that this could happen automatically as people committed new code to the core code base. As I have moved into cloud architecture I have made sure to use best practice and ensure everything I build isn’t considered production ready until it is in IAC and deployed through a CI/CD pipeline.

In a personal capacity I am an avid tech and ML enthusiast. I have designed my own cluster including monitoring and deployment that runs several services that my family uses including chat and DNS and am in the process of designing a “set and forget” system that will allows me to have multi user tenancies on hardware I operate that should enable us to have the niceties of cloud services like email, storage and scheduling with the safety of knowing where that data is stored and exactly how it is used. I also like to design small IoT devices out of Arduino boards allowing me to monitor and control different facets of our house like temperature and light.

Currently I am working on a project to merge my skill in SQL Modelling and Orchestration with GPT API’s to try and lessen that burden. You can see some of this work in its very early stages here:

gpt-sql-generator

dbt_sources_generator

I look forward to hearing from you soon.

Sincerely,

Andrew Ridgway

A Resume

2024-02-23T20:00:00+10:00

OVERVIEW

I am a Senior Data Engineer looking to transition my skills to Data and Solution Architecting as well as project management. I have spent the better part of the last decade refining my abilities in taking business requirements and turning those into actionable data engineering, analytics, and software projects with trackable metrics. I believe in agnosticism when it comes to coding languages and have experimented in my own time with many different languages. In my career I have used Python, .NET, PowerShell, TSQL, VB and SAS (multiple products) in an Enterprise capacity. I also have experience using Google Cloud Platform and AWS tools for ETL and data platform development as well as git for version control and deployment using various IAC tools. I have also conducted data analysis and modelling on business metrics to find relationships between both staff and customer behavior and produced actionable recommendations based on the conclusions. In a private context I have also experimented with C, C# and Kotlin I am looking to further my career by taking my passion for data engineering and analysis as well as web and software development and applying it in a strategic context.

SKILLS & ABILITIES

Python (scripting, compiling, notebooks – Sagemaker, Jupyter)
git
SAS (Base, EG, VA)
Various Google Cloud Tools (Data Fusion, Compute Engine, Cloud Functions)
Various Amazon Tools (EC2, RDS, Kinesis, Glue, Redshift, Lambda, ECS, ECR, EKS)
Streaming Technologies (Kafka, Hive, Spark Streaming)
Various DB platforms both on Prem and Serverless (MariaDB/MySql,
Postgres/Redshift, SQL Server, RDS/Aurora variants)
Various Microsoft Products (PowerBI, TSQL, Excel, VBA)
Linux Server Administration (cron, bash, systemD)
ETL/ELT Development
Basic Data Modelling (Kimball, SCD Type 2)
IAC (Cloud Formation, Terraform)
Datahub Deployment
Dagster Orchestration Deployments
DBT Modelling and Design Deployments
Containerised and Cloud Driven Data Architecture

EXPERIENCE

Cloud Data Architect

Redeye Apps

May 2022 - Present

Greenfields Research, Design and Deployment of S3 datalake (Parquet)
AWS DMS, S3, Athena, Glue
Research Design and Deployment of Catalog (Datahub)
Design of Data Governance Process (Datahub driven)
Research Design and Deployment of Orchestration and Modelling for Transforms (Dagster/DBT into Mesos)
CI/CD design and deployment of modelling and orchestration using Gitlab
Research, Design and Deployment of ML Ops Dev pipelines anddeployment strategy
Design of ETL/Pipelines (DBT)
Design of Customer Facing Data Products and deployment methodologies (Fully automated via Kakfa/Dagster/DBT)

Data Engineer,

TechConnect IT Solutions

August 2021 – May 2022

Design of Cloud Data Batch ETL solutions using Python (Glue)
Design of Cloud Data Streaming ETL solution using Python (Kinesis)
Solve complex client business problems using software to join and transform data from DB’s, Web API’s, Application API’s and System logs
Build CI/CD pipelines to ensure smooth deployments (Bitbucket, gitlab)
Apply Prebuilt ML models to software solutions (Sagemaker)
Assist with the architecting of Containerisation solutions (Docker, ECS, ECR)
API testing and development (gRPC, Rest)

Enterprise Data Warehouse Developer

Auto and General Insurance

August 2019 - August 2021

ETL development of CRM, WFP, Outbound Dialer, Inbound switch in Google Cloud, SAS, TSQL
Bringing new data to the business to analyse for new insights
Redeveloped Version Control and brought git to the data team
Introduced python for API enablement in the Enterprise Data Warehouse
Partnering with the business to focus data project on actual need and translating into technical requirements

Business Analyst

Auto and General Insurance

January 2018 - August 2019

Automate Service Performance Reporting using PowerShell/VBA/SAS
Learn and leverage SAS EG and VA to streamline Microsoft Excel Reporting
Identify and develop data pipelines to source data from multiple sources easily and collate into a single source to identify relationships and trends
Technologies used include VBA, PowerShell, SQL, Web API’s, SAS
Where SAS is inappropriate use VBA to automate processes in Microsoft Access and Excel
Gather Requirements to build meaningful reporting solutions
Provide meaningful analysis on business performance and provide relevant presentations and reports to senior stakeholders.

Forecasting and Capacity Analyst

Auto and General Insurance

January 2017 – January 2018

Develop the outbound forecasting model for the Auto and General sales call center by analysing the relationship between customer decisions and workload drivers
This includes the complete data pipeline for the model from identifying and sourcing data, building the reporting and analysing the data and associated drivers.
Forecast inbound workload requirements for the Auto and General sales call center using time series analysis
Learn and leverage the Aspect Workforce Management System to ensure efficiency of forecast generation
Learn and leverage the capabilities of SAS Enterprise Guide to improve accuracy
Liaise with people across the business to ensure meaningful, accurate analysis is provided to senior stakeholders
Analyse monthly, weekly and intraday requirements and ensure forecast is accurately predicting workload for breaks, meetings and Leave

Senior HR Performance Analyst

Queensland Department of Justice and Attorney General

June 2016 - January 2017

Harmonise various systems to develop a unified workforce reporting and analysis framework with appropriate metrics
Use VBA to automate regular reporting in Microsoft Access and Excel
Participate in government process through the production of briefs including Questions on Notice and Estimates Briefs for departmental executives

Workforce Business Analyst

Queensland Department of Justice and Attorney General

July 2015 – June 2016

Develop and refine current workforce analysis techniques and databases
Use VBA to automate regular reporting in Microsoft Access and Excel
Act as liaison between shared service providers and executives and facilitate communication during the implementation of a payroll leave audit
Gather reporting requirements from various business areas and produce ad-hoc and regular reports as required
Participate in government process through the production of briefs including Questions on Notice and Estimates Briefs for departmental executives

EDUCATION

2011 Bachelor of Business Management, University of Queensland
2008 Bachelor of Arts, University of Queensland

REFERENCES

Anthony Stiller Lead Developer, Data warehousing, Queensland Health

0428 038 031

Jaime Brian Head of Cloud Ninjas, TechConnect

0422 012 17

Metabase and DuckDB

2023-11-15T20:00:00+10:00

Ahhhh DuckDB if you're even partly floating around in the data space you've probably been hearing ALOT about it and it's "Datawarehouse on your laptop" mantra. However, the OTHER application that sometimes gets missed is "SQLite for OLAP workloads" and it was this concept that once I grasped it gave me a very interesting idea.... What if we could take the very pretty Aggregate Layer of our Data(warehouse/LakeHouse/Lake) and put that data right next to presentation layer of the lake, reducing network latency and... hopefully... have presentation reports running over very large workloads in the blink of an eye. It might even be fast enough that it could be deployed and embedded

However, for this to work we need some form of conatinerised reporting application.... lucky for us there is Metabase which is a fantastic little reporting application that has an open core. So this got me thinking... Can I put these two applications together and create a Reporting Layer with report embedding capabilities that is deployable in the cluster and has a admin UI accesible over a web page all whilst keeping the data locked to our network?

The Beginnings of an Idea

Ok so... Big first question. Can Duckdb and Metabase talk? Well... not quite. But first lets take a quick look at the architecture we'll be employing here

But you'll notice this pretty glossed over line, "Connector", that right there is the clincher. So what is this "Connector"?.

To Deep dive into this would take a whole blog so to give you something to quickly wrap your head around its the glue that will make metabase be able to query your data source. The reality is its a jdbc driver compiled against metabase.

Thankfully Metabase point you to a community driver for linking to duckdb ( hopefully it will be brought into metabase proper sooner rather than later )

Now the release of this driver is still compiled against 0.8 of duckdb and 0.9 is the latest stable but hopefully the PR for this will land very soon giving a good quick way to link to the latest and greatest in duckdb from metabase

But How do we get Data?

Brilliant, using the recomended DockerFile we can load up a metabase container with the duckdb driver pre built

FROM openjdk:19-buster

ENV MB_PLUGINS_DIR=/home/plugins/

ADD https://downloads.metabase.com/v0.46.2/metabase.jar /home
ADD https://github.com/AlexR2D2/metabase_duckdb_driver/releases/download/0.1.6/duckdb.metabase-driver.jar /home/plugins/

RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar

CMD ["java", "-jar", "/home/metabase.jar"]

Great Now the big question. How do we get the data into the damn thing. Interestingly initially when I was designing this I had the thought of leveraging the in memory capabilities of duckdb and pulling in from the parquet on s3 directly as needed, after all the cluster is on AWS so the s3 API requests should be unbelievably fast anyway so why bother with a persistent database?

Now that we have the default credentials chain it is trivial to call parquet from s3

SELECT * FROM read_parquet('s3://<bucket>/<file>');

However, if you're reading direct off parquet all of a sudden you need to consider the partioning and I also found out that, if the parquet is being actively written to at the time of quering, duckdb has a hissyfit about metadata not matching the query. Needless to say duckdb and streaming parquet are not happy bed fellows (and frankly were not desined to be so this is ok). And the idea of trying to explain all this to the run of the mill reporting analyst whom it is my hope is a business sort of person not tech honestly gave me hives.. so I had to make it easier

The compromise occured to me... the curated layer is only built daily for reporting, and using that, I could create a duckdb file on disk that could be loaded into the metabase container itself.

With some very simple python as an operation in our orchestrator I had a job that would read direct from our curated parquet and create a duckdb file with it.. without giving away to much the job primarily consisted of this

def duckdb_builder(table):
    conn = duckdb.connect("curated_duckdb.duckdb")
    conn.sql(f"CALL load_aws_credentials('{aws_profile}')")
    #This removes a lot of weirdass ANSI in logs you DO NOT WANT
    conn.execute("PRAGMA enable_progress_bar=false")
    log.info(f"Create {table} in duckdb")
    sql = f"CREATE OR REPLACE TABLE {table} AS SELECT * FROM read_parquet('s3://{curated_bucket}/{table}/*')"
    conn.sql(sql)
    log.info(f"{table} Created")

And then an upload to an s3 bucket

This of course necessated a cron job baked in to the metabase container itself to actually pull the duckdb in every morning. After some carefuly analysis of time (because I'm do lazy to implement message queues) I set up a s3 cp job that could be cronned direct from the container itself. This gives us a self updating metabase container pulling with a duckdb backend for client facing reporting right in the interface. AND because of the fact the duckdb is baked right into the container... there are NO associated s3 or dpu costs (merely the cost of running a relatively large container)

The final Dockerfile looks like this

FROM openjdk:19-buster

ENV MB_PLUGINS_DIR=/home/plugins/

ADD https://downloads.metabase.com/v0.47.6/metabase.jar /home
ADD duckdb.metabase-driver.jar /home/plugins/

RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar

RUN mkdir -p /duckdb_data

COPY entrypoint.sh /home

COPY helper_scripts/download_duckdb.py /home

RUN apt-get update -y && apt-get upgrade -y

RUN apt-get install python3 python3-pip cron -y

RUN pip3 install boto3

RUN crontab -l | { cat; echo "0 */6 * * * python3 /home/helper_scripts/download_duckdb.py"; } | crontab -

CMD ["bash", "/home/entrypoint.sh"]

And there we have it... an in memory containerised reporting solution with blazing fast capability to aggregate and build reports based on curated data direct from the business.. fully automated and deployable via CI/CD, that provides data updates daily.

Now the embedded part.. which isn't built yet but I'll make sure to update you once we have/if we do because the architecture is very exciting for an embbdedded reporting workflow that is deployable via CI/CD processes to applications. As a little taster I'll point you to the metabase documentation, the unfortunate thing about it is Metabase have hidden this behind the enterprise license.. but I can absolutely see why. If we get to implementing this I'll be sure to update you here on the learnings.

Until then....

Implementing Appflow in a Production Datalake

2023-05-23T20:00:00+10:00

I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called Amazon Appflow. This product claimed to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.

This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box.

Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.

Datalake Extraction Layer

I often find that the flakiest part of any data solution, or at least a data solution that consumes data other applications create, is the extraction layer. If you are going to get a bug its going to be here, not always, but in my experience first port of call is... did it load :/

It is why I believe one of the most saturated parts of the enterprise data market is in fact the extraction layer. It seems every man and his dog (not to mention start up ) seems to be trying to "solve" this problem. The result is often that, as a data architect, you are spoilt for choice. BUT it seems that every different type of connection requires a different extractor, all for varying costs and with varying success.

The RDBMS extraction space is largely solved, and there are products like Qlick replicate, or AWS DMS as well as countless others that can do this at the CDC level and the work relatively well, albeit at a considerable cost.

The API landscape for extraction is particularly saturated. I believe I saw on linkedin a graphic showing no less than 50 companies offering extraction from API endpoints, I'm not offey with all of them but they largely seem to claim to achieve the same goal, with varying levels of depth.

This proliferation of API extractors obviously coinccides with the proliferation of SAAS products taking over from bespoke software that enterprises would have once ran with, hooked up to their existing enterprise DB's and used. This new landscape seems also shows that rather than an enterprise owning there data, they often need the skills, and increasingly $$$'s to access it.

This complexity for access is normally coupled with poor documentation, where its a crapshoot as to whether there is an swaggerui, let alone useful API documentation (this is getting better though)

So why Managed for Extraction?

As you see above when you're extracting data it is so often a crapshoot and writing something bespoke is so incrediblly risky that the idea of it gives me hives. I could write a containerised python function for each of my API extractions, or a small batch loader for RDBMS myself and have a small cluster of these things extracting from tables and API endpoints but the thought of managing all of that, especially in a 1 man DataOps team is far to overwhelming.

And Right there is my criteria for choosing a managed server.

Do I want to manage this myself?
Is there any benefit to me managing this?
Is it more cost effective to have someone else manage it?

Invariably, the extraction layer, at least when answering the questions above, gives me the irks and I just decide to run with a simple managed service where I can point at the source and target click go and watch it go brrrrrrrrrrrrr

When you couple ease of use with the relative reliability the value proposition of designing bespoke applications for the extraction task rapidly decreases, at least for me

And this is why Extraction, at least in systems I design, is more often than not handled by a managed service, and why AppFlow, with the concept of a managed service for API calls to s3, was a cool tech I had to swing a chance to play with.

AppFlow, The Good, The Bad, The Ugly

Using AppFlow turned out to be a largely simple affair, even in Terraform, Once you have the correct Authentication tokens its more or less select the service you want and then create a "flow" for each endpoint. The complex part is the "Map_All" function for the endpoint. When triggered it automtically create a 1 - 1 mapping for all fields in the endpoint into the target file (in my case parquet) BUT this actually fundamentaly changes the flow you have created and thus causes terraform to shit the bed. This can be dealt with via a lifecycle rule, but means schema changes in the endpoint could cause issues in the future.

All in All having a Managed Service to manage API endpoint extraction has been great and enabled the expansion of a datalake with no bespoke application code to manage the extraction of information from API endpoints which has proved to be a massive time and money saver overall

I am yet to play with establishing a custom endpoint and it will be interesting to see just how much work this is compared with writing the code for a bespoke application... sounds like a good blog post if I get to do it one day.

Dawn of another blog attempt

2023-05-10T20:00:00+10:00

So, once again I'm trying this blog thing out. For the first time though I'm not going to make it niche, or cultral, but just whatever I feel like writing about. For a number of years now my day job has been in and around the world of data. Starting out as a "Workforce Analyst" (read downloading csv's of payroll data and making excel report) and over time moving to my current role where I build and design systems for ingesting data from various systems systems to allow analysts and Data Scientists. My hobby however has been... well.. tech. These two things have over time merged into the weirdness that is my professional life and I'd like to take elements of this life and share my learnings.

The core reason for this is that I keep reading that its great to write. The other is I've decided that getting my thoughts into some form of order might be beneficial both to me and perhaps a wider audience. There are so many things I've attempted, succeeded and failed at, that, at the ver least, it will be worth getting them into a central repository of knowledge so that I, and maybe others, can share and use as time progresses. I also keep seeing on Hacker News a lot of refernences to the guys who've been writing blogs since the early days of the internet and I want to contribute my little pie to what I want the internet to be

So strap yourselves in as I take you on my data/self hosting journey, sprinkled with a little dev ops and data engineering to wet your appetite over the next little while. Sometimes I might even throw in some cultral or policitcal commentry just to keep things spicy!