I recently attended a meetup where there was a talk by an AWS spokesperson. Now don't get me wrong, I normally take these things with a grain of salt. At this talk there was this tiny tiny little segment about a product that AWS had released called Amazon Appflow. This product claimed to be able to automate and make easy the link between different API endpoints, REST or otherwise and send that data to another point, whether that is Redshift, Aurora, a general relational db in RDS or otherwise or s3.
This was particularly interesting to me because I had recently finished creating and s3 datalake in AWS for the company I work for. Today, I finally put my first Appflow integration to the Datalake into production and I have to say there are some rough edges to the deployment but it has been more or less as described on the box.
Over the course of the next few paragraphs I'd like to explain the thinking I had as I investigated the product and then ultimately why I chose a managed service for this over implementing something myself in python using Dagster which I have also spun up within our cluster on AWS.
Datalake Extraction Layer
I often find that the flakiest part of any data solution, or at least a data solution that consumes data other applications create, is the extraction layer. If you are going to get a bug its going to be here, not always, but in my experience first port of call is... did it load :/
It is why I believe one of the most saturated parts of the enterprise data market is in fact the extraction layer. It seems every man and his dog (not to mention start up ) seems to be trying to "solve" this problem. The result is often that, as a data architect, you are spoilt for choice. BUT it seems that every different type of connection requires a different extractor, all for varying costs and with varying success.
The RDBMS extraction space is largely solved, and there are products like Qlick replicate, or AWS DMS as well as countless others that can do this at the CDC level and the work relatively well, albeit at a considerable cost.
The API landscape for extraction is particularly saturated. I believe I saw on linkedin a graphic showing no less than 50 companies offering extraction from API endpoints, I'm not offey with all of them but they largely seem to claim to achieve the same goal, with varying levels of depth.
This proliferation of API extractors obviously coinccides with the proliferation of SAAS products taking over from bespoke software that enterprises would have once ran with, hooked up to their existing enterprise DB's and used. This new landscape seems also shows that rather than an enterprise owning there data, they often need the skills, and increasingly $$$'s to access it.
This complexity for access is normally coupled with poor documentation, where its a crapshoot as to whether there is an swaggerui, let alone useful API documentation (this is getting better though)
So why Managed for Extraction?
As you see above when you're extracting data it is so often a crapshoot and writing something bespoke is so incrediblly risky that the idea of it gives me hives. I could write a containerised python function for each of my API extractions, or a small batch loader for RDBMS myself and have a small cluster of these things extracting from tables and API endpoints but the thought of managing all of that, especially in a 1 man DataOps team is far to overwhelming.
And Right there is my criteria for choosing a managed server.
-
Do I want to manage this myself?
-
Is there any benefit to me managing this?
-
Is it more cost effective to have someone else manage it?
Invariably, the extraction layer, at least when answering the questions above, gives me the irks and I just decide to run with a simple managed service where I can point at the source and target click go and watch it go brrrrrrrrrrrrr
When you couple ease of use with the relative reliability the value proposition of designing bespoke applications for the extraction task rapidly decreases, at least for me
And this is why Extraction, at least in systems I design, is more often than not handled by a managed service, and why AppFlow, with the concept of a managed service for API calls to s3, was a cool tech I had to swing a chance to play with.
AppFlow, The Good, The Bad, The Ugly
Using AppFlow turned out to be a largely simple affair, even in Terraform, Once you have the correct Authentication tokens its more or less select the service you want and then create a "flow" for each endpoint. The complex part is the "Map_All" function for the endpoint. When triggered it automtically create a 1 - 1 mapping for all fields in the endpoint into the target file (in my case parquet) BUT this actually fundamentaly changes the flow you have created and thus causes terraform to shit the bed. This can be dealt with via a lifecycle rule, but means schema changes in the endpoint could cause issues in the future.
All in All having a Managed Service to manage API endpoint extraction has been great and enabled the expansion of a datalake with no bespoke application code to manage the extraction of information from API endpoints which has proved to be a massive time and money saver overall
I am yet to play with establishing a custom endpoint and it will be interesting to see just how much work this is compared with writing the code for a bespoke application... sounds like a good blog post if I get to do it one day.