The promised holy grail of data integration
Get all your ELT data pipelines running in minutes, even your custom ones. Let your team focus on insights and innovation.
Boom, bold statement. Taken from the Airbyte website. What a selling point, and one of the promises that spiked our interest in using one of these tools. What else do these tools promise to provide I hear you say?
I'll combine the main selling points listed on the websites of both platforms for you:
- Faster development speed
- Unification of all pipelines into one platform and standard
- A long list of existing connectors
- Integration with DBT for transformation
- A UI for monitoring the status of your ETL pipelines
Each of the 2 platforms unique benefits of its own, so I have not included them in the above list. For these, see the tool comparison later in this post.
I'm sure at this point you understand why we were — let's say — intrigued. Build more robust integrations faster, while following a standard that lets you utilize community-created connectors, does some basic post-integration transformation, and gives the entire team a UI to monitor the status of your pipelines? Insert some heavy drooling here!
Ultimate standoff: Dragon vs. Octopus
Right, so we've set the stage: We're all pretty excited to POC the hell out of these 2 tools. Here is what we learned:
First off: The tools have too much in common to describe here. But there are a few very distinct differences in the philosophies they follow, so I'll dive into those:
Accessibility versus Version Control
You can feel the emphasis Airbyte puts on accessibility. The web UI is fantastic and leads you through every step of adding new connections, configuring them, and setting up a new ETL pipeline. This is the only way you can do it in Airbyte though and the result of this is saved to a database.
Meltano on the other hand relies on ETL as code — version controllable as a YML file. The YML can be written by hand, created using CLI commands, or using a rather rudimentary web UI. Steeper learning curve; potentially more robust.
Meltano takes the point!
Which one you prefer I'll leave up to you. I personally would go for Meltano on this one. I like being able to version control the pipeline setup and interact in a more direct fashion with the created configuration.
Development experience and deploying a custom tap
It's the worst-case scenario: The connector you need does not exist and you have to get your hands dirty yourself. For which tool would I rather do it?
Frankly — I wouldn't mind either. Both SDKs are very similar. While Airbyte relies on docker and plain Python CLI commands for running the connector during development and deploying it, Meltano relies on a few Python tools like poetry and cookiecutter for the same thing. In my opinion, both options work totally fine, are easy to use, and get the job done.
Meltano provides a simple way of setting up developer environments though, with different credentials, endpoints, and so on. Running the full pipeline for testing purposes locally — not a problem with Meltano.
Aside from this, what is really setting the two tools apart is the documentation. To me, we have a clear winner here: Airbyte. It is better structured, more complete, and easier to follow, as well as providing clear guidelines and an instructional video for first-time connector developers. It is a whole lot easier to get started. After the initial learning curve, I don't think this will make much of a difference though.
It's a draw!
So all in all, for newcomers Airbyte is a lot easier to swallow. In the long run, Meltano with its simple way of setting up dev environments may be the way to go though.
Integrate the modern data stack
Yay, buzzwords. Both tools integrate with DBT, Meltano, in addition, integrates with Airflow. Actually, Meltano does not only integrate with it, it can spin both up itself and manage them.
To be fair though, it looks like this is only worth it if you're not using these tools already. Meltano seems to be quite strict about how the tools are set up — which is not an issue if you have meltano spin them up for you. Integrating them with existing deployments seems a bit harder.
Airbyte on the other hand seems to integrate fine with external deployments of DBT, so that's something. But if you use Airflow anyways to schedule your integration pipelines, you may as well have your transform step as a separate task in your DAG.
No hits dealt!
All in all — if you're not using DBT yet or you don't plan on scheduling your integration pipelines with an external Airflow deployment, this could be useful for you. For us, not so much.
Scale them to the moon
Airbyte is made up of several components under the hood, but the one component doing the heavy lifting is the worker. Airbyte provides extensive documentation on how to scale up and out to several workers to handle workloads of any size.
Meltano on the other hand is a little cryptic about scaling. I could not find anything about it in their own documentation, just a mention of scaling out by deploying a full Meltano version to a separate Kubernetes pod for every single production job in their community Slack.
50 points to Airbyte!
For me, Airbyte is the winner here. First of all — they actually provide documentation on how to do it. But also scaling up and out workers feels like the way to go for me, rather than having meltano run in 50 different pods.
So if I had to — which tool would I use
It is honestly a hard choice. Both tools have their strengths and weaknesses. My choice would be made based on your data team.
If you have dedicated data engineers, I would suggest Meltano. The steeper learning curve IMO is less of an issue and the ability to version control your entire ETL pipelines is definitely worth the trade-off.
If you do not have dedicated data engineers and you just want a great user experience when developing simple taps and getting your pipelines up and running, I would heavily suggest Airbyte.
In the end, they are similar enough that either way you will end up with a very useful tool in your belt. Plus, I have found mentions of both tools planning to support the other tools standard in a future version, so the lock-in may not be very big either.
So why did we not go forward with these tools?
We wanted these tools to mainly solve 3 pain points for us:
- The lack of unit tests we had for our data integration
- Move the data integration that does not need the power of spark away from it. Currently, all our integrations, including batch processing run on databricks
- Improve the dev speed of integrating a new data source
1 and 2 would have been definitely solvable by using either of these tools.
But we don't think it would actually increase our speed at this point. For this, not enough of our data sources are available yet, and developing a new tap in either of these standards takes a bit longer than just doing it freestyle.
The dragon/octopus slayer
The moment we've all been waiting for: The final nail in the integration dreams coffin — a round of questionable applause for our dealbreaker!
These tools do not fit into our current integration strategy.
They currently do not allow a data backup step between E and L of the ELT pipeline which is the strategy we are following. We back up all the data we pull to a data lake before we even impose a schema on it. It does not look like this is currently possible with these tools.
We could probably MacGyver a solution to this by creating the taps and sources necessary and scheduling two pipelines together in Airflow … but this seems like a terrible way of using these tools. So we will wait for someone to create a clean solution to this (please).
TODO: Insert existential crisis here
Well … not quite. We decided to take a more incremental approach instead to solve our issues. We'll start with firing up our unit testing efforts, as for this we just need to figure out a way of doing it in databricks (there may be a post coming in about this soon).
Then, we'll tackle our integration dev speed by creating more reusable code across our integration pipelines.
I hope this entire post will be outdated very soon and Airbyte and/or Meltano become mature enough for me to revisit them, as I firmly believe in their potential. I just don’t think they are there just yet.
If you disagree with the findings presented in this post, either because the tools have changed or because you think I missed something, please let me know.
The meltano tap I developed while POCing Meltano, you can find in this Github repository.
Finally — a massive thanks to my employer Kolibri Games for letting me share our findings. Feel free to take a gander at our open positions, it is a fantastic place to work at!