New features in Airbyte and lakeFS make it easy to send data replicated by Airbyte into a lakeFS repo. See how to leverage this integration in your data pipelines!
If you work in data, chances are you rely on replicating data between different systems to centralize it for analysis. Modern companies produce data from all kinds of systems including relational DBs, marketing products, clickstream tracking, and even the dreaded Excel spreadsheet.
The good news is there are a growing number of services that can handle the rote aspects of data replication for you. One of the fastest growing is Airbyte, which coincidentally celebrated its 1 year anniversary last week (along with lakeFS of course).
Taking a look at their Sources page shows nearly 100(!) data sources the platform supports. With Airbyte you can sync data from any of these sources to the destination of your choice.
Airbyte Sync to S3
One of Airbyte’s most common destinations is S3 given its hard-to-beat its combination of price, durability, and interoperability.
Storing your data in S3 is pretty neat, but it isn’t perfect either.
First, it provides no way to achieve isolation between data producers and consumers without copying data across multiple buckets or prefixes. Second, it also lacks the ability to synchronize multiple datasets that reference each other. Failing to account for this can result in subtle data errors where you report sales for a product that doesn’t exist, or miss sales for a product that does.
Let me ask you a question: If there was a one-click setting in S3 that you could activate to keep all of its benefits while mitigating the downsides, would you use it?
Airbyte Sync to S3: Enhanced by lakeFS
As already mentioned, when using Airbyte to send data to S3, you can use default settings to send data to a typical S3 bucket. Or with a small settings change, you can send data instead to a lakeFS repository created over a bucket.
What are the benefits?
There are several reasons to favor a lakeFS repository over your basic S3 bucket. As the name suggests, within a repo you can perform git-like operations over data that are more useful than the default S3 commands you are used to, like
Remember the issues mentioned earlier related to isolation and consistency?
With lakeFS branches and merge operations, data isolation can be guaranteed. And through atomic merge operations, we can maintain consistency, even across datasets.
What's that setting again?
Once we save the settings and run it, data will sync immediately to our lakeFS repository. If you switch to the lakeFS UI, you’ll see the replicated data in the repo, as shown below.
After each sync of data by Airbyte we recommend making a commit of the data in lakeFS. This allows you to easily time travel to any historical state of the data and recover from unwanted or troublesome data syncs.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
We recently introduced the new High Level Python SDK, which provides a friendlier interface to interact with lakeFS, as part of our evergoing effort to
TL;DR In this blog post, we will explore how to add data versioning to an ML project; a simple end-to-end rain prediction project for the
You can ingest data files from external sources using a variety of technologies, from Oracle and SQL Server to PostgreSQL and systems like SAP or
Table of Contents