New features in Airbyte and lakeFS make it easy to send data replicated by Airbyte into a lakeFS repo. See how to leverage this integration in your data pipelines!


If you work in data, chances are you rely on replicating data between different systems to centralize it for analysis. Modern companies produce data from all kinds of systems including relational DBs, marketing products, clickstream tracking, and even the dreaded Excel spreadsheet.
The good news is there are a growing number of services that can handle the rote aspects of data replication for you. One of the fastest growing is Airbyte, which coincidentally celebrated its 1 year anniversary last week (along with lakeFS of course).
Taking a look at their Sources page shows nearly 100(!) data sources the platform supports. With Airbyte you can sync data from any of these sources to the destination of your choice.Â
Airbyte Sync to S3
One of Airbyte’s most common destinations is S3 given its hard-to-beat its combination of price, durability, and interoperability.Â
Storing your data in S3 is pretty neat, but it isn’t perfect either.
First, it provides no way to achieve isolation between data producers and consumers without copying data across multiple buckets or prefixes. Second, it also lacks the ability to synchronize multiple datasets that reference each other. Failing to account for this can result in subtle data errors where you report sales for a product that doesn’t exist, or miss sales for a product that does.
Let me ask you a question: If there was a one-click setting in S3 that you could activate to keep all of its benefits while mitigating the downsides, would you use it?
Airbyte Sync to S3: Enhanced by lakeFS
As already mentioned, when using Airbyte to send data to S3, you can use default settings to send data to a typical S3 bucket. Or with a small settings change, you can send data instead to a lakeFS repository created over a bucket.
What are the benefits?
There are several reasons to favor a lakeFS repository over your basic S3 bucket. As the name suggests, within a repo you can perform git-like operations over data that are more useful than the default S3 commands you are used to, like get_bucket()
and list_objects_v2()
.
Remember the issues mentioned earlier related to isolation and consistency?Â
With lakeFS branches and merge operations, data isolation can be guaranteed. And through atomic merge operations, we can maintain consistency, even across datasets.
What's that setting again?
When configuring the Destination settings for S3, we can use the the Endpoint field (released in June 2021) to route data through a lakeFS installation.


Once we save the settings and run it, data will sync immediately to our lakeFS repository. If you switch to the lakeFS UI, you’ll see the replicated data in the repo, as shown below.


After each sync of data by Airbyte we recommend making a commit of the data in lakeFS. This allows you to easily time travel to any historical state of the data and recover from unwanted or troublesome data syncs.Â
About lakeFS
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.


Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?
This post is a recap of a talk I gave at this year’s Data + AI Summit about why I believe the Rust Programming Language


Mixing Metadata, Air and Water: Use the lakeFS Airflow Provider to Link Airflow Execution to lakeFS Data
Introduction “How do I integrate X with lakeFS” is an ever-green question on lakeFS Slack. lakeFS takes a “tooling-first” strategy to data management: it slots


Data Version Control in R with lakeFS
How and why you should use lakeFS to provide data version control for your data lake[house] when working with R. Hands-on examples and code snippets.
Note: This article was contributed to by Paul Singman, Developer Advocate at lakeFS.
Table of Contents