Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Itai Admi
Itai Admi Author

Itai is an R&D team leader at lakeFS, an open-source...

August 5, 2021
New features in Airbyte and lakeFS make it easy to send data replicated by Airbyte into a lakeFS repo. See how to leverage this integration in your data pipelines!
airbyte-lakefs

If you work in data, chances are you rely on replicating data between different systems to centralize it for analysis. Modern companies produce data from all kinds of systems including relational DBs, marketing products, clickstream tracking, and even the dreaded Excel spreadsheet.

The good news is there are a growing number of services that can handle the rote aspects of data replication for you. One of the fastest growing is Airbyte, which coincidentally celebrated its 1 year anniversary last week (along with lakeFS of course).

Taking a look at their Sources page shows nearly 100(!) data sources the platform supports. With Airbyte you can sync data from any of these sources to the destination of your choice. 

Airbyte Sync to S3

One of Airbyte’s most common destinations is S3 given its hard-to-beat its combination of price, durability, and interoperability. 

Storing your data in S3 is pretty neat, but it isn’t perfect either.

First, it provides no way to achieve isolation between data producers and consumers without copying data across multiple buckets or prefixes. Second, it also lacks the ability to synchronize multiple datasets that reference each other. Failing to account for this can result in subtle data errors where you report sales for a product that doesn’t exist, or miss sales for a product that does.

Let me ask you a question: If there was a one-click setting in S3 that you could activate to keep all of its benefits while mitigating the downsides, would you use it?

Airbyte Sync to S3: Enhanced by lakeFS

As already mentioned, when using Airbyte to send data to S3, you can use default settings to send data to a typical S3 bucket. Or with a small settings change, you can send data instead to a lakeFS repository created over a bucket.

What are the benefits?

There are several reasons to favor a lakeFS repository over your basic S3 bucket. As the name suggests, within a repo you can perform git-like operations over data that are more useful than the default S3 commands you are used to, like get_bucket() and list_objects_v2().

Remember the issues mentioned earlier related to isolation and consistency? 

With lakeFS branches and merge operations, data isolation can be guaranteed. And through atomic merge operations, we can maintain consistency, even across datasets.

What's that setting again?

When configuring the Destination settings for S3, we can use the the Endpoint field (released in June 2021) to route data through a lakeFS installation.

airbyte-lakefs-endpoint
Configuring the S3 destination in Airbyte to sync to lakeFS

Once we save the settings and run it, data will sync immediately to our lakeFS repository. If you switch to the lakeFS UI, you’ll see the replicated data in the repo, as shown below.

postgres-sync
lakeFS repo after sync of postgres data

After each sync of data by Airbyte we recommend making a commit of the data in lakeFS. This allows you to easily time travel to any historical state of the data and recover from unwanted or troublesome data syncs. 

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

Note: This article was contributed to by Paul Singman, Developer Advocate at lakeFS.

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started