Itai Admi
August 5, 2021
New features in Airbyte and lakeFS make it easy to send data replicated by Airbyte into a lakeFS repo. See how to leverage this integration in your data pipelines!
airbyte-lakefs

If you work in data, chances are you rely on replicating data between different systems to centralize it for analysis. Modern companies produce data from all kinds of systems including relational DBs, marketing products, clickstream tracking, and even the dreaded Excel spreadsheet.

The good news is there are a growing number of services that can handle the rote aspects of data replication for you. One of the fastest growing is Airbyte, which coincidentally celebrated its 1 year anniversary last week (along with lakeFS of course).

Taking a look at their Sources page shows nearly 100(!) data sources the platform supports. With Airbyte you can sync data from any of these sources to the destination of your choice. 

Airbyte Sync to S3

One of Airbyte’s most common destinations is S3 given its hard-to-beat its combination of price, durability, and interoperability. 

Storing your data in S3 is pretty neat, but it isn’t perfect either.

First, it provides no way to achieve isolation between data producers and consumers without copying data across multiple buckets or prefixes. Second, it also lacks the ability to synchronize multiple datasets that reference each other. Failing to account for this can result in subtle data errors where you report sales for a product that doesn’t exist, or miss sales for a product that does.

Let me ask you a question: If there was a one-click setting in S3 that you could activate to keep all of its benefits while mitigating the downsides, would you use it?

Airbyte Sync to S3: Enhanced by lakeFS

As already mentioned, when using Airbyte to send data to S3, you can use default settings to send data to a typical S3 bucket. Or with a small settings change, you can send data instead to a lakeFS repository created over a bucket.

What are the benefits?

There are several reasons to favor a lakeFS repository over your basic S3 bucket. As the name suggests, within a repo you can perform git-like operations over data that are more useful than the default S3 commands you are used to, like get_bucket() and list_objects_v2().

Remember the issues mentioned earlier related to isolation and consistency? 

With lakeFS branches and merge operations, data isolation can be guaranteed. And through atomic merge operations, we can maintain consistency, even across datasets.

What's that setting again?

When configuring the Destination settings for S3, we can use the the Endpoint field (released in June 2021) to route data through a lakeFS installation.

airbyte-lakefs-endpoint
Configuring the S3 destination in Airbyte to sync to lakeFS

Once we save the settings and run it, data will sync immediately to our lakeFS repository. If you switch to the lakeFS UI, you’ll see the replicated data in the repo, as shown below.

postgres-sync
lakeFS repo after sync of postgres data

After each sync of data by Airbyte we recommend making a commit of the data in lakeFS. This allows you to easily time travel to any historical state of the data and recover from unwanted or troublesome data syncs. 

Want to Learn More?

Learn how lakeFS transforms object stores into git-like repositories.

Read Related Articles.

Note: This article was contributed to by Paul Singman, Developer Advocate at lakeFS.

LakeFS

  • Get Started
    Get Started