Data Version Control in R with lakeFS

Robin Moffatt

Last updated on April 26, 2024

Home > Blog > Data Version Control in R with lakeFS

Try lakeFS open source data version control. Watch how it works

In this article, I’m going to work through an example workflow in which an R developer wants to test some code in isolation that will remove a subset of the data and write it in a new format. Once happy that the code works, they apply the change to the main dataset—all of that using the open-source tool lakeFS .

Introduction

Today I want to show you how lakeFS can be used with one of the most established languages in the data science space, R.

By using lakeFS with R, you gain the ability to:

Version the data you’re working with
Work with production data in a sandbox environment
Roll back changes to data

lakeFS is an open-source tool that provides a Git-like interface on top of your data lake. With lakeFS, your object store—whether S3, MinIO, GCS, or ADLS—can provide isolated and versioned branches of your data using copy-on-write to ensure low footprint and overhead. Developers can work on the same sets of data without treading on each other’s toes or can choose to interact and share access to the same branch of data.

lakeFS integrates with pretty much all tools and technologies in the data world. S3 has long been the de facto object store interface that any respectable tool has to support—and since lakeFS provides an S3 gateway, it gains instant integration capabilities to a vast range of tools.

Using R with lakeFS

R works with lakeFS through the S3 support in lakeFS for reading and writing, and the lakeFS API for other operations including creating branches and committing data.

In this article, I’ll show how an R developer can take a dataset from the data lake, manipulate it in isolation and validate the changes, and then write it back for others to use. The benefit is that the changes aren’t exposed to anyone else until they’re finalised.

Initial load and commit of the dataset

Let’s start by loading some data and inspecting it. I’m using an extract of the NYC Film Permits dataset which I’ll start by loading from JSON into R:

nyc_data <- fromJSON("/data/nyc_film_permits.json")

We can inspect the data with str(nyc_data) to get an idea of its contents:

'data.frame':   1000 obs. of  14 variables:
 $ eventid         : chr  "691875" "691797" "691774" "691762" ...
 $ eventtype       : chr  "Shooting Permit" "Shooting Permit" "Shooting Permit" "Shooting Permit" ...
 $ startdatetime   : chr  "2023-01-20T06:00:00.000" "2023-01-20T09:00:00.000" "2023-01-20T11:30:00.000" "2023-01-20T02:30:00.000" ...
 $ enddatetime     : chr  "2023-01-20T22:00:00.000" "2023-01-21T01:00:00.000" "2023-01-21T01:00:00.000" "2023-01-20T23:00:00.000" ...
 $ enteredon       : chr  "2023-01-18T14:34:06.000" "2023-01-18T11:48:09.000" "2023-01-18T10:47:25.000" "2023-01-18T09:57:45.000" ...
 $ eventagency     : chr  "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" ...
 $ parkingheld     : chr  "31 STREET between 47 AVENUE and 48 AVENUE" "3 AVENUE between BROOK AVENUE and EAST  162 STREET,  BROOK AVENUE between 3 AVENUE and EAST  161 STREET,  BROOK"| __truncated__ "WEST   15 STREET between 9 AVENUE and 10 AVENUE,  WEST   14 STREET between 10 AVENUE and WASHINGTON STREET,  WA"| __truncated__ "KINGSLAND AVENUE between GREENPOINT AVENUE and NORMAN AVENUE,  MONITOR STREET between GREENPOINT AVENUE and NOR"| __truncated__ ...
 $ borough         : chr  "Queens" "Bronx" "Manhattan" "Brooklyn" ...
 $ communityboard_s: chr  "2" "1, 3" "2, 4" "1, 2" ...
 $ policeprecinct_s: chr  "108" "40, 42" "10, 6" "108, 94" ...
 $ category        : chr  "Television" "Television" "Television" "Television" ...
 $ subcategoryname : chr  "Episodic series" "Episodic series" "Episodic series" "Episodic series" ...
 $ country         : chr  "United States of America" "United States of America" "United States of America" "United States of America" ...
 $ zipcode_s       : chr  "11101" "10451" "10011, 10014" "11101, 11109, 11222" ...

The data has several dimensions within it, including the borough for which the permit was issued. We can summarise the data to see how many permits were issued for each borough:

table(nyc_data$borough)

Bronx      Brooklyn     Manhattan        Queens Staten Island
   28           334           463           168             7

In the rest of this article, I’m going to work through an example workflow in which the developer wants to test some code in isolation that will remove a subset of the data and write it in a new format. Once happy that the code works, they’ll apply the change to the main dataset.

To get started, we’ll write and commit the original set of data as an R object to the main branch of the lakeFS repository. The lakeFS S3 gateway exposes repositories as buckets, and branches as the first part of the object path.

# Write the data
aws.s3::s3saveRDS(x = nyc_data,
                  bucket = repo_name, object = "main/nyc/nyc_permits.R", 
                  region="", use_https=useHTTPS)

# Build a commit message
body=list(message="Initial data load", 
          metadata=list(
              client="httr", author="rmoff"))

# Commit the data
branch <- "main"
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/commits"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

Create a dev branch for isolated development

Just as we create branches in Git to isolate development work that we do on code, we can do the same in lakeFS for working with data. Branches contain the full set of data of the branch from which they’re created.

Because lakeFS uses copy-on-write, it means that there isn’t actual data duplication so branches are cheap; data is only actually written back to the object store once it changes on the branch (and then, only what has changed).

We’ll create a branch called dev from the main branch that we wrote to above:

branch <- "dev"

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches"), 
       body=list(name=branch, source="main"),
       authenticate(lakefsAccessKey, lakefsSecretKey),
       encode="json" )

Whilst in practice one user may have written the data in the main branch and another have branched it to this new dev branch, we can still wear both hats and now look at the data in the dev branch ourselves.

Let’s read the data from the branch into a new variable and confirm that it looks the same as that which was written to main above:

# read data from the dev branch into a new variable
nyc_data_dev <- aws.s3::s3readRDS(object = "dev/nyc/nyc_permits.R", 
                                  bucket = repo_name, 
                                  region="",
                                  use_https=useHTTPS)
                                  
table(nyc_data_dev$borough)

Bronx      Brooklyn     Manhattan        Queens Staten Island 
   28           334           463           168             7

Looks identical! So now we can go and make some changes to it, safe in the knowledge that we’re working in isolation from main. That is to say, any changes we make on this dev branch won’t show up on main.

Making changes on the `dev` branch

The example we’re going to use here is deleting some data and storing the data in a new format (Parquet). We want to make sure we get that deletion right and the results look OK before making the change on the live data.

First, we delete the data for Manhattan:

nyc_data_dev <- nyc_data_dev[nyc_data_dev$borough != "Manhattan", ]

table(nyc_data_dev$borough)

Bronx      Brooklyn        Queens Staten Island 
   28           334           168             7

Then we write the amended dataset back to the branch. Because we want to write it as Parquet, I’m using the Arrow library with its own support for S3, so it looks a little different from the aws.s3 calls above

write_parquet(x = nyc_data_dev,
              sink = lakefs$path(paste0(repo_name, "/dev/nyc/nyc_permits.parquet")))

Finally, we’ll clean up after ourselves and remove the original R object:

lakefs$DeleteFile(paste0(repo_name, "/dev/nyc/nyc_permits.R"))

What does the `dev` branch look like now?

We can inspect the state of the dev branch programmatically:

r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/diff"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

str(content(r)$results)

List of 3
 $ :List of 4
  ..$ path      : chr "nyc/"
  ..$ path_type : chr "object"
  ..$ size_bytes: int 48278
  ..$ type      : chr "added"
 $ :List of 4
  ..$ path      : chr "nyc/nyc_permits.R"
  ..$ path_type : chr "object"
  ..$ size_bytes: int 48278
  ..$ type      : chr "removed"
 $ :List of 4
  ..$ path      : chr "nyc/nyc_permits.parquet"
  ..$ path_type : chr "object"
  ..$ size_bytes: int 48278
  ..$ type      : chr "added"

But perhaps seeing it visually is more useful, and we can do that through the lakeFS UI:

lakeFS web UI showing Uncommitted changes on the dev branch

Under Uncommitted Changes we can see that the .R file has been removed and .parquet file added. We can take advantage of the built-in object browser in lakeFS to inspect the Parquet file and even query it too:

Using the lakeFS web UI to query a parquet file with SQL

The Parquet file shows that the data for Manhattan has been removed which is what we would expect.

Side Note: What about the `main` branch?

So we’ve made these changes to the dev branch—but where does that leave main? Well, exactly where we would hope: entirely untouched.

???? Somewhat boringly, it looks exactly as it did before we started work:

The lakeFS showing the main branch in the same state as it was originally

We can double-check this by looking at the Uncommitted Changes too:

The lakeFS web UI showing no uncommitted changes on the main branch

Finalising our work

Having made the changes that we want to and verified that they’ve worked as intended, we’re ready to commit them and merge them back into main for everyone else to see and use.

First, we commit the changes:

# Build the commit message
body=list(message="remove data for Manhattan, write as parquet, remove original file", 
          metadata=list(
              client="httr", author="rmoff"))

# Make the commit
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/dev/commits"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

Then we merge it back into main:

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/refs/dev/merge/main"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=list(message="merge updated data to main branch"), encode="json" )

If we look at the main branch now, we’ll see it’s got the Parquet file as it should do:

The lakeFS web UI showing the main branch with the changes from dev applied

Looking at the Parquet file, we can see that it is holding the data with Manhattan removed:

nyc_data <- read_parquet(lakefs$path(paste0(repo_name, "/main/nyc/nyc_permits.parquet")))

table(nyc_data$borough)

Bronx      Brooklyn        Queens Staten Island 
   28           334           168             7

An R client for lakeFS?

The examples above all use httr to directly call the lakeFS REST API for creating branches, commits, and merges. lakeFS publishes an OpenAPI specification, and there is some work being done to look at the feasibility of an R client. For more information, please see #6177 and the sample notebook here.

Try it out yourself ????

lakeFS is open-source and you can download it from GitHub. You can use it with your existing R environment, or try out the lakeFS-samples repository which includes a Jupyter Notebook with R kernel and optional Docker Compose to run lakeFS too as a self-contained stack.

The NYC Film Permit example used in this blog is available as a notebook, along with several other R examples.

For help with getting started with lakeFS or any questions you may have, be sure to join our Slack group.