In this article, I’m going to work through an example workflow in which an R developer wants to test some code in isolation that will remove a subset of the data and write it in a new format. Once happy that the code works, they apply the change to the main dataset—all of that using the open-source tool lakeFS .
Introduction
Today I want to show you how lakeFS can be used with one of the most established languages in the data science space, R.
By using lakeFS with R, you gain the ability to:
- Version the data you’re working with
- Work with production data in a sandbox environment
- Roll back changes to data
lakeFS is an open-source tool that provides a Git-like interface on top of your data lake. With lakeFS, your object store—whether S3, MinIO, GCS, or ADLS—can provide isolated and versioned branches of your data using copy-on-write to ensure low footprint and overhead. Developers can work on the same sets of data without treading on each other’s toes or can choose to interact and share access to the same branch of data.
lakeFS integrates with pretty much all tools and technologies in the data world. S3 has long been the de facto object store interface that any respectable tool has to support—and since lakeFS provides an S3 gateway, it gains instant integration capabilities to a vast range of tools.
Using R with lakeFS
R works with lakeFS through the S3 support in lakeFS for reading and writing, and the lakeFS API for other operations including creating branches and committing data.
In this article, I’ll show how an R developer can take a dataset from the data lake, manipulate it in isolation and validate the changes, and then write it back for others to use. The benefit is that the changes aren’t exposed to anyone else until they’re finalised.
Initial load and commit of the dataset
Let’s start by loading some data and inspecting it. I’m using an extract of the NYC Film Permits dataset which I’ll start by loading from JSON into R:
nyc_data <- fromJSON("/data/nyc_film_permits.json")
We can inspect the data with str(nyc_data)
to get an idea of its contents:
'data.frame': 1000 obs. of 14 variables:
$ eventid : chr "691875" "691797" "691774" "691762" ...
$ eventtype : chr "Shooting Permit" "Shooting Permit" "Shooting Permit" "Shooting Permit" ...
$ startdatetime : chr "2023-01-20T06:00:00.000" "2023-01-20T09:00:00.000" "2023-01-20T11:30:00.000" "2023-01-20T02:30:00.000" ...
$ enddatetime : chr "2023-01-20T22:00:00.000" "2023-01-21T01:00:00.000" "2023-01-21T01:00:00.000" "2023-01-20T23:00:00.000" ...
$ enteredon : chr "2023-01-18T14:34:06.000" "2023-01-18T11:48:09.000" "2023-01-18T10:47:25.000" "2023-01-18T09:57:45.000" ...
$ eventagency : chr "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" ...
$ parkingheld : chr "31 STREET between 47 AVENUE and 48 AVENUE" "3 AVENUE between BROOK AVENUE and EAST 162 STREET, BROOK AVENUE between 3 AVENUE and EAST 161 STREET, BROOK"| __truncated__ "WEST 15 STREET between 9 AVENUE and 10 AVENUE, WEST 14 STREET between 10 AVENUE and WASHINGTON STREET, WA"| __truncated__ "KINGSLAND AVENUE between GREENPOINT AVENUE and NORMAN AVENUE, MONITOR STREET between GREENPOINT AVENUE and NOR"| __truncated__ ...
$ borough : chr "Queens" "Bronx" "Manhattan" "Brooklyn" ...
$ communityboard_s: chr "2" "1, 3" "2, 4" "1, 2" ...
$ policeprecinct_s: chr "108" "40, 42" "10, 6" "108, 94" ...
$ category : chr "Television" "Television" "Television" "Television" ...
$ subcategoryname : chr "Episodic series" "Episodic series" "Episodic series" "Episodic series" ...
$ country : chr "United States of America" "United States of America" "United States of America" "United States of America" ...
$ zipcode_s : chr "11101" "10451" "10011, 10014" "11101, 11109, 11222" ...
The data has several dimensions within it, including the borough for which the permit was issued. We can summarise the data to see how many permits were issued for each borough:
table(nyc_data$borough)
Bronx Brooklyn Manhattan Queens Staten Island
28 334 463 168 7
In the rest of this article, I’m going to work through an example workflow in which the developer wants to test some code in isolation that will remove a subset of the data and write it in a new format. Once happy that the code works, they’ll apply the change to the main dataset.
To get started, we’ll write and commit the original set of data as an R object to the main
branch of the lakeFS repository. The lakeFS S3 gateway exposes repositories as buckets, and branches as the first part of the object path.
# Write the data
aws.s3::s3saveRDS(x = nyc_data,
bucket = repo_name, object = "main/nyc/nyc_permits.R",
region="", use_https=useHTTPS)
# Build a commit message
body=list(message="Initial data load",
metadata=list(
client="httr", author="rmoff"))
# Commit the data
branch <- "main"
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/commits"),
authenticate(lakefsAccessKey, lakefsSecretKey),
body=body, encode="json" )
Create a dev branch for isolated development
Just as we create branches in Git to isolate development work that we do on code, we can do the same in lakeFS for working with data. Branches contain the full set of data of the branch from which they’re created.
Because lakeFS uses copy-on-write, it means that there isn’t actual data duplication so branches are cheap; data is only actually written back to the object store once it changes on the branch (and then, only what has changed).
We’ll create a branch called dev
from the main
branch that we wrote to above:
branch <- "dev"
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches"),
body=list(name=branch, source="main"),
authenticate(lakefsAccessKey, lakefsSecretKey),
encode="json" )
Whilst in practice one user may have written the data in the main
branch and another have branched it to this new dev
branch, we can still wear both hats and now look at the data in the dev
branch ourselves.
Let’s read the data from the branch into a new variable and confirm that it looks the same as that which was written to main
above:
# read data from the dev branch into a new variable
nyc_data_dev <- aws.s3::s3readRDS(object = "dev/nyc/nyc_permits.R",
bucket = repo_name,
region="",
use_https=useHTTPS)
table(nyc_data_dev$borough)
Bronx Brooklyn Manhattan Queens Staten Island
28 334 463 168 7
Looks identical! So now we can go and make some changes to it, safe in the knowledge that we’re working in isolation from main
. That is to say, any changes we make on this dev
branch won’t show up on main
.
Making changes on the dev
branch
The example we’re going to use here is deleting some data and storing the data in a new format (Parquet). We want to make sure we get that deletion right and the results look OK before making the change on the live data.
First, we delete the data for Manhattan:
nyc_data_dev <- nyc_data_dev[nyc_data_dev$borough != "Manhattan", ]
table(nyc_data_dev$borough)
Bronx Brooklyn Queens Staten Island
28 334 168 7
Then we write the amended dataset back to the branch. Because we want to write it as Parquet, I’m using the Arrow library with its own support for S3, so it looks a little different from the aws.s3
calls above
write_parquet(x = nyc_data_dev,
sink = lakefs$path(paste0(repo_name, "/dev/nyc/nyc_permits.parquet")))
Finally, we’ll clean up after ourselves and remove the original R object:
lakefs$DeleteFile(paste0(repo_name, "/dev/nyc/nyc_permits.R"))
What does the dev
branch look like now?
We can inspect the state of the dev
branch programmatically:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/diff"),
authenticate(lakefsAccessKey, lakefsSecretKey),
body=body, encode="json" )
str(content(r)$results)
List of 3
$ :List of 4
..$ path : chr "nyc/"
..$ path_type : chr "object"
..$ size_bytes: int 48278
..$ type : chr "added"
$ :List of 4
..$ path : chr "nyc/nyc_permits.R"
..$ path_type : chr "object"
..$ size_bytes: int 48278
..$ type : chr "removed"
$ :List of 4
..$ path : chr "nyc/nyc_permits.parquet"
..$ path_type : chr "object"
..$ size_bytes: int 48278
..$ type : chr "added"
But perhaps seeing it visually is more useful, and we can do that through the lakeFS UI:


Under Uncommitted Changes we can see that the .R
file has been removed and .parquet
file added. We can take advantage of the built-in object browser in lakeFS to inspect the Parquet file and even query it too:


The Parquet file shows that the data for Manhattan has been removed which is what we would expect.
Side Note: What about the main
branch?
So we’ve made these changes to the dev
branch—but where does that leave main
? Well, exactly where we would hope: entirely untouched.
🥱 Somewhat boringly, it looks exactly as it did before we started work:


We can double-check this by looking at the Uncommitted Changes too:


Finalising our work
Having made the changes that we want to and verified that they’ve worked as intended, we’re ready to commit them and merge them back into main
for everyone else to see and use.
First, we commit the changes:
# Build the commit message
body=list(message="remove data for Manhattan, write as parquet, remove original file",
metadata=list(
client="httr", author="rmoff"))
# Make the commit
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/dev/commits"),
authenticate(lakefsAccessKey, lakefsSecretKey),
body=body, encode="json" )
Then we merge it back into main
:
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/refs/dev/merge/main"),
authenticate(lakefsAccessKey, lakefsSecretKey),
body=list(message="merge updated data to main branch"), encode="json" )
If we look at the main
branch now, we’ll see it’s got the Parquet file as it should do:


Looking at the Parquet file, we can see that it is holding the data with Manhattan removed:
nyc_data <- read_parquet(lakefs$path(paste0(repo_name, "/main/nyc/nyc_permits.parquet")))
table(nyc_data$borough)
Bronx Brooklyn Queens Staten Island
28 334 168 7
An R client for lakeFS?
The examples above all use httr
to directly call the lakeFS REST API for creating branches, commits, and merges. lakeFS publishes an OpenAPI specification, and there is some work being done to look at the feasibility of an R client. For more information, please see #6177 and the sample notebook here.
Try it out yourself 🔧
lakeFS is open-source and you can download it from GitHub. You can use it with your existing R environment, or try out the lakeFS-samples repository which includes a Jupyter Notebook with R kernel and optional Docker Compose to run lakeFS too as a self-contained stack.
The NYC Film Permit example used in this blog is available as a notebook, along with several other R examples.
For help with getting started with lakeFS or any questions you may have, be sure to join our Slack group.
Table of Contents