Building an Auto-Upgrading lakeFS Environment

Nirav Adunuthula

Last updated on June 20, 2023

Home > Blog > Building an Auto-Upgrading lakeFS Environment

One look at the cheerful aquamarine axolotl resting atop the lakeFS homepage was all it took to assure me that an internship here would make this summer different than any other...

This summer I had the opportunity to intern with the amazing developer team at Treeverse and work on the lakeFS project.

This opportunity represents an important progression of my programming journey, which started humbly with a little orange cat from MIT’s blog-based programming language – Scratch! After my introduction to programming in grade school, it didn’t take long before I was sending the Scratch Cat hurtling across the screen with the orange and blue puzzle pieces that encoded logic and motion.

This process of defining a world piece by piece, of coding a certain Italian plumber to jump or making a dragon spout flames, was thrilling. I’m sure anyone who works with code can appreciate the moments of success in programming, when all your lines of text finally start working together to make the magic happen!

I continued creating little worlds of code throughout school, completing computer science coursework and staying up throughout the night, frantically completing hackathon projects. By the time I moved into my university dorms, I had realized something – creating an application was a difficult (and rewarding) process, but it was only one part of the whole. Deploying the code and enabling other people to use it was a separate aspect that required additional learning.

This combination of software engineering and DevOps would go on to shape my experience at Treeverse.

Diving into DevOps

After learning the ins and outs of the lakeFS codebase, I was ready to start work on my intern project!

My task: to deploy the application and create a deployment pipeline that would allow developers to have a long-lived, upgradable lakeFS environment. This environment would let us:

Gain valuable insight into the user deployment experience and the issues they might encounter in their own CI/CD processes
Pave the road for developers at Treeverse to demonstrate lakeFS’ capabilities in a production environment.

The mission seemed daunting at first – I had little to no experience with Docker containers or AWS, and deployment was a fancy buzz word thrown out by engineers at tech companies. But with the help of Itai, my amazing mentor, and the documentation, I was quickly able to learn the tools necessary for the job.

Using Terraform & AWS

The focus of the lakeFS deployment is Terraform, an Infrastructure as Code (IaC) that allowed me to spin up AWS resources from configuration files. A look at earlier work from Yoni Augarten and the lakeFS deployment page gave me an idea of what resources I would need to deploy lakeFS.

We went with ECS for the container orchestration, ALB for Elastic Load Balancing, Route 53 for the DNS records, and RDS for the Postgres database required for lakeFS. I created and linked each of these resources using Terraform’s configuration language, and set the requisite lakeFS environment variables within the ECS task definition.

After the setup, all it took was a terraform apply command to spin up lakeFS!

Automatic Upgrades and Those Pesky Migrations

Once I had lakeFS running and connected to our S3 bucket, I needed to determine the upgrade process. The goal of the project was to create a long-lived lakeFS instance, so automated upgrades were an important functionality.

lakeFS doesn’t store state on the application, instead offloading that to a Postgres database. Based on this design, it became clear that database migrations were going to be a primary concern.

Running lakefs migrate up from a new lakeFS image upgrades the database schema using the DDL scripts within the binary. I leveraged ECS’ container dependency feature to create a container that ran the “migrate up” command before running lakeFS from a separate container.

Testing this upgrade strategy locally with Terraform was successful! In the process though, I did run into some interesting (and destructive) migration behavior in the code which I’ll discuss in more detail later.

Automating With GitHub Actions

With the lakeFS upgrade process baked into the Terraform command, I had to determine how to start the deployment after a push to the lakeFS Github repository – it was time for some GitOps!

lakeFS already uses Github Workflows to build and push Docker images, so I decided to trigger the deployment using Github Workflows as well. Here’s a diagram of the workflow:

As seen above, we decided to trigger the deployment of this long-lived lakeFS instance on a push to a specific branch in the lakeFS repository.

I take advantage of Github’s workflow dispatch triggers to chain workflows together, both within the public lakeFS repository and an internal Terraform repository.

After the pushed code successfully passes system tests, the lakeFS Version is passed over to the Terraform repository where the relevant Workflow pulls the specified image from ECR and runs terraform apply to spin up or update the lakeFS instance.

Viola! We get a long-lived lakeFS environment that can be upgraded with a push to a branch!

Migration Difficulties

While lakeFS upgrades work great, the deployment process isn’t fully ironed out. Specifically, I ran into problems while trying the less common, but still important counterpart – downgrades.

The reason for this is related to .up.sql and .down.sql database migration files lakeFS uses to build and destroy the Postgres database.

Running lakefs migrate up runs the .up.sql files one after another to update the database to the latest version. When trying to run lakefs migrate down, I realized that all of the Postgres information was deleted! The command ran the .down.sql scripts which destroyed all the database tables; obviously not the ideal behavior.

We quickly removed the offending command, but now we needed another way of running downward migrations. A new command lakefs migrate goto was a proposed alternative, but it ran into a chicken-and-egg complication.

Our proposal required getting the appropriate Postgres schema version from the intended version of lakeFS. However at the time it wasn’t possible to know the appropriate Postgres schema without running the image first.

This functionality was the motivation behind one of my final contributions to lakeFS – the db-schema-version label that started appearing on lakeFS Docker images starting from the v0.48.0 release. With the addition of the label, future versions of lakeFS will have a much more streamlined rollback process!

Final Thoughts

Working at a burgeoning start-up has been an amazing experience. Each day was filled with the opportunity to get hands on and dive into the ocean of different technologies that make up the platform. I learned more than I thought possible about software engineering and DevOps principles in such a short period of time.

I want to make sure to thank all the people that guided me through the internship. I received help from everyone and could not have done it without their patient assistance.

Most of all I would like to thank Oz, for helping me think through my goals for my career and at Treeverse, Ariel for helping me become a (slightly 😊) better programmer, Paul for his feedback for this blog post, and Itai for being the best mentor throughout these months!

I am incredibly thankful that I was able to work at Treeverse this summer, and I hope I get to work with this amazing team in the future!