Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Nadav Steindler
Nadav Steindler Author

Nadav Steindler is a software engineer specializing in high-performance backend...

Published on November 19, 2024

The world is changing rapidly. The data revolution of the past couple decades continues to gain steam as it spreads beyond Big Tech to many traditional businesses. Innovations in Mobile, IoT, Analytics, AI, and the Public Cloud have many firms discovering the benefits they can reap by collecting and analyzing more data. When a company makes the decision to join the Data Revolution, the amount of data they collect can rapidly increase by orders of magnitude. Not only that, once an early adopter sees success from this approach, the trend spreads rapidly to their whole industry. When that happens, managing storage costs becomes a strategic concern, keeping the business more efficient and profitable than the competition.

The Cloud Storage industry continues to grow rapidly as more industries transform their businesses. The Business Research Company’s Cloud Storage Global Market Report 2024 predicts 19.1% annual industry growth(CAGR) for the next four years while TechNavio predicts 19.6%.

Metrics Beat Assumptions

Let me share a little cost savings anecdote from my past, that some of you may relate to.

Representatives had gathered from every department, meeting in the executive conference room on the top floor for a periodic discussion of cloud costs. One of the new cloud engineers was sharing his insights into the system by going through a never-ending Powerpoint, showing every available cost metric on every obscure cloud service. He was rapidly losing his audience, with the participants in the room split between those fighting-off sleep and those counting down the minutes until lunch. Suddenly one of the Principal Engineers interjected loudly:

“Your slide is wrong! We don’t spend nearly that much on storage. Our biggest cloud expenditure is on servers– everyone knows that!”

The experienced team members in the room began nodding. The cloud engineer giving the presentation looked confused. He was a new hire. He wasn’t aware of the years of experience we all had with our customers, and the knowledge that they practice heavy analytics on relatively small datasets. Why, we had just done an in-depth study of cloud costs 4 or 5 years ago that confirmed all of this and showed that compute accounted for 80% of our cloud costs, with Storage being relatively insignificant.

The cloud engineer opened the AWS console, navigated service by service, and showed us the raw metrics. Storage costs now accounted for 60% of total costs! Storage utilization had been climbing rapidly over the past couple of years but no one had raised a flag until now!

I can still remember the shocked silence in the room. That silence spoke volumes- it was the sound of assumptions crumbling to dust.

Once everyone got over the initial shock, the team began to work on a new strategy for storage cost savings. This plan would go on to save the firm millions of dollars in operating expenses over the coming years.

Techniques for Cloud Storage Costs Savings

There are many infrastructural techniques for storage efficiency on one’s Data Lake, like using cheaper storage tiers for archival data and compressed file formats for greater efficiency.  These are good solutions since they can be implemented by Data Engineers in a manner largely transparent to the Data Analysts. That said, the way Data Analysts work can also have a big impact on storage costs.

I’ve seen datasets that have been duplicated dozens of times by analytical teams. There are many reasons that datasets get duplicated: for the purpose of experimentation, to give each team member their own sandbox to work in, or even simply for governance reasons, so as to preserve the specific version of the data that was used to build a particular model. The problem with such duplications is that all these copies of the data are hard to manage and the Cloud Engineer in charge of cost savings has no idea what can be safely deleted and what can’t. There are various tools for detecting duplicate data and measuring duplication levels, but fixing these problems is never easy since it gets at the heart of how the analytical teams work.

Enter lakeFS, the version control system for Object Storage(such as S3, MinIO, GCS, ABS, etc). lakeFS uses Git-like operations to version-control the data, regardless of data format or size. Because the data is versioned on a metadata level only, compliance and security are satisfied without incurring the costs of duplicating data.

Not only that, but lakeFS allows administrators to define granular policies for data retention. In this way, clear rules are set-up for how long data is kept around. Data sets required for governance can be retained for a longer period, while experimental data will have a shorter lifetime. This provides an elegant solution for what can and cannot be deleted and means that we are not paying for data storage that is no longer needed.

These lakeFS features ultimately lead to massive storage cost savings. Let’s take a look at how this can work.

A Simple Branching Example

There are three teams of Analysts in our hypothetical organization. They all need access to our 2024 Usage dataset but each requires their own copy of the data for the sake of governance and not interfering with each other. As such they have each decided to clone the tables in the dataset under a different path for each team. And just like that, we quadrupled the original 3TB dataset to 12TB.

Now let’s give these teams Data Version Control with lakeFS. Rather than copying the data, each team can create a branch. The branch gives the illusion of working in a private copy of the data, but without actually duplicating the files. The teams can commit models to their branch without interfering with one another and without inducing additional storage overhead.

This simple example shows how data version control prevents data duplication by analytical teams. In reality, a single team may require multiple versions of a dataset for the sake of different projects or for the benefit of different team members. lakeFS branches are cost-effective, with negligible overhead and can be created to match any working methodology: per team, per project, or per individual analyst.

Retention Example

But what about deleting old stale data? Retention Policies can be defined per repository and per branch.

Let’s look at a simple example of two repositories. The Usage Data repository records the usage logs of our application while the BI Aggregations repository contains OLAP cubes derived from the Usage Data and other repositories.

The BI team only needs Usage Data from the past few months to compute the OLAP cubes, which need to be retained for a full year for our BI dashboards.  

At the same time, the Machine Learning team also uses the Usage Data repository and has a number of branches where they perform transformations on the raw Usage Data to prepare it for their analytical work. These analytical branches are only relevant for a short time and therefore can have a retention policy of 1 month.

Reduce storage costs with lakeFS

With lakeFS Retention Policies, each team’s business need can be easily expressed and the system will automatically free-up storage that is no longer needed.

Summary

There are many tools and technologies out there to help companies manage their ballooning cloud costs. As part of a cost saving strategy, it is important to get teams working efficiently with large datasets.

The Data Engineering industry still lacks a standard method for handling data duplication and clean-up. lakeFS nips the data duplication problem in the bud, avoiding duplication by letting analysts define different branches on their data, and providing a simple language for data retention rules. lakeFS can turn a colossal $100,000/year data lake into a modest $20,000/year expense, and it saves the Data Analysts the trouble of copying large datasets and keeping track of different data versions.

lakeFS