Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Case Study

How Enigma Improved Their Research Velocity with lakeFS

Ryan Green
Ryan Green Author

Ryan Green is the Chief Technology Officer at Enigma Technologies,...

Last updated on September 24, 2024
Company

Enigma Technologies is a data science company that specializes in providing business data and must continually create and refine ML-based algorithms in a highly research driven development process.

Solution

Enigma needed to quickly deploy changes in an isolated data pipeline and measure the results against what’s currently running in production environment.

Result

Following a successful partnership, Enigma has fully migrated their data pipelines to lakeFS and have already reduced testing time by 80% on a couple projects.

The company

Enigma Technologies is a New York City-based data science company that specializes in providing business data. Their goal of building a complete set of authoritative profiles of U.S. small businesses calls for integrating hundreds of different data sets. Enigma teams must continually create and refine machine learning–based algorithms in a highly research-driven development process.


The challenges

Flexible and robust experimentation

For Engima, flexibility means that the team can continually integrate new data sources and algorithms to rapidly experiment with and validate. Conversely, robustness means running a complex data pipeline at scale while spending minimal time on maintenance.

The tension between flexibility and robustness arises when the team gets excited about a potential research breakthrough and wants to test its effects on the data asset rapidly. 

Enigma was looking to quickly deploy the change in an isolated data pipeline and measure the results against what’s currently running in their production environment.


Data validation

Engima’s data pipeline consists of a series of stages. The team runs validation logic between each stage before promoting the results to the next stage. However, the hand-crafted promotion logic could be simplified.


Data set tagging

Another challenge the company faced was determining which data set versions contributed to the final data asset. 


Adopted solution

Challenge solved: Flexible and robust experimentation

Enigma began to explore lakeFS for data branching to resolve the tension between flexibility and robustness in the experimentation process. 

Data branching overlays a Git-like abstraction on top of the physical data. A data branch is a logically isolated view of the data that can be modified and merged into other branches.

Data branching makes it trivial for researchers to create an environment based on the latest production data. With a simple command, a data scientist can create an isolated data branch for their experiment that’s guaranteed to be identical to production except for the specific changes they make. This empowers data scientists to work independently of data engineers.

Data branching resolves issues of environment contention by allowing for the creation of isolated experimental environments (each experiment runs on a different branch).

lakeFS branches solve the isolation challenge straightforwardly. Today, every developer and researcher creates separate data branches, which include a complete snapshot of the prod data (at no additional storage cost). Users can make changes and review their impact on the final data set without fear of interfering with someone else’s work or polluting the production data.

It’s also much easier for Enigma to run parallel pipelines and maintain stable pipelines for customers who want to upgrade at a slower cadence. 


Challenge solved: Data validation

Engima’s team runs validation logic between stages before promoting the results to the next stage. lakeFS simplified the process by letting the team run their candidate pipeline on a branch and merging this branch into the main if the validation succeeded.


Challenge solved: Data set tagging

To determine which data set versions contributed to the final data asset, Enigma can tag a branch to provide clear semantics about the complete set of intermediate data sets that went into the pipeline. This is extremely helpful when diagnosing issues and anomalies in the final data asset.


Results

Following a successful partnership, Enigma has fully migrated their data pipelines to lakeFS. Moving to a data-branching solution has paid off quickly for the company. A few days after completing the migration, testing time has already been reduced by 80% on two different projects. lakeFS is set to increase Enigma’s product velocity in the future.

“For the most part, the product has met or exceeded our expectations. Where it hasn’t (mainly response time on certain endpoints) the lakeFS team has been fully engaged in turning around solutions — often in a matter of days. The attentiveness and sense of urgency is refreshing.

Moving to a data branching solution has paid off quickly for us. A few days after completing the migration, we’ve already reduced testing time by 80% on two different projects. And we’re excited to see how data branching increases our product velocity in the coming quarter.”

lakeFS