Scalable Data Version Control
Manage your data as code using Git-like operations and achieve reproducible, high-quality data pipelines. Available Open Source or on the Cloud.


Take control of your data




COMPUTE ENGINES
lakeFS supports all standard computation engines.
lakefs
lakeFS uses metadata to manage data versions. Its versioning engine is highly scalable with minor impact to storage performance
formats
lakeFS is format agnostic, regardless of format type be it structured, unstructured, open table, or anything else.
Object Storage
lakeFS supports data in all object stores including all major cloud providers S3, Azure Blob, GCP, and on prem MinIO, Ceph, Dell EMC and any other S3 compatible storage.
Use Cases
lakeFS helps data engineers and data scientists in every field manage their data like code — at scale
- Data Science
- Data engineering
Robust Data Pre-Processing
Data cleaning, outlier handling, filling in missing values, etc. Ensure your data pipelines for pre-processing are robust and provide high quality.
Deduplicated Experimentation
Use lakeFS branches to run experiments in parallel with zero-copy clones in a fully deduplicated data lake, allowing you to effectively compare them to select the best one.
Reproducible Feature Engineering & Model Training
Commit the results of your experiments and use the lakeFS Git integration to reproduce any experiment with the right version of the data, the code and the model weights.


Isolated Dev/Test Environments
Create isolated dev/test environments using lakeFS branches and reduce your testing time by 80%.
Promote Only High Quality Data to Production
Implement CI/CD for data with lakeFS hooks, allowing for automation of quality validation checks.
Fix Bad Data with Production Rollback
Save entire consistent snapshots of your data using commits, allowing you to rollback to previous commits in case of bad data.


lakeFS is already helping thousands of developers
UP TO 80%
Reduce storage costs
2X
Double efficiency
UP TO 99%
Increase production
outage recovery
Trusted by
lakeFS saved us from the analysis paralysis of overthinking how to test new software on our data lake at Netflix scale. In less than 20 min I had lakeFS up and running, and was able to run tests against my production data in isolation and validate the software change thoroughly before pushing to production. With lakeFS, we improved the robustness and flexibility of our data systems.
Open Source Engineer


Moving to a data branching solution has paid off quickly for us. A few days after completing the migration, we’ve already reduced testing time by 80% on two different projects. And we’re excited to see how data branching increases our product velocity.
CTO


The cloud never warned us about the data getting clouded. As the blessing of infinite storage quickly became an unmanageable mess, there is a need for technologies like lakeFS to make data accessible again
CTO
With lakeFS we can easily achieve advanced use cases with data, such as running parallel pipelines with different logic to experiment or conduct what-if analysis, compare large result sets for data science and machine learning, and more
Raghvendra Verma,
Cory Matheson
Since introducing lakeFS to our production data environment, we’ve enjoyed the benefits of atomic and isolated operations in our data pipelines. This has allowed us to spend more time improving other aspects of our data platform, and less time dealing with the fallout from race conditions and partially failed operations
Data Platform Team Lead
By using lakeFS we produce a commit history on the production branch that easily allows for rollbacks. In the case of data quality issues in production, this allows us to simply revert to the previous high quality snapshot of our data.
Big Data R&D Team Lead
Seamless integration with
all your data stack
lakeFS connects to every object storage that uses the S3 interface
lakeFS supports all broadly used compute engines
All common ingest technologies are integrated into lakeFS
lakeFS is format agnostic! Regardless of the format you’re using, lakeFS will support it
Open table
Unstructured
Data Quality is mandatory for your data lake health. Ensure/maintain the highest data quality together with lakeFS
lakeFS Data Version Control Blog
ML Data Version Control and Reproducibility at Scale
Introduction In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as...
Databricks Unity Catalog: A Comprehensive Guide to Streamlining Your Data Assets
As data quantities increase and data sources diversify, teams are under pressure to implement comprehensive data catalog solutions. Databricks Unity Catalog is...
Jupyter Notebook & 10 Alternatives: Data Notebook Review [2023]
The tech industry responded to the needs of data practitioners with various IDE solutions for developing code and presenting findings in a...