Increase data quality and reduce the painful cost of errors
Data engineering best practices
using git-like operations on data
lakeFS is an open source data version control for data lakes.
It enables zero copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more.
Trusted by
lakeFS saved us from the analysis paralysis of overthinking how to test new software on our data lake at Netflix scale. In less than 20 min I had lakeFS up and running, and was able to run tests against my production data in isolation and validate the software change thoroughly before pushing to production. With lakeFS, we improved the robustness and flexibility of our data systems.
Open Source Engineer


Moving to a data branching solution has paid off quickly for us. A few days after completing the migration, we’ve already reduced testing time by 80% on two different projects. And we’re excited to see how data branching increases our product velocity.
CTO


The cloud never warned us about the data getting clouded. As the blessing of infinite storage quickly became an unmanageable mess, there is a need for technologies like lakeFS to make data accessible again
CTO
With lakeFS we can easily achieve advanced use cases with data, such as running parallel pipelines with different logic to experiment or conduct what-if analysis, compare large result sets for data science and machine learning, and more
Raghvendra Verma,
Cory Matheson
Since introducing lakeFS to our production data environment, we’ve enjoyed the benefits of atomic and isolated operations in our data pipelines. This has allowed us to spend more time improving other aspects of our data platform, and less time dealing with the fallout from race conditions and partially failed operations
Data Platform Team Lead
By using lakeFS we produce a commit history on the production branch that easily allows for rollbacks. In the case of data quality issues in production, this allows us to simply revert to the previous high quality snapshot of our data.
Big Data R&D Team Lead
Big Data engineering requires data version control
Our data is transient and dealing with it is an inefficient and manual task. With lakeFS, your data lake is versioned and you can easily time-travel between consistent snapshots of the lake.
EASIER ETL TESTING
Test your ETL on top of production data, in isolation
Safely experiment, test and collaborate with your team on full production data without consuming extra storage costs.
CI/CD FOR DATA
Promote only high quality data to production
Automate data quality checks and ensure that only the validated data is pushed to production while bad data is kept out.


REPRODUCIBLE EXPERIMENTS
Re-run experiments, regardless of their version
Time travel with your data and move back in time to any state of your experiments as they were during development, allowing for easy reproduction of past experiments
Data Version Control that works seamlessly with today’s data stack
lakeFS is fully compatible with a wide ecosystem of data engineering tools and technologies
Works seamlessly with today’s data stack
lakeFS is fully compatible with a wide ecosystem of data engineering tools and technologies
3000+
Installations
2.2K
GitHub Stars
1800+
Community members
Trusted by
Manage your data like code
with data version control
Your data stays in place while lakeFS provides highly scalable, format agnostic and zero copy data version control over it
20%-80%
Storage Cost Reduction



X2
Double Data Engineering Efficiency



2 Seconds
Average time to rollback bad data
Seamless integration with
all your data stack
lakeFS connects to every object storage that uses the S3 interface
lakeFS supports all broadly used compute engines
All common ingest technologies are integrated into lakeFS
lakeFS is format agnostic! Regardless of the format you’re using, lakeFS will support it
Open table
Unstructured
Data Quality is mandatory for your data lake health. Ensure/maintain the highest data quality together with lakeFS
Stay updated
OLTP: Guide to Enterprise Data Architecture Part 1
Data is a goldmine for every organization, no matter the industry. But to make the most of it, businesses need technology to...
The State of Data Engineering 2023
A lot has happened since 2022, from the rise of Generative AI to the economic slowdown and job losses impacting data practitioners...
Data Engineering Best Practices
The world of software engineering underwent a huge acceleration in the past decades. This was possible thanks to the emergence of methodologies...