What is lakeFS?
As data practitioners, we use many different terms to talk about what we do – we call it business intelligence, analytics, data pipelines, or insights. But there’s one term that captures what we do really well: delivering products.
When we were leading a large R&D organization, we couldn’t help but wonder about the gap between the best practices used by teams delivering software applications and those delivering data-intensive applications.
It seemed to us that data products were struggling with quality and high cost of error, while the software folks were reaping benefits from engineering best practices such as agile development methodology and ALM tooling.
lakeFS was designed to allow data practitioners to use engineering best practices with their data, by providing highly scalable, high performance Data Version Control for data lakes.
What issues does lakeFS solve?
The move to data lakes, with their infinite scale and low costs, also introduced a new challenge in maintaining and ensuring data resilience and reliability within the data lake as time goes by. Naturally, the quality of the data we introduce determines the overall trust in our data lake.
Despite the scalability and performance advantages of running a data lake on top of object stores, enforcing best practices, ensuring high data quality and recovering quickly from errors remains extremely challenging.
Through its versioning engine, lakeFS enables the following built-in operations familiar from Git, to enable these best practices that are coming from the world of code into the world of data engineering:
- branch: a consistent copy of a repository, isolated from other branches and their changes. Initial creation of a branch is a metadata operation that does not duplicate objects.
- commit: an immutable checkpoint containing a complete snapshot of a repository.
- merge: performed between two branches — merges atomically update one branch with the changes from another.
- reset: reset a repo to the exact state of a previous commit.
- tag: a pointer to a single immutable commit with a readable, meaningful name.
Incorporating these operations into your data lake pipelines provides the same engineering best practices and benefits as you get when managing application code with source control.
As a data driven organization – what are the benefits of using lakeFS?
When using lakeFS on your object store, you improve the entire process of data management within your organization and enjoy the following benefits:
- Data teams efficiency – lakeFS enables automation of many of the repetitive manual labor-heavy tasks that data engineers deal with on a daily basis. lakeFS eliminates manual tasks such as manual rollback of production data (have you ever tried to restore data that was accidentally deleted by some retention algorithm?), or trying to debug issues in production without a solid version of the data at the time of failure. When your data engineers are free from these tasks, they can focus on what they really know and love to do: develop more and more rich & efficient data sources and algorithms for your organization.
- High quality data products – lakeFS enables validating the data coming into the data lake or created in it, before it is exposed to consumers. Being able to prevent inconsistencies and errors before they occur is one of the strongest capabilities of lakeFS.
- Data resilience – Data resilience means that even when mistakes and inconsistencies occur, we can quickly recover from them. One of the core capabilities of lakeFS is the ability to rollback the entire data lake (or the part of it you choose to manage in your lakeFS repository) to its previous consistent state. This is a valuable feature which enables organizations to eliminate data downtimes. In addition, data engineers can access the data as it was at the time of failure and dramatically reduce the time they invest in debugging data.
Why are we offering lakeFS as a cloud service?
As the lakeFS community continued growing, and more and more organizations adopted lakeFS to implement data engineering best practices, we witnessed growing demand from our open-source users to provide them with a managed version of lakeFS. This is because many organizations do not consider the infrastructure they need to maintain for their extensive technology stacks as assets, but rather as a liability. It requires them to assign DevOps and infrastructure engineers to maintain these software solutions, deploy them, upgrade them, scale them as needed, maintain their uptime and SLA and make sure they are deployed with the proper security and compliance measures.
Adopting lakeFS cloud is a great way for organizations to only pay for the services they consume and scale according to their needs without needing to source their teams with additional DevOps and infrastructure engineers.
lakeFS cloud allows organizations to enjoy the full value of lakeFS with the additional benefits:
- Assurance of high availability, uptime and all the required security guarantees.
- Enjoy the value of lakeFS at a click of a button – without the need to download, deploy, scale and maintain it.
- Enterprise support and SOC2 compliance guarantees.
Our commitment to open source remains
The lakeFS project will forever be open source. We built lakeFS open source, because we believe in bottom up adoption of technologies. We believe collaborative communities have the power to bring the best solutions to the community. Furthermore, we believe that every engineer should be able to use, contribute and influence cutting edge technologies, so they can innovate in their domain. The lakeFS core capabilities will always be open source, including all the versioning primitives that make lakeFS so powerful and useful. We are deeply committed to our community of engineers who use and contribute to the project. We will continue to be highly responsive and shape lakeFS together to provide the data lake management capabilities we are all looking for. Read more about our commitment to the open-source community in our documentation.