What is lakeFS?
lakeFS is a platform that helps data engineers build scalable and resilient data lakes running on object storage. It provides version control, branching, and merging capabilities for data at petabyte scale, on or off premises. lakeFS enables teams to collaborate and manage data effectively by applying engineering best practices to data management.
lakeFS was created by Treeverse and it is an open-source project under the Apache 2.0 license. In addition to the open-source project, lakeFS is available as a SaaS offering or an Enterprise software solution.
This article will review the differences between the 3 options.
As mentioned, lakeFS is an open-source software (OSS). We chose the open-core model; meaning the core of a software product is released as OSS, while additional features and functionalities are provided as commercial add-ons under a proprietary license. (You might be familiar with this model from companies like GitLab, dbt, Confluent, Databricks, Dagster, Preset and others).
Our commitment to open source is to keep lakeFS’s versioning capabilities, git-like interface, public object store APIs, CLI, API, and GUI interfaces in the open source.
In other words, you will be able to use all the versioning capabilities at scale for your data using lakeFS.
- Data Version Control
- Format Agnostic
- Zero Clone Copy utilizing Branches, creating isolated environments
- Atomic promotion of data utilizing Merges
- Enhance your data CI/CD with lakeFS Hooks: configure actions to trigger when predefined events occur
- A loooooong list of out-of-the-box Integrations, such as (but not limited to):
- Delta Lakes
- AWS Glue
- AWS EMR
- Amazon Sagemaker
- And more…
- Keep your storage cost low by:
- lakeFS zero clone copy
- Deduplication of files on the object store over time
- Garbage collection
In addition to all the benefits of the lakeFS OSS solution, the Enterprise version grants customers access to some enterprise features as well as support SLA and customer success services.
SSO (OIDC Integration)
Achieve Single Sign On (SSO) using the OpenID Connect (OIDC) protocol built on top of the OAuth 2.0 framework.
With SSO OIDC integration, users can authenticate once with the identity provider (IDP) of their choice, and then gain access to lakeFS without having to enter their login credentials again. The IDP acts as a trusted third-party that vouches for the user’s identity, and the application can verify the users’ identity by validating the authentication token issued by the IDP.
SSO OIDC integration provides a more secure and user-friendly experience for users, as they only have to remember one set of credentials, reducing the burden on your IT department by simplifying user management and access control.
lakeFS Enterprise enables Role-Based Access Control (RBAC), which is achieved by a mechanism similar to AWS Identity and Access Management (IAM).
IAM allows for centralized management of user identities and access privileges, enabling administrators to easily assign roles and permissions based on job responsibilities and organizational hierarchy. This enhances security by ensuring that users have access only to the resources they need to perform their jobs, and reduces the risk of unauthorized access or accidental data breaches.
lakeFS authentication enables critically required granularity to control data access for different users, or groups of users, to different resources, down to an individual object or dataset in a repository.
Overall, using IAM for RBAC improves security, streamlines access management, and enhances compliance with regulations and standards.
The lakeFS Customer Success team will ensure you are successful in achieving your desired outcomes and goals throughout the use of lakeFS. We take a proactive approach to maximize the value of lakeFS.
Some of our services include:
We will assist in the initial install, integration and setup of lakeFS. Our experts will walk you through:
- lakeFS basic training
- Deployment sizing and planning exercise
- Use case review and optimization
Quarterly technical reviews
Once a quarter, we will discuss new feature releases and how you can benefit from them.
Our customers also provide input on the roadmap, helping to shape future features (including but not limited to the OSS).
Enterprise Support SLA
We are committed to providing high-quality support and will work to resolve any issues that arise in a timely manner. This can help to minimize downtime and disruptions to business operations, which can be particularly important for mission-critical data applications.
Our lakeFS enterprise customers will be notified in case of security threats or known bugs that will impact them, and we will proactively assist in resolving issues even before they are “discovered” by our customers.
Our support SLA helps you plan and budget for support costs more effectively, as you clearly understand the expected response times and resolution times. lakeFS support continuously receives top customer satisfaction scores for every issue, reflecting our commitment to making your satisfaction our priority.
lakeFS Cloud is a hosted lakeFS Enterprise Software as a Service (SaaS) solution that includes some additional features to lakeFS Enterprise.
Why consume lakeFS as SaaS?
First, lakeFS Cloud reduces the time and cost required to deploy and maintain the software, enabling you to focus on unlocking the full benefits of using lakeFS. Additionally, lakeFS Cloud offers out-of-the-box scalability and flexibility, as the service will auto-scale to meet your data management needs. lakeFS Cloud will automatically update and upgrade versions, ensuring you always have access to the latest features and security patches without needing to perform manual upgrades.
Overall, utilizing the service drives cost savings, flexibility, scalability, and ease of maintenance, making it our customer’s most popular choice.
Additional Features of lakeFS Cloud
Remember that time data was changed and you needed to trace it back? This is exactly why an auditing log is critical. The lakeFS Audit Log allows you to view all relevant user action information in a clear and organized table, including when the action was performed, by whom, and what it was they did. This is a key feature in data governance, as it improves security by identifying potential threats and preventing unauthorized access, helping you comply with regulatory requirements and standards, and troubleshoot issues by identifying the cause of errors or problems.
Our customers find this feature also holds users accountable for their actions, and provides valuable information for forensic analysis in the event of a security incident or data breach.
Managed Garbage Collection
Managing ephemeral data objects can be challenging. This is why we developed Garbage Collection. Garbage collection (GC) rules in lakeFS define for how long to retain objects after they have been deleted (learn more here). lakeFS OSS provides a Spark program to hard-delete objects that have been deleted and whose retention period has ended according to the GC rules. Using OSS, you will need to configure, run, troubleshoot and maintain the GC execution.
With lakeFS Cloud, GC is a fully transparent managed service. In other words, once you define your GC rules, lakeFS Cloud will automatically and continuously manage the execution of the garbage collection. This keeps your storage costs low, while simultaneously allowing you to roll back your data according to your policies.
lakeFS Cloud meets rigorous security standards and is designed to protect sensitive data and systems from unauthorized access, theft, or misuse. Choosing a SOC2 compliant software helps our customers comply with regulatory requirements, demonstrate due diligence to their customers and stakeholders, and mitigate the risk of data breaches and cyber-attacks.
Upcoming lakeFS Cloud features
In addition to available features today, we are working with on some exciting features in the future. Our roadmap includes things like:
- Managed ingestion into Snowflake
- Managed Data Observability checks (using integrations with leading data observability providers such as Monte Carlo, Great Expectations and dbt Tests)
- Enhanced DataOps visibility
- Automated DataOps management (auto-delete unused branches, find data duplications and more)
- Cross region replication
Compute cost savings with lakeFS Cloud
Similar to lakeFS OSS and Enterprise, using lakeFS Cloud, the data stays in place. This means all your files, including the metadata files which lakeFS manages, sit on your own buckets (S3, Azure Blob, Google Storage) within your Virtual Private Cloud / Network.
However, with lakeFS Cloud, the compute required for the lakeFS Server runs outside your account, saving you costs with your cloud vendor. Furthermore, additional compute operations such as garbage collection executions don’t run within your account, saving you additional costs.
lakeFS OSS is used by hundreds of organizations (that we know of) today. We are committed to continuing support and further improving our open-source solution. Having said that, you might be looking for additional features to allow your organization to maximize the benefits of lakeFS. Below is a table comparing the three solutions, side by side:
|lakeFS Open Source||lakeFS Enterprise||lakeFS Cloud (SaaS)|
|Format-agnostic data version control||V||V||V|
|Zero Clone copy for isolated environment (via branches)||V||V||V|
|Atomic Data Promotion (via merges)||V||V||V|
|Data stays in place||V||V||V|
|Configurable Garbage Collection||V||V||V|
|Data CI/CD using lakeFS hooks||V||V||V|
|Integrates with your data stack||V||V||V|
|Role-Based Access Control||V||V|
|Managed Service (Auto-updates, Auto-scaling, Disaster Recovery, etc.)||V|
|Managed Garbage Collection||V|
|SOC 2 Compliant||V|
Table of Contents