If you’re deploying lakeFS on Azure you’ll be keen to ensure that you’re using the optimal set of components in doing so. One of the pieces that lakeFS needs to run is somewhere to store its metadata, and in the latest v0.103 release of lakeFS we now support CosmosDB for this.
In this blog post, we’ll explore the powerful combination of the new integration between lakeFS and CosmosDB, highlighting its benefits and providing a comprehensive guide on deploying and using this dynamic duo.
lakeFS provides git-like capabilities by managing a metadata layer that points to the actual objects stored in the object-store. lakeFS metadata objects are stored both in the object-store (mostly for committed objects metadata), and a database for everything else (uncommitted metadata, repositories & branches pointers and more). CosmosDB joins PostgreSQL as a second database option for Azure installations.
What is CosmosDB
Azure CosmosDB is a fully managed, globally distributed, and multi-model database service, offering comprehensive support for NoSQL databases. By integrating CosmosDB with lakeFS, we gain access to a wide range of compelling features. These include automatic scaling, guaranteed high availability, managed replications and more. CosmosDB’s flexible data models, including key-value, perfectly complement lakeFS’s diverse data versioning requirements, known as the kvstore.
Deploying lakeFS with CosmosDB
Setting up lakeFS with CosmosDB is a straightforward process that requires a few key steps:
The first step involves creating an Azure CosmosDB account within the Azure portal.
After creating the CosmosDB account, it’s easier to go ahead and run lakeFS while passing the endpoint, database and container names. lakeFS will create the database & container with the appropriate partition key and consistency.
For authentication, you can either pass the PrimaryKey of the CosmosDB account to lakeFS as another parameter. Or, if running in Azure cloud, rely on Azure Managed Identities and assign your lakeFS app with the proper role.
Benefits of Combining lakeFS with CosmosDB
Using CosmosDB as the K/V store for lakeFS is the recommended option when deploying lakeFS on Azure. This is for multiple reasons, listed below:
- Scalability: CosmosDB’s automatic scaling capabilities, combined with lakeFS’s ability to handle large datasets, ensures seamless scalability. As data volumes grow, both components can dynamically adjust to accommodate increased storage and performance requirements.
- High Availability: CosmosDB’s multi-region replication ensures that data remains highly available even in the event of a regional outage or failure.
- Data Consistency: CosmosDB provides strong consistency guarantees, ensuring that all read and write operations provide the most up-to-date data. lakeFS builds upon this foundation, offering consistent snapshots and atomic commits, which are vital for managing complex data workflows and ensuring data integrity.
- Performance: By using CosmosDB’s low-latency access and efficient indexing mechanisms, llakeFS enables swift metadata operations and efficient query execution. This results in enhanced performance when performing versioning operations, enabling faster access to historical data and reducing the time required for analysis and experimentation.
- Faster Provisioning: It is much quicker to spin up a CosmosDB container than creating a PostgreSQL server.
Table of Contents