In the world of data management, security is a paramount concern.
The more data we generate and store, the more critical it becomes to ensure that data is both accessible and protected.
lakeFS, a powerful and innovative data version control system, takes data security to the next level by offering a unique feature: the ability to manage data it cannot access!
This is made possible by leveraging pre-signed URLs, which are available on all common object stores (Amazon S3, Google Cloud, Azure Storage, MinIO).
In this blog post, we’ll explore the security benefits of this approach and how it enhances data management for lakeFS users.
The Challenge of Data Lake Security
Data lakes and object stores have become the backbone of modern data infrastructure, allowing organizations to store vast amounts of structured and unstructured data. While the scalability and flexibility of these solutions are undeniable, ensuring data security within them can be challenging.
Traditional access control mechanisms, such as IAM (Identity and Access Management) policies may not be sufficient for more complex environments – especially in cases where the service or person accessing the data isn’t the end user.
This is a very common practice for data practitioners. For example, in most modern data stacks, we’d introduce a compute system that actually interfaces with the storage layer – be it Snowflake, Databricks, Trino or any other system. Typically that compute layer is the one with the IAM permissions to access the underlying data. The end user – typically an analyst, data engineer, data scientist – isn’t really consuming raw objects; they consume tables, columns, queries and dashboards!
lakeFS: A Secure Data Version Control System
lakeFS is designed to help organizations overcome these challenges. It acts as a data layer that sits on top of your existing data lake or object store, providing powerful version control and data management capabilities. One of its standout features is the ability to manage data it cannot even access!
This is made possible by utilizing pre-signed URLs: time-limited URLs that grant temporary access to specific objects in an object store, such as Amazon S3 or Google Cloud Storage. These URLs can be generated and provided to users or applications as needed, without compromising the credentials (or keys) used to generate them.
This allows intermediary systems, such as the compute engines listed above – as well as lakeFS (more on that later), to provide another, typically more specific authorization mechanism, essentially plugging in their own authorization logic on top of the object store. Here’s a typical scenario:
In this case, the authorization system performs two important tasks:
- It translates the business context of the user’s request into the underlying data required (i.e. from “orders” table → an object store path)
- It generates the required pre-signed URLs to allow the user to interact with the storage system without having to provide them with storage credentials
While simple in concept, this is a very powerful approach. Those of you familiar with the concept (or simply very observant) will notice that the authorizing system never actually contacts the Object store to fulfill this role!
Enhanced Security Benefits
1. Granular Access Control – Pre-signed URLs enable fine-grained access control. By generating URLs with specific permissions for each user or application, you can restrict access to only what is necessary. This reduces the risk of unauthorized access, limiting the potential for data breaches.
2. Temporary Access – Pre-signed URLs have a limited validity period. Once the defined time frame expires, the URL becomes invalid, rendering it useless for unauthorized access. This feature ensures that sensitive data is only accessible for a limited time, reducing the window of vulnerability.
3. Isolation of lakeFS – Since lakeFS manages data it cannot access directly, the data remains isolated from the lakeFS environment. Even if an unauthorized entity gains access to lakeFS, it cannot access the data stored in the underlying object store without meeting other security criteria first: Network perimeters, MFA, etc. This extra layer of security protects your data lake from potential breaches.
In practice: using lakeFS with pre-signed URLs
Let’s see a practical example – in this case, using lakeFS Cloud alongside an AWS S3 bucket.
Let’s imagine a scenario where our organization doesn’t allow data to move in or out of the network perimeter of our AWS account.
lakeFS Cloud is highly secure, SOC2 compliant and provides single-tenant isolation – but it runs inside its own managed VPC!
How can we still meet the strict network requirements? The answer should be obvious by now: pre-signed URLs!
Let’s start by making sure lakeFS Cloud indeed cannot access our data, by restricting our bucket’s data access operations to only happen from the perimeters of our VPC network:
{
"Sid": "allowLakeFSRoleFromCompanyOnly",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"
},
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::[BUCKET]/*"
],
"Condition": {
"StringEquals": {
"aws:SourceVpc": "vpc-123"
}
}
}
Any GetObject
(download) or PutObject
(upload) operation that takes place outside our VPC (vpc-123
in this example) would simply fail. That means that the lakeFS Cloud server, even though it might have a role that permits these operations, will not be able to execute them because it runs in another VPC!
A full example is available in the storage guide on the lakeFS docs
So now that we configured our server, let’s go ahead and see how to configure different clients to use pre-signed URLs:
Example #1 – Using pre-signed URLs with the lakeFS Python SDK
Pre-signed URLs are fully supported by the lakeFS Python SDK. To get started, first install the lakefs-sdk
package from pypi:
$ pip install lakefs-sdk~=1.1.0
Once installed, import and configure a lakeFS client:
from lakefs_sdk import Configuration
from lakefs_sdk.client import LakeFSClient
client = LakeFSClient(Configuration(
host="https://<YOUR LAKEFS SERVER URL>",
username="AKIAIOSFODNN7EXAMPLE",
password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"))
Now that we have a client, let’s download an object from lakeFS by utilizing pre-signed URLs:
client.objects_api.get_object(
repository="my-repository",
ref="main",
path="tables/orders/1.csv",
presign=True, # This is where the magic happens!
)
By simply passing presign=True
to the get_object
function, lakeFS would return a pre-signed URL for the underlying object store, that the python client would then perform an HTTP GET request for. No data goes through lakeFS!
See more information on using lakeFS with Python
Example #2 – Using pre-signed URLs in Apache Spark
Let’s do something more ambitious to see how well this scales! Let’s put a distributed compute framework in front of lakeFS – in this case, Apache Spark.
Let’s start by setting up your Spark environment with the lakeFS file system implementation – this is typically only a few lines of configuration:
$ spark-shell --packages io.lakefs:lakefs-spark-client_2.12:0.11.0
Next, set the spark.hadoop.fs.lakefs.access.mode setting to presigned – and provide your Spark environment with connection details to lakeFS:
spark-shell --conf spark.hadoop.fs.lakefs.access.mode=presigned \
--conf spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem \
--conf spark.hadoop.fs.lakefs.access.key=AKIAlakefs12345EXAMPLE \
--conf spark.hadoop.fs.lakefs.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY \
--conf spark.hadoop.fs.lakefs.endpoint=https://example-org.us-east-1.lakefscloud.io/api/v1 \
--packages io.lakefs:hadoop-lakefs-assembly:0.2.1
In a managed environment such as AWS EMR or Databricks, you’d probably not do this directly in the command line – see the full reference to learn how to set these values in your environment
Now that the jar has been added and the configuration is set, go ahead and read/write from lakeFS like you would from any storage service, using the lakefs:// URI:
df = spark.read.parquet('lakefs://my-repository/main/tables/orders/')
df.write.parquet('lakefs://my-repository/main/other/path')
All I/O operations would use pre-signed URLs provided by the lakeFS server for you – the lakeFS Spark client abstracts that away from you!
Example #3 – Using pre-signed URLs in `lakectl
` – the lakeFS command-line client
lakeFS comes with a powerful command line utility called lakectl
– it allows to fully manage, read and write data from a lakeFS server.
This client is pretty clever! It requires no additional configuration to use pre-signed URLs:
$ lakectl fs download lakefs://my-repository/main/tables/orders/1.csv # will use a pre-signed URL if supported by the underlying storage!
$ lakectl fs upload 2.csv lakefs://my-repository/main/tables/orders/2.csv # Also works when writing!
Install lakectl and read more on how to use it
Conclusion
The use of pre-signed URLs by lakeFS to manage data it cannot directly access significantly enhances data security for its users.
This approach provides granular access control, temporary access, and isolation of lakeFS from the underlying data.
By combining these features with lakeFS’s data version control capabilities, organizations can confidently manage and protect their data, even in complex multi-user and multi-environment settings.
With data security concerns on the rise, lakeFS offers a compelling solution for safeguarding your critical data assets.
Table of Contents