Eden Ohana
March 11, 2022

Introduction

It may seem strange at first, but increasingly we cannot be sure when putting or getting data from an object store that the data is being sent around is coming from the service we think it is. Imitation is the highest form of flattery, and given cloud object storages’ popularity, they have naturally attracted imitators.

The most popular cloud object store is AWS’ S3 service. Imitating S3 amounts to maintaining “S3-compatible” APIs that implement the same API methods for putting and retrieving data, and return it in the same structures. Some of the most popular S3-compatible services are the $1 Billion valuated MinIO and the Linux Foundation’s Ceph with over 10k Github Stars. Both Ceph and MinIO aim to provide open source storage solutions that allow for self-hosting, cost-savings, and greater customization than S3.

S3-compatible services aren’t limited to providing mere storage though. Take for example the project who’s blog you are reading, lakeFS, with monthly installations that grew 2000% in 2021. We open sourced lakeFS to extend object stores with git-like functionality to improve the manageability of data lakes.

A key feature of lakeFS is its S3-compatible API Gateway, which reduces integrations with other tools in the ecosystem to a single-line config update. 

What's the point?

A side-effect of this S3-compatibility is that we have newfound freedom to pick-and-choose what storage we want to use for particular purposes. Perhaps you use the actual S3 cloud service for production workloads but a self-hosted MinIO cluster for development, for example. 

The challenge is to not have the complexity of multi-service object store architectures fully reflected in our code. However this is not so simple due to the fact that one instance of the Boto S3 client (the popular python SDK for AWS) can only interact with a single storage endpoint.

Therefore multiple storage endpoints requires multiple boto client instantiations. And what ends up happening is you manually maintain multiple differently-named client connections in your code. This might not be a big deal going from 1 storage endpoint to 2 — but it does prevent further scaling from seeming worthwhile. 

We don’t want this to stop us and others from trying out these type of architectures and so we decided to do something about it.

Our solution is boto s3 router package available for download on PyPI. Let’s walk through when it makes sense to use it and how it works!

A Deeper Motivating Example

s3://example-bucket/
	       team1/
	       team2/
	       team3/

Consider a case where a company uses the above storage structure to separate the data of multiple teams. Now suppose one of those teams want to adopt using MinIO for a new project.

Code you company uses today to store objects might look like this…

s3_client.put_object(Bucket="example-bucket", Key="team1/obj1", Body=obj1)
s3_client.put_object(Bucket="example-bucket", Key="team2/obj2", Body=obj2)
s3_client.put_object(Bucket="example-bucket", Key="team3/obj3", Body=obj3)

…will now have to evolve to look more like the below.

s3_client.put_object(Bucket="example-bucket", Key="team1/obj1", Body=obj1)

minio_client = boto3.client('s3', endpoint_url='http://minio-url',
                    aws_access_key_id='YOUR-ACCESSKEYID',
                    aws_secret_access_key='YOUR-SECRETACCESSKEY',
                    config=Config(signature_version='s3v4'))
                    
#one client for minio, one for s3                    
minio_client.put_object(Bucket="minio-bucket", Key="data/team2/obj2", Body=obj2)
s3_client.put_object(Bucket="example-bucket", Key="team3/obj3", Body=obj3)

It may not seem like a big deal for code performing a single put operation. But in reality, things easily get more confusing.

For example, let’s assume we have a very_complex_data_pipeline that is, well, complex. Think of your own pipelines – they probably  contain calls to code written by other teams or a third party you don’t control. To use multiple clients instead of just one, you would likely need to change every function down the stack, and add custom logic to decide when to use which client.

This is where the Boto S3 Router Package enters the picture. It is a boto3-compatible client that simplifies working with multiple S3-compatible services simultaneously by routing requests to the relevant service according to the bucket name and key. This way, you can interact with multiple S3-compatible services alongside S3, without instantiating multiple clients and updating code everywhere it is referenced.

Using the Boto S3 Router

The Boto S3 Router package can be simply installed via pip:

pip install boto-s3-router

Now let’s make it work for our example above. Firstly, we initialize our two “real” clients: one for S3 and one for MinIO.

import boto3
import boto_s3_router as s3r

s3_client = boto3.client('s3')
minio_client = boto3.client('s3', endpoint_url='http://minio-url',
                    aws_access_key_id='YOUR-ACCESSKEYID',
                    aws_secret_access_key='YOUR-SECRETACCESSKEY')

Next, we tell the router which requests will go to which client. In our case, all requests to keys under “team2” will get routed to MinIO. Other requests will go to S3. Crucially, this client mapping is the only place we need to refer to minio and s3 clients.

profiles = {
    "minio": {
        "source_bucket_pattern": "example-bucket",
        "source_key_pattern": "team2/*",
        "mapped_bucket_name": "minio-bucket",
        "mapped_prefix": "data/team2",
    }
}

# Define the mapping between the profiles to the Boto clients:
client_mapping = {"minio": minio_client, "default": s3_client}

Finally, we initialize the router and use it the same way we used our original S3 client. Additionally, we can pass into other functions without worry. No further code changes are necessary!


s3r_client = s3r.client(client_mapping, profiles)

very_complex_data_pipeline(s3r_client, Bucket="example-bucket", Paths=["team1/", "team2/", "team3/"])

Another Use-case: Cross-region S3 Access

When your data is split between buckets across different regions, the Boto S3 Router can help you access it using a single client.

import boto3
import boto_s3_router as s3r

# Initialize two boto S3 clients:
s3_east = boto3.client('s3', region_name='us-east-1')
s3_west = boto3.client('s3', region_name='us-west-1')

profiles = {
    "s3_west": {
        "source_bucket_pattern": "us-east-bucket",
    },
    "s3_east": {
        "source_bucket_pattern": "us-west-bucket",
    }
}

client_mapping = {"s3_west": s3_west, "s3_east": s3_east, "default":s3_east}

client = s3r.client(client_mapping, profiles)

client.put_object(Bucket="us-west-bucket", Prefix="a/b/obj") # routes to s3_west
client.put_object(Bucket="us-east-bucket", Prefix="a/b/obj") # routes to s3_east

Handling multi-region services is less messy when you can map them within a single client! And of course it should be noted this is also a great way to incrementally adopt lakeFS for your data lake 🙂

Looking Ahead

In addition to boto, another way to interact with S3 via python is S3FS. S3FS is a pythonic file interface to S3, built on top of botocore. Given that it’s the most prevalent way to interact with S3 on Pandas, we believe it is important to support it as well. 

And finally of course we cannot ignore Spark. We are in the process of developing a Hadoop FileSystem implementation which will allow users to use multiple storage services side-by-side in Spark, with minimal changes to the existing code.

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

LakeFS

  • Get Started
    Get Started