Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Paul Singman
Paul Singman Author

April 20, 2021

It makes perfect sense that if you type aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed.

But there’s no hard rule that you have to connect to the real bucket. And in fact, there’s a simple parameter you can pass to the above CLI command to easily connect instead to any URL of your choice. 

Consider the following:

$ aws s3 ls s3://my-bucket --endpoint-url http://localhost:5000

Notice the inclusion of the argument --endpoint-url pointing to the localhost on port 5000? Rather than list the contents of my-bucket on some AWS server somewhere, we will instead check our own machine for a bucket-like resource that can respond to the ls command.

This begs the question: Why would AWS expose a parameter like this and why would we ever want to use it? 

Well, it turns out there are good reasons for both. In this article, we’ll cover two situations where messing with the endpoint_url proves useful.

Use Case #1: Local Testing

Anyone who’s tested their code locally knows how fraught the situation can become when calling external dependencies. We can agree that it is preferable for a unit test to not actually insert a test record into our database, for example.

To get around this issue, there are a few strategies at our disposal. We can mock. We can monkeypatch. Or we can unpleasantly add testing-specific logic to our code.

By running a local instance of an AWS service and then setting the endpoint_url parameter in code to localhost and the correct port, we can effectively mock a service in a painless way.

Let’s see this in action.

An Example using moto

For a full-fledged example, I’ll point to the excellent moto docs, a package created specifically to mock AWS services.

Moto has a Stand-alone Server Mode that makes how this works especially clear. After pip installing the package, start the local moto server:

$ moto_server s3
* Running on http://127.0.0.1:5000/

Then in a piece of code, we can use the endpoint_url to point to the local moto S3 instance:

mock_s3 = boto3.resource(
service_name='s3',
region_name='us-east-1',
endpoint_url='http://localhost:5000',
)

Now we can do whatever operations we want on the mock_s3 object without worrying about altering the contents of the actual bucket.

Use Case #2: Seamless Integrations

Given the “rise of the cloud”, it shouldn’t be surprising that the APIs for cloud services also become ubiquitous. For better or worse, the AWS S3 API is now “the defacto standard in the object storage world.” [1]

When multiple technologies adopt the same standard, what happens is the pain of integrating each pairwise combination goes away. 

lakeFS, MinIO, and Ceph are examples of technologies that speak the S3 language. Or put more accurately, they maintain compatibility with a meaningful subset of the S3 API.

As a result, any tool that expects to connect to S3 can also seamlessly integrate with these tools… via Endpoint URLs!

An Example Featuring Spark and lakeFS

Spark is an example of a technology that often interacts with S3. Most commonly reading data into a DataFrame, doing some transformation, and then writing it back to S3. 

lakeFS is designed to enhance the functionality of data lakes over object storage, making possible operations such as branching, merging, and reverting.

If you wanted to take advantage of the benefits of both Spark and lakeFS and there was no ability to set custom endpoint URLs, a bespoke integration would have to be developed between the two.

Luckily, with configurable endpoints, the integration becomes a one-liner pointing Spark’s S3 Endpoint to a lakeFS installation:

spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.lakefs.example.com")

Now accessing data in lakeFS from Spark is exactly the same as accessing S3 data from Spark!

Final Thoughts

While only an introduction to this topic, hopefully, you have a better understanding of what endpoint URLs are and when it makes sense to alter them from the default value.


If you enjoyed this article, check out our Github repo, Slack group, and related posts:

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started