Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Barak Amar
Barak Amar Author

December 9, 2020

TL;DR

With some thoughtful engineering, we can achieve a lot of the benefits that come with a microservice oriented architecture, while retaining the simplicity and low operating cost of being a monolith.

Monolith in Utah Canyon (By Photo Looks)

What is lakeFS?

lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like capabilities over your object storage environment and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc.

lakeFS is built as a monolith: a single binary that holds UI, REST API, S3-Compatible API, and command-line functionality all packed in a single executable, living in a single Git repository.

What is a “service” in lakeFS?

When people say “service” in 2020, they usually mean a microservice: a highly scoped, independent, replaceable and upgradeable component.

Our services follow the same principles, even though they are running inside a single process.

Each service is a unit of functionality, working independently with an explicit API. Services can use other services, but this dependency is made explicit only when initializing each service, and doesn’t change afterwards.

To demonstrate this concept, let’s take a look at the lakeFS block adapter. This service is responsible for reading and writing data to an underlying object store (such as S3, GCS, or even a local filesystem).

Pseudo code based on adapter.go

type Adapter interface {
	...
	Put(path string, reader io.Reader) error
	Get(path string) (io.ReadCloser, error)
	...
}

It can be seen that the interface methods do not require other services. Dependency between services is defined when initializing them. In this case, each implementation of the adapter would hold the relevant storage client library.

A service that requires access to an object store will ask for a `block.Adapter`. It doesn’t care which implementation will be passed to it. For example, ManifestManager, a service that uses the block adapter, can be injected with any service implementing the block adapter interface.

Example code:

type Manifest struct {...}

type ManifestManager struct {
	adapter block.Adapter
}

func NewManifestManager(adapter block.Adapter) *ManifestManager{
	return &ManifestManager{ adapter: adapter }
}

func (m *ManifestManager) Load(name string) (*Manifest, error) {
	reader, err := m.adapter.Get("manifests/" + name)
	if err != nil {
		return nil, err
	}
	defer reader.Close()

	var m Manifest
	err := json.NewDecoder(reader).Decode(&m);
	if err != nil && err != io.EOF {
		return nil, err
	}
	return &m, nil
}

Using plain types in our service interface, and not references to other services, makes it easy to mock the interface and even wrap it with a communication layer to have it self contained as a microservice.

A good litmus test to know if a service is properly scoped, is how easy it would be to turn the Go interface definition into a gRPC service definition, or any other form of RPC.

It should have a marginal effect on the rest of the code, since dependencies are very explicit and clear (leaving aside the world of complexity added by communication over a network, more on that later).

Dynamic vs static service binding

One of the big advantages, when your services live in a monolith, is having them bound statically. Our “Service Discovery” is a main() function. When you create a new service instance it accepts all the dependent services to work with.

Pseudo code based on our service initialization

authService := auth.NewDBAuthService(db, crypt.NewSecretStore())

// ... more initialization code

s3gatewayHandler := gateway.NewHandler(blockAdapter, authService)

Here we can see how DBAuthService and gateway handlers are composed of different parts. 

Each component is written in a different package and supplies an isolated interface to work with.

Any change in the requirements, change in the interface, or adding dependency will be checked in compile-time and fail the build.

In a microservices environment, this binding is done in runtime. Each service uses another service endpoint and the interface requirements are met at the protocol level.

The dependency graph is intentionally simple: more specialized services depend on a few “core” infrastructure building blocks. Specialized services will not depend on each other. Neither will the core services. This makes the setup process very straightforward.

Starting from the lower layers and up to the entry points of the application, avoiding circular dependencies we have between code packages.

In a dynamic environment, you need your deployment process to take care of the setup and tear down process. Deploying the right version and taking care of rollback when incompatibility at runtime is found.

Entry point and composition

Having access to all our services inside the same project still requires us to have an entry point that composes all the services into our working solution.

We can also have multiple entry points to the monolith, one can compose all the API services and another can perform specific tasks using our services.

Each entry point can have different composition depending on the operation you want to perform.

Let’s take for example our main server entry point: After we create and set up the services, we ask each service to expose an HTTP handler and using a simple composition handler `httputil.HostMux` that we can pass to Go’s `http.Server`.

In the current implementation we chose to listen on a single port to make the server deployment easy, but we can do the same by listening to multiple endpoints.

httputil package

Pseudo code based on our entry point

/// ... more setup code

apiHandler := api.NewHandler(blockStore, authService,
	authMetadataManager,
	bufferedCollector,
	logger.WithField("service", "api_gateway"),
)

// bind all handlers under our server listening endpoint
server := &http.Server{
	Addr: GetListenAddress(),
	Handler: httputil.HostMux(
		// api as default handler
		httputil.HostHandler(apiHandler).Default(),
		// s3 gateway for its bare domain and subdomains
		httputil.HostHandler(s3gatewayHandler, 
			httputil.Exact(GetS3GatewayDomainName()),
			httputil.SubdomainsOf(GetS3GatewayDomainName())),
	),
}

Having a command-line tool to perform a specific operation will be very similar. We first compose the required services and call the right methods to perform the specific operation.

Understanding the trade-offs

In a modern microservices architecture, it is a given that services communicate with each other over a network. With the main driver for microservices adoption being the scalability of the organization building the system and the desire for multiple, independent release queues, this makes sense.

This is a big tradeoff, however. Communicating over a network means we have a lot of new problems to solve now (observability, data consistency, handling cascading failures, and many other issues). 

With lakeFS being an open source project, we essentially have a single release queue more suitable for a monolith. We can keep most of the good parts that come with a clear service separation, without paying the complexity tax.

It’s not a novel concept, we believe it’s just good engineering.

Summary

I hope this article was useful and  that it helps you with your future project design choices. Especially if you don’t need or can’t have multiple release queues.


If you enjoyed this article, check out our Github repoSlack group, and related posts:

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started