Paul Singman
August 25, 2021

Introduction

An important part of developing an open source project like lakeFS is assisting and advising our users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally.

This means recreating the environment, running the same code, and raising the same error.

In complex, modern data stacks this is easier said than done. Developed from experience over the past year, we have a setup that helps us in this pursuit. Affectionately, it is referred to as the Everything Bagel.

What is the Everything Bagel?

The Everything Bagel is a multi-container Docker environment that spins up locally with a single command. It contains many of the technologies we see lakeFS commonly deployed with, including:

  • Spark
  • Hive
  • Trino
  • MinIO 

The best part – it’s publicly available right in the lakeFS GitHub repo!

In this post I’ll cover how to get the Docker Everything Bagel up and running on your own laptop. In the process, I’ll also cover how it works and some of the cool things you can do with your very own Everything Bagel.

docker everything bagel diagram
Diagram of the Everything Bagel Environment

Spin It Up!

The only pre-requisite is to have Docker installed on your machine. Once installed, the steps to get the Everything Bagel running are as follows:

  1. Clone the lakeFS repo: git clone https://github.com/treeverse/lakeFS.git
  2. Navigate to the deployments/compose directory:  cd deployments/compose
  3. Run:  docker compose up -d

That’s it! The different containers will start to spin up. Once it completes (will take a few min the first time) you can check the status of the resulting containers by running docker compose ps. You should see a response in Terminal like:

NAME                     COMMAND                   SERVICE             STATUS              PORTS
compose_spark-worker_1   "/opt/bitnami/script…"    spark-worker        running             0.0.0.0:53129->8081/tcp
compose_spark-worker_2   "/opt/bitnami/script…"    spark-worker        running             0.0.0.0:53126->8081/tcp
compose_spark-worker_3   "/opt/bitnami/script…"    spark-worker        running             0.0.0.0:53128->8081/tcp
compose_spark_1          "/opt/bitnami/script…"    spark               running             0.0.0.0:18080->8080/tcp, :::18080->8080/tcp
hive                     "/bin/sh -c \"/entryp…"   hive-metastore      running             0.0.0.0:9083->9083/tcp, :::9083->9083/tcp
hiveserver2              "hive --service hive…"    hive-server         running
lakefs                   "/app/wait-for postg…"    lakefs              running             0.0.0.0:8000->8000/tcp, :::8000->8000/tcp
lakefs-setup             "/app/wait-for postg…"    lakefs-setup        exited (0)
mariadb                  "docker-entrypoint.s…"    mariadb             running             3306/tcp
minio                    "minio server /data …"    minio               running             0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9001->9001/tcp, :::9001->9001/tcp
minio-setup              "mc mb lakefs/example"    minio-setup         exited (0)
postgres                 "docker-entrypoint.s…"    postgres            running             5432/tcp
trino                    "/usr/lib/trino/bin/…"    trino               running             8080/tcp

A Quick Note on Memory

When creating the 10+ Everything Bagel containers, it’s important to make sure the Docker application has enough memory allocated to it.

To adjust this setting, go to Docker’s preferences page. From there, go to the Resources tab and make sure the Memory setting is set to at least 4 GB.

How It Works

The key to understanding how the Docker Everything Bagel works is to look at the docker-compose.yml file that the docker compose up -d command (the -d flag means run  in “detached” state) by default references.

Although this article won’t be a comprehensive review of Docker Compose files, let’s take a peek at a few of the service sections to understand what’s happening.

Let’s look at part of the config.

version: "3"
services:

  lakefs-setup:
    image: treeverse/lakefs:latest
    container_name: lakefs-setup
    depends_on:
      - postgres
      - minio-setup
    environment:
      - LAKEFS_AUTH_ENCRYPT_SECRET_KEY=some random secret string
      - LAKEFS_DATABASE_CONNECTION_STRING=postgres://lakefs:lakefs@postgres/postgres?sslmode=disable
      - LAKECTL_CREDENTIALS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
      - LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      - LAKECTL_SERVER_ENDPOINT_URL=http://lakefs:8000
    entrypoint: ["/app/wait-for", "postgres:5432", "--", "sh", "-c",
      "lakefs setup --user-name docker --access-key-id AKIAIOSFODNN7EXAMPLE --secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY && lakectl repo create lakefs://example s3://example"
      ]

  minio-setup:
    image: minio/mc
    container_name: minio-setup
    environment:
        - MC_HOST_lakefs=http://minioadmin:minioadmin@minio:9000
    depends_on:
      - minio
    command: ["mb", "lakefs/example"]
    
  postgres:
    image: postgres:11
    container_name: postgres
    environment:
      POSTGRES_USER: lakefs
      POSTGRES_PASSWORD: lakefs

This snippet shows sections that create containers running postgres (a lakeFS dependency), MinIO (underlying object store) and lakeFS.

Docker Compose Specifics

Within a service’s section, perhaps the most important setting is the image.

This specifies the base image used in building the container. In most cases, you can find an official image for a service on Docker Hub that installs all the packages you need to run the service. For example, the lakeFS image on Docker Hub is updated automatically by a GitHub Action workflow every time there’s a new official release of lakeFS and starts running lakeFS automatically when started.

Next, the depends_on: key is used to control the order in which containers are created. Before starting the lakeFS service we make sure a container running postgres is up first, as well as a service that starts MinIO and creates a bucket. If you aren’t familiar with MinIO, it’s an open-source object store that maintains compatibility with the S3 API. This makes it convenient for simulating S3 in local environments as we are doing here.

Any environment variables can be listed within the environment: key, as shown in all three services above. Similarly we can save entire config files and copy them into the container as a volume. The spark service contains an example of this.

The last thing I’d like to to cover is the entrypoint: key. This is the command or commands we want to run inside the container. Usually this simply starts the relevant service. In the lakeFS-setup service however, we perform the additional steps of creating a lakeFS user and repository (via the lakectl command line tool) so these aren’t steps that need to be performed manually each time.

After all, there is little value in a repository-less lakeFS instance.

Correctly setting these settings lets us run pretty much any service we want inside an isolated Docker environment. Pretty cool!

Using the Everything Bagel

Once you have the Everything Bagel up and running, you are capable of doing a variety of things, such as…

  1. Connecting to the Hive and Trino [docker compose — profile client run — rm trino-client] clients and creating tables or querying data
  2. Hopping into the spark container [docker-compose exec spark bash] and testing out spark-submit jobs
  3. Logging into the lakeFS (http://localhost:8000) and MinIO (http://localhost:9000) UIs to see how Spark, Hive, and Trino’s operations are reflected.

Try different things out, let us know how it goes!

Looking Ahead

This is an introduction to the Bagel. I hope you learned a bit about Docker and how it can be used to easily create complex environments locally. Next time we’ll dive into a more advanced use case showing the Everything Bagel in action!

We’re always continuing to improve it as well, making the Docker environment simpler and adding other relevant technologies.

Thoughts or suggestions on how to make the Everything Bagel better?

Read Related Articles.

LakeFS

  • Get Started
    Get Started