An important part of developing an open source project like lakeFS is assisting and advising our users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally.
This means recreating the environment, running the same code, and raising the same error.
In complex, modern data stacks this is easier said than done. Developed from experience over the past year, we have a setup that helps us in this pursuit. Affectionately, it is referred to as the Everything Bagel.
What is the Everything Bagel?
The Everything Bagel is a multi-container Docker environment that spins up locally with a single command. It contains many of the technologies we see lakeFS commonly deployed with, including:
The best part – it’s publicly available right in the lakeFS GitHub repo!
In this post I’ll cover how to get the Docker Everything Bagel up and running on your own laptop. In the process, I’ll also cover how it works and some of the cool things you can do with your very own Everything Bagel.
Spin It Up!
The only pre-requisite is to have Docker installed on your machine. Once installed, the steps to get the Everything Bagel running are as follows:
- Clone the lakeFS repo:
git clone https://github.com/treeverse/lakeFS.git
- Navigate to the deployments/compose directory:
docker compose up -d
That’s it! The different containers will start to spin up. Once it completes (will take a few min the first time) you can check the status of the resulting containers by running
docker compose ps. You should see a response in Terminal like:
NAME COMMAND SERVICE STATUS PORTS compose_spark-worker_1 "/opt/bitnami/script…" spark-worker running 0.0.0.0:53129->8081/tcp compose_spark-worker_2 "/opt/bitnami/script…" spark-worker running 0.0.0.0:53126->8081/tcp compose_spark-worker_3 "/opt/bitnami/script…" spark-worker running 0.0.0.0:53128->8081/tcp compose_spark_1 "/opt/bitnami/script…" spark running 0.0.0.0:18080->8080/tcp, :::18080->8080/tcp hive "/bin/sh -c \"/entryp…" hive-metastore running 0.0.0.0:9083->9083/tcp, :::9083->9083/tcp hiveserver2 "hive --service hive…" hive-server running lakefs "/app/wait-for postg…" lakefs running 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp lakefs-setup "/app/wait-for postg…" lakefs-setup exited (0) mariadb "docker-entrypoint.s…" mariadb running 3306/tcp minio "minio server /data …" minio running 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9001->9001/tcp, :::9001->9001/tcp minio-setup "mc mb lakefs/example" minio-setup exited (0) postgres "docker-entrypoint.s…" postgres running 5432/tcp trino "/usr/lib/trino/bin/…" trino running 8080/tcp
A Quick Note on Memory
When creating the 10+ Everything Bagel containers, it’s important to make sure the Docker application has enough memory allocated to it.
To adjust this setting, go to Docker’s preferences page. From there, go to the Resources tab and make sure the Memory setting is set to at least 4 GB.
How It Works
The key to understanding how the Docker Everything Bagel works is to look at the
docker-compose.yml file that the
docker compose up -d command (the -d flag means run in “detached” state) by default references.
Although this article won’t be a comprehensive review of Docker Compose files, let’s take a peek at a few of the service sections to understand what’s happening.
Let’s look at part of the config.
version: "3" services: lakefs-setup: image: treeverse/lakefs:latest container_name: lakefs-setup depends_on: - postgres - minio-setup environment: - LAKEFS_AUTH_ENCRYPT_SECRET_KEY=some random secret string - LAKEFS_DATABASE_CONNECTION_STRING=postgres://lakefs:lakefs@postgres/postgres?sslmode=disable - LAKECTL_CREDENTIALS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE - LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY - LAKECTL_SERVER_ENDPOINT_URL=http://lakefs:8000 entrypoint: ["/app/wait-for", "postgres:5432", "--", "sh", "-c", "lakefs setup --user-name docker --access-key-id AKIAIOSFODNN7EXAMPLE --secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY && lakectl repo create lakefs://example s3://example" ] minio-setup: image: minio/mc container_name: minio-setup environment: - MC_HOST_lakefs=http://minioadmin:minioadmin@minio:9000 depends_on: - minio command: ["mb", "lakefs/example"] postgres: image: postgres:11 container_name: postgres environment: POSTGRES_USER: lakefs POSTGRES_PASSWORD: lakefs
This snippet shows sections that create containers running postgres (a lakeFS dependency), MinIO (underlying object store) and lakeFS.
Docker Compose Specifics
Within a service’s section, perhaps the most important setting is the
This specifies the base image used in building the container. In most cases, you can find an official image for a service on Docker Hub that installs all the packages you need to run the service. For example, the lakeFS image on Docker Hub is updated automatically by a GitHub Action workflow every time there’s a new official release of lakeFS and starts running lakeFS automatically when started.
depends_on: key is used to control the order in which containers are created. Before starting the lakeFS service we make sure a container running postgres is up first, as well as a service that starts MinIO and creates a bucket. If you aren’t familiar with MinIO, it’s an open-source object store that maintains compatibility with the S3 API. This makes it convenient for simulating S3 in local environments as we are doing here.
Any environment variables can be listed within the
environment: key, as shown in all three services above. Similarly we can save entire config files and copy them into the container as a
volume. The spark service contains an example of this.
The last thing I’d like to to cover is the
entrypoint: key. This is the command or commands we want to run inside the container. Usually this simply starts the relevant service. In the lakeFS-setup service however, we perform the additional steps of creating a lakeFS user and repository (via the lakectl command line tool) so these aren’t steps that need to be performed manually each time.
After all, there is little value in a repository-less lakeFS instance.
Correctly setting these settings lets us run pretty much any service we want inside an isolated Docker environment. Pretty cool!
Using the Everything Bagel
Once you have the Everything Bagel up and running, you are capable of doing a variety of things, such as…
- Connecting to the Hive and Trino [
docker compose — profile client run — rm trino-client] clients and creating tables or querying data
- Hopping into the spark container [
docker-compose exec spark bash] and testing out spark-submit jobs
- Logging into the lakeFS (http://localhost:8000) and MinIO (http://localhost:9000) UIs to see how Spark, Hive, and Trino’s operations are reflected.
Try different things out, let us know how it goes!
This is an introduction to the Bagel. I hope you learned a bit about Docker and how it can be used to easily create complex environments locally. Next time we’ll dive into a more advanced use case showing the Everything Bagel in action!
We’re always continuing to improve it as well, making the Docker environment simpler and adding other relevant technologies.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
As data practitioners, we use many different terms to talk about what we do – we call it business intelligence, analytics, data pipelines, or insights.
What if you could manage your data lake just like you manage code? With rollback, versioning, and branching capabilities on top of your existing data