Paul Singman, Guy Hardonag
December 15, 2021

Introduction: Dockerize Your Data Pipeline

I can remember times when my company started using a new technology — be it Redis, Kafka, or Spark — and in order to try it out I found myself staring at a screen like this:

install redis kafka spark

At the time I thought nothing of doing this. And even wore it as a badge of pride that I was capable of figuring out how to install and run cutting-edge technologies on my own.

And while that is a valuable skill, from a company’s perspective this form of manual, local installation performed by each developer is a fantastic waste of productivity.

Especially when minor details like release version and dependencies start getting jumbled. To illustrate, one way to know you’ve hit rock bottom: when you send a .jar file to your colleague over Slack .

A Better Approach to Local Installation

One of the ways to solve this problem is to standardize and simplify the local installation process. In a previous article, I covered how this can work with the Docker Everything Bagel.

As a quick summary — the Everything Bagel is a Docker Compose environment that lets you run Big Data technologies like MinIO, lakeFS, Hive, Spark, and Trino with a single command. Using the Docker concepts demonstrated by The Bagel, nearly any data pipeline can be replicated locally in this way.

everything bagel docker hive trino lakefs minio
Diagram of the public lakeFS docker-compose environment.

In this article, what I’d like to do is take it a step further and show it in action. Following the wonderful demo prepared by Brian Olsen of Starburst, we’ll show off querying data with Trino on different lakeFS branches.

Connecting to the Trino Container with a Database Manager

The previous article left off with jumping into the Trino container and running a few query commands to show it working. The truth is though, no one wants to run queries from the command line.

First things first, we need to connect a SQL client to the Trino container. Same as Brian, I’m gonna use the free DBeaver tool. Step 1 is to create a “New Database Connection” from the Database drop-down menu.

From that screen, you’ll find Trino under the section “Hadoop/Big Data”

After selecting the Trino space bunny, you’ll enter the connection details:

dbeaver trino connection jdbc details

According to the Trino documentation, the JDBC URL will take the form: 

jdbc:trino://{host}:{port}

For a Trino installation running locally in Docker, host = localhost and the port will be whichever you chose to expose on that container (the Bagel uses 8080). 

After a bit of trial-and-error, I figured out that for Username you can put any string and Password should be left blank.

Now we can connect!

Note: We can also connect to the lakeFS and MinIO UI’s. In your browser go to localhost:9001 for MinIO (username & password = minioadmin) and localhost:8000 for lakeFS UI (credentials are hardcoded in the docker-compose in the lakefs-setup environment vars).

Trino + lakeFS

Now connected, we can define tables in Hive on the main branch of the lakeFS repository and run queries over them via Trino’s distributed query engine. Let’s walk through how to do this.

Step 1: Create a schema — This provides a separate name space for the new tables. We’ll point the location to the main branch of lakeFS.

create schema s3.tiny
with (location = 's3a://example/main/tiny');

Step 2: Create tables in the schema — We’ll replicate the customers and orders tables.

create table s3.tiny.customer
with (
format = 'ORC',
external_location = 's3a://example/main/tiny/customer'
) as select * from tpch.tiny.customer;
create table s3.tiny.orders
with (
format = 'ORC',
external_location = 's3a://example/main/tiny/orders'
) as select * from tpch.tiny.orders;

Step 3: Write Queries! — Here’s an example joining both tables for the orders before March ‘95.

select orderkey, orderdate, shippriority
from s3.tiny.customer c, s3.tiny.orders o
where c.custkey = o.custkey and orderdate < date'1995-03-15'
group by orderkey, orderdate, shippriority
order by orderdate;

What Happens When I Create a New Branch?

Of course, querying data like this is cool, but it is also a waste of the Everything Bagel’s potential. Instead of merely one version of the data, let’s use a lakeFS branch to create a new version.

The first step is to create a new branch in lakeFS I’ll call v2.

Creating a new branch from the Branches tab of the example repo.

Before we can query the v2 version of the tables, we need to re-define the schema pointing to the new location in S3. As you can see below, this simply involves replacing “main” with “v2” in the location path.

create schema s3.tiny_v2
with (location = 's3a://example/v2/tiny');

Lastly, we need to redefine the tables in the new schema. We can do this by running show create table s3.tiny.orders; on the table created before, and then copying the generated table DDL making the same branch name replacements. 

It’ll look like this:

CREATE TABLE s3.tiny_v2.orders (
orderkey bigint,
custkey bigint,
orderstatus varchar(1),
totalprice double,
orderdate date,
orderpriority varchar(15),
clerk varchar(15),
shippriority integer,
comment varchar(79)
) WITH (
external_location = 's3a://example/main/tiny_v2/orders',
format = 'ORC'
);

Now we have a completely independent version of the table that we can make any changes to that we want 1) without affecting the version on the main branch and 2) not duplicating any data.

Why Would We Use This?

Think of the data table assets most commonly used at your company. In the time you’ve been using them, they have never changed, right?

Probably not. On the contrary, they are changing all the time. Metric definitions get updated, columns are added, etc.

With dependencies between tables, it is a delicate, time-consuming process to roll out these types of changes in a way that ensures there are no unintended consequences to consumers of the data.

Adopting this branching-based workflow simplifies deployments as both versions of a table can exist side-by-side without unnecessarily duplicating data (important for large tables). In addition, git commands (exposed by lakeFS) are at your disposal to manage the lifecycle of the branches.

Final Thoughts

The next time you’re about to make a change, pause for a second. Resist the urge to copy data. Instead, adopt the same workflow used for code and create a new data branch.

This achieves the same effect and offers source control benefits you wouldn’t get otherwise.

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

LakeFS

  • Get Started
    Get Started
  • Git for Data - What, How and Why Now?

    Read the article
    +