Paul Singman
March 30, 2021

Rid yourself of these troubling habits and start the journey towards data lake mastery!

Data Lake Anti-Patterns
Photo by Ali Zbeeb on Unsplash

Introduction

Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience.

This is troublesome since I believe the developer experience is as important, if not more, in proving the worth of technology or paradigm.

When creating and maintaining a complex system like a data lake, unfriendly user workflows and interfaces can sap productivity, similar to an application with too much tech debt or poor documentation.

Data Lake Anti-Pattern #1

You click around the S3 (or comparable) storage console often

One symptom of unfriendly workflows with a data lake is spending too much time in the lake’s storage service of choice.

In a case earlier in my career, it was an S3 data lake, and one day I realized the amount of time spent poking around the AWS console to see the number of files present in a partition or what path a certain table’s data was stored under… it was unhealthy.

As a comparison, no one is going into their database’s internals to see where on the B-tree a table is stored.

Tactics to Mitigate:

Issues will inevitably occur within your data lake. A vendor will have an issue sending over a data file one day. Someone will accidentally re-run an ingestion job multiple times that shouldn’t.

And in the process of debugging such issues (or in some cases, false-alarms) it will be useful to play detective and investigate the evidence of what happened by checking the last_modified_date of files or checking the number of rows in today’s partition.

My advice to make this a friendlier process is twofold:

  1. Maintain an internal Debugging Playbook doc page specifying issues that arise and detailing the steps taken to identify and resolve them. This helps prevent knowledge hoarding amongst the team and avoids every poor sap on it from having to figure the best means of resolution his or herself.
  2. When common debugging patterns emerge — like No. of file checks — develop a way to automate the reporting of these metrics in an internal table or dashboard. The next time an issue occurs, you should have a single-pane view that quickly exposes your lake’s recent behavior.

The goal of these practices is to prevent you from needing to check what’s up in a storage console for most issues. And make it a simple, quick, and targeted investigation when you do.

Data Lake Anti-Pattern #2

You physically copy files to create multiple versions of the same dataset

Each directory contains an almost-exact duplicate of the same dataset.

If you open up your object store’s console and see a directory structure like so, I would immediately know you’re a data lake amateur.

I would know because physically duplicating files is a tedious process that incurs unnecessary cost. And it’s a problem that grows with the scale of your data.

Instead of copying files, what if you could write one shell command and get the same effect?

> lakectl branch create <branch uri>

How does this work?

Luckily this is a mostly solved problem by connecting a git-for-data tool on top of your lake.

By leveraging metadata about the files and their contents in your lake, familiar operations like branching or committing collections become possible. Iterating and experimenting on a data lake is faster and safer when you have these actions at your disposal.

Visual of a branch added virtually to an S3 collection.

For more information, check out the open-source project lakeFS and their documentation pages!

Data Lake Anti-Pattern #3

You have complex data processing logic based on file path patterns

Like Eve standing before the apple tree, you will be tempted to use logic on an object’s filepath string to determine its fate in your data lake.

DO NOT DO THIS!

If you employ code like in the example above, yes, you have managed to write one function that processes moving different datasets to their appropriate location in the data lake.

What you’ve actually done, though, is increase the cyclomatic complexity of your data processing code and complicate the maintenance of it.

Anytime a new data source is added, you have to think through the implications of where it’s being landed, and based on this centralized function, where it’ll end up. Any changes to one dataset necessitate understanding if there are unintended impacts on others.

What to do instead:

The solution is straightforward — use individual methods to process each dataset in your lake. Common helper functions should still be used within these methods for simple actions, like copying a file from one S3 key to another.

To be clear, some logic on filepath strings is unavoidable. Be wary though, of nested logic and make the code as explicit and readable as possible.

Wrapping up

In this article we discussed why it’s a troubling sign if you are:

  1. Clicking around a storage console to debug every issue.
  2. Physically copying files in your data lake.
  3. Overly dependent on filepath names in processing code.

Avoiding these anti-patterns and instead following the best practices should help make maintaining your data lake a breeze!

LakeFS

  • Get Started
    Get Started