Oz Katz
December 22, 2022

We’re happy to introduce experimental support for running SQL queries, directly on objects (your files!) in lakeFS. All through the lakeFS UI. No need to install or configure anything.

TLDR – You can now run SQL queries on Parquet and other tabular formats, directly from the lakeFS UI! Explore data, look at its schema, compare versions and more – without using any additional tools or installing additional components

Exploring Data is hard

Running a modern Data Lake sure has its upsides:

Cloud storage is cheaper than ever , more scalable than ever and allows a plethora of use cases and compute frameworks to all access a single, centrally managed storage system.

But with all the benefits, it’s pretty obvious that object stores (the bedrock of every cloud data lake) were never designed to be Data Warehouses.

With lakeFS, we set out to change that. The lakeFS UI already allows users to easily explore their data at logical, consistent versions and attach meaningful branch names and commit messages. It eve renders Markdown files so that documentation could co-exist with data.

However, up until now, actually interacting with the data required using an external tool: Spark, Trino, Athena, Azure Synapse or others – just to make sense of the data files within a given repository or commit.

Enter: DuckDB

DuckDB is an open source in-process SQL OLAP database written in C++. It allows running very fast (vectorized) SQL queries on data files such as CSVs and Apache Parquet.

What is vectorized SQL query? I’ve got you covered!

In vectorized query execution, multiple rows are grouped/batched together by columns, and operators iterate over data within batches using simple loops. This query approach greatly reduces the CPU usage for reading, writing and query operations like scanning, filtering.

Now that we know what is vectorized SQL query, let’s continue exploring DuckDb!

For good reason, DuckDB has been exploding in popularity recently – Data engineers are looking for simple, performant ways to explore data without having to run expensive and complex distributed systems.

DuckDB star history graph, courtesy of https://star-history.com/

One really interesting aspect of the project, is that it complies to WebAssembly – a relatively new standard that allows compiling native code in C++, Rust and many other languages into a set of instructions that can run directly in a web browser 🤯

Running DuckDB, embedded in the lakeFS UI

So, what is this all about and why should you care? We’re happy to introduce experimental support for running SQL queries, directly on objects (your files!) in lakeFS. All through the lakeFS UI. No need to install or configure anything.

To use it, simply click on a supported file (CSV, TSV and Parquet) in the lakeFS UI.
You’ll be greeted with a SQL shell, with a default query that selects a sample of rows from the chosen file.

You can use the full range of supported SQL functions (even JOINs!) Right from the comfort of your web browser of choice. Use it to DESCRIBE the contents of a parquet file to understand its schema, compare objects between different versions to understand how the data changed, or simply select a small sample to gain an understanding of an unfamiliar dataset. The possibilities are endless!

Getting Started

To get started querying data in the lakeFS UI, make sure your current lakeFS installation is on version 0.88.0 or higher. No additional configuration required.

If you’re not using lakeFS, you can quickly set up a local lakeFS instance and import your existing data into it. The nice thing is that importing data into lakeFS doesn’t copy or modify any data – it is a simple metadata operation where lakeFS  creates pointers to your existing data objects.

Use this one-liner to run lakeFS with an S3 storage adapter – this will allow you to import data from your existing object store:

docker run --pull always -p 8000:8000 \
   -e LAKEFS_BLOCKSTORE_TYPE='s3' \
   -e AWS_ACCESS_KEY_ID='YourAccessKeyValue' \
   -e AWS_SECRET_ACCESS_KEY='YourSecretKeyValue' \
   treeverse/lakefs run --local-settings

For instructions on how to run lakeFS with Azure Blob Storage, Google Cloud Storage or MinIO – see the lakeFS quick start guide.

Once up, use the zero-copy import tool to get your data showing in lakeFS.

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +