Paul Singman
March 22, 2021

What is a Data Lake?

A data lake is a system of technologies that allow for the querying of data in file or blob objects.  When employed effectively, they enable the analysis of structured and unstructured data assets at tremendous scale and cost-efficiency.

The number of organizations employing data lake architectures has increased exponentially since the term “data lake” was first coined in 2010. They support a diverse set of analytic functions, ranging from basic SQL querying of data, to real-time analytics, to machine learning use cases.

Primary Components

Data lakes are comprised of four primary components: storage, format, compute, and metadata layers.

Why Use a Data Lake?

Data lake architectures encourage the consolidation of data assets into a centralized repository. This repository then serves as the foundation for cross-functional analysis of previously siloed data. Furthermore, insights derived from a data lake help cultivate a culture of data-driven decision making and improve resulting outcomes.

Any organization with large-scale unstructured data from sources like IoT sensors or mobile app clickstreams benefit from employing a data lake. The flexibility and cost-efficiency of data lakes make them especially relevant for these use cases.

Data Lake vs Data Warehouse

Data lakes and Data warehouses are similar in that they both enable the analysis of large datasets. However, their approaches in achieving this differ in several key ways.

Modularity: Data warehouses are typically proprietary, monolithic applications that offer managed convenience at the expense of cost and vendor lock-in. On the other hand, data lakes are marked by the modularity of their components, comprised mostly of open-source technologies and open formats. This also allows for mixing and matching different technologies most appropriate for a given workload.

Schema Enforcement: Data warehouses require data to conform to a DDL-defined schema immediately on write or ingest. In contrast, data lakes allow data to be landed freely, with schema validation occurring on-read.

Cost vs Performance: Data warehouses typically offer high performance at a higher price. Users often face their storage history or applicable aggregations before inserting data into a table to avoid prohibitive costs.

Data lakes store data inside highly cost-effective storage services and therefore incur no storage charges above the bare service costs. Compute resources elastically scale up and down to optimally meet the needs of workloads without extra cost.

Structured vs Unstructured Data: Data warehouses are designed for structured, tabular datasets. Meanwhile, data lakes can be used to analyze data in unstructured or semi-structured formats as well.

Depending on the data and expected use cases, it can make sense to use either a data warehouse or data lake. Maybe even both simultaneously!

How to Build a Data Lake?

Landing data in a highly-available storage service is an important first step to creating a data lake. One should store the data “as is” in its native format, before transforming it into a format more suitable for analysis.

Next, connecting a compute engine such as Spark or Presto to run computations over the data creates a working minimally-viable data lake! From there, one can take additional steps to optimize storage patterns via partitioning, and enforce schema validations to improve the lake’s functionality.

1. Raw data is landed in the object store
2. Optimize raw data files for analysis by size and format
3. Add metadata tooling to define schema and enable versioning + discovery
4. Integrate downstream consumers to optimized data assets

Data Lake Tools and Providers

At each layer in the stack, there are a number of technologies that can be combined to create the data lake. Once a lake contains useable data assets, they then become available for consumption by different downstream clients and libraries.

Storage: The storage services of major cloud providers AWS S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS) are most commonly used for a data lake’s storage layer. There are a number of other storage providers, both managed and open-source, that are also perfectly capable of supporting a data lake, including: MinIO, HDFS, IBM Cloud Storage, Alibaba Object Storage, Wasabi, Ceph, Oracle Cloud Storage, SwiftStack, and Spaces Object Storage.

Data Format: The simplest format examples are CSV and JSON, which data lakes support due to their ubiquity. More specialized formats designed specifically for data lake use cases also exist like Parquet, Delta, Iceberg, Avro, and Hudi. These formats enhance the efficiency of lake operations and make functionality like transaction atomicity and time-travel possible.

Unstructured data formats associated primarily with media-rich image, video, and audio files are also commonly found in data lakes.

Compute: A vast number of compute engines built atop elastic compute primitives are available to run operations over data distributed in a lake. Examples include technologies like MapReduce and Hadoop, managed services such as AWS Athena, Azure HDInsight, Google Cloud Dataproc, and applications like Spark and Presto.

Metadata Important as the data itself is data-about-the-data, or metadata, in a lake. Tools like Hive, AWS Glue, and DataProc Metastore maintain information about schema and partitions to provide structure.

Clients & Libraries: Through JDBC/ODBC and other data-transfer interfaces, there are countless clients and libraries that can access data in a lake. The ubiquity of the S3 API  (and other similar storage services) means that nearly every programming language, BI Tool, and SQL client can consume the data lake.

Use Cases 

Data lakes are suitable for any analytics initiative. Some common examples are:

  • Data-in-place Analytics: Once one lands data in a lake, there’s no need to move it elsewhere for SQL-based analysis. Let analysts run queries over data lake data to identify trends and calculate metrics about the business.
  • Machine Learning Model Training: Machine learning models often need large volumes of data to train on to optimize their parameters and achieve high levels of accuracy. Data lakes make it possible to repeatably create training and testing sets for data scientists to optimize models.
  • Archival and historical data storage: Apart from the immediate business value data lakes provide, one can use them as storage for archiving historical data.

Challenges of Data Lakes

The ecosystem around data lakes is relatively new and the technologies used are still maturing in some cases. As a result, data lakes are susceptible to facing a few common problems.

Small Files: One such problem is the “small file problem” and occurs when a large number of files—each containing a small amount of data—arise in a data lake. The issue with small files that they are inefficient to run computations over and keep up-to-date metadata statistics on. 

The solution to the small file problem is to run periodic maintenance jobs that compact data into the ideal size for efficient analysis.

Partitioning and Query Efficiency: A similar concept to the addition of indices on a warehouse table, data lake assets can be optimized for aggregation or filtering on certain fields by employing partitioning. Partitioning refers to the physical organization of data by a specific field or set of fields on blob storage. 

Without realizing it, a user can incur large costs and/or wait times by running queries that aren’t well suited to a table’s partition structure.

The Shared Drive: Without proper workflows and governance, a data lake can easily resemble a shared folder in which multiple people place files without regard for the intended schematic requirements of other users. Proper schema enforcement and thoughtful data lake workflows are necessary to avoid a lake becoming a data swamp.

Let’s look a bit deeper into how to avoid this.

How to Avoid a Data Swamp?

A data swamp is the degenerative state of a data lake. Tables within the lake either return inaccurate data or files become corrupted and queries stop running altogether.

It is imperative to maintain awareness of the quality and attributes of any data inserted into production datasets. 

This can be difficult to achieve given the standard APIs made available to inspect and manipulate data on common storage services. Therefore, adopting technologies that expose primitives that enhance the ability to manipulate data objects is wise.

Conclusion

Apart from containing traditional operational structured data, the digital universe is now full of massive volumes of data coming from billions of devices and users. It is subsequently a priority to leverage data lake systems that can handle enormous data workloads and an influx of data types.

If you want to know more about data lakes, refer to the articles below:

LakeFS

  • Get Started
    Get Started