Ariel Shaqed (Scolnicov)
September 25, 2021

This week I’m pleased to share “Tardy Data”, a chapter I wrote for the new book 97 Things Every Data Engineer Should Know, adapted to blog format.  It’s about what “late” means in data engineering. This content is licensed under the CC BY 4.0.

Tardy Data

When collecting time-based data: some are born late, some achieve lateness, and some have lateness thrust upon them. That makes processing the “latest” data challenging. For instance,

  • Trace data is usually indexed by start time. Data for an ongoing interval are born late and cannot be generated yet.
  • Collection systems can work slower in the presence of failures or bursts, achieving lateness of generated data.
  • Distributed collection systems can delay some data, thrusting lateness upon them.

Lateness occurs at all levels of a collection pipeline. Most collection pipelines are distributed, and late data arrives significantly out of order.

Lateness is unavoidable, handling it robustly is essential.

At the same time, providing repeatable queries is desirable for some purposes, and adding late data can directly clash with it. For instance aggregation must take late data into account.

Common strategies align by how they store and how they query late data. Which to choose depends as much on business logic as it does on technical advantages.

Strategy #1: Update It

Conceptually simplest is to update existing data with late data.  Each item of data, no matter how late, is inserted according to its timestamp. 

This can be done in a straightforward manner with many databases. It can be performed with simple data storage. But any scaling is hard – e.g., new data files or partitions need to be generated for late data. There is no repeatability (very late data might have arrived between repetitions of the same query), and any stored aggregations must be augmented, reprocessed or dropped. 

Thus this method is mostly suited for smaller scales.

Strategy #2: Two Times

Bi-temporal modeling lets us add repeatability: add a second serialized storage arrival time field to all data.  Every query for analytics or aggregation can filter times by timestamp and then by some storage arrival time that is known (by serialization) to be in the past. 

Aggregates include the upper storage arrival time in their metadata, allowing queries to use them by filtering the primary data for data to see the data that arrived later.

Strategy #3: Ignore It!

Yet another option is to ignore late data. Set some fixed deadline interval. Any data arriving later than that deadline is dropped (preferably with some observability). Release data for access after the deadline.

This is a simple option to understand, implement, and scale. But for repeatability it delays all data by the deadline interval, even when all data arrive on time. So it is directly useful if there is a relevant deadline value.

Combine ignoring late data with an effective arrival time by layering multiple instances of this strategy. Set a sequence of deadline intervals. Data go into the first layer not beyond its deadline, giving a quantized arrival time.

Equivalently, data collection keeps a sequence of buckets with increasing deadlines.  When a deadline expires its bucket is sealed and a new one is opened with the same deadline. 

Queries are for a particular time and use only buckets that were sealed at that time; the time is part of the query, ensuring repeatability.

Want to learn more?

Read Related Articles.

LakeFS

  • Get Started
    Get Started