This week I’m pleased to share “Tardy Data”, a chapter I wrote for the new book 97 Things Every Data Engineer Should Know, adapted to blog format. It’s about what “late” means in data engineering. This content is licensed under the CC BY 4.0.
Tardy Data
When collecting time-based data: some are born late, some achieve lateness, and some have lateness thrust upon them. That makes processing the “latest” data challenging. For instance,
Trace data is usually indexed by start time. Data for an ongoing interval are born late and cannot be generated yet.
Collection systems can work slower in the presence of failures or bursts, achieving lateness of generated data.
Distributed collection systems can delay some data, thrusting lateness upon them.
Lateness occurs at all levels of a collection pipeline. Most collection pipelines are distributed, and late data arrives significantly out of order.
Lateness is unavoidable, handling it robustly is essential.
At the same time, providing repeatable queries is desirable for some purposes, and adding late data can directly clash with it. For instance aggregation must take late data into account.
Common strategies align by how they store and how they query late data. Which to choose depends as much on business logic as it does on technical advantages.
Strategy #1: Update It
Conceptually simplest is to update existingdata with late data. Each item of data, no matter how late, is inserted according to its timestamp.
This can be done in a straightforward manner with many databases. It can be performed with simple data storage. But any scaling is hard – e.g., new data files or partitions need to be generated for late data. There is no repeatability (very late data might have arrived between repetitions of the same query), and any stored aggregations must be augmented, reprocessed or dropped.
Thus this method is mostly suited for smaller scales.
Strategy #2: Two Times
Bi-temporal modeling lets us add repeatability: add a second serialized storage arrival time field to all data. Every query for analytics or aggregation can filter times by timestamp and then by some storage arrival time that is known (by serialization) to be in the past.
Aggregates include the upper storage arrival time in their metadata, allowing queries to use them by filtering the primary data for data to see the data that arrived later.
Strategy #3: Ignore It!
Yet another option is to ignore late data. Set some fixed deadline interval. Any data arriving later than that deadline is dropped (preferably with some observability). Release data for access after the deadline.
This is a simple option to understand, implement, and scale. But for repeatability it delays all data by the deadline interval, even when all data arrive on time. So it is directly useful if there is a relevant deadline value.
Combine ignoring late data with an effective arrival time by layering multiple instances of this strategy. Set a sequence of deadline intervals. Data go into the first layer not beyond its deadline, giving a quantized arrival time.
Equivalently, data collection keeps a sequence of buckets with increasing deadlines. When a deadline expires its bucket is sealed and a new one is opened with the same deadline.
Queries are for a particular time and use only buckets that were sealed at that time; the time is part of the query, ensuring repeatability.
4 Metadata Management Challenges And How To Solve Them Modern data architectures support an increasing number and variety of business use cases. Product creation, tailored
Ariel Shaqed (Scolnicov) is a Principal Software Engineer at lakeFS. He has been generating big data from the days when 2 gigabytes was “big”. Since then he has has worked for companies small and large on everything from genomics, to network tracing and monitoring, to cloud execution platforms. When not with his partner and three boys, or cooking, or working, he sometimes runs.
This website uses cookies to improve your experience.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.