Processing what you’d call the “latest” data may sound simple, but in reality, it’s complex and challenging. When you gather time-based data, you’ll quickly notice that some are born late, some achieve lateness, and others are forced to become late.
How does that happen? Here are a few good reasons why late-arriving data are so common:
- Usually, trace data is indexed by start time. So, data for an ongoing interval is born late and can’t be created yet.
- If a failure or burst happens in a collection system, it might impact its performance. Slower performance results in delayed data generation.
- Some data can be delayed by distributed collection systems.
In the data world, lateness can happen at every stage of the collection process. Most data-gathering pipelines are distributed, and late data arrives out of order.
When you can’t avoid late-arriving data, you need to deal with it.
Note: Providing repetitive searches may be beneficial for some applications, and adding late-arriving data might directly conflict with that. For example, aggregation must accommodate late data.
So, how do you handle late-arriving data? Here’s a selection of strategies for dealing with late-arriving data.
Strategy 1: Update your data
You can use this easy fix to simply replace old data with new. Every piece of data, no matter how late it arrives, will be added according to its timestamp.
Many databases allow doing that easily – just think of basic data storage. However, scalability is difficult because late data requires the creation of new data files or partitions.
There is no repeatability here, as extremely late data may have been received between queries. So, you need to supplement, reprocess, or delete any saved aggregations.
You can probably tell by now that this approach works best for smaller data scales.
Strategy 2: Bi-temporal modeling
You can use bi-temporal modeling to solve the late-arriving data issue as well. How does that work? You need to increase repeatability by including a second serialized storage arrival time field in all of your data.
Every analytics or aggregation query can filter times by timestamp – and then by some storage arrival time known to have existed in the past (through serialization).
Aggregates provide the upper storage arrival time in their metadata, allowing queries to use them by filtering the primary data to view the data that arrived later.
Strategy 3: Just ignore your late-arriving data
Another alternative is to disregard late data. Create a specified deadline interval and discard any data received after it (preferably with some observability in place). Be sure to make the data available for access after the deadline.
This is an easy strategy to apply and scale. However, it delays all data by the deadline period for repeatability – this happens even when all data arrives on schedule. If there is a meaningful deadline value, it’s directly useful.
By layering numerous instances of this method, you can combine disregarding late data with an effective arrival time. Create a series of deadline intervals, where data enters the first layer no later than the deadline, providing a quantized arrival time.
Data collection, on the other hand, maintains a series of buckets with rising deadlines. When a deadline passes, its bucket is closed, and a new one with the same deadline is opened.
Queries work for a certain time period and employ only buckets that were sealed at that time. The time is included in the query to ensure repetition.
Strategies from Databricks and others
Databricks shared a helpful article on how to deal with late-arriving data. But before we go into the details…
Late-arriving data – dimensions vs. facts
Data engineering is all about processing facts and dimensions. You can find fact and dimension tables in a solution known as a Star Schema. Its objective is to maintain tables denormalized sufficiently so that data practitioners can write SQL queries simply and quickly.
A dimension in a warehouse functions similarly to an entity that represents an individual, a non-overlapping data element where the facts are behavioral data generated as a consequence of an activity on or by a dimension. A fact table is surrounded by one or more dimension tables because it contains a reference to dimension natural or surrogate keys.
That’s why late-arriving dimensions are so much more inconvenient than late facts. And you often need to handle dimensions first to ensure data correctness.
ETLs and late-arriving data
In a typical data warehouse scenario, ETL operations are developed such that dimensions are fully processed first. When the process is finished, the facts are loaded assuming that the relevant dimension data is already in place.
Late-arriving dimensions deviate from this pattern because fact data is handled before the related dimension information. Fact records pertaining to new dimension members – members that have yet to be processed – might be encountered by ETL procedures.
Natural and surrogate keys
Natural keys and surrogate keys play distinct roles in the context of dimensional models.
The natural key helps to determine if a dimension record is new or already exists in the dimension table while processing late-arriving dimension rows. And the surrogate key connects the fact table with the dimension.
Depending on the use cases and limits of your infrastructure, you can choose from various methods:
Strategy 4: Process first, hydrate after
Even if a dimension lookup fails, facts land in fact tables with default values in the dimension columns. You can use a hydration procedure to regularly hydrate the missing dimension data in your facts table.
The disadvantage of this strategy is that it doesn’t work when dimension keys are also foreign keys in the tables. Another drawback is that in some scenarios, facts may be written to many destinations, so the hydration process needs to update the missing dimension columns in all of them for data consistency.
Strategy 5: Detect late-arriving dimensions at an early stage
You can detect early-arriving facts (facts for which the relevant dimension has not yet arrived), put them on hold, and retry them after a period of time until they are processed or retries are exhausted. This ensures the target tables’ data quality and consistency.
Databricks designed an internal framework that integrates into the streaming pipelines to manage such late-arriving dimensions. The system is based on a fundamental pattern shared by all streaming pipelines:
Strategy 6: The reconciliation pattern
This pattern prepares the data for reconciliation in two steps.
First, you need to write unjoined records to the streaming pipeline errors table. Next, you need to establish a procedure that consolidates unsuccessful retries for the same event into a new fact row with extra retry detail.
For the second step, a scheduled batch process can automatically take care of how often it tries again.
The flow looks as follows:
- Read new events from Kafka.
- Retryable events are read from the reconciliation table.
- UNION retryable events and Kafka.
- Join this unioned data to the user’s dimension table.
- If a fact arrives early (dimension has not yet arrived), indicate it as retryable.
- Retryable records should be written to the streaming pipeline error database.
- MERGE all of the acceptable records into the target delta table.
- A batch operation dedupes fresh error records and MERGES them into a reconciliation table with the retry count and status updated.
For more details, check out the Databricks article.
Strategy 7: Never Process Fact
Shared by Soumendra Mishra, this method involves looking up data in the dimension table. If no corresponding dimension record is discovered, the record is discarded from loading into the fact table. In other words, never process the fact record until a dimension record is available.
Strategy 8: Park and Retry
If there is no related dimension record, another strategy is to put the mismatched record into a landing table and retry loading data into the fact table in the next batch operation.
In this example, Mishra entered a record with the CustomerCode [CUST-004] into a landing table and attempted loading from the landing table to the fact table in the next batch operation if it matched the associated dimension record.
Strategy 9: Inferred Flag Lookup
Another strategy shared by Mishra. If no equivalent dimension record exists, insert a new dimension record and push the reference key to the fact table. A new dimension record with an autogenerated surrogate key will be added, but all other column values will be kept as N/A (not available).
True will be added to the InferredFlag. When an actual dimension record becomes available, columns designated as N/A will be updated with the new data, and the InferredFlag will be set to False.
Late-arriving data is a typical problem in data warehouse solutions. Data practitioners have come up with myriad ways to deal with this challenge and make sure their systems work smoothly and productively.
We hope that our line-up of strategies for handling late-arriving data will help you navigate this tricky territory. In the end, that choice will depend on many factors, from your business context to infrastructure limitations.
Table of Contents