Databricks Delta Lake includes a number of time travel features to let you access any previous version of the extensive data that Delta automatically versions and stores in your data lake. This makes it simple to audit, roll back data in the event of unintentional poor writes or deletes, and reproduce reports and trials.
How does Delta Time Travel work exactly? Keep reading for a primer on time travel in Databricks.
What is Delta Lake Time Travel?
Delta Lake time travel features ensure that every operation you carry out when writing to a Delta table or directory is versioned automatically.
There are two methods by which you can access the various versions of the data:
1. Applying a timestamp
2. Using a versioning system
Delta Time Travel Syntax
What you need to do is add a clause after the table name specification to query a Delta table with time travel.
Timestamp_expression may be any of the following:
'2018-10-18T22:15:12.013Z', that is, a string that can be cast to a timestampcast('2018-10-18 13:36:32 CEST' as timestamp)'2018-10-18', that is, a date stringcurrent_timestamp() - interval 12 hoursdate_sub(current_date(), 1)
Note that version and timestamp_expression are not subqueries.
Accepted formats are only date or timestamp strings: for example, "2019-01-01" and "2019-01-01T00:00:00.000Z".
Check out the code below for an illustration of time travel syntax:
SELECT * FROM people10m TIMESTAMP AS OF '2018-10-18T22:15:12.013Z';
SELECT * FROM people10m VERSION AS OF 123;You may also include the version or timestamp in the table name using the @ syntax. The timestamp needs to be formatted in yyyyMMddHHmmssSSS. Adding a v before the version can specify a version after @.
Here’s a snippet showing this syntax:
SELECT * FROM people10m@20190101000000000
SELECT * FROM people10m@v123Different Ways of Time Travel in Databricks
1. Timestamp
Scala
The timestamp or date string can be added to the DataFrame reader as an option:
*note that the sample code provided is in HTML/Markup format
val df = spark.read
.format("delta")
.option("timestampAsOf", "2019-01-01")
.load("/path/to/my/table")Python
Here’s an example of a timestamp in Python:
df = spark.read \
.format("delta") \
.option("timestampAsOf", "2019-01-01") \
.load("/path/to/my/table")SQL syntax
What about SQL? Here’s an example:
SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01"
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01 01:30:00.000"You can still go back in time for a table by adding the timestamp in yyyyMMddHHmmssSSS format to the path:
val inputPath = "/path/to/my/table@20190101000000000"
val df = loadData(inputPath)
// Function in a library that you don't have access to
def loadData(inputPath : String) : DataFrame = {
spark.read
.format("delta")
.load(inputPath)
}
inputPath = "/path/to/my/table@20190101000000000"
df = loadData(inputPath)
# Function in a library that you don't have access to
def loadData(inputPath):
return spark.read \
.format("delta") \
.load(inputPath)
}2. Version Number
Every write in Delta has a version number, which you can use to go back in time.
Scala
Here’s an example code snippet that would be used in Scala:
*note that the sample code provided is in HTML/Markup format
val df = spark.read
.format("delta")
.option("versionAsOf", "5238")
.load("/path/to/my/table")
val df = spark.read
.format("delta")
.load("/path/to/my/table@v5238")Python
This is what things look like in Python:
df = spark.read \
.format("delta") \
.option("versionAsOf", "5238") \
.load("/path/to/my/table")
df = spark.read \
.format("delta") \
.load("/path/to/my/table@v5238")SQL syntax
And SQL:
SELECT count(*) FROM my_table VERSION AS OF 5238
SELECT count(*) FROM my_table@v5238
SELECT count(*) FROM delta.`/path/to/my/table@v5238`How to Recover a Delta Lake Table Using Databricks Time Travel
Identify the Correct Version
First things first, you need to know which Delta table version you want to recover.
Consider table_name first. This part identifies the Delta table version to be restored. Databricks forbids using a temporal specification in the table name.
Now let’s consider the timestamp_expression. It can be any of the following:
- ‘2018-10-18T22:15:12.013Z’, that is, a string that can be cast to a timestamp
- cast(‘2018-10-18 13:36:32 CEST’ as timestamp)
- ‘2018-10-18’, that is, a date string
- current_timestamp() – interval 12 hours
- date_sub(current_date(), 1)
- Any other expression that is or can be cast to a timestamp
Then there’s version. It’s a long value that can be obtained from the output of DESCRIBE HISTORY table_spec.
Note: Neither timestamp_expression nor version can be subqueries.
Restore the Table
To restore a Delta table to its previous state, you need to use the RESTORE command. To facilitate a table’s restoration to a previous state, a Delta table internally preserves older versions of the table.
Note: You can’t restore a table to a previous version in which the data files were manually or automatically removed. You may still partly restore to this version if spark.sql.files.ignoreMissingFiles is set to true.
When going back to a previous state, you need to use the timestamp format yyyy-MM-dd. HH:mm:ss. Alternatively, you can supply just a date string (yyyy-MM-dd).
Here’s what the command looks like in Databricks SQL:
RESTORE [ TABLE ] table_name [ TO ] time_travel_version
time_travel_version
{ TIMESTAMP AS OF timestamp_expression |
VERSION AS OF version }Don’t forget that restore is an operation that modifies data. DataChange is set to true in Delta Lake log entries that were added using the RESTORE command. The data change log entries produced by this operation are regarded as new data updates, and processing them may result in duplicate data if there is a downstream application, such as a Structured streaming job, that processes the updates to a Delta Lake table.
Using Databricks Time Travel: Examples
Fixing accidental deletes to a table
Let’s say that you want to fix accidental deletes to a table for user 222. Here’s the SQL code you need:
INSERT INTO my_table
SELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
WHERE userId = 222Fix incorrect updates to a table
To fix an incorrect update to a table that happened accidentally, use this:
MERGE INTO my_table target
USING my_table TIMESTAMP AS OF date_sub(current_date(), 1) source
ON source.userId = target.userId
WHEN MATCHED THEN UPDATE SET *Common Challenges With Changing Data
Auditing Data Changes
Data compliance and basic debugging to comprehend how data has changed over time both depend on the auditing of data changes. In such cases, organizations transitioning from traditional data systems to big data technology and the cloud face many different challenges.
Reproducing Experiments & Reports
Data practitioners carry out multiple experiments with varying parameters on a particular piece of data during model training.
Usually, upstream pipelines change the original data when they return to their experiments later to replicate the models. Many times, these upstream data changes catch them off guard, making it difficult for them to repeat their research.
By making several copies of the data, some scientists and organizations create best practices, which raise the cost of storage. This also applies to analysts who are writing reports.
Rollbacks
Data pipelines may occasionally provide inaccurate data for users farther down the line. This can occur due to various problems, such as pipeline errors, sloppy data, and unstable infrastructure.
Date-based partitioning makes rollbacks easy for pipelines that do basic appends to directories or tables. This may get quite complicated when dealing with updates and deletes, and data engineers usually have to design a sophisticated pipeline to handle such circumstances.
Databricks Time Travel Key Functionalities
Audit Data Changes
The UI or the DESCRIBE HISTORY command allows you to view the history of table modifications.
Reproduce Experiments & Reports
Time travel is crucial in data science and machine learning. For data scientists, reproducibility of experiments and models is crucial since they may develop hundreds or even thousands of models before putting one into production.
Throughout that process, they may sometimes need to return to prior models. This is extremely difficult to do because data science tools and data management are often different.
By combining MLflow, an open-source platform for the machine learning lifecycle, with Delta’s time-travel capabilities, Databricks addresses the reproducibility issue.
You can easily monitor which version of the data was used for each training job by adding a timestamped URL to the route as an MLflow option, enabling reproducible ML training. This allows you to replicate previous models by returning to previous datasets and parameters.
You don’t have to worry about copying data for other trials or coordinating with teams upstream on the data.
Rollbacks
Time travel also opens the door to simple rollbacks in the event of poor writing. For instance, you can quickly repair the pipeline if a malfunction causes user data to be inadvertently erased during a GDPR pipeline job using this:
INSERT INTO my_table
SELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
WHERE userId = 111And here’s how you can repair erroneous updates:
MERGE INTO my_table target
USING my_table TIMESTAMP AS OF date_sub(current_date(), 1) source
ON source.userId = target.userId
WHEN MATCHED THEN UPDATE SET *Pinned View of a Continuously Updating Delta Table Across Multiple Downstream Jobs
You can now pin the snapshot of an ever-updating Delta table for several downstream tasks using AS OF queries.
Imagine a scenario where a downstream operation reads from a Delta table and updates several destinations regularly. The Delta table is updated continually, every 15 seconds. In these situations, you usually want the source Delta table to have a consistent view so that all destination tables show the same information.
You can deal with such situations easily:
version = spark.sql("SELECT max(version) FROM (DESCRIBE HISTORY my_table)").collect()
# Will use the latest version of the table for all operations below
data = spark.table("my_table@v%s" % version[0][0]
data.where("event_type = e1").write.jdbc("table1")
data.where("event_type = e2").write.jdbc("table2")
...
data.where("event_type = e10").write.jdbc("table10")Queries for Time Series Analytics Made Simple
Time travel also makes time series analytics simpler. Here’s a very basic query you might use to find out how many new clients you brought on in the past week:
SELECT count(distinct userId) - (
SELECT count(distinct userId)
FROM my_table TIMESTAMP AS OF date_sub(current_date(), 7))
FROM my_tableSingle Table Time Travel vs. Data Version Control with lakeFS
How does the single table time travel capability differ between Databricks and lakeFS?
lakeFS enables you to manage your data like code, whereas Delta tables let you travel back in time for a specific table. The former permits time travel of all those tables simultaneously and maintains a store of delta tables, which can even number in the hundreds of thousands. You can see a snapshot of every table in the repository at a given moment by returning to that particular point.
Moreover, changes made to your data may be committed, giving you always-accessible snapshots. To work with an isolated data environment, you can create a branch in a repository and merge data modifications to your primary production branch.
It’s also critical to remember that losing history is a prerequisite for Delta table compaction. With data version management systems, historical data is never lost until absolutely necessary. Compaction further reduces the amount of history that may be accessed in Delta time travel.
When using compaction on Delta tables with lakeFS, the history of those tables is preserved in all prior commits; only the history of the specific commit you made is lost.
Although Delta tables provide strong per-table time travel capabilities, lakeFS offers a more complete and adaptable data management solution. Its excellent capabilities for isolated settings, Write-Audit-Publish, debugging, and ML reproducibility, together with its format-agnostic nature, make it an indispensable solution for contemporary data science and data engineering operations.
lakeFS facilitates safe data promotion, isolated environments, and repeatable operations, expediting the creation and implementation of data/AI solutions. This promotes secure and effective cooperation while guaranteeing high data integrity and quality standards.
Conclusion
In Delta, time travel significantly increases developer productivity. It helps data scientists do their research more effectively. Data engineers remove errors from their writing and streamline their workflows, while data analysts make reporting simple.
Ultimately, time travel capabilities help teams standardize on a pristine, centralized, versioned big data repository in their cloud storage. And you can easily expand the single table time travel offered by Databricks with more data version control capabilities from lakeFS!
Check out lakeFS for Databricks to increase your data engineering velocity and reap benefits from full-scale version control for data.


