Table of Contents
What is Data Lifecycle Management
Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table.
Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing this process is often referred to as data lifecycle management.
Put another way, data lifecycle management encapsulates two things:
- What is required to publish new data?
- What is required to ensure published data is useful?
Necessity Borne Out of Complexity
One size for managing a data’s lifecycle does not fit all. Instead it is true that as complexity of the data environment increases¹ adding functionality to provide additional guarantees proves more critical.
In other words, when things are simpler, you can get away with a more laissez-faire approach to lifecycle management.
With this mind, we’ll start by discussing each step of the data lifecycle from a simple perspective and work our way up to what can be added to deal with complexity.
¹As measured by volume of data, No. of jobs, No of people collaborating over, etc
The Different Phases of Data Lifecycle Management
Instead of taking a “project management-y” perspective of creating entirely new datasets, I prefer to think in terms of the journey new records of data take to become part of a production dataset.
We break it down like so:
- Data Ingestion — Bringing raw data into the data environment
- Data Transformation — logic applied to landed data to produce clean datasets
- Testing & Deployment — Quality/validation tests applied during data publication
- Monitoring & Debugging — Tracking data health and finding cause of errors
The processes depicted in the above diagram aren’t created and run once, producing datasets that exist statically thereafter. Instead it’s a constantly running and evolving system that cycles every day, hour, or even second.
Let’s discuss in more detail!
New data needs to come from somewhere, and data ingestion is the process of bringing new data into the analytics environment.
The most basic strategy is to land raw data in a warehouse or object storage, often partitioned by date. Once the raw data is stored, it becomes available to downstream transformational processes that turn it into comsumable datasets.
In simpler environments where some data downtime is tolerated, this strategy suffices since the time spent recovering from bad data that makes it way downstream is not prohibitive. For example, with a small enough dataset you can drop and recreate an entire production table in a blink.
As that “scorched table” approach becomes less feasible, the ingestion lifecycle can evolve to include validation tests executed over ingested data.
Managing the dependency between ingesting, testing, and transforming tasks is often the responsibility of an orchestration tool like Airflow. If validation tests fail, downstream transformations won’t execute, preventing low-quality data from polluting production.
As the number of datasets and tests increase, even this strategy will reach limit of its manageability. If you find yourself in this situation, it is strongly recommended to take a different approach to data ingestion with stronger isolation guarantees.
As shown above, the most resilient approach lands new data onto an isolated ingest branch sourced from the raw data. Validation tests are executed against the isolated ingest branch, and only upon passing, does new data become part of the raw dataset.
Critically, downstream jobs can now read from the raw data without first checking the state of a testing dependency. This eases the burden on the orchestration tool and reduces the brittleness of the overall system.
It is never the case that the raw data ingested is in the ideal final form for later analysis. Instead transformations of the data are required that do some or all of the following:
- Standardize Fields
- Resolve Duplicates
- Calculate time-based aggregations
- Apply business logic
In terms of managing data’s lifecycle, an important idea in data transformations is traceability. Given a row of transformed data, how easy is it to figure out:
- Which raw dataset it came from and what was that dataset’s state at the time of execution?
- Which process transformed it and what logs did it generate?
If answering the above questions is difficult, you are susceptible to spending more time than is ideal figuring out what’s happening. Figuratively speaking, your data’s lifecycle will grind to halt.
From my experience in simpler environments, the answer to these questions is simply remembered by the people working in it. As a fallback, making use of the UI’s provided by tools like Airflow or dbt to visualize job and data dependencies is helpful.
Testing & Deployment
A smart data lifecycle will include some sort of process for deployment… other than thoughtlessly adding data to production and hoping for the best.
The tried and true strategy used in application development of continuous deployment also works for data. The recommended approach is to implement a CI/CD-type workflow with a combination of unit, data quality, and integration tests as part of the deployment process.
At lakeFS we strongly believe in enabling this type of workflow through the use of merging operations and pre-merge hooks. If you do not test your transformed data before exposing it to data consumers, you’re leaving the fate of your data quality to chance.
Monitoring & Debugging
At the risk of sounding like a data hypochondriac, the health of the data environment and the datasets it produces need constant monitoring.
A goal to aim for is to proactively catch issues in the data before external users of the data do. It is more reassuring to see a note in a dashboard saying “we’ve identified an error in this data and are working on a fix” than having the CEO of your company find it herself.
In order for this to happen, failures in job execution need to be captured and sent as high or low-level alerts. Tests that check for unexpected changes in distribution (like percent of android users in the population) should run periodically over critical datasets.
These strategies help to catch the errors; the next step is to fix them. In smaller environments, you can get away with manually running operations like deleting a partition in an object store or running a “delete from” query against the warehouse.
Although they get the job done, both are prone to human error and risk compounding an issue further. As complexity increases, the ability to revert data in the simplest way possible becomes a necessity instead of a nice-to-have.
A data environment that either doesn’t catch errors or isn’t simple in reverting them risks breaking the data lifecycle, preventing new development from even starting.
What are the Benefits of Good Data Lifecycle Management?
The goal of Data Lifecycle Management is to improve the practice of data practitioners by structuring how they think about managing data. It is common to manage data flowing from many input sources, all which combine and transform to create valuable data assets used in reporting, machine learning, and operational functions.
Given this complexity, it is imperative to use consistent processes to manage the lifecycle of these data assets, from inception to consumption. This takes the form of centralized technologies and processes for ingesting, transforming, and monitoring data quality.
Ultimately, this lets a data team maximize its potential and impact. A data team following effective Data Lifecycle Management practices can manage more use cases of its data without getting bogged down investigating data quality or other issues.
Data Lifecycle Management Vs Application Lifecycle Management
Like many concepts in the data world, it borrows from a popular concept in software engineering. Data Lifecycle Management is no exception, taking its namesake from the influential ALM concept.
Software applications go through the same process of development, deployment, and eventually retirement. This leads there to be a great deal of overlap between the fundamental concepts in both data and application lifecycles.
Where they differ, is where it is the most interesting. One main difference stems from the fact that applications are relatively small in size. Even the largest app on your phone is no more than a few hundred MB.
Data assets meanwhile, can grow as large as hundreds of terabyes, which is much, much bigger. This makes it significantly more complicated to create copies of these data assets for testing before deployment. As a result, it is imperative to be able to clone large data objects, creating a new reference to it without duplicating the underlying objects in storage.
This provides a useful to test even the largest data assets during deployment and let’s Data Lifecycle Management more closely reflect the principles of Application Lifecycle Management.
lakeFS as the Data Lifecycle Management Tool
So far we’ve discussed the data lifecycle at a mostly conceptual level, without getting into specific implementation details. You should, however, start to have an understanding how lakeFS enables best practices in data lifecycle management at each step, particularly as complexity in the environment increases.
This is realized through enabling:
- Isolated ingestion workflows
- Automated testing-based deployment
- Easy data reproducibility and revert operations
For this reason, we like to think of lakeFS as the tool for adopting best practices in data lifecycle management.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
It was 27th June 2022. San Francisco was bustling with 5000+ data folks from around the world to attend the Data & AI summit live