What is Data Lifecycle Management
Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table.
Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing this process is often referred to as data lifecycle management.
Put another way, data lifecycle management encapsulates two things:
- What is required to publish new data?
- What is required to ensure published data is useful?
Necessity Borne Out of Complexity
One size for managing a data’s lifecycle does not fit all. Instead it is true that as complexity of the data environment increases¹ adding functionality to provide additional guarantees proves more critical.
In other words, when things are simpler, you can get away with a more laissez-faire approach to lifecycle management.
With this mind, we’ll start by discussing each step of the data lifecycle from a simple perspective and work our way up to what can be added to deal with complexity.
¹As measured by volume of data, No. of jobs, No of people collaborating over, etc
The Data Lifecycle Management Steps
Instead of taking a “project management-y” perspective of creating entirely new datasets, I prefer to think in terms of the journey new records of data take to become part of a production dataset.
We break it down like so:
- Data Ingestion — Bringing raw data into the data environment
- Data Transformation — logic applied to landed data to produce clean datasets
- Testing & Deployment — Quality/validation tests applied during data publication
- Monitoring & Debugging — Tracking data health and finding cause of errors
The processes depicted in the above diagram aren’t created and run once, producing datasets that exist statically thereafter. Instead it’s a constantly running and evolving system that cycles every day, hour, or even second.
Let’s discuss in more detail!
New data needs to come from somewhere, and data ingestion is the process of bringing new data into the analytics environment.
The most basic strategy is to land raw data in a warehouse or object storage, often partitioned by date. Once the raw data is stored, it becomes available to downstream transformational processes that turn it into comsumable datasets.
In simpler environments where some data downtime is tolerated, this strategy suffices since the time spent recovering from bad data that makes it way downstream is not prohibitive. For example, with a small enough dataset you can drop and recreate an entire production table in a blink.
As that “scorched table” approach becomes less feasible, the ingestion lifecycle can evolve to include validation tests executed over ingested data.
Managing the dependency between ingesting, testing, and transforming tasks is often the responsibility of an orchestration tool like Airflow. If validation tests fail, downstream transformations won’t execute, preventing low-quality data from polluting production.
As the number of datasets and tests increase, even this strategy will reach limit of its manageability. If you find yourself in this situation, it is strongly recommended to take a different approach to data ingestion with stronger isolation guarantees.
As shown above, the most resilient approach lands new data onto an isolated ingest branch sourced from the raw data. Validation tests are executed against the isolated ingest branch, and only upon passing, does new data become part of the raw dataset.
Critically, downstream jobs can now read from the raw data without first checking the state of a testing dependency. This eases the burden on the orchestration tool and reduces the brittleness of the overall system.
It is never the case that the raw data ingested is in the ideal final form for later analysis. Instead transformations of the data are required that do some or all of the following:
- Standardize Fields
- Resolve Duplicates
- Calculate time-based aggregations
- Apply business logic
In terms of managing data’s lifecycle, an important idea in data transformations is traceability. Given a row of transformed data, how easy is it to figure out:
- Which raw dataset it came from and what was that dataset’s state at the time of execution?
- Which process transformed it and what logs did it generate?
If answering the above questions is difficult, you are susceptible to spending more time than is ideal figuring out what’s happening. Figuratively speaking, your data’s lifecycle will grind to halt.
From my experience in simpler environments, the answer to these questions is simply remembered by the people working in it. As a fallback, making use of the UI’s provided by tools like Airflow or dbt to visualize job and data dependencies is helpful.
Testing & Deployment
A smart data lifecycle will include some sort of process for deployment… other than thoughtlessly adding data to production and hoping for the best.
The tried and true strategy used in application development of continuous deployment also works for data. The recommended approach is to implement a CI/CD-type workflow with a combination of unit, data quality, and integration tests as part of the deployment process.
At lakeFS we strongly believe in enabling this type of workflow through the use of merging operations and pre-merge hooks. If you do not test your transformed data before exposing it to data consumers, you’re leaving the fate of your data quality to chance.
Monitoring & Debugging
At the risk of sounding like a data hypochondriac, the health of the data environment and the datasets it produces need constant monitoring.
A goal to aim for is to proactively catch issues in the data before external users of the data do. It is more reassuring to see a note in a dashboard saying “we’ve identified an error in this data and are working on a fix” than having the CEO of your company find it herself.
In order for this to happen, failures in job execution need to be captured and sent as high or low-level alerts. Tests that check for unexpected changes in distribution (like percent of android users in the population) should run periodically over critical datasets.
These strategies help to catch the errors; the next step is to fix them. In smaller environments, you can get away with manually running operations like deleting a partition in an object store or running a “delete from” query against the warehouse.
Although they get the job done, both are prone to human error and risk compounding an issue further. As complexity increases, the ability to revert data in the simplest way possible becomes a necessity instead of a nice-to-have.
A data environment that either doesn’t catch errors or isn’t simple in reverting them risks breaking the data lifecycle, preventing new development from even starting.
lakeFS as the Data Lifecycle Management Tool
So far we’ve discussed the data lifecycle at a mostly conceptual level, without getting into specific implementation details. You should, however, start to have an understanding how lakeFS enables best practices in data lifecycle management at each step, particularly as complexity in the environment increases.
This is realized through enabling:
- Isolated ingestion workflows
- Automated testing-based deployment
- Easy data reproducibility and revert operations
For this reason, we like to think of lakeFS as the tool for adopting best practices in data lifecycle management.
Read Related Articles.
New features in Airbyte and lakeFS make it easy to send data replicated by Airbyte into a lakeFS repo. See how to leverage this integration
Introducing Kubeflow and lakeFS Kubeflow is a cloud-native ML platform that simplifies the training and deployment of machine learning pipelines on Kubernetes. An ML project using