Based on my presentation at PyData Global 2025
When we – engineers – hear the word “compliance,” we tend to roll our eyes. We want to build features, not fill out forms. But here’s good news: the exact same tools that help you debug your code can also keep you out of trouble.
In this article, I’m going to show you how to apply the same version control principles you use for your code to your data to build a data pipeline where compliance happens automatically.
Why should we care about compliance?
Imagine this scenario:
It’s November 3. Your team deployed the Alpha 1 credit risk model. It’s a masterpiece in A/B testing, increasing revenue by 15%. Everyone is happy, you get a bonus, and champagne is popped.

But three months later, you get a subpoena. Regulators believe Alpha 1 is systematically discriminating against specific ZIP codes. They demand one thing: to show them the exact training data you used on November 3 to prove there was no bias.
You check your S3 bucket, but the November data is gone. It was overwritten by the ETL job that ran in January. You have the model. You have the model weights. You have the code. But you have lost your data history. You cannot prove you’ve done nothing wrong.
This nightmare scenario might be the reality for ML teams today. We build amazing engines, but we don’t treat them well. I’m going to show you how to ensure this never happens to you.
A compliance strategy is a must-have
A few years ago, ML was the Wild West: we grabbed data, trained the model, and shipped it. That time is over.
If you’re doing business in Europe, you have the EU AI Act. It demands rigorous record-keeping for high-risk models. You have to prove exactly what data went into the system.
If you deal with user data, GDPR says users have a right to transparency in automated decisions. This means you need to be able to explain why the model rejected their loan application. To do that, you need to be able to reproduce the exact state of the world when the decision was made.
In healthcare, if you can’t prove who accessed a patient’s data and that it wasn’t altered, you’re in big trouble.
Failing to comply means big fines. For example, a GDPR fine for a violation of PII (personally identifiable information) can reach up to 20 million euros or 4% of a company’s global annual revenue.
But that isn’t actually the biggest problem for most engineering teams. The biggest problem is losing business.
If you’re selling B2B software today, your customers will demand certifications like SOC2 or ISO 42001. They will ask: Can you trace every decision back to the source data?
If your competitor can say, “Yes, here is the automated log,” and you say, “let me ask the data guy,” you risk losing the deal.
Compliance isn’t just about avoiding jail. It’s a competitive advantage.
To see why compliance matters in the real world, let’s look at three scenarios involving a data engineer named Alice.

3 compliance issues data engineers face
1. PII leakage
Alice is smart and uses modern tools. But she’s about to have a very bad week. On a random Tuesday, she’s facing a PII leakage. Alice ships a new dataset to improve accuracy. It works. She ships the model.
But two weeks later, someone realizes the dataset contained PII – maybe Social Security numbers – that she wasn’t allowed to use. Because she can’t untangle exactly which model version used that specific data, she has to scrap the entire project.
2. Reproducibility trap
Next, Alice faces the reproducibility trap. An auditor comes in and asks about a model prediction from six months ago. They ask, “Run the training code again and prove you get the same result.”
Alice runs the code, but the result is different because the database has changed since then. Records were updated. Users were deleted. The data is mutable, and the history is gone. She cannot prove how the original model actually worked.
3. Traceability gap
Finally, there’s the traceability gap. Legal sends an urgent email: we are being sued for copyright infringement. Did we use the “Book Three” dataset in the model we shipped on Tuesday?
Alice looks at her S3 bucket. It contains folders named V1, V2, and V2_final, or V2_final_real. There is no automated link between the model binary running in production and the specific folder of files used to train it. She literally cannot answer the question.
Data was the common failure point for all three scenarios

All three issues could be boiled down to one failure point: the data.
The data was mutable. The data had no history. It became a black box.
We treat our data like a swamp – and then we wonder why we get bogged down trying to explain our models.
I’m going to show you how to apply the same version control principles you use for your code to your data. You’ll see a pipeline where compliance happens automatically.
We will fix Alice’s workflow so that she can block bad data before it enters the system, reproduce any model from any point in time, and generate an audit trail without writing a single spreadsheet. Let’s dive in.
Solution? Data version control
If the problem is that our data is mutable, messy, and unversioned, the solution is actually staring at us in the face.
We solved this problem for software engineering 20 years ago. We need to treat our data exactly like we treat our code. We need to be able to version it, break it, and revert it.
If we can bring the discipline of software development to our data lake, compliance nightmares – reproducibility, traceability, governance – get solved.
Imagine an object store with Git semantics.
You could take your production data lake and create a branch: an isolated sandbox where you can experiment, delete files, or add PII without affecting the main production view.
After you train the model, you create a commit: an immutable snapshot of that exact state of the data, with a unique ID, a timestamp, and the author. And then you could merge data changes only after they pass validation tests.
This is the mental model we need – and this is exactly what we built lakeFS to do.
What is lakeFS?
lakeFS is a layer that sits on top of your object storage and transforms it into a Git-like repository. It gives you the same commands – branch, commit, merge, revert – but for your data.
Now, I know what the data engineers are thinking: I have 500 terabytes of training data. If I create a branch, do I have to copy all the data? That will be slow and expensive.
The answer is no.
lakeFS sits between your storage and your compute. It manages pointers to files, not the files themselves. When you create a branch, it’s a metadata operation. It takes milliseconds, and it costs nothing. We use copy-on-write, so you only store the changes. You can create a thousand branches of a petabyte-scale lake, and it won’t duplicate your storage costs.
The final piece of the puzzle is usability. You don’t want to rewrite your entire ML stack just to get compliance. With lakeFS, you don’t have to. It is transparent to your compute.
Whether you’re using Spark, Pandas, Databricks, or standard Python scripts, nothing changes in your code logic. You simply change the path. Instead of reading from a bucket, you read from a repository and a branch. It looks like a file store, but it acts like a Git repo.
So, we have the concept of Git for data. Now, let’s see how this actually fixes Alice’s bad week.
Typical ML workflow vs. compliant architecture
We’re going to build a workflow that actually solves Alice’s problems – not by asking her to work harder, but by giving her better tools. In this pipeline, compliance isn’t a checklist you panic about at the end of the quarter. Instead, we bake it directly into the engineering process.
I’m going to show you three specific things:
- How to use lakeFS with a gatekeeper to automatically block PII from ever touching your production data.
- How to ensure that every single experiment you run is attached to a traceable, immutable dataset.
- How to generate an audit trail that will satisfy even the strictest auditor.
By the end, you’ll see how the infrastructure can do the heavy lifting for you.
The data trap of typical ML workflow

Refer to the diagram above:
On the left, you have your raw data store. You run your ingestion process to move the data into your centralized feature store. This is where you curate datasets, clean them, and prepare them for training.
From a feature store, your training jobs pick up the data, crunch the numbers, and produce a binary. This binary gets pushed to a model registry. Finally, you deploy your model for serving – and in many modern GenAI recommendation stacks, this also involves a vector DB to handle embeddings and retrieval.
It looks robust. You have dedicated tools for every step. But there is a hidden flaw:
The model registry versions your binaries, and the feature store versions your definitions. But the actual underlying data – the Parquet files, the images, the raw text – everything is sitting in mutable object storage.
What does this mean?
- If you update a feature set, the old data is often overwritten.
- If you reindex your vector DB, the previous state is lost.
- There is no single snapshot that binds all the data in the feature store to the model in the registry and the embeddings in the vector DB – the link is broken.
Here’s what compliant architecture looks like

The upper part hasn’t changed: you’re still using your favorite feature store and model registry. You don’t need to update your stack. You’re not replacing Airflow or MLflow.
Instead, we introduce lakeFS as a unified control plane for all these components. It sits between your applications and your raw storage layer. When your feature jobs write data, they’re writing to a versioned branch in lakeFS. When your training job runs, it reads from an immutable commit hash – not a floating path.
Because lakeFS manages the underlying storage, you get total cross-component consistency and reproducibility without changing your application.
We’re going to treat our production data exactly like we treat the main branch of our code repository.
And just like software engineers, we never write data directly to main. The main branch is our system of record – our source of truth. When new data arrives from ingestion jobs, it doesn’t go to main but to a dedicated ingest branch. And when a data scientist wants to run an experiment or clean up some features, they branch off: experiment-1, for example.
This structure gives us isolation. It means we can mess up, delete data, and test things without putting production data at risk. We only merge data from ingest to main after it passes specific quality gates.
How data version control fixes the 3 compliance issues
Preventing PII leakage
Remember how Alice accidentally trained the model on Social Security numbers? We need to make that impossible to do. In the software world, you wouldn’t let code merge to production if unit tests fail. We’re going to do the same thing for data using pre-merge hooks.
We configure a hook called “validate PII.” This is compliance-as-code, so we’re not relying on Alice to remember to run a script manually. We configure the lakeFS server to enforce a policy: no merge enters main unless it passes PII validation.
The hook doesn’t have to be complicated. Here’s an example:
function validate_pii()
local sensitive_keywords = {"ssn", "social_security", "email", "credit_card"}
for _, file in ipairs(action.files()) do
if string.match(file.path, "%.parquet$") then
local success, schema = pcall(parquet.get_schema, file.physical_address)
if success then
for _, col in ipairs(schema.columns) do
for _, keyword in ipairs(sensitive_keywords) do
if string.find(string.lower(col.name), keyword) then
error("PII detected in " .. file.path)
end
...In this example, we use a simple script written by the trusted engineering team. It runs automatically whenever a merge request is made – it reads the specific files being changed, opens them up, and scans columns against a set of regular expressions looking for patterns that match Social Security numbers, credit card formats, or email addresses.
If it finds a match, it returns a non-zero exit code and signals failure. And because this runs at the infrastructure level, it acts as a hard gate. This runs every single time, for every single byte of data trying to enter production.
So once Alive has done her ingestion and thinks the data is clean, she may go to the web UI and click the green merge button.
But we’ve installed the pre-merge hook in lakeFS for every merge attempt to main. And now look at the output:

The system rejected the operation. The merge is blocked. That sensitive data is trapped on the ingest branch and hasn’t polluted the main branch. It hasn’t been exposed to the training pipeline.
We successfully shifted compliance left. We caught the issue moments after ingestion instead of discovering it six months later during a frantic audit.
Solving the reproducibility trap
Once our data is clean and we successfully merge it, lakeFS creates a commit. In a standard file system, when you update a file, the old version is overwritten. It’s gone. In lakeFS, every change is immutable. A commit ID is a cryptographic hash: a unique fingerprint that represents the entire state of your data lake at a precise time.
To solve reproducibility, we need to change how we talk about training data.
We need to stop saying “I trained the model on the Q3 dataset” or “I trained it on the data in that folder Mark shared.” Those are ambiguous and bound to change. We need to start saying: I trained this model on commit A1B2C3D.
So how do we implement this in code?
Let’s look at a concrete example using tools you likely use every day: Python, Pandas, and the lakeFS SDK.
In this snippet, we are starting a new experiment:
repo = lakefs.Repository("my-repo")
exp_branch = repo.branch("experiment-1").create(source_reference="main", exist_ok=True)
head_commit_id = exp_branch.head.id
table_path = "my_table.csv"
dataset_source_url = f"s3://{repo.id}/{head_commit_id}/{table_path}"
# Read from lakeFS
raw_data = pd.read_csv(dataset_source_url, delimiter=";", storage_options={
"key": "AKIAIOSFOLKFSSAMPLES",
"secret": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"client_kwargs": {"endpoint_url": "http://localhost:8000"}
})
# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(raw_data, source=dataset_source_url, me="famous_people")Instead of reading from a bucket and hoping no one changes it while we work, we explicitly create an isolated environment:
- First we define our repository. Then, and this is the critical step, we create a new branch called
experiment-1branching offmain. This happens instantly. Now we have a safe sandbox. - Next, we tell Pandas where to find the data. We construct a precise S3 URL using the repository ID and the head commit ID of our branch. This path points directly to an immutable snapshot of data on that branch.
- Finally, we read the CSV into a Pandas DataFrame.
- Notice: we don’t need a special lakeFS plugin for Pandas. We just pass standard S3 storage options pointing to the lakeFS endpoint.
It looks like a standard S3 request to Pandas. But under the hood, we are reading from a version-controlled isolated branch.
Now we have our data loaded. Next we need to make sure our experiment tracking tool, MLflow, knows exactly where the data came from.
Standard logging tracks hyperparameters like learning rate or batch size. But that’s not enough for compliance. We need to track the state of our data.
Inside our MLflow run context, we do two things.
- First, we register the dataset object we used
- Second – and most importantly – we set custom tags for lake repo, lake branch, and lake commit
...
# Track the dataset versioned by lakeFS
with mlflow.start_run() as run:
mlflow.log_input(dataset, context="training")
mlflow.set_tag("lakefs_repo", repo_id)
mlflow.set_tag("lakefs_branch", branch_id)
mlflow.set_tag("lakefs_commit", head_commit_id)We are linking the commit ID, that unique cryptographic hash, to the metadata of the training run. This is our golden link.
By adding those three lines, we permanently bind the resulting model artifact to the exact immutable state of the world that created it.
Six months from now, when you look at this run in the MLflow UI, you won’t have to guess which folder was used. You will have the specific commit ID right there.
Let’s see the result in the MLflow dashboard:

You see the standard metrics: accuracy, loss, and so on. But in the parameter section, you also see the data commit. Imagine an auditor walks in six months from now and points to a decision your model made and says, “Prove to me this model was trained on valid data. Reproduce it.”
Without this, you’re stuck guessing what was in the folder back then. But with this, you can copy the commit ID. Using the lakeFS web UI, you can see all the files that are part of that commit.
And using the lakeFS CLI, you can check out that entire data environment locally.
Within milliseconds, it’s ready for use – reverted to the exact state it was in when the training job ran. You can rerun the code and get the exact same result.

Addressing traceability
Traceability requires you to document the entire lifecycle of your data. Because we’re using a version control system, we get a full audit log for free.
This is the lakeFS commit view:

It looks like the commit history in GitHub, but for your data. For every change that has ever happened in your lake, you can see who made the change – was it Alice, or the automated ingest job? – and when they made it.
You can click into a commit and see exactly what files were added, modified, or deleted.
We’re no longer relying on manual spreadsheets or asking people to please sign the log book. The system captures lineage automatically. This closes the loop on the legal nightmare we started with.
Remember when legal asked: did we use the copyrighted Book Three dataset in the model we shipped on Tuesday? Now Alice can answer that question in seconds.
She looks at the model running in production. She grabs the commit ID tag. She goes to lakeFS, searches that commit for the Book Three file, and can definitively say: no, that file wasn’t present in this commit. We’re safe.

We’ve established full lineage from the final model decision all the way back to the raw input data.
Wrap up
We started with Alice’s nightmare: losing data history, facing audits she couldn’t answer, and scrambling to prove her models weren’t biased. Her story isn’t unique—it’s the reality for most ML teams today.
The problem boils down to one thing: we treat our data like a swamp. It’s mutable, unversioned, and impossible to trace. Meanwhile, we solved this exact problem for code two decades ago with Git.
The solution? Apply version control principles to your data lake.
In this article, I showed you how data version control with lakeFS solves the three critical compliance challenges:
PII leakage
Pre-merge hooks act as automated gatekeepers, blocking sensitive data before it touches production. No more accidental Social Security numbers in your training sets.
Reproducibility
Every model is linked to an immutable commit ID. Six months later, you can checkout that exact data state and rerun your training job with identical results.
Traceability
An automated audit trail captures who changed what and when. No manual spreadsheets, no guesswork – just a complete history you get for free.
The key takeaway is simple: version everything: your code, your environment, and your data.
Don’t let your data lake become a data swamp. Compliance shouldn’t slow you down. When you build it into your infrastructure from day one, you can stop worrying about lawsuits and start shipping with confidence.



