The Write-Audit-Publish (WAP) pattern in data engineering gives teams greater control over data quality. But what does it entail, and how do you implement it? Keep reading to learn more about the Write-Audit-Publish pattern, examine its use cases, and get a practical implementation example.
What is Write-Audit-Publish all about?
Write-Audit-Publish (WAP) aims to boost trust and confidence in the data you deliver by verifying it before it’s made available to consumers, whether it’s the end users viewing the data in a dashboard or subsequent data processing jobs. How WAP is implemented will vary greatly depending on the specifics of your technology platforms, architecture designs, and other requirements.
A critical condition behind WAP is that consumers only view the currently available data as new data that is waiting to be processed.

The idea behind Write-Audit-Publish is to make the data available in a regulated manner:
Write
Write the data you are processing somewhere that downstream customers will not be able to read it. This could refer to a staging or temporary location, a branch, or another location. New data is loaded in a temporary staging area.

Audit
Audit the data to ensure that it meets data quality standards. The temporary staging area data is checked for quality. Users of the published data still view only the existing data.

Publish
Publish the data by writing it to a location where downstream customers can read it. Data is released after a successful audit, and all downstream users can now see it.

What does publishing data entail? It can be anything like:
- Adding data from a staging table to the main database against which users execute their queries
- Merging a data branch into the trunk on platforms that support it
- Flipping a flag in a table so that people querying it now include that data in their results
Why is Write-Audit-Publish important?
Write-Audit-Publish is quite similar to the concept of Blue-Green deployments in software engineering. Martin Fowler popularised blue-green deployment more than a decade ago. When you deploy something new, you don’t go “big bang” but rather route some or all traffic to the new deployment with the flexibility to switch back at any time.

This is quite similar to what WAP provides but for datasets rather than application servers.
A user queries previously published data, which is logically routed. They execute the same query against the same connection and receive different data once it is “published” (as opposed to an actual router).
This could involve branching. In the example below, the yellow box becomes the “main” trunk branch, and merging into it allows the user to see the most recent data. It could also be a more manual process, such as combining data from a staging table.

A user queries data, logically routed from the ‘green’ recently published data.
WAP is an excellent fit for recurring data pipelines and one-time data processing activities. But how do you implement WAP?
How to pick the right Write-Audit-Publish tool
When shopping for a WAP tool, consider the following aspects:
| Aspect | What to Consider |
|---|---|
| Granularity | At what level can you write, publish, and roll back data? Is it per file, table, catalog, or the whole lake? |
| Data format | Can you dissociate your technological decision from the data format, or are they inextricably linked? Do you need to keep any other data formats in the same logical group as those published? |
| Isolation | Can you perform a “dirty read” of data from another client that was previously written but not published? This is critical if you want to conduct an asynchronous data audit from another process or tool (such as an external data quality tool) |
| Ease of use | Do you have to contort yourself into knots to fit the tool into the WAP pattern, or does it flow naturally? |
| Cross-language support | Given that not everyone lives and breathes JVM, how accessible is WAP functionality to a novice SQL hack navigating this wild new world of data engineering? |
Implementing Write-Audit-Publish: 6 Options to Consider
1. Apache Iceberg
In the area of open table formats, Iceberg has made the most progress, with active discussion and development of WAP.
Initially, support was restricted to a laborious requirement to determine the writer’s ID in a table before querying it for audit. Still, with the addition of branches and tags in the 1.2.0 release (video), it’s now very straightforward to use.
Write
The documentation describes the process clearly. The one thing that isn’t immediately obvious—and maybe a holdover from when branches weren’t supported but “integrated audit” was—is how to address a branch and what session settings you need to establish.
Regardless of how you do it, you must create a branch before making changes to the table:
ALTER TABLE db.table
CREATE BRANCH etl_job_42From here, you can choose one of two approaches to writing:
The first one is adding the branch_<branch-name> suffix to the table name to write directly to the branch.
DELETE FROM db.table.branch_etl_job_42 WHERE borough='Manhattan'Alternatively, you can configure the table for WAP, assign the spark.wap.branch session parameter to the branch, and then execute the write:
-- Enable WAP for the table
ALTER TABLE db.table
SET TBLPROPERTIES (
'write.wap.enabled'='true'
)
-- Set the branch used for this session for reads and writes
spark.conf.set('spark.wap.branch', 'etl_job_42')
-- Write.
DELETE FROM db.table WHERE borough='Manhattan'Audit
After you’ve written the data, you can read it from the branch to verify it. The important part is that the primary table remains untouched, so customers won’t see any updated data at this point.
To read from the branch, you have three options:
1. Add a.branch_ suffix.
SELECT foo, bar
FROM db.table.branch_etl_job_42;2. Use the VERSION AS OF clause.
SELECT foo, bar
FROM db.table VERSION AS OF 'etl_job_42';3. Set spark.wap.branch.
spark.conf.set('spark.wap.branch', 'etl_job_42')
SELECT foo, bar
FROM db.table;Publish
Once your audit is complete, you are ready to publish the branch. This is accomplished using a fast-forward branch merge via the manageSnapshots().fastForwardBranch API:
table
.manageSnapshots()
.fastForwardBranch("main", "etl_job_42")
.commit();Unlike the ManageSnapshots().cherrypick API has a stored-procedure accessible (cherrypick_snapshot), fastForwardBranch does not have one as of 1.2.1, so you must use the API directly.
2. Apache Hudi
Hudi offers features similar to WAP, such as Pre-Commit Validators. It comes with three validators and can be extended to include your own. Note that it can support only the Hudi data format.
Unfortunately, the tool doesn’t support asynchronous auditing. Instead, it’s an all-or-nothing scenario. This example from the docs provides a sense of how it works.
df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators",
"org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
option("hoodie.precommit.validators.single.value.sql.queries",
"select count(*) from <TABLE_NAME> where col=null#0").
save(basePath)When you write your data (Write), the pre-commit validator (Audit) is invoked, and if it succeeds, the write is completed (Publish).
Assuming the validator fails, you can land the faulty data into a temp area from which you can inspect and fix it and then re-run the original process. There is no option to stage the written data for audit by a secondary process, which is limiting in terms of tooling as well as troubleshooting.
Unlike Iceberg, which uses WAP-based configuration naming, Hudi doesn’t claim to implement WAP; rather, the pre-commit validators approximate the pattern. If Hudi is your tool and the validators work well with your pipeline implementation, there’s no reason not to use it.
3. Delta Lake
As far as we can tell, Delta Lake doesn’t support WAP.
If you use Databricks Runtime (DBR), you can use a Shallow clone where you roll out your own Publish logic (compared to, for example, Iceberg’s merge branch).
Their docs provide an example of a bespoke MERGE INTO statement specific to the fields and logic of the write operation, which strikes me as perhaps cumbersome and error-prone compared to simpler options.
4. RDBMS
In the same way that Apache Hudi offers a “Pre-Commit Validator,” you can BEGIN TRANSACTION in RDBMS. This is where you write your data, audit it, and then COMMIT to publish.
Partition exchange, also known as SQL Server SWITCH IN is a more flexible approach to the WAP pattern that allows for asynchronous auditing by an external process. This involves writing to a temporary table, auditing it (if desired) from another process, and then publishing it as a partition of the table from which users read.
5. Project Nessie
Nessie is a Transactional Catalog for Data Lakes and offers Git-like semantics. Nessie controls objects at the Catalog level, implying that integration is tighter but more constrained.
Nessie was created in close collaboration with the Apache Iceberg project and only supports it.
Write
After you’ve set up Nessie as your catalog (for example, in Spark), you build a branch:
CREATE BRANCH etl_job_42
IN my_catalog FROM mainYou then change the context for your session to this branch:
USE REFERENCE etl_job_42 IN my_catalogFollowing this, you can modify as many tables as you wish.
DELETE FROM my_catalog.table WHERE borough='Manhattan'Audit
Once you’ve written the data, you can check it by reading it from the branch, which can be done within the same process or through an external one. It’s important to note that the table(s) in the main branch will remain intact so that users won’t notice any changes.
To read from the branch, you have two options:
The first one: Set the context as you did when you wrote (in fact, if it’s in the same session, you don’t need to set it again because it’ll still be valid).
USE REFERENCE etl_job_42 IN my_catalog;
SELECT my, audit, query FROM my_catalog.table;Alternatively, you can reference the table directly:
SELECT my, audit, query FROM my_catalog.`table@etl_job_42`;Publish
To publish the data, you merge your branch back into the main for everyone to see.
MERGE BRANCH etl_job_42 INTO main IN my_catalog6. lakeFS
So far, the finest example of a WAP-supporting tool is Apache Iceberg, which explicitly uses branches and merging. lakeFS is all about branches and merging.
lakeFS supports any data format, including open-table formats (Iceberg, Hudi, and Delta), file formats (Parquet, CSV, etc.), and binary files (e.g., pictures).
lakeFS is a layer on top of your S3-compatible storage that includes native Spark clients and an S3-compatible gateway, allowing it to work easily with almost any data processing tool..
This is demonstrated in this notebook.
The only thing to understand is that instead of addressing your data into literal object store paths, such as
s3://my-bucket/foo/bar/table
You use lakeFS-based pathing, which provides the name of the repository and the current branch.
s3a://my-repo/branch/foo/bar/table
Write
Branches in lakeFS are copy-on-write, meaning they’re practically “free” as you only start saving data (other than metadata) on disk after you write changes to the branch, and then only the altered data.
lakeFS allows you to establish a branch in a variety of methods, including from the Python client:
lakefs.create_branch(repository="my-repo",
branch_creation=BranchCreation(
name="my-etl-job",
source="main".
)You can alternatively use the CLI (lakectl branch create), the online interface, or directly access the REST API.
After creating a branch, you can modify one or more tables in any format you like. When writing the data, you use the branch name from the path to which you are writing:
repo="my-repo"
branch="my-etl-job"
base_datapath=(f"s3a://{repo}/{branch}")
df.write.mode('overwrite').save(f"{base_datapath}/my/table")The beauty of lakeFS is that you may use simple Parquet files or open table formats such as Delta Lake (or Hudi and Iceberg as well).
At this point, the data on the main branch is completely untouched and unaffected by any processing done here.
Audit
Anything that can read data from S3 can also read data from lakeFS, allowing the audit to be performed completely independently.
For example, we may look for NULLs in the loaded data of a Delta table:
SELECT COUNT(*) AS ct
FROM delta.`s3a://my-repo/my-etl-job/this_is_a_delta_table`
WHERE year IS NULLYou can inspect a CSV file that we saved during the Write phase:
spark.read.text('s3a://my-repo/my-etl-job/some_src_data.csv').sample(fraction=0.2, seed=42).show()
lakeFS supports many data processing tools, including Spark, Trino, Presto, Python, and others.
Publish
If you’re satisfied with the audit, you can publish by merging the branch. The API can be accessed in various ways, including through the lakectl tool.
lakectl merge \
lakefs://my-repo/my-etl-job \
lakefs://my-repo/mainUnpublish (that’s right!)
If necessary, you can unpublish a branch, which is commonly known as rollback or reverse.
lakectl branch revert \
lakefs://my-repo/main \
main --parent-number 1 --yesWrap up
In this article, we discussed several options for implementing Write-Audit-Publish pattern using tools ranging from Apache Iceberg and Hudi to Project Nessie and lakeFS.
Now that you know your options, you’re well on your way to building a solid Write-Audit-Publish process. lakeFS is an excellent choice of WAP tool if you want to use Apache Hudi or Delta Lake but also need complete support for the WAP pattern.
If you’d like a more in-depth look into WAP, check out this 3-part series:
1. Data Engineering Patterns: Write-Audit-Publish (WAP)
2. How to Implement Write-Audit-Publish (WAP)
3. Putting the Write-Audit-Publish Pattern into Practice with lakeFS


