The Write-Audit-Publish (WAP) pattern in data engineering gives teams greater control over data quality. But what does it entail, and how do you implement it? Keep reading to learn more about the Write-Audit-Publish pattern, examine its use cases, and get a practical implementation example.

What is Write-Audit-Publish all about?

Write-Audit-Publish (WAP) aims to boost trust and confidence in the data you deliver by verifying it before it’s made available to consumers, whether it’s the end users viewing the data in a dashboard or subsequent data processing jobs. How WAP is implemented will vary greatly depending on the specifics of your technology platforms, architecture designs, and other requirements.

A critical condition behind WAP is that consumers only view the currently available data as new data that is waiting to be processed.

New data for 18 May 2023 is available. Data for previous day is loaded in the database and shown in downstream use.

The idea behind Write-Audit-Publish is to make the data available in a regulated manner:

Write

Write the data you are processing somewhere that downstream customers will not be able to read it. This could refer to a staging or temporary location, a branch, or another location. New data is loaded in a temporary staging area.

Audit

Audit the data to ensure that it meets data quality standards. The temporary staging area data is checked for quality. Users of the published data still view only the existing data.

Publish

Publish the data by writing it to a location where downstream customers can read it. Data is released after a successful audit, and all downstream users can now see it.

What does publishing data entail? It can be anything like:

Adding data from a staging table to the main database against which users execute their queries
Merging a data branch into the trunk on platforms that support it
Flipping a flag in a table so that people querying it now include that data in their results

Why is Write-Audit-Publish important?

Write-Audit-Publish is quite similar to the concept of Blue-Green deployments in software engineering. Martin Fowler popularised blue-green deployment more than a decade ago. When you deploy something new, you don’t go “big bang” but rather route some or all traffic to the new deployment with the flexibility to switch back at any time.

This is quite similar to what WAP provides but for datasets rather than application servers.

A user queries previously published data, which is logically routed. They execute the same query against the same connection and receive different data once it is “published” (as opposed to an actual router).

This could involve branching. In the example below, the yellow box becomes the “main” trunk branch, and merging into it allows the user to see the most recent data. It could also be a more manual process, such as combining data from a staging table.

A user queries data, logically routed from the ‘green’ recently published data.

WAP is an excellent fit for recurring data pipelines and one-time data processing activities. But how do you implement WAP?

How to pick the right Write-Audit-Publish tool

When shopping for a WAP tool, consider the following aspects:

Aspect	What to Consider
Granularity	At what level can you write, publish, and roll back data? Is it per file, table, catalog, or the whole lake?
Data format	Can you dissociate your technological decision from the data format, or are they inextricably linked? Do you need to keep any other data formats in the same logical group as those published?
Isolation	Can you perform a “dirty read” of data from another client that was previously written but not published? This is critical if you want to conduct an asynchronous data audit from another process or tool (such as an external data quality tool)
Ease of use	Do you have to contort yourself into knots to fit the tool into the WAP pattern, or does it flow naturally?
Cross-language support	Given that not everyone lives and breathes JVM, how accessible is WAP functionality to a novice SQL hack navigating this wild new world of data engineering?

Implementing Write-Audit-Publish: 6 Options to Consider

1. Apache Iceberg

In the area of open table formats, Iceberg has made the most progress, with active discussion and development of WAP.

Initially, support was restricted to a laborious requirement to determine the writer’s ID in a table before querying it for audit. Still, with the addition of branches and tags in the 1.2.0 release (video), it’s now very straightforward to use.

Write

The documentation describes the process clearly. The one thing that isn’t immediately obvious—and maybe a holdover from when branches weren’t supported but “integrated audit” was—is how to address a branch and what session settings you need to establish.

Regardless of how you do it, you must create a branch before making changes to the table:

What Is Write-Audit-Publish and Why Should You Care?

What is Write-Audit-Publish all about?

Write

Audit

Publish

Why is Write-Audit-Publish important?

How to pick the right Write-Audit-Publish tool

Implementing Write-Audit-Publish: 6 Options to Consider

1. Apache Iceberg

2. Apache Hudi

3. Delta Lake

4. RDBMS

5. Project Nessie

6. lakeFS

Wrap up

Related articles

Building Compliant and Reproducible ML Pipelines

Iceberg REST Catalog Alternatives: Top Options & How to Choose The Best One For Your Team

lakeFS Top 10 Defining Product Milestones in 2025

Put your Write-Audit-Publish skills to use with lakeFS. Watch how

lakeFS

Pick up the Slack with lakeFS