When testing ETLs for big data applications, data engineers usually face a challenge that originates in the very nature of data lakes. Since we’re writing or streaming huge volumes of data to a central location, it only makes sense to carry out data testing against equally massive amounts of data.
You need to test with the real volume of data – and not just the number of users, but the volume, complexity, and variety of your data set. All the different tools and software versions in the test and production environments need to be the same.
That makes sense, doesn’t it? Only it’s easier said than done. Replicating production data is time-consuming and expensive. In today’s economic reality, requesting a budget for data testing purposes is doomed to fail.
Object stores may be cheap, but they’re certainly not free. In the big data world, teams deal with data lakes that are often petabytes in size – and rapidly growing as the organization expands. Copying files to separate buckets for production-like testing can take hours.
If your data lake consumes 100 TB of data on Amazon S3, creating a single copy of the data for a continuous testing environment will cost roughly $25,000 annually. Want to run several test environments in parallel? Just multiply that figure – and prepare for a tricky conversation with the finance manager!
Luckily, there’s a way out of this mess. Keep reading to explore all the nuances of big data testing and get a practical solution.
The first step to big data testing is the engineering culture
In the software development world, there’s no better way to improve software quality than via rigorous testing. The same is true for data engineering; teams need to build and execute a comprehensive testing strategy to achieve the Holy Grail of high-quality data in production.
Since data teams often face hard deadlines, it’s common for engineers to put together data pipelines that are functional but not necessarily built with best practices in mind.
That’s why organizations looking to achieve a high quality of data need to build a culture that supports incorporating all the best practices that are valuable in the long run.
This is especially important given that not all data engineers (or data engineering leaders, for that matter) have a software engineering background. They might not be that familiar with SWE development principles and best practices.
Note that the industry itself is only catching up with these practices. Running an automated suite of tests and automated deployment/release of data products still can’t be considered mainstream.
Finally, there’s the complexity that comes from big data itself. In ETL testing, data engineers need to compare huge volumes of data (on the scale of millions of records), often coming from different source systems. This includes comparing transformed data resulting from complex SQL queries or Spark jobs.
Big data testing is a data-centric testing process. To effectively test data pipelines, engineers need production-like data across the volume, variety, and velocity.
What data do teams use to replicate production scale for data testing?
Testing against production data is risky, so many data engineering teams resort to various tactics that give them access to production-like data for testing purposes.
1. Using mock data
Many data engineers use this approach because creating mock data is relatively easy thanks to the plethora of synthetic data generation tools such as Faker. However, mock data doesn’t reflect production data regarding volume, variety, or velocity. You won’t be testing the full picture and might miss out on issues that snowball into real problems later on.
2. Sampling production data to the test/dev environment
Another common tactic is copying a fraction of the production data instead of the entire thing, and then testing it. If you go for this approach, make sure to use the right sampling strategy to ensure the sample reflects real-world production data. Tests that run successfully on sample prod data may easily fail on actual data because volumes and variety aren’t guaranteed.
3. Copying all the production data to the test environment
If you do that, you’ll have all the real-world production data available for testing. It sounds too good to be true? That’s because it is.
First, if your production data contains PII data, copying it for testing purposes might lead to data privacy violations. Second, if your production data changes constantly, then the copy of prod data in the test/dev environment will become stale.
You’ll need to constantly update it. So, while copying prod data guarantees volume and variety, it doesn’t guarantee velocity.
4. Copying anonymized production data to the test environment
This tactic again makes all the real-world production data available for your data testing initiative. It also guarantees compliance with all data privacy regulations.
But constantly changing prod data will again create a challenge for the team. The data in test env might become stale quickly and you’ll have to refresh it regularly.
Also, you’ll need to run PII anonymization every time you copy data out of prod. Running anonymization steps manually every time and maintaining a long-running test data environment is error-prone and resource-intensive. It adds a huge overhead to the already busy data engineering team, which is a problem.
5. Using a data versioning tool to fully mimic production data to dev/test env
This tactic paves the way for the future. You get access to real-world production data to be used in automated, short-lived test environments available through a Git-like API. All you need to do is add a new tool to your existing data stack.
To help you understand how this last approach works, let’s take a look at this practical example of using lakeFS for big data testing.
Big data testing & data versioning in practice
Tools like lakeFS let you test your ETL pipelines directly as if you were testing against production data – but without ever wasting time and energy on copying or anonymizing it.
Using lakeFS, you can create an environment that’s exactly like your production – it’s just as complex, and massive, and includes the same configurations. All of this is thanks to a data versioning mechanism.
How does data versioning help in big data testing?
lakeFS is an object-based file system that sits on top of cloud storage and provides Git-like capabilities (merge, branch, revert, and commit) via an API, a command line interface, or a graphical UI.
So, the ecosystem of tools that you have can either access the storage as usual or through lakeFS to get versioning capabilities. The only difference is that before you accessed a collection in a bucket – Amazon Web Services Simple Storage Service (S3), Azure Blob Service (ABS), or any object storage that supports the S3 protocol – now you include a name of a branch or a commit identifier. For example, main, prod, gold, or whatever you call your production data branch.
How does lakeFS do that without copying any data?
Let’s look at how lakeFS works under the hood and manages metadata. Every commit in lakeFS is a collection of pointers to objects. Since the object store is immutable, you can copy on write. When a new object is created, the next commit will point to that new object.
Source: lakeFS
Files don’t change between commits, so you’ll have multiple commits pointing to the same physical object. This is very useful for developing in isolation against production data. If you want to create an identical environment to production and then create a branch, you can branch out in milliseconds because lakeFS doesn’t copy the data.
Many lakeFS users have massive lakes that count petabytes of data. But they can branch out in milliseconds because it’s a metadata-only operation.
What does it mean for testing purposes? That you can easily create 20, 50, or 100 test environments without copying any physical objects in the production environment. This leads to serious savings on storage fees by 20 to 80%. This is just one of several benefits of data versioning for this use case.
Advantages of using lakeFS in big data testing
Reduction of storage costs by 20-80%
When you use lakeFS, you can test your applications and operate them in your data lake without ever copying an object. Each developer has a fully functioning copy of the production environment while saving on storage costs.
Furthermore, data lakes retain most of the same files for long periods of time, and only a subset of files changes on a regular basis. lakeFS provides deduplication for the entire data lake over time.
Increased engineering productivity
Applying engineering best practices allows engineers to be more effective and less frustrated. They can quickly get the environment they need, work on it, and then throw it out or use it to merge environments in seconds.
99% faster recovery from production outage
If something bad happens in your production environment, you can easily go back to the last known good state of your entire cluster in milliseconds.
You can get all these benefits through our open-source solution hosted locally or our managed hosted solution where we provide service level assurance and other advantages.
Take lakeFS for a spin in the lakeFS playground to see how it works for use cases ranging from big data testing to developing ETL pipelines in isolation.

