Best Practices, Data Engineering, Tutorials

ETL Testing Tutorial with lakeFS: Step-by-Step Guide

Iddo Avneri

Last updated on September 26, 2024

Home > Blog > ETL Testing Tutorial with lakeFS: Step-by-Step Guide

Start ETL testing with lakeFS

ETL testing is critical in integrating and migrating your data to a new system. It acts as a safety net for your data, assuring completeness, accuracy, and dependability to improve your decision-making abilities.

ETL testing may be complex owing to the volume of data involved. Furthermore, the data is almost always varied, adding an extra layer of complexity.

However, it doesn’t have to be this way. With the correct approach and tools, even new team members may rapidly get started with ETL testing.

This post explains everything you need to know about ETL testing. Continue reading to learn about recommended practices for ETL testing and how to enhance data quality in your business.

What is ETL?

An ETL process extracts data from different structured or unstructured formats and multiple sources into a centralized repository.

Then, an ETL process transforms the data into a format that will be easier to consume for specific business use cases. This transformation typically includes:

Data cleaning (i.e., data deduplication, standardizing date and time formats, data cleansing, etc.)
And data restructuring (i.e., joining different data sets to a single table, summarization, etc.).

Finally, the transformed data is loaded into a central repository (for example, a data warehouse or data lake).

What is ETL Testing?

ETL testing is the process of evaluating and verifying that the ETL (Extract, Transform, Load) processes work correctly and deliver data of high quality. It includes aspects such as database testing, data warehouse testing, performance testing, data accuracy testing, data transformation testing, and more. ETL testing tools can cover the entire range of your data pipeline, from data sources to the existing data warehouse or data lake.

Tasks involved in ETL testing

ETL Testing often involves the following tasks:

Understanding the data that will be used for reporting
Examining the data model
Source to target mapping
Data checks on source data
Package and schema validation
Data verification in the target system
Validation of data transformation and aggregation rules
Data comparison between the source and destination systems
Data integrity and quality controls in the target system
Performance testing on data

ETL Testing Tutorial – Practical Steps with lakeFS

What is lakeFS?

lakeFS is an open-source solution that applies software engineering best practices to data engineering. Concepts such as Dev/Test environments and Write-Audit-Publish are harder to implement in data engineering since you’re not just managing code but also data.

lakeFS provides version control over the data lake and uses Git-like semantics to create and access those versions, so every engineer feels at home with lakeFS in a few minutes.

Scalability

lakeFS’s data versioning takes advantage of metadata manipulation. So, with lakeFS, setting up a new environment identical to production will:

Be done in milliseconds for data lakes of any scale.
Not consume additional storage.

Using lakeFS, each developer can test ETLs by branching out of production as many times as they want, immediately, and with no additional costs.

Reproducibility

Once you’ve developed new versions of your ETL, those can easily be tested against historical commits. That way, you can compare the output of different versions of your ETL against the same input.

Accuracy

Since branches are created directly on production data, using lakeFS, ETL testing is done against production identical data, including all the complexities that come with it. Rest assured that this is precisely what the production environment looks like.

Isolation

Have you accidentally deleted all your data? No problem. Delete the branch and start testing again with your new code against a fresh branch.

Automation

Take advantage of lakeFS post-commit hooks to automate checks by triggering them at specific moments in your data’s journey through your pipeline.

Get started: Test your first ETL with lakeFS

1. Deploy a lakeFS Server

Copy Code

docker run --pull always -p 8000:8000 treeverse/lakefs run --local-settings

2. Configure server

You might want to configure the server to run against your cloud storage.
For example, to connect to AWS:

Copy Code

docker run --pull always -p 8000:8000 \
  -e LAKEFS_BLOCKSTORE_TYPE='s3' \
  -e AWS_ACCESS_KEY_ID='YourAccessKeyValue' \
  -e AWS_SECRET_ACCESS_KEY='YourSecretKeyValue' \
  treeverse/lakefs run --local-settings

3. Create a new repository

Create a new repository (in this example, I’m using the lakectl command line interface, but this can be achieved from the UI as well):

Copy Code

lakectl repo create lakefs://some-repo-name s3://some-bucket-name

4. Import your data

Import your data into lakeFS (this will only create metadata and not copy any of the objects) – Once again, can be executed from the web user interface as well:

Copy Code

lakectl import --from s3://bucket/my-data/path/ --to lakefs://some-repo-name/main/path/ --merge

5. Create a development branch

Create a development branch out of your production data (or via UI):

Copy Code

lakectl branch create lakefs://some-repo-name/test-etl -s lakefs://some-repo-name/main

Watch how to achieve this in under 3 minutes.

ETL Test Case Examples

Data validation for consumer data

This test case guarantees that customer information is accurately vetted before being imported into the target system.

Data transformation for sales data

This test case is done to guarantee that sales data is properly translated from the source to the destination system.

Data loading test for financial transactions

This test case ensures that financial transactions are entered into the target system without mistakes or duplication.

Data reconciliation test for inventory levels

This test aims to ensure that inventory levels in the source and destination systems match.

Data completeness test for employee records

This test is done to guarantee that all necessary personnel information is available and that there are no missing data.

Common Types of ETL Bugs

Bug	Description
Cosmetic and user interface issues	These bugs are directly related to the GUI of the software. Examples include font style, font size, colours, alignment, spelling errors, navigation, etc.
Boundary Value Analysis (BVA)-related bug	Minimum and maximum values
Equivalence Class Partitioning (ECP)-related problem	Valid and invalid types
Input/output bugs	Valid values are not accepted Invalid values are accepted
Calculation bugs	Mathematical mistakes The final output is incorrect
Load Condition bugs	Does not allow multiple users Does not allow customer’s expected load
Version control issues	No logo matching No version information is provided. This typically arises during regression testing
H/W bugs	The device is not responding to the app
Help Source bugs	Mistakes in help documents

Bonus: Don’t Only Test ETLs; Promote Data Quality

Data quality testing is a vital component of every data-driven organization’s ETL (Extract, Transform, Load) process. ETL is used to extract data from numerous sources, convert it to an appropriate format, and load it into a destination system, which is usually a data warehouse or a data lake. The data entered into the target system is then used for reporting, analytics, and decision-making.

Here are a few reasons why data quality testing is an essential component of the ETL process:

Accurate Decision Making – The correctness and reliability of the data used in the decision-making process have a significant impact on the organization’s decision quality. If the data is of low quality, choices based on it may result in inaccurate conclusions and wrong actions.
Consistency and Uniformity – Data quality testing ensures that data follows a set of predetermined norms and standards, fostering consistency and uniformity among the organization’s datasets. This uniformity is critical for producing accurate and comparable reports and analyses.
Compliance and Governance – Many businesses are subject to stringent regulatory and data governance laws. Data quality testing guarantees that the data adheres to these requirements, reducing legal and compliance concerns.
Reduced Errors and Costs – Detecting data quality concerns early in the ETL process lowers the chance of errors in downstream applications. This, in turn, lowers the expenses associated with resolving data-related issues later in the process.
Enhanced Data Integration – Data is often collected from various sources, each with its own format and structure. Data quality testing guarantees that data integration is effective and that modified data is compatible with other datasets.
Improved Data Trust – High-quality data inspires trust in users, analysts, and stakeholders who rely on it. When data quality is ensured, consumers can rely on the information, allowing them to make better educated decisions.
Preventing Data Silos – Poor data quality can lead to the formation of data silos, in which separate teams or departments keep their own datasets because they don’t trust data from other sources. Data quality testing helps to break down these divisions and promote cooperation by establishing a single source of truth.
Detection of Anomalies and Outliers – Data quality testing assists in identifying anomalies, outliers, and inconsistencies in the data. These insights can lead to more investigations and help identify potential data quality concerns that might otherwise go undiscovered.
Efficient Resource Utilization – By identifying and resolving data quality concerns early on, resources are better utilized since less time and effort is wasted on rework or debugging downstream operations.
Continuous Improvement – Data quality assessment should be a continuous process rather than a one-time exercise. Continuously monitoring data quality helps companies uncover trends and patterns, allowing them to optimize their data collection, transformation, and loading processes over time.

Conclusion

ETL testing is a key process that confirms that data from source systems has been properly extracted, accurately transported, and loaded in the appropriate format. Ultimately, it indicates whether your data is of good quality. It will detect duplicate data, data loss, and any missing or inaccurate data, leading you towards higher data quality and better outcomes.

The Control Plane for AI-Ready Data

Versioned. Reproducible. Compliant.