The lakeFS team
May 31, 2022

As organizations develop new product offerings and data streams, data engineers deal with the largest and most complex datasets ever. Add growing teams and new data orchestration tools into the mix, and you got yourself a pretty complicated data landscape.

As the global cloud storage market grows, so does the complexity of managing data. This infographic shows a glimpse into its current and future status:

Source: https://www.emergenresearch.com/industry-report/cloud-object-storage-market

In a world where data volumes keep on scaling exponentially, failures and inconsistencies in production may become a common pain point for data teams. And no team can afford to deal manually with this, on top of the many other challenges it faces every day.

In this article, we will discuss three real-life examples of such common pains and share some proven strategies that effective data engineers practice as a habit to achieve efficient data architecture and management. 

Avoiding data failures during the development stage 

Best practice #1: Run experiments and test your data under different configurations, ETL code versions, computation tools, and compression algorithms

Teams should be able to try new tools, upgrade versions, and evaluate code changes quickly. How else can they drive innovation?

To run experiments with peace of mind, data engineers need a way to isolate a data segment from the data lake. This is where versioning helps by allowing teams to create a separate branch with data for worry-free experimentation and testing.

Versioning also opens the door to comparing results between branches with different experiments. Understanding the impact of a potential change is also easier when you can easily compare your branch to the main.

Best practice #2: Experiment in isolation without making multiple copies of your data

Engineers can run experiments and test code in full isolation by turning to data versioning as well. Why is working in isolation so beneficial? By creating a branch of the data, teams get an isolated snapshot where they can try the riskiest moves without worrying that other users get exposed to them. 

One common mistake data engineers make is copying lots of data when the only alternative is potentially compromising data quality by testing it on a data subset or an outdated version of the data lake. 

Versioning offers a way out and prevents teams from discovering massive data quality issues already in production. It does so by avoiding copying the entire data lake and testing a new job on that copy. No team wants to end up with multiple clones of a data lake that need managing and maintaining. 

Best practice #3: Scan commit history for consistency to identify potential bugs

Debugging an issue in a data lake is hard when you don’t know the exact state of your data when the error occurred.

The best way around that is to check specific commits in the repository’s commit history. This is how teams can generate consistent historical versions of their data. And when troubleshooting, you can access the state of the data lake right at the point in time of the issue to identify its root cause faster.

Real-life example – Upgrading Spark and using Reset action 

The issue: Imagine that you’ve just installed the latest version of Apache Spark. You’re now ready to test your Spark jobs to verify that the upgrade doesn’t have any undesired side effects. Some jobs fail halfway through, leaving you with intermediate partitions, data, and metadata.  

For lack of a better thing: If a Spark job fails, you’ll be forced to spend time on manual cleanup.

Doing it right: You can create a branch that will only be used to test your Spark jobs. If any of them fail, you can easily reset the branch to its original state without any concern about the intermediate results of your last experiment. Then, you can perform another test in an isolated branch and hope to succeed! The good news? Reset actions are atomic and immediate, so there’s no need to do any manual cleanup. Once you complete the testing, you can delete this experimental branch and be sure that all the data that isn’t used on any other branches will be deleted with it.

branching_1
Source: lakeFS

Developing excellence in the deployment stage

Best practice #1: Validate new data before it goes into the lake 

Even when you think your data lake has reached stability, mistakes might still happen – a data source may contain corrupted data, or someone might add a job without testing it properly. 

Solving such issues after they occur is unavoidable. But imagine what would happen if you were able to prevent them? That’s a whole new level of efficiency. 

Software developers do that via CI/CD tests for any new code merged to the main code source. Why not practice something similar for data?

This is where pre-merge hooks come in. They detect any issue in the data entering the lake and prevent problems from cropping up in production data. Engineers can define pre-merge and pre-commit hooks to run tests that enforce a schema and validate data properties to identify issues before they reach production.

Best practice #2: Ensure that only quality data streams into the lake

That’s where version control helps data engineers again. By using a version control solution, they can retain commits for a configurable duration. That way, readers can query data from the latest commit (or any other point in time), and writers introduce new data atomically, preventing inconsistent data views. This has a massive impact on the quality of data in the lake.

Real-life example – Validating new data 

The issue: Organizations often enforce validation checks such as:

  • No user_* columns except under /private/…
  • Only (*.parquet | *.orc | _delta_log/*.json) files allowed
  • Only backward-compatible schema changes are allowed under /production, 
  • New tables on the main need to be registered in the metadata repository first (with owner and SLA).

For lack of a better thing: Data engineers end up performing these checks manually, which is time-consuming.

Doing it right: By using a versioning solution for data, you can create a designated branch to ingest new data (“new-data-1” in the image). There, you can run automated tests to validate predefined best practices as pre-merge hooks. New data will be automatically merged into the main branch if it passes validation. And if you encounter a hiccup and the validation fails, you’ll get alerted about this and new data won’t be exposed to consumers.

Pre-merge hooks ensure that the main lake is never compromised.

branching_4
Source: lakeFS

Fast recovery in production

Automation goes hand in hand with faster response time, including work with data lakes. When the data is in production, engineers rarely keep versions of it. But that data is incredibly valuable for troubleshooting. Teams can use it to reproduce errors in production and understand why a model trained on the data brings different results. 

The ability to travel back in time between different data versions is a fundamental capability. That way, teams no longer need to keep and maintain copies of the entire data lake that matches each model trained.

Best practice #2: Onboard version control to quickly recover from issues

When you encounter errors in production data, you might be tempted to just fix them manually instead of reverting to the previous high-quality version of your data lake. After all, these issues usually bring a lot of pain and require quick fixes. And analyzing the root cause to resolve the problem takes time that you might not have.

This is where automated rollback helps. Data teams can roll back to the last high-quality version of the data lake to buy time for diagnosing and fixing the root cause. You can only imagine how much drama that helps to avoid. 

Troubleshooting is easier that way since engineers can investigate production errors on a snapshot of the datasets state at the time of failure. 

Real-life example – Troubleshooting and reproducing a bug in production

The issue: Suppose you upgraded Spark and deployed some changes in production. After a while, you identify a data quality issue, performance degradation, or spike in your infra costs. It’s a bug that needs fixing!

For lack of a better thing: You have to manually search for the bug, having no snapshot of the dataset state at the time when the issue occurred.

Doing it right: Using a versioning solution, you can open a branch of your lake from that specific merge/commit that introduces changes to production. Thanks to the metadata saved there, you can quickly reproduce all aspects of the environment and the issue itself to debug it. In the meantime, you can revert the main branch to the previous point in time or keep it as is.

branching_3
Source: lakeFS

Wrap up

Data teams are already under a lot of pressure and will be facing even more challenges in the future. The best practices we shared above offer a productive response to problems that arise when organizations manage data at scale. 

Handling data similarly to code pays off. Using Git practices such as versioning, branching, rollback, or webhooks, data teams can leverage the ecosystem of proven methods to automate routine tasks and keep their data lakes free of any low-quality data or bugs.

How to implement these practices in your organization? Start by equipping your data team with the right solution that brings Git-like operations to the world of data. 

lakeFS is an open-source project that does just that, and your team can easily try it without installing it in this playground version to see how this approach increases the quality of data lakes.

LakeFS

  • Get Started
    Get Started
  • Git for Data - What, How and Why Now?

    Read the article
    +