In today’s world of data engineering, we need to store more than just simple text information in relational or non-relational databases, tables or documents. Data formats include email, images, video, web pages, audio files, datasets, sensor data and other types of media content. Basically, a big chunk of unstructured data.
Studies have shown that somewhere around 80% of the data in any organization is unstructured. In large enterprises and organizations, storing and managing this unmatched amount of data can be challenging and very costly.
The move to data lake architecture
As a solution, the ease of use of object-based storage systems and the benefits they bring turned them into a preferred method for data archiving, backup, and more or less for storing any type of static content. While many would initially expect a large volume of data to be stored poorly, with object storage you can ensure high quality through a data lake model. This is a main reason that the data lake model is taking over. To sum this up we can confidently say that object storage is the preferred option for static data management.
Data quality challenge within the data lake
The move to data lakes introduces a new challenge in maintaining and ensuring data quality within the data lake as time goes by. Naturally, the quality of the data we introduce determines the overall reliability of our data lake. Specifically, the data ingestion stage is critical for ensuring the soundness of our service and data.
Despite the scalability and performance advantages of running a data lake on top of object stores, enforcing best practices and ensuring high data quality remains extremely challenging.
When considering it, data engineers should continuously test newly ingested data while ensuring they meet data quality requirements, much like software engineers applying automatic new code testing. So that when a mistake happened and ‘bad data’ was ingested into the lake, they can have a feasible way to reproduce the ingestion error at the time of failure, and roll back to the previous high quality snapshot of their data. Sounds right, doesn’t it?
Data engineers are the first responders in the data-quality battle
Data engineers, who are responsible for implementing and maintaining the entire data pipeline structure of the company, are becoming more and more fundamental to their organizations. However, their success depends on being armed with appropriate tools that will allow them to optimize efficiency and make the most of their time.
What is it like to play the role of a data engineer in a large scale data driven organization? A pillar mission of data engineers can be defined as implementing and maintaining the ever growing number of data sources as well as serving the requirements coming from their peers – data analysts, data scientists, backend engineers and more.
Some requirements may as well come from other stakeholders within the organization, such as compliance and information security officers. Often, these stakeholders request data-engineers to implement data governance processes such as: data retention, deletion of sensitive data, and more.
In an ever growing, complex data lake with thousands of running ETL jobs, very complex DAGs that orchestrate the entire pipeline, etc., each task has the potential of becoming a hideous nightmare. Add to that the need to overcome issues in production, and the challenge just grows and grows.
I am sharing here 5 common mistakes that data engineers tend to make, that result in distracting them from their main missions, increasing their frustration, creating repetitive exhausting work, and wasting their precious time.
Don’t worry, there are also multiple solution paths for you to choose from!
1. During the development stage, data engineers are enforced to create multiple copies of the entire lake in order to test their code in isolation
In the process of adding or changing a data source into a data lake, testing it first on top of the existing data is a step that should never be skipped. However, in order to test the pipeline on the entire existing data, there are currently two non-ideal (to say the least) practices:
a. Testing on the real production data, which is a very bad tactic, needless to explain why.
b. In order to avoid testing on production data, there’s the popular practice of copying the entire data lake and testing the new job on this copy. And this, you guessed it, is a bad practice as well. It creates multiple clones of the data lake, which bears huge costs to the organization. Maintaining these multiple copies becomes impossible, and if the development of the data source takes a long time, the copy becomes obsolete and therefore doesn’t reflect the data lake any more, so we now need to copy the entire data lake all over again. Do you see where I’m going?
2. Also in development, data teams compromise on the quality of the data by testing it on a subset of the data or an out-of-date version of the data lake
In order to avoid numerous multiplications of the data lake, some data engineers argue that using a small subset of the data – for example, 20% of it – is good enough and acts as a reasonable solution. Well, is it really? Here again, when the data lake becomes huge, these copy habits become costly as well… Besides, we often hear scary stories on how this “best practice” led to missing very big data quality issues that were only visible in production – which is exactly what we want to avoid when testing things in the first place.
3. When deploying code to production, data engineers change the data pipeline logic, which changes the data available in the data lake, without qualifying the data outcome ahead of merging their code
Even when the data lake reaches stability, all the data flows properly and everybody is happy, mistakes still tend to happen. For instance, a data source may bring in corrupted data, or someone might add a job without first testing it properly. You know, things happen.
Solving such issues after they occur is better than nothing, but identifying them ahead and preventing issues before they happen – that’s a different level of efficiency.
In code, this is done very easily and is a common practice today using CI/CD tests before any new code is merged to the main code source, thereby ensuring the quality of code that runs in production.
How come the same isn’t a common practice in data, with pre-merge hooks that detect any issue with the data entering the lake and preventing issues in production data?
4. When the data is in production, engineers don’t keep versions of the production data to enable troubleshooting
Keeping versions of the production data helps reproduce errors in production, as well as understand why a model that was trained on the data brings different results.
The ability to travel back in time between versions is a fundamental capability to ease the lives of data engineers and bypass the need to manually keep and maintain copies of the entire data lake that matches each model that was trained, or matches a timestamp.
5. When errors in production data occur, engineers fix them manually instead of reverting to the previous high-quality version
As much as we try to avoid them, production data issues are inevitable, and when they do happen they are usually very painful and require rapid fixes. Imagine that the view that serves the management team in making strategic decisions suddenly shows data that is completely inconsistent with the common trends. Or that one of the data sources corrupted the data, but remediation isn’t instant as you need time to analyze the root cause and resolve the problem.
The ability to roll back on the data lake is an efficient way to buy time until the root cause is diagnosed and a fix is presented. It enables the organization to roll back to the latest quality version of the data, and reduces a lot of the drama that such errors can cause.
How to avoid these common mistakes in data engineering
In lakeFS, us being data engineers and data consumers, we struggled a lot with all of the above.
When we dove into their root cause we realized that they all have a common denominator and therefore can all benefit from one conceptual solution. In the basis of each of these faults lies the need to solve them by treating our data estates as products and as such enable full lifecycle management of the data. By that we mean adding the following capabilities:
Revert as quality issues occur in production data – If we encounter a quality issue in production, we’d want to be able to simply revert the data to the last commit.
Working in isolation – We might want to work in isolation, so we should be able to branch our data repository and get an isolated environment for our data. We can either develop on it or use it as a staging environment.
Reproducing results – To reproduce results, we need to return to a commit of data in the repository. It’s a consistent snapshot of our repository from the past, so it has all of our results.
Ensuring that changes are safe and atomic – To make sure that all the changes are safe and atomic, we can use the concept of merging. For example, we would carry out some work in a branch, test it, and – once we’ve ensured that it’s high-quality – merge it into our main branch and expose it to our users.
Experimenting with large and potentially destructive changes – If we want to execute pretty drastic things like chaos data engineering or testing a significant change, we can create a new branch, delete stuff, try and run a better version of Spark, etc. Once we’re done, we can discard the branch while staying confident that nothing will defect the production.
Git-like version control will help avoiding common data engineering mistakes
Solutions (OSS or cloud) like lakeFS help data teams transform data lakes into a Git-like repository to quickly implement parallel pipelines for experimentation, reproducibility, and CI/CD for data. You can play around with it to see how it works without installing. Go to the playground.
Basically, such tools let you manage data lakes like code because they provide an interface to access object storage which provides you with ‘Git-like’ operations on top of the data through an API. This way you can keep on working with your regular data tools as you are used to.
You’ll have a production branch and working branches that you can use long-term or short-term, or discard altogether. You can protect these branches and use commits, commit IDs, tags, and mergers. When you want to use data, instead of accessing the object storage directly you access it through lakeFS, specifying a branch or commit ID.
For example, If a customer calls and says they are experiencing a data bug, no problem – you can create a branch right now to get the snapshot of the data as it appeared to that customer and quickly troubleshoot the error.
If you want to try a new algorithm that may delete files, you can create a branch for experimentation and develop in isolation. Once you’re done, you can merge this data back to production or use ETL to delete the branch.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
As data practitioners, we use many different terms to talk about what we do – we call it business intelligence, analytics, data pipelines, or insights.