As a new year is just around the corner, it is time to look ahead to the year that is coming and make some new year resolutions. It is a chance to adopt new habits that will make you more successful and more impactful, but it is also a great opportunity to stop some bad habits that are bad for you, professionally and personally.
The life of a data engineer
As a data engineer, especially in data driven companies, you are spending a lot of time serving the needs of various stakeholders and departments within the company. Whether they are data analysts, data scientists, backend engineers, senior management – they all have requirements and requests. It is your job as the person in charge of the data pipelines and data integrations to make sure that they can perform their work successfully and that the data products that are generated by your company are ones consumers can confidently rely on. This is not an easy job.
Then things get complicated
Daily routine works that are part of your life as data engineers like writing and updating ETLs, Testing new and changed ETLs, adding these pipelines to production – can easily turn out to become time consuming nightmares. Writing new data pipelines to production that aren’t tested well on top of production data can damage the data and make it unreliable. Analyzing these failures can become a tough mission to handle and a serious time consumer – especially without the proper tools to do that. Working in a big data engineering team on the same data sets makes you damage other people’s work, and vice versa, which wastes everyone’s time and energy. And deleting data? Some data engineers would get really scared just from imagining this. Wouldn’t you like this year to be a bit different?
Some advice from our community
As data engineers ourselves, suffering from many of these pains, here is our shortlist of new year resolutions – mainly – things we will stop doing this year, to eliminate some of the pains and agonies that come with our job.
1. This year I will not delete data
Deleting production data is sometimes mandatory, but can be very risky. When done incorrectly, there can be severe consequences on the business – because some of the analytics and algorithms that rely on the data will be damaged. In order to safely delete data – using a data version control is a good solution. You create a branch of the production data, on which deletion is applied, and run your data quality checks on that branch. After that you can merge back the tested data set to production. However, if later on you discover that you accidentally deleted things that shouldn’t have been deleted, you can easily revert back to the state of the data before the deletion.
2. This year I will not test things on production data
Testing new and changed ETLs on top of production data is mandatory – because this is the only way to ensure that your ETLs are performing what they need to do without any bugs. But testing on the production bucket itself – is a bad practice that can have terrible consequences. Would you test a new job that deletes some of the data on production itself at the risk that there is some bug in it? The existing solutions – like copying the entire data lake and testing on it, or testing on a subset of the data lake – are error prone and require a lot of excessive work. The solution to that – again – would be to use a data version control. This way – you can safely test your ETLs on top of the production data, without risking the production itself. This is done using branching – you can create a branch with the full production data, test your ETL on it, and then safely deploy your ETL to production.
3. This year I will not destroy someone else’s work
Collaboration is mandatory if you want to build big things together with your team. But overriding other people’s work is counterproductive – and it happens when the team is not working with the right tools to facilitate collaboration. When working with data lakes it is very common for several data engineers, analysts and data scientists to work together on the same bucket and the same data set. Without necessary precautions, they can unintentionally interfere in each other’s work and possibly harm it. The solution to that will also be using a data version control, just like software engineers would use a code version control to be able to collaborate effectively. Similarly, when using data version control each data engineer can work on their own branch – which is a copy of the data, in isolation, and merge the data into the main bucket only after quality tests and validations. This way they can collaborate while keeping their independence and not interfering with the work of their teammates.
4. This year I will not look for the code that corrupted the data for 6 whole weeks
Every data engineer will relate to this situation: one of the data consumers is calling you, because there is something wrong with the data. When this happens, a series of fun events begins, usually with you trying to reverse engineer everything that happened to the data lake before the error appeared, so you can identify the root cause and fix it. In this case, using a data version control can help as well – by providing you with reproducibility. When the issue appears all you need to do is check – on which version of the data this happened, and by traveling between the versions you can look back exactly at what has changed and reproduced the failure. This way, weeks of exhausting work of analysis can be replaced by a simple action of moving between versions and debugging what has changed and what caused this change.
5. This year, I will push only validated data to production
Data engineers spend too much time analyzing errors and inconsistencies in the data. This usually happens because the issues are found when the data is already in production. The smart way to avoid this is to ensure that the data is safe to become production data before it does. In code – this is called CI/CD – every piece of code is tested and validated automatically with quality validation hooks before it becomes part of production. In this case as well – data version control enables you to add hooks before your data becomes production data. We call it – CI/CD for data. This way the safest way to ensure that the data that enters the production bucket is safe and can be trusted.
Using a data version control can help you keep your new year promises, and avoid common data engineering pitfalls. Open source solutions like lakeFS help data teams transform data lakes into a Git-like repository to quickly implement parallel pipelines for experimentation, reproducibility, and CI/CD for data.
You’ll have a production branch and working branches that you can use long-term or short-term, or discard altogether. You can protect these branches and use commits, commit IDs, tags, and mergers. When you want to use data, instead of accessing the object storage directly you access it through lakeFS, specifying a branch or commit ID.
We hope that this year’s new year resolution will include using a data version control – and will be happy to hear if using this improved your day to day work. Happy new year!