Michal Wosk
December 11, 2022

Introduction

As 2022 is about to end, many engineering leaders use this time to reflect on the year that has passed and start planning ahead on the year upcoming. Data engineering teams are usually swamped with tasks and requirements, which are many times also accompanied by failures and issues that require immediate attention. Therefore using the new year planning to reestablish the best practices and methodologies that should be reinforced within the teams, is always a recommended approach.

The past year has been very challenging for the tech industry. Even companies who were not directly impacted by the economic slowdown started reevaluating their spends and their headcounts, in order to start adapting to times of recession. This means, cutting whichever cost that can be avoided, and trying to keep the business growing and generating revenues, with less resources.

What will 2023 look like?

After such dramatic 3 years the tech industry has experienced ever since the pandemic, trying to predict what the future will hold sounds like a failed attempt to begin with. However there are some principles and behaviors that are always relevant to running a successful business, and in times of economic uncertainty they become even more important. Cutting operational costs is always needed, and now even more. When cutting costs and reducing staff, organizations need to start utilizing their existing resources better. Teams who were impacted by cutoffs and headcount reductions may struggle with their new capacities, and need to be empowered to a focused delivering oriented mindsets. And as always, but now more than ever, engineering deliveries should be fast and positively impactful on the customers.

In light of these principles, we created a short list of new year resolutions that we recommend for every engineering team that deals with big data to consider.

4 new year resolutions to consider

1. Cut on storage costs

Reducing costs is the first thing that comes into mind when planning towards a year with uncertain economic conditions. In this sense, while organizations become increasingly reliant on their data, it may seem that this line within the cloud budget is doomed to grow inevitably. However, while data will most certainly grow, the good news is that there are effective ways to control the costs of storing the data.

Methods of that sort include: using tiered storage; Considering a move to object storage from tabular data architecture; Making sure that the data architecture is optimal to the data types that the organization is using; Using branching in order to develop and test data in isolation and more.

2. Increase data engineering teams efficiency

In challenging economic times, when forced to freeze new hirings, and in some cases even downsize data engineering teams, it only makes sense that we would try to do anything we can to increase the existing team’s productivity and impact. We can very effectively achieve that equipping our data teams with appropriate tools and best practices. These will help them stop working on repeating time consuming tasks that are mandatory but waste plenty of their time. One such best practice is to enable them the ability to develop and test new data sources and ETL’s on a designated isolated environment, without needing to copy anything. This way they can save valuable time by not needing to copy and manage multiple clones of the data, but more importantly: they are empowered to thoroughly test their work on the freshest production data and ensure that what they build is of the highest quality, and will not create regressions, inconsistencies and other data failures.

3. Empower data engineering teams by helping them focus on the value creating tasks

Data engineers are in a central position in any big data driven organization. This is because their work is serving the needs of a large number of customers internally in the organization and externally. As data is becoming more and more central in many organizations, this becomes even more challenging. That’s because whenever an error or inconsistency occurs, this has the potential to become an issue with impact on the business itself.

Reproducing data failures in big data environments – especially in object stores – can be one of the most frustrating, time consuming tasks a data engineer is required to perform. This is the case because when dealing with data lakes, there is no built-in way to move back to the exact state of the data at the time of failure. There are some solutions that can help an engineer go back in time for a specific table or set of tables, but in many cases failures and inconsistencies happen due to changes that are done widely within the data lake.

Making sure engineers are equipped with the right tools to perform their job to the best of their ability is a managerial responsibility, but the good news is that there are many open source tools that can help with that. One such tool is lakeFSwhich enables reproducibility in testing and production through the data version control capabilities it adds to the data lake.

The investment in integrating such tools pays off quickly, because they help data engineers focus on the deep analysis itself, and not waste their time on repeating tasks that can be easily done by an automatic tool.

4. Rapidly ship valuable products – without any compromise on data quality

In times of economic uncertainty, more than in regular times, organizations need to be laser focused on the value they bring to their customers. In software engineering this means – shipping valuable quality products, in short cycles, in order to retain existing customers and increase their satisfaction. This requires easy to apply processes of building, testing and shipping products, in short time periods, but without any compromise on their quality.

In software development, adopting the concept of CI/CD for data helped organizations to transition from a long, sluggish waterfall approach, into becoming innovative agile and competition-optimized organizations that continuously ship high quality products to their customers.

Data engineering organizations can and should adopt this approach as well, with the right tools. Data versioning tools are a good option to consider in that sense. Protecting the data lake with quality validating hooks makes sure that the data that flows in is of the highest quality, without needing to manually validate it.

Summary

Working with data organizations world wide, and engaging with their engineering decision makers helped us get a small glimpse into the unique set of the business, technology and management challenges they are facing. We are basing our recommendations on the observations we gain, and believe that applying even some of these resolutions, one step at a time, can contribute to your success. We’d be happy to hear back from you – What are your new year resolutions?

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +