Ready to dive into the lake?

lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Einat Orr, PhD.
May 18, 2023

The world of software engineering underwent a huge acceleration in the past decades. This was possible thanks to the emergence of methodologies and tools that helped establish and apply new successful engineering best practices. 

The leading example is the move from a waterfall software development process into the concept of DevOps: The idea that at each moment in time there is a working version of the software which can be instantly shipped and deployed to the end customers. 

This approach, also known as the DevOps approach, became the leading way that software is being developed thanks to the development of frameworks and tools which enabled this change. In this we refer to tools such as Git – which enabled collaboration between teams and continuous development, Jenkins – which enabled continuous integration of the software, Docker – which enabled easy testing and shipment of the software to its consumers and many more.

Source: PagerDuty

These tools enabled the DevOps approach to take over software development processes, with CI/CD being a key ingredient: Continuous integration and continuous deployment of the software.

Source: Synopsis

Software Engineering world changed over time

The world of data engineering has undergone quite a few revolutions over time as well, and tools and technologies emerged in order to cater for the growing needs of this domain.

Source: lakeFS

Some of the changes this domain has undergone are: The emerging variety of data types that need to be stored led to the move from tabular structured data into unstructured data and data storage. The growing demand for AI in almost every digital application drove the need to develop complex algorithms and analytics with compute and analytics engines. The variety of sources and the need for infinitely fast distributed ingestion of the data led to the development of advanced technologies that enable ETLs and streaming from every data source one could imagine. The need to develop and maintain complex AI algorithms for production systems at scale led to the emerging world of AI & ML tools that enable to operationalize the algorithm development process as much as possible. And these are only some of the advancements this field has seen.

Data Engineering, or: The elephant in the room

With all these technological advancements, data products are still considerably very slow to develop, ship and maintain. Something is still very much missing in data engineering, to enable the rapid development of data products at scale, similar to the pace of software product development. In software engineering this was the DevOps approach which made the difference. It broke the engineering silos between teams. It made sure there is always a working quality version of the software ready to be shipped to the consumers. And it enabled fast deployment of these software products into the hands of the consumers to close the rapid feedback loop.

Then what is missing in date engineering? Can it be the case that by understanding these best practices and applying them for data engineering we can revolutionize the entire process of data products development? We argue that this is indeed the case.

The main ingredients of a Data Engineering best practices framework

What is required from a data engineering team in order to rapidly build and ship quality data products? What are the set of principles that are at the heart of software engineering best practices framework and should be adopted in data engineering as well?

Working with hundreds of data teams worldwide, listening to their pains and challenges, helped us establish a list of assumptions and behaviors that can serve as the foundation for data engineering best practices. Here are the 7 best practices for data engineering:

  1. Data is a product

A data product is any tool or application that processes data and generates insights. These insights are aimed at helping businesses make better decisions for the future. Stored data can then be sold or consumed by users internally, or by customer organizations which then process the data as needed. 

In order to adopt a data products approach, we need to apply the following principles:

  1. Product Management Methodology – including the people and processes that are in charge of building the definitions, requirements and KPIs
  2. Appropriate product delivery methods – including all the other engineering best practices that are required to continuously deliver quality data products
  3. Measurement and improvement processes – including relentless monitoring and validation of data quality, in all its aspects, and SLA agreements that include not only availability of the interface, but also freshness of the data.
  1. Building data products requires collaboration

Very much like the case in software engineering – in data as well: Teams that develop, test, ship and maintain complex data products, are usually composed of several team members, as well as members from different teams that consume and change the data. This requires the team members to collaborate and contribute to each other’s work, but nevertheless – keep their ability to work independently at their own pace. This requires tools that enable safe development in an isolated environment, and an ability to merge each other’s work together continuously so there is always a working version ready to be consumed.

  1. Resilient data products require quick recovery from errors

Resilient, high quality products are usually not born this way. Errors and bugs do happen, even in the most experienced teams. Rapidly shipping high quality data products means fast root cause analysis, followed by fast recovery from quality issues and fast deployment of fixed versions. In order to enable this, we need a system that enables us to identify root causes as quickly and easily as possible, and test and deploy a fix just as fast – or in other words – reproducibility during development and testing

  1. CI/CD for data enables continuous delivery of quality data products

Solving data quality issues in data products is very important, but to be able to continuously ensure the quality of these products is the best practice to achieve fast development and deployment cycles. This is the way to detect and prevent errors before they even appear. In order to achieve that, software engineering applied the concept of hooks and pre-merge validations. This can and should be applied in data as well, by creating hooks that test the new data before it becomes production data and preventing erroneous data from becoming part of the production.This concept is in the core of the CI/CD approach, which with the right tooling can and should be applied on data.

  1. Data versioning is the conceptual solution to all of these needs

Data versioning is a core enabler for the best practices we listed above. It refers to a system which holds versions of the data that are created with every change that is applied in the data. This enables collaboration in teams, because each member of the team can get a branch – their own copy of the data, to safely develop and test on, without impacting the work of the other team members. It enables reproducibility, because data engineers can always time travel to the version of the data as it was at the time of failure. And it ultimately enables CI/CD, because whenever new data is generated, a new version of the data is created and tested, and if it fails to pass a certain quality test it does not become the main version until the issue is resolved.

  1. Data versioning needs an appropriate tooling – Data Version Control

Data versioning is indeed a convincing concept, but without appropriate tooling it will stay at the conceptual level. With tools and services that provide capabilities of branching data to work in isolation, time travel to enable reproducibility and hooks to enable full CI/CD on the data – this entire mode of work comes to life. Data versions control tools are the implementation of the data versioning methodology for data engineering, and there are various such tools which you can choose from.

  1. The standardized approach to version control is git

When considering a data version control system, the most straightforward approach would be git, since it created and established a standard taxonomy to approach version control capabilities. The git interface allows straightforward actions for all of the features we mentioned above: branching, merging, moving between versions and hooks. Its UI has become very intuitive for developers, and is easy to integrate into almost any existing stack. There are several solutions that provide git for data – they vary in features such as scalability, supported file formats, support in tabular data and unstructured data, volume of data that are supported, and more.

Summary

Adopting and applying proven best practices coming from software engineering can help the world of data engineering keep pace with rhythm that is needed in digital products delivery. As soon as organizations start to mindshift into a continuous integration and delivery mindset, with the necessary cultural behavioral changes, we will start seeing smarter digital products powered by resilient high quality data products.

Git for Data – lakeFS

  • Get Started
    Get Started
  • The annual State of Data Engineering Report is now available. Find out what’s new in 2023 -

    +