TL;DR: we worked with an object-storage based data lake, it’s an excellent data architecture but you have no systematic way of:
- Communicating between writers and readers (if you are thinking MetaStore as a solution, keep reading).
- Avoiding the data Swamp everybody is warning you about.
- Creating data pipelines that are resilient to changes in data and code
The long version starts when we* finished a migration of our on-prem Hadoop clusters to AWS in mid 2017. Our architecture included a Kafka based ingest process of over hundreds of different sources, managed by the data collection group. The data was saved to S3 and partitioned by arrival time. We had 4 different engineering groups on the consuming side:
- Data science: converting raw data into estimations. The group consisted of data scientists, data engineering and DevOps teams to support a DAG of almost 1000 jobs running daily to produce our production data.
- Web application: mainly consuming the output of the data science group.
- Business intelligence: BI engineers and analysts, running counts on the amount of data we ingested by device, application, country, etc’,
- Professional services: analysts providing ad hoc reports to customers based on any data set that supports their analysis.
We were happy. Finally throughput was not a problem, cost effectiveness of the storage (S3) was high, and with some tweaks on prefixes, throughput on read was satisfactory.
In retrospect we still think we made the right choice with this architecture. Before the migration we were dealing with “this is not feasible” issues and afterwards transitioned to the “this is not manageable” kind. And when data is your product, “this is not manageable” means production issues due to error prone operations.
Here are some examples of the “this is not manageable” problem space:
Fragile Writers and Readers: how to signal consumers the data is ready?
Pain Point 1: The data science group starts running its DAG only after data from several different sources (that were written independently) were complete. There’s no simple way for the writing teams to signal the completion of all writes to the reading teams. To work around this, we used time-based synchronization. At midnight, readers start consuming the data. Of course, this neglected late arrivals and in-flight writes. It was good enough most of the time, but a real pain when trying to reproduce results.
Pain Point 2: Consider the DAG of 1000 Spark jobs. Each job depends on several ancestor jobs’ output as its input. Basically the same as pain point #1, only with Airflow orchestrating a DAG of spark jobs. We used Spark SUCCESS files as a hook for Airflow. This doesn’t work in case the success file fails, or if for some reason someone intervenes manually.
Pain Point 3: In certain instances, correctness of the readers depends on a synchronization of several collections. For example, sometimes we needed the same data in two storage formats. ORC for AWS Athena, and Parquet for Spark. Another example: the professional services group have permission to access any data set serving their current ad hoc analysis. How do you make sure everything they access is synchronized to the same time/other parameters to maintain consistency? We used Hive Metastore as a workaround. Each write process updates the Metastore when completed, and the reader waiting on those inputs would wait until all updates were announced. In some cases it meant trading consistency for availability, but we weren’t the first to encounter this tradeoff :-).
A word about the workarounds: every time we introduce new data or analysis the solution needs to be implemented again. If you neglect to do so, you will suffer the consequences. There was no one systematic and easy solution for all these use-cases. In addition, each solution failed sometimes, causing a correctness issue later in the data pipeline.
Data Swamp: how to ensure visibility and governance?
Pain point 1: Many organizations avoid limiting data access using permissions as they want to democratize the data in the organization, and as long as regulation permits, allow data consumers to generate as much value as possible from the data. We hold this philosophy. The challenge is that there’s no systematic way to ensure isolation, i.e., to ensure no one changes the data while you’re using it. This is why copying is so common in object storage. The copy you create yourself usually has a very meaningful name, such as “Einat_final_final_V2”, or better yet “Production_temp”. You don’t have lineage capabilities that indicate which data is behind that name, unless you enforce naming conventions…Good luck with that 🙂
Over time, your lake becomes a swamp, cost increases and you have no real control over your data from a compliance standpoint.
Pain point 2: If you are working to avoid the swamp, you’re probably running retention jobs to ensure stuff doesn’t get out of hand. Consider a home grown retention job running periodically over the lake. If it has a bug and deletes valuable data, according to some logic it will start to spread across many collections. Although you have backups of all the objects in the lake, you don’t have a snapshot of the lake to revert to. Recovering fully from such an error may take weeks. Amazon’s S3 object-level versioning will not save you here.
The need for Data CI/CD: how to ensure data quality?
Pain point: Data is useless, unless it’s trustworthy. When your delivery is data, it is not enough to ensure the correctness of the code. It’s also critical to protect the properties of the data that you assumed you had when you developed the code. Now, remember the 1000 spark jobs orchestrated by Airflow that run every day? These run an algorithm, so each job includes assumptions on the data. If those assumptions are no longer valid, jobs may fail, or worse, the quality of the output data will decrease dramatically. Why is the second scenario worse? Because it’s harder to detect.
If we could run a job in isolation, test the results, and merge back automatically only after validating schema and data correctness, then it would be possible to identify issues earlier, putting us in a much better position to deliver quality data.
We were thinking: Git interface with MVCC capabilities
While each one of these challenges may have a workaround we can use, or a homegrown solution we can develop, there is no conceptual solution that simply makes the work manageable and hence resilient. We want a concept and a language that provides a solution for all of those challenges, and for the challenges that are yet to come.
On the one hand, database systems use transactions to provide systematic guarantees over the data. This is usually implemented using Multi-Version Concurrency Control.
On the other hand, there is a standard for code versioning. The Git terminology is a common language for developers to deal with versions of things. So we were thinking, if we build an MVCC management layer for Data Lakes, using a Git-like interface, we will have an intuitive way of getting the guarantees we want for our data lake, in the performance and reliability we need. The name “lakeFS” followed soon after.
Introducing lakeFS: versioned, manageable and resilient data lake
lakeFS is an open source platform that delivers resilience and manageability to your existing object-storage based data lake. With lakeFS you can build repeatable, atomic and versioned data lake operations – from complex ETL jobs to data science and analytics. Here you can try lakeFS without installing, to fully understand what it’s all about.
Main features include:
- Cross-Lake Isolation – Creating a lakeFS branch provides you with a snapshot of the entire lake at a given point in time (no copying involved). Guaranteeing that all reads from that branch will always return the same results.
- Object-level Consistency – Ensuring all operations within a branch are strongly consistent (read-after-write, list-after-write, read-after-delete, etc).
- Cross-collection Consistency – Branches provide writers consistency guarantees across different logical collections. Merging to “main” is only done after several datasets are created successfully.
- Versioning – Retain commits for a configurable duration, so readers can query data from the latest commit or any other point in time. Writers can atomically and safely rollback changes to previous versions.
- Data CI/CD – Define automated rules and tests mandatory to pass before committing or merging changes to data.
Try it out! It will solve the challenges you have now, and will prevent you from running into undesired issues in the future.
* referring to the amazing R&D team at SimilarWeb. We are proud to have been a part of before embarking on the lakeFS adventure