How Windward Leverages lakeFS for Resilient Data Ingestion

Lior Resisi

Lior Resisi Author

Full Bio →

Last updated on August 22, 2024

Company

Windward is the company behind Maritime Artificial Intelligence Analytics (MAIA™️), a platform that delivers predictive intelligence on global maritime conditions to hundreds of businesses.

Problem

Windward’s pipelines need to be fault-tolerant and managing late-arriving data with transactional guarantees is necessary to move the data to the correct partition according to when it actually took place and this can become quite complex.

Solution

By utilizing lakeFS branches and commits, the team can guarantee that new data gets moved to the correct partition without interfering with downstream data consumers and thus ensure late-arriving data with transactional guarantees.

Table of Contents

The company

Windward is the company behind Maritime Artificial Intelligence Analytics (MAIA™️), a platform that delivers predictive intelligence on global maritime conditions to hundreds of businesses. Customers across different industries—like oil and energy, commodity trading, financial institutions, and more—leverage MAIA™️ to optimize operations and mitigate risk at sea.

Thanks to big data processing, the platform can aggregate over 30 unique sources consisting of billions of data points. This includes proprietary and open-source data, including AIS, GIS layers, weather conditions, satellite images, and nautical charts. This information is processed as inputs to proprietary algorithms that accurately determine vessel identity, location, cargo visibility, voyage patterns, and more.

Underpinning the platform are scalable, resilient data pipelines that incorporate and process the data sources mentioned above. If these pipelines were to fail or show incorrect numbers, the fallout could cost the company’s customers and their businesses millions of dollars. Therefore, Windward’s pipelines need to be fault-tolerant.

The challenges

Challenge: Managing late-arriving data with transactional guarantees

The first step for data entering our platform is to land in S3, separated by hourly partitions according to when it was received. This is often, but not always, the same as when the actual events occurred.

In the case of late-arriving events, it is necessary to move the data to the correct partition according to when it actually took place. To do this, Windward runs a separate process that looks at the events contained within a file and, if needed, moves them to the correct partition.

Unfortunately, getting the type of transactional guarantees the team would like for this copy-and-delete operation on an object store like S3 is not so simple.

For example, how do you recover when the copy operation succeeds but then the deletion of the original file fails? Even more concerning, how can a downstream job prevent the ingested dataset from being read right after a copy but before the deletion?

Adopted solution

Challenge solved: Managing late-arriving data with transactional guarantees

Windward learned about lakeFS and how it can provide isolation and transactional guarantees for operations over an object store.

After deploying lakeFS in Windward’s data environment and creating a repository, the data ingestion process looks as follows:

By utilizing lakeFS branches and commits, the team can guarantee that new data gets moved to the correct partition without interfering with downstream data consumers.

By utilizing lakeFS branches and commits, the team can guarantee that new data gets moved to the correct partition without interfering with downstream data consumers

Lior Resisi

Backend Team Lead, Windward

Incremental adoption with lakeFS Exports

While leveraging a lakeFS repository and its related operations works great for many datasets, Windward didn’t want to depend on repositories for every dataset. More accurately, the company wanted to have some datasets referenced by their normal S3 prefix.

To get the best of both worlds, Windward used the lakeFS export operations as a final step in our jobs. This allows copying all data from a given lakeFS commit to a designated S3 path. Other applications down the stack can read directly from the exported S3 location without the need to be familiar with lakeFS.

Results

Since introducing lakeFS to our production data environment, Windward enjoyed the benefits of atomic and isolated operations in our data pipelines. This allowed the team to spend more time improving other aspects of its data platform and dealing less with the fallout from race conditions and partially failed operations.

Lior Resisi Author

How Windward Leverages lakeFS for Resilient Data Ingestion

The company

The challenges

Challenge: Managing late-arriving data with transactional guarantees

Adopted solution

Challenge solved: Managing late-arriving data with transactional guarantees

Lior Resisi

Incremental adoption with lakeFS Exports

Results

lakeFS

Pick up the Slack with lakeFS