Why Data is Killing Your AI Projects And What To Do About It

Iddo Avneri

November 4, 2025

This article is a summary of a joint session, Why Data Is Killing Your AI Project and What to Do About It, featuring a great panel of experts: The past year saw companies and teams weather a rollercoaster of highs and lows, riding the hype cycle of AI excitement followed by disillusionment. Most teams are […]

Best Practices Data Engineering Machine Learning

How lakeFS Transactional Mirroring Keeps Your Data Available During Cloud Outages

Idan Novogroder

October 23, 2025

When AWS Goes Down, Your Data Shouldn’t On October 20th, 2025, AWS experienced a significant outage centered in the us-east-1 region. What started as a DNS resolution issue affecting DynamoDB quickly cascaded into widespread failures across major services and applications. From gaming platforms like Fortnite and social apps like Snapchat to enterprise systems and IoT

Best Practices Data Engineering Machine Learning

Bound by Physics: Why Data Version Control is Critical for Real-World AI

Vince Antinozzi, Yoav Yetinson

September 23, 2025

TL;DR Software-only systems can be rerun from the source, but physics-bound workflows face a tougher challenge. Once a moment is gone, it’s gone. Sensor drift, hardware changes, and environmental uniqueness make it impossible to recreate the exact conditions. For audits, safety, and machine learning, you need full data provenance, including raw data, derived outputs, and

Best Practices Data Engineering Machine Learning

Versioning Data Labels: Integrating Labeling Tools with lakeFS

Iddo Avneri

September 10, 2025

In this post, we explore how lakeFS can integrate with popular data labeling solutions, the differences between labeling tools’ built-in dataset management and lakeFS data version control, and why combining them is invaluable. We’ll also highlight use cases – from autonomous vehicles to healthcare – where rigorous data versioning alongside labeling is essential. Overview of

Best Practices Product Tutorials

Versioned Data with Apache Iceberg Using lakeFS Iceberg REST Catalog

Amit Kesarwani

August 14, 2025

lakeFS Enterprise offers a fully standards-compliant implementation of the Apache Iceberg REST Catalog, enabling Git-style version control for structured data at scale. This integration allows teams to use Iceberg-compatible tools like Spark, Trino, and PyIceberg without any vendor lock-in or proprietary formats. By treating Iceberg tables as versioned entities within lakeFS repositories and branches, users

Best Practices Machine Learning Thought Leadership

OpenAI’s Open Source Revolution: Why Enterprise AI Infrastructure Matters More Than Ever

Gottfried Sehringer

August 6, 2025

Yesterday, OpenAI launched gpt-oss-120b and gpt-oss-20b, marking the company’s first open-weight models since GPT-2 in 2019. This strategic shift represents far more than a product release—it signals a fundamental transformation in how large organizations, particularly in regulated industries, approach AI infrastructure and data management. OpenAI’s Strategic Return to Open Source The gpt-oss models—gpt-oss-120b and gpt-oss-20b—are

Best Practices Product Thought Leadership

The Evolving Equation: When Do You Move From Open Source to Enterprise with Data Version Control

Tal Sofer

July 16, 2025

Open source software has fundamentally reshaped technology—delivering unmatched flexibility, low friction, and rapid innovation. For some teams, it’s a philosophical commitment. For others, it’s the fastest path to building. lakeFS supports both models. For most data teams, the journey starts with open source and evolves over time. lakeFS open source offers a robust foundation for

Best Practices Machine Learning

AI-Ready Data: Characteristics, Challenges & Best Practices

Tal Sofer

July 14, 2025

Despite the increasing adoption of Artificial Intelligence (AI) applications, most organizations are bound to see implementation challenges. One of the issues lies in the data itself. A recent survey showed 80% of companies believe their data is suitable for AI, but more than half are actually dealing with challenges like internal data quality and categorization

Best Practices Machine Learning Product Tutorials

A Single Pane of Glass to Your Data: Multiple Storage Backends Support in lakeFS

Tal Sofer

May 13, 2025

Today’s organizations don’t just use a single data storage solution – they operate across on-prem servers, multiple cloud providers, and hybrid environments. This distributed approach has become necessary, but it comes with significant costs: teams struggle with siloed tools, duplicated processes, and an endless cycle of environment management that diverts focus from delivering actual value.

Best Practices Data Engineering Machine Learning

What is AI infrastructure? Benefits & how to build one

Idan Novogroder

April 29, 2025

A solid AI infrastructure is essential for efficiently developing and deploying AI and machine learning (ML) applications – from facial and speech recognition to text processing and computer vision. Before we dive into why AI infrastructure is crucial and how it works, let’s define it first. What is AI infrastructure? AI infrastructure, also known as

Best Practices Data Engineering Machine Learning

6 Types of Metadata: Examples, Tools & Frameworks

Idan Novogroder

April 22, 2025

With the volumes of generated data increasing, metadata has become an essential component in organizing and comprehending massive datasets. Metadata plays a key role in any modern data strategy, especially among organizations that treat data as one of their most precious assets. This article dives into all the different metadata types, tools, and frameworks to

Best Practices Machine Learning

What is AI Data Storage? Benefits, Challenges & Best Practices

Tal Sofer

April 17, 2025

Many companies are modernizing their data storage infrastructure to capitalize on the opportunities of machine learning (ML) and advanced analytics. However, teams face several unique data management challenges such as the increasing time required for AI training and inference workloads, as well as the cost and scarcity and resources, particularly GPUs. Storage is a key

Best Practices

Pick up the Slack with lakeFS