Best Practices Data Engineering Machine Learning

Versioning Data Labels: Integrating Labeling Tools with lakeFS

Iddo Avneri

September 10, 2025

In this post, we explore how lakeFS can integrate with popular data labeling solutions, the differences between labeling tools’ built-in dataset management and lakeFS data version control, and why combining them is invaluable. We’ll also highlight use cases – from autonomous vehicles to healthcare – where rigorous data versioning alongside labeling is essential. Overview of […]

Data Engineering Machine Learning

Unified Data Management: Types, Challenges & Best Practices

Idan Novogroder

September 2, 2025

Historically, companies have developed their IT systems on an ad hoc basis, installing various software and taking on data management approaches as their needs changed. The resulting organization is diverse, with multiple tools and data that serve the same function. Data tends to be segregated and dispersed across teams and areas, with little to no

Data Engineering Machine Learning

What is Metadata Filtering? Benefits, Best Practices & Tools

Idan Novogroder

August 21, 2025

Vector databases are a critical enabler for expanding the use of LLMs. They power applications such as Retrieval Augmented Generation (RAG), pattern matching, anomaly detection, and recommendation systems by retrieving relevant data for your application. A vector database needs to carry out efficient similarity searches across vector embeddings of both unstructured and structured data. You

Data Engineering Machine Learning Product

How lakeFS Helps Ensure Data Compliance

Tal Sofer

August 18, 2025

Data compliance is all about adhering to laws, regulations, standards, and internal policies regarding data use. Organizations must comply with regulations like the General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), the California Consumer Privacy Act (CCPA) and SOC2 standards to protect sensitive information and maintain trust. Data compliance plays

Data Engineering Machine Learning

What is Data Compliance? Tools, Benefits & Key Metrics

Tal Sofer

August 13, 2025

Organizations deal with ever-increasing volumes of data. More data translates into more risk, as hackers have a larger target area. This is where data compliance comes in. It helps mitigate these threats and protect consumer data by setting compliance standards that companies and individuals must adhere to while working with data. How does data compliance

Data Engineering Machine Learning Product

How We Built Our lakeFS Iceberg Catalog

Itai Gilo

August 4, 2025

A behind-the-scenes look at the design decisions, architecture, and lessons learned while bringing the Apache Iceberg REST Catalog to lakeFS. When we first announced our native lakeFS Iceberg REST Catalog, we focused on what it means for data teams: seamless, Git-like version control for structured and unstructured data, at any scale. But how did we

Data Engineering Machine Learning

What is Data Virtualization? Benefits, Use Cases & Tools

Tal Sofer

July 22, 2025

Data integration is a vital first step in developing any AI application. This is where data virtualization comes in to help organizations accelerate application development and deployment. By virtualizing data, teams can unlock its full potential by providing real-time AI insights for applications like predictive maintenance, fraud detection, and demand forecasting. Virtualizing data centralizes and

Data Engineering Machine Learning Product

Git-Like Data Versioning Meets MLOps: lakeFS with MLflow, DataChain, Neptune & Quilt

Iddo Avneri

June 26, 2025

Modern machine learning pipelines involve a mix of tools for experiment tracking, data preparation, model registry, and more. MLflow, DataChain, Neptune, and Quilt are some MLOps tools serving these needs. However, one critical piece underpins them all: data version control. This is where lakeFS comes in. lakeFS is not an experiment tracker or ML platform;

Data Engineering Machine Learning

What is Data Discovery, How It Works & Why It Matters

Tal Sofer

June 4, 2025

Most organizations collect massive amounts of data from various sources, including customer interactions, supply networks, financial systems, and more. As a result, teams may feel overwhelmed by a flood of data while seeking key insights, and the question of data manageability becomes more pressing than ever. This is where data discovery comes in. Data discovery

Data Engineering Machine Learning Thought Leadership

The State of Data and AI Engineering 2025

Einat Orr, PhD

May 29, 2025

Since 2021, we’ve published the annual State of Data Engineering Report, which includes a summary of all key categories that directly impact data engineering infrastructure. In 2025, we see five primary trends that influence the categories that will be covered in this report. Trend #1: MLOps space is slowly diminishing The MLOps space is slowly

Data Engineering Machine Learning

What Is an AI Factory and How Does It Work?

Tal Sofer

May 28, 2025

During the 2025 Nvidia GTC conference, one of the keywords that drew a lot of attention was “AI factory.” An AI factory is Nvidia’s idea for producing large-scale AI systems. This concept aligns AI development with the industrial process, in which raw data is received, improved through computation, and converted into valuable products via data-driven

Best Practices Data Engineering Machine Learning

What is AI infrastructure? Benefits & how to build one

Idan Novogroder

April 29, 2025

A solid AI infrastructure is essential for efficiently developing and deploying AI and machine learning (ML) applications – from facial and speech recognition to text processing and computer vision. Before we dive into why AI infrastructure is crucial and how it works, let’s define it first. What is AI infrastructure? AI infrastructure, also known as

Data Engineering

Pick up the Slack with lakeFS