Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Machine Learning

Machine Learning

AI Data Infrastructure: Components, Challenges & Best Practices

Tal Sofer

A solid AI data infrastructure is a key enabler for teams looking to efficiently deploy ML applications. It delivers the fundamental features required to enable the whole machine learning lifecycle, from data import and model training to deployment, monitoring, and scaling. Without infrastructure, you’re bound to face obstacles in performance, cooperation, and dependability. This article […]

Machine Learning

What is a Data Registry? Benefits, Use Cases & Best Practices

Tal Sofer

One of the major pain points in ML is the lack of transparency, consistency, and control over data assets. Without a centralized system, teams often struggle with fragmented datasets, unclear version histories, and poor documentation, which may lead to reproducibility failures, compliance risks, and wasted effort. A data registry solves this by offering a structured

Best Practices Data Engineering Machine Learning

Bound by Physics: Why Data Version Control is Critical for Real-World AI

Vince Antinozzi, Yoav Yetinson

TL;DR Software-only systems can be rerun from the source, but physics-bound workflows face a tougher challenge. Once a moment is gone, it’s gone. Sensor drift, hardware changes, and environmental uniqueness make it impossible to recreate the exact conditions. For audits, safety, and machine learning, you need full data provenance, including raw data, derived outputs, and

Best Practices Data Engineering Machine Learning

Versioning Data Labels: Integrating Labeling Tools with lakeFS

Iddo Avneri

In this post, we explore how lakeFS can integrate with popular data labeling solutions, the differences between labeling tools’ built-in dataset management and lakeFS data version control, and why combining them is invaluable. We’ll also highlight use cases – from autonomous vehicles to healthcare – where rigorous data versioning alongside labeling is essential. Overview of

Data Engineering Machine Learning

Unified Data Management: Types, Challenges & Best Practices

Idan Novogroder

Historically, companies have developed their IT systems on an ad hoc basis, installing various software and taking on data management approaches as their needs changed. The resulting organization is diverse, with multiple tools and data that serve the same function. Data tends to be segregated and dispersed across teams and areas, with little to no

Data Engineering Machine Learning

What is Metadata Filtering? Benefits, Best Practices & Tools

Idan Novogroder

Vector databases are a critical enabler for expanding the use of LLMs. They power applications such as Retrieval Augmented Generation (RAG), pattern matching, anomaly detection, and recommendation systems by retrieving relevant data for your application.  A vector database needs to carry out efficient similarity searches across vector embeddings of both unstructured and structured data. You

Data Engineering Machine Learning Product

How lakeFS Helps Ensure Data Compliance

Tal Sofer

Data compliance is all about adhering to laws, regulations, standards, and internal policies regarding data use. Organizations must comply with regulations like the General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), the California Consumer Privacy Act (CCPA) and SOC2 standards to protect sensitive information and maintain trust. Data compliance plays

Data Engineering Machine Learning

What is Data Compliance? Tools, Benefits & Key Metrics

Tal Sofer

Organizations deal with ever-increasing volumes of data. More data translates into more risk, as hackers have a larger target area. This is where data compliance comes in. It helps mitigate these threats and protect consumer data by setting compliance standards that companies and individuals must adhere to while working with data.  How does data compliance

Best Practices Machine Learning Thought Leadership

OpenAI’s Open Source Revolution: Why Enterprise AI Infrastructure Matters More Than Ever

Gottfried Sehringer

Yesterday, OpenAI launched gpt-oss-120b and gpt-oss-20b, marking the company’s first open-weight models since GPT-2 in 2019. This strategic shift represents far more than a product release—it signals a fundamental transformation in how large organizations, particularly in regulated industries, approach AI infrastructure and data management. OpenAI’s Strategic Return to Open Source The gpt-oss models—gpt-oss-120b and gpt-oss-20b—are

Data Engineering Machine Learning Product

How We Built Our lakeFS Iceberg Catalog

Itai Gilo

A behind-the-scenes look at the design decisions, architecture, and lessons learned while bringing the Apache Iceberg REST Catalog to lakeFS. When we first announced our native lakeFS Iceberg REST Catalog, we focused on what it means for data teams: seamless, Git-like version control for structured and unstructured data, at any scale. But how did we

Data Engineering Machine Learning

What is Data Virtualization? Benefits, Use Cases & Tools

Tal Sofer

Data integration is a vital first step in developing any AI application. This is where data virtualization comes in to help organizations accelerate application development and deployment. By virtualizing data, teams can unlock its full potential by providing real-time AI insights for applications like predictive maintenance, fraud detection, and demand forecasting. Virtualizing data centralizes and

Best Practices Machine Learning

AI-Ready Data: Characteristics, Challenges & Best Practices

Tal Sofer

Despite the increasing adoption of Artificial Intelligence (AI) applications, most organizations are bound to see implementation challenges. One of the issues lies in the data itself. A recent survey showed 80% of companies believe their data is suitable for AI, but more than half are actually dealing with challenges like internal data quality and categorization

lakeFS