Making Sense of Large-Scale Data Through Metadata
Picture this: Your ML team needs to find all images labeled “defective” from Q3 production runs tagged by a specific annotation workflow to retrain a quality control model. In a data lake with 10 billion objects, how do you find them?
For most teams, the answer is: you can’t. At least not without building custom infrastructure, maintaining complex indexing systems, or manually scanning through files.
As data lakes scale to billions of objects, teams face a double challenge: finding the right data and understanding what it means. In modern AI and analytics environments, data is constantly changing – files are replaced, pipelines are rerun, and workflows evolve daily. Answering even basic questions like “Which files are missing required metadata fields?”, “Has the distribution of our training data shifted over time?”, or “When did schema or tagging errors start appearing?” becomes increasingly difficult.
Metadata provides the context that makes data meaningful: when it was created, by whom, how it’s used, and what it represents. As we explored in our post on AI metadata management, this contextual information is essential for building reliable AI systems. Yet as organizations collect more metadata, they often struggle to make it accessible and actionable.
lakeFS Metadata Search changes that. It brings structure, discoverability, and reproducibility to metadata-rich environments – making it possible to explore and query your data lake directly through metadata, at any scale.
The Challenge: Metadata Management at Scale
Every object in a data lake includes details about its system properties (like size, creation time, or type) and its contextual attributes (like annotation labels, sensitivity classification, or pipeline identifiers). But to make metadata truly useful, it needs to be queryable. That typically means building and maintaining separate indexing systems or custom metadata catalogs – expensive, brittle, and hard to scale.
The result is predictable: metadata becomes scattered across storage backends, workflow logs, and feature stores. Queries are slow and return only partial results. Data changes over time, making queries non-reproducible. Teams spend more time maintaining metadata infrastructure than using it – and in many cases, using metadata for search at scale simply isn’t feasible.
What Is lakeFS Metadata Search?
Instead of building custom systems to track your data and metadata, you can now query it directly:
USE "repo.main.system";
SELECT path
FROM object_metadata
WHERE user_metadata['class'] = 'horse'
AND user_metadata['workflow_id'] = 'customer-etl-v2'
AND size > 1073741824;That’s it. No custom indexing.
With Metadata Search, metadata becomes the interface to your data lake – so you can reason about behavior, lineage, and quality without reading a single file. It enables teams to run queries like:
- Show all Parquet files > 1 GB tagged
pii=trueunder/raw/2025/. - List files produced by
workflow_id=abclast week. - Which paths have
schema_version=3and were after January 1st.
Metadata Search delivers two key capabilities:
- Searchability at scale via SQL – find any data across the lake using familiar syntax and tools.
- Reproducible queries – anchor each query to an immutable data version to get the same answer every time.
How it works
Built on Iceberg
Metadata Search indexes every object’s metadata in your lakeFS repository and exposes it as an Iceberg table you can query with standard SQL via any Iceberg-compatible client (e.g., Trino, Spark, DuckDB, PyIceberg).
Two Types of Metadata
Metadata Search indexes two types of object metadata:
- System metadata: automatically captured information such as
path,size, orcommit_id - User-defined metadata: key-value pairs attached to objects by users or workflows, such as
workflow_id,annotation,pii=true, etc.
Under the hood, lakeFS continuously builds a queryable table called object_metadata. The table is eventually consistent and updates after data is committed to lakeFS.
Writing Reproducible Queries
Asking “show me all horse images” on Monday gives different results than Friday if your data lake is actively changing. This is especially problematic in two common scenarios: dynamic environments where data is constantly updated by automated pipelines, and shared environments where multiple teams need to reference the same datasets. Without reproducibility, AI governance, regulatory compliance, and team collaboration all break down.
In lakeFS, you solve this by querying immutable versions – specific commit hashes or tags – instead of querying moving branch heads. This guarantees the same query always returns the same results, regardless of when or where it’s executed.
If you query a branch, you query the “latest” data:
USE "repo.main.system";
SELECT *
FROM object_metadata
WHERE user_metadata['class'] = 'horse';To get reproducible results, pin the query to a specific commit
USE "repo.c123abc.system";
SELECT *
FROM object_metadata
WHERE user_metadata['class'] = 'horse';Or tag
USE "repo.v1.2.0.system";
SELECT path, size, commit_id
FROM object_metadata
WHERE user_metadata['class'] = 'horse';Running the same query against the same commit or tag will always return the same result, regardless of when or where it’s executed. This makes it straightforward to share data curation results, validate experiments, and satisfy audit requirements.
The Bottom Line
As data lakes grow to billions of objects, metadata becomes the only practical way to navigate, understand, and govern your data at scale. lakeFS Metadata Search transforms system and user-defined metadata from scattered documentation into a queryable, reproducible interface – letting you find the right data, understand its context, and ensure consistent results across AI and analytics workflows. Stop building custom metadata infrastructure and start querying your data lake directly. Explore the full documentation or contact us to get started!


