Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Idan Novogroder
Idan Novogroder Author

Idan has an extensive background in software and DevOps engineering....

Last updated on May 17, 2024

Many businesses are dealing with increasing volumes of data spread over several databases and repositories across on-premises systems, cloud services, and IoT technology. This complicates data management and data quality, preventing data practitioners from locating important data and unlocking insights from it. 

This is where data catalogs come in. Initially, data catalogs required bespoke scripts to crawl data and capture information. Modern systems used by data professionals can automatically and dynamically detect data properties, kinds, and profiles.

Keep reading to learn more about the impact of data catalogs on organizations and explore 15 data catalogs in the market.

What is a Data Catalog?

A data catalog is a collection of metadata paired with data management and search capabilities that helps data users find the data they want. It serves as an inventory of accessible data and provides information to assess the fitness of data for its intended applications.

This highlights numerous aspects of data catalogs, including data administration, searching, inventory, and assessment, all of which rely on the central capability to offer data collection.

Data catalog tools make it easier to find data and determine its purpose, delivering benefits such as:

Benefit Description
Fast asset discovery A data catalog makes it easier to identify data, which helps employees be more productive. It provides an overview of where data originates from, how it flows through systems, and how it is altered.
Improved data quality When a company receives new data, employees must fill out numerous fields in the data catalog. When users browse the catalog, they may learn about the origins of data, transformation procedures, and editing dates, which gives them more confidence when engaging with the material.
Increased efficiency A data catalog promotes uniformity in names, definitions, and measurements, ensuring that diverse teams within an organization have a shared understanding and usage of data.
Enhanced security Enterprise data catalogs ensure that sensitive data is managed correctly and that appropriate access is allowed. Organizations can trace where their data originates, who has accessed it, and how it is utilized, improving regulatory compliance activities.

What Does a Data Catalog Do?

Data catalogs can give a single picture of an enterprise’s data assets. The catalog concept has existed since the early days of relational databases when teams needed to trace how data sets were connected, joined, and modified across SQL tables. 

Modern data catalog solutions inventory and gather information from a broader range of data sources, including data lakes, data warehouses, NoSQL databases, cloud object storage, and others.

They are also frequently coupled with data governance software to help organizations keep up with changing regulatory compliance requirements and other aspects of governance initiatives. Furthermore, the technologies are growing to use natural language searches, machine learning, and other AI capabilities.

Top 15 Data Catalog Tools

1. Amundsen Data Catalog

Source: Amundsen

Amundsen was designed to help users find answers to data availability, trustworthiness, ownership, usage, and reusability issues. Amundsen’s main features include simple metadata ingestion, search, discovery, lineage, and visualization. The Amundsen project is now overseen by the Linux Foundation’s AI & Data branch. 

Amundsen’s architecture consists of many services, including the metadata service, the search service, the frontend service, and the data builder. These services rely on technologies such as Neo4j and Elasticsearch, so you’ll need to learn how to use them to resolve difficulties as they emerge.

2. Marquez Data Catalog

Source: Marquez

Marquez was designed to tackle metadata management at WeWork. Its primary goals were to search and visualize data assets, understand how they connect to one another, and how they change as they move from a data source to a target environment. Marquez also paved the way for OpenLineage, a solution for recording, manipulating, and preserving data lineage in real time.

Marquez’s key features are metadata management and lineage visualization, with a particular emphasis on interacting with technologies like dbt and Apache Airflow. Marquez seeks to increase data trust, provide (lineage) context, and enable users to self-serve the required data.

Marquez is now incubating with the Linux Foundation’s AI & Data project. Although there is no obvious public plan, the blog, community Slack channel, and documentation provide enough information to keep you updated on project development.

3. Apache Atlas Data Catalog

Source: Apache Atlas

Apache Atlas represents data as types and entities, allowing enterprises to generate, organize, and administer their data assets on Hadoop clusters. These “entities” are examples of metadata types containing information about metadata items and their relationships.

Apache provides a cutting-edge “atlas-modeling” solution to help you describe the origins of your data, together with all of its transformations and artifacts. This service eliminates the complexity of managing metadata by adding metadata to things using labels and classes. Although anybody may create and assign item labels, system administrators control categories using Atlas rules.

4. DataHub Data Catalog

Source: DataHub

DataHub is an event-based data catalog that, according to its feature set, might be regarded as a metadata platform similar to OpenMetadata. LinkedIn created and used it internally. They decided to open-source it in early 2020. Since then, the adoption and community surrounding it have expanded dramatically. 

Acryl is now the primary developer and maintainer of DataHub. They also have a DataHub SaaS service in their product range. Nonetheless, Acryl is deeply dedicated to the open-source paradigm. This means that most features (if not all) are and will remain part of the open-source release. 

Because DataHub is event-based, each interaction with the user interface that impacts metadata, or metadata ingestion, generates an event in a Kafka topic. That event is detected by the backend service, which updates the database. 

This functionality can be delegated to two other services that can be maintained separately: the Metadata Change Event (MCE) consumer service and the Metadata Audit Event (MAE) consumer service. 

Metadata ingestion occurs either in a dedicated container or a frontend container that has been set up and launched. Alternatively, you may use the Python SDK to consume metadata programmatically.

Using DataHub’s helm charts and the suggested default configuration (MySQL as database, dedicated Kafka instance, Elasticsearch for search and graph index, MCE and MAE in the backend), you can deploy DataHub quickly and easily.

5. IBM Knowledge Catalog Data Catalog

Source: IBM Knowledge

IBM Knowledge Catalog is a metadata store intended to enable AI, machine learning, and other analytics operations. It integrates with the core InfoSphere Information Governance Catalog to assist enterprises in discovering and managing data from both cloud and on-premises sources. 

The tool can catalog a wide range of data and analytics assets, including machine learning models and structured, unstructured, and semi-structured data. It enables intelligent cataloging and data discovery aided by automatic search suggestions.

The data catalog software also includes a self-service portal and automated data governance features such as active policy management, role-based access restriction, and dynamic masking of sensitive data. You can run it in the cloud, on-premises, or as a fully managed service on the IBM Cloud Pak for Data platform. 

The current version, which was released in November 2023 alongside Cloud Pak for Data 4.8, includes new data sources for importing metadata, relationship diagrams to visualize complex asset relationships, new user permissions for data quality controls, automatic mapping of logical data models, data privacy enhancements, and other features.

6. Boomi Data Catalog and Preparation

Source: Boomi

Boomi Data Catalog and Preparation is part of the company’s AtomSphere Platform, a suite of products that also includes data integration, master data management, and other features. The tool includes a data catalog with data preparation tools. 

Organizations can use this catalog to compile a consolidated business dictionary of data for tracking data sets, processing tasks, and workflow schedules. Then, they can run a data prep recommendation engine to automatically cleanse, enrich, normalize, and convert data.

The catalog tool has interfaces to over 1,000 endpoints, including more than 200 apps. IT and data management teams may also use data pipelines to automate workflows for analytics, machine learning, and AI procedures. Moreover, data governance and security features can improve controls across many apps and business processes.

Boomi Data Catalog and Preparation comes with the following capabilities:

  • Support for natural language questions and personalized searches
  • You can deploy and run in the cloud, on-premises, or a hybrid environment
  • Collaboration capabilities include the ability to rate and comment on data and ask data stewards for access to specific datasets

7. Dataproc Metastore (Google Cloud Data Catalog) 

Source: Dataproc Metastore

Dataproc Metastore (Google Cloud Data Catalog) is a fully managed data discovery and metadata management solution that supports both cloud and on-premises data sources. It’s intended to allow both data experts and business users to search a catalog using natural language queries and annotate data at scale. 

The tool includes built-in connections with Google BigQuery, Pub/Sub, Dataproc Metastore, and Cloud Storage data services. It also works with the company’s IAM and Cloud Data Loss Prevention services to help with data security and compliance management as part of data governance projects.

The data catalog software is delivered as a serverless service, eliminating the need for customers to set up or manage infrastructure. It allows you to catalog data assets and access additional features using the UI in Google’s Dataplex data fabric environment, as well as a CLI and a set of APIs and client libraries. The tool may hold both technical and business metadata, including tags and tag templates. You can save both custom metadata types and file set schemas from the Cloud Storage service.

Other features of this data catalog tool include:

  • Automatic synchronization of technical metadata
  • Allows for automatic labeling of sensitive data
  • A unified view of data from cloud and on-premises platforms

8. Atlan Data Discovery & Catalog

Source: Atlan

Atlan is a third-generation data library based on design concepts from GitHub, Slack, and other end-user technologies. In particular, Atlan Data Discovery & Catalog facilitates cooperation by seamlessly integrating standard data procedures.

For example, data teams can use it to identify issues that must be handled promptly. It allows for contextual conversations in Slack chats that can take advantage of a reverse metadata feature, and individual users can submit Jira tickets to report concerns while browsing data sets.

The program also provides the following capabilities to aid in facilitating integration with common data sources and data quality tools:

  • Open APIs that allow for fully customized metadata ingestion
  • Programmable bots to help with task automation using proprietary machine learning and data science methods
  • A plugin marketplace that connects to numerous data tools and platforms

9. Collibra Data Catalog

Source: Collibra

Collibra provides a Data Intelligence Cloud platform focused on the Collibra Data Catalog. Its data catalog capabilities include a wide set of automated characteristics for data discovery and categorization using a proprietary machine learning algorithm. Other capabilities include data curation powered by machine learning and data lineage. 

Collibra also supports graph-based metadata management approaches, which help you understand data quality and lineage.

Collibra Data Catalog features prebuilt interfaces for consuming metadata from various data repositories and popular business applications, BI platforms, and data science tools. It also has embedded data governance capabilities, guided data stewardship features, and granular controls for enforcing data security and privacy safeguards, all from a single dashboard.

In addition, Collibra has the following features:

  • A business lexicon that standardizes language and automated data governance procedures and dashboards
  • Collaboration options include crowdsourcing input on data assets via ratings, reviews, and comments
  • A “data shopping experience” that allows users to find relevant data without requiring any SQL scripting

10. Alation Data Catalog

Source: Alation

Alation’s data catalog uses AI, machine learning, automation, and NLP techniques to simplify data discovery, automatically generate business glossaries, and are behind the Behavioral Analysis Engine. The latter analyzes data usage patterns to streamline data stewardship, data governance, and query optimization.

Alation offers guided navigation and a variety of collaborative tools. For example, it can automatically find data stewards or other subject matter experts to answer data-related inquiries, and users can build wiki pages and searchable chats. Data scientists can also subscribe to receive automated notifications if datasets or articles are updated. Prebuilt analytics dashboards provide customizable reporting, while Alation Cloud Service delivers data insight as a service.

Other significant aspects of the Alation tool are:

  • The capacity to identify data quality concerns and establish enterprise data governance standards
  • Prebuilt connections for various data sources, as well as an Open Connector Framework SDK for creating unique ones
  • You can use A built-in SQL editor instead of a natural language search

11. Data.world Data Catalog

Source: Data.world

Data.world is a cloud-native data catalog product that is available as a SaaS platform. It’s known for its knowledge graph methodology, which provides teams with a semantically structured view of enterprise data assets and related metadata across several platforms. This makes it easier for business and analytics users to locate important data and comprehend its context.

Data.world introduced data catalog services driven by knowledge graphs in 2022. The Eureka package contains automation for deploying and managing data catalogs and an Action Center dashboard with metrics, alerts, suggestions, and other features.

The data catalog tool also has generative AI capabilities to increase data discovery. AI bots can help with data searches, provide research questions and analytics hypotheses, turn natural language inquiries into SQL code, and produce natural language descriptions for metadata resources.

Other key features include:

  • Collaboration features can assist in expediting workflows and facilitating knowledge exchange between data producers and users
  • Metadata may be automatically organized, aggregated, and presented in a way that is easy to use and share across collaborators
  • Data access can be virtualized or federated, and data governance controls are built in

12. Select Star Data Catalog

Source: Select Star

Select Star is a relatively new data discovery platform designed for the cloud. Its highly automated platform and accessible UI provide insights into your data model, allowing data engineers and non-technical stakeholders to quickly comprehend the context of their data. You can set up your catalog in less than an hour thanks to native connections with common data warehouses, ETL, and BI tools.

Select Star automates lineage, ERDs, and documentation/tag propagation, reducing the manual labor necessary to curate your data. It also has a universal search feature that leverages popularity to bring up the most relevant results from all of your data sources.

Select Star’s open API makes it simple to programmatically manage your data or interface with other applications, while permission-based access management provides data teams complete control over their metadata.

13. OpenMetadata Data Catalog

Source: OpenMetadata

OpenMetadata, developed by the team behind Uber’s data architecture, approaches the metadata problem from a new viewpoint by avoiding typical technical decisions made by previous solutions. Its technological design forbids utilizing a full-fledged graph database such as JanusGraph or Neo4j. Rather, it uses PostgreSQL’s graph features to store relationships.

It does the same by eschewing a Lucene-based full-text search engine like Apache Solr or Elasticsearch and instead relying on PostgreSQL’s flexible design to manage the burden. OpenMetadata’s feature set is similar to that of the majority of other open-source data cataloging tools.

OpenMetadata strives to centralize metadata for governance, quality, profiling, provenance, and cooperation. It’s backed by a diverse set of connectors and integrations for cloud and data platforms. The data catalog tool is extensively used and under active development.

14. Zeenea Data Catalog

Source: Zeenea

Zeenea is a scalable SaaS solution that you can integrate with any data source. Its physical and logical metamodels let you view and record your data and relationships.

Zeenea provides users several options for finding the data they need, including a simple keyword search with a clever filtering system and direct catalog browsing. You can analyze data lineage using a user-friendly lineage graph, increasing trust in the accessible data throughout the company. This capability can be boosted with data version control tools available on the market.

Zeenea also provides traceability features for compliance reporting, and its corporate lexicon ensures terminology uniformity across the firm.

15. Gravitino Data Catalog

Source: Gravitino

Gravitino is an open-source, high-performance metadata lake that is geographically dispersed and federated. It’s built on top of Apache Iceberg and handles information directly on the server side across many sources, kinds, and geographies. It also gives users consistent metadata access to data and AI assets via REST APIs.

Gravitino offers a consistent interface for managing Iceberg information and supports the Iceberg REST catalog interface for interoperability with existing data ecosystems. This makes Gravitino a data hub that connects any data, regardless of kind or location. Supporting the Iceberg REST catalog service aligns neatly with Gravitino’s goals.

Gravitino’s REST catalog, like a conventional catalog service, supports all namespace and table activities, such as table creation, deletion, modification, and renaming. Aside from these fundamental capabilities, certain sophisticated features are supplied or planned.

Gravitino’s basic design idea is to make modules pluggable to meet various requirements. For example, it provides pluggable authorization and authentication interfaces, metric storage interfaces, and event listeners.

Conclusion

Modern data catalog platforms use various approaches to improve usability and productivity. Automation helps teams manage a data catalog with less effort. Integration features enable the catalog to retrieve metadata from various sources automatically.

Data catalog search tools go beyond simple keyword searches to make suggestions. They also provide filters, allowing users to discover data based on various parameters. The user experience is similar to current search engines, with relevant, ranked, and quickly accessible results. Effective data retrieval saves time while promoting data discovery and exploration.

Data catalog tools act as a global dictionary, providing consistent definitions of terminology and measurements throughout an organization. They guarantee that each metadata word has a single, explicit description. This is especially important for ensuring data integrity and encouraging clear communication across diverse teams.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Who’s coming to Data+AI Summit? Meet the lakeFS team at Booth #69! Learn more about -

    lakeFS for Databricks
    +