Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Einat Orr, PhD
Einat Orr, PhD Author

Einat Orr is the CEO and Co-founder of lakeFS, a...

Last updated on April 26, 2024

[2024 Update]

Vector databases first emerged a few years ago to power a new generation of search engines based on neural networks. Today, they play a new role: helping organizations deploy applications based on large language models like GPT4.

Vector databases differ from standard relational databases, such as PostgreSQL, which were built to store tabular data in rows and columns. They’re also distinct from newer NoSQL databases like MongoDB that store data as JSON. That’s because a vector database is designed to store and retrieve just one type of data: vector embeddings.

Vector embeddings are the distilled representations of the training data produced as an output from the training stage of the machine learning process. They serve as the filter through which fresh data is processed during inference.

What vector database solutions are available today to help you store and retrieve high dimensional vectors? Before we move on to the review of the 16 most promising vector databases and libraries, let’s clarify the difference between these two technologies.

Vector libraries vs. vector databases

While specialized vector databases are storage systems developed for efficient management of dense vectors, vector libraries are integrated into existing database management systems (DBMS) or search engines to provide similarity search.

Vector libraries are a good choice for static data applications such as academic information retrieval benchmarks. Vector databases are useful for applications that require frequent data changes, such as e-commerce suggestions, image search, and semantic search.

What is the best vector database you can choose for your project? What are the leading vector databases on the market today? Dive into our overview of the 16 most popular vector databases on the market to understand your options and pick the best tool for your project.

16 best vector databases you should consider in 2024

1. Pinecone

Pinecone
Pinecone: https://www.pinecone.io/

Open source? No 

GitHub stars: –

What problem does it solve?

Pinecone is a managed, cloud-native vector database with a straightforward API and no infrastructure requirements. Users can launch, operate, and expand their AI solutions without the need for any infrastructure maintenance, service monitoring, or algorithm troubleshooting. 

The solution processes data quickly and lets users use metadata filters and sparse-dense index support for high-quality relevance, guaranteeing speedy and accurate results across a wide range of search needs.

Key features

  • Detection of duplicates.
  • Rank tracking.
  • Data search.
  • Classification.
  • Deduplication.

2. MongoDB

MongoDB vector search

MongoDB: https://www.mongodb.com/

GitHub stars: 25.2k

What problem does it solve?

MongoDB Atlas is the most popular managed developer data platform that can handle a large variety of transactional and search workloads. Atlas Vector Search uses a specialized vector index that is automatically synced with the core database and can be configured to run on separate infrastructure, offering the benefits of an integrated database with the independent scaling that is often why users would look to a vector database.

Key features

  • Integrated database + vector search capabilities
  • Independent provisioning for database and search index  
  • Storage for 16 MB of data per document
  • High availability, strong transaction guarantees, multiple levels of data durability, archiving, and backup
  • Industry leader in transactional data encryption
  • Hybrid search

3. Milvus 

Milvus
Milvus: https://milvus.io/

Open source? Yes 

GitHub stars: 21.1k

What problem does it solve?

Milvus is an open-source vector database designed to facilitate vector embedding, efficient similarity search, and AI applications.  It was published in October 2019 under the open-source Apache License 2.0 and is now a graduate project under the auspices of the LF AI & Data Foundation.

The tool simplifies unstructured data search and delivers a uniform user experience independent of the deployment environment. To improve elasticity and adaptability, all components in the refactored version of Milvus 2.0 are stateless.

Use cases for Milvus include image search, chatbots, and chemical structure search.

Key features

  • Searching trillions of vector datasets in milliseconds.
  • Unstructured data management is simple.
  • Reliable vector database that is always available.
  • Highly scalable and adaptable.
  • Search hybrid.
  • Unified Lambda structure.
  • Supported by the community and acknowledged by the industry.

4. Chroma

Chroma
Chroma: https://www.trychroma.com/

Open source? Yes 

GitHub stars: 7k

What problem does it solve?

Chroma DB is an open-source, AI-native embedding vector database that aims to simplify the process of creating LLM applications powered by natural language processing by making knowledge, facts, and skills pluggable for machine learning models at the scale of LLMs – as well as avoiding hallucinations. 

Many engineers have expressed a desire for “ChatGPT but for data,” and Chroma offers this link via embedding-based document retrieval. It also provides ‘batteries included’ with everything teams need to store, embed, and query data, including strong capabilities like filtering, with more features like intelligent grouping and query relevance on the way.

Key features

  • Feature-rich: queries, filtering, density estimates, and many other features.
  • LangChain (Python and JavScript), LlamaIndex, and more will be added shortly.
  • The same API that runs in your Python notebook scales to your cluster for development, testing, and production.

5. Weaviate

Weaviate
Weaviate: https://github.com/weaviate/weaviate

Open source? Yes 

GitHub stars: 6.7k

What problem does it solve?

Weaviate is a cloud-native, open-source vector database that is resilient, scalable, and quick. The tool can convert text, photos, and other data into a searchable vector database using cutting-edge machine learning models and algorithms.

It can perform a 10-NN neighbor search in single-digit milliseconds over millions of items. Engineers can use it to vectorize their data during the import process or submit their own vectors, ultimately creating systems for question-and-answer extraction, summarization, and categorization.

Weaviate modules enable the use of prominent services and model hubs like OpenAI, Cohere, or HuggingFace, as well as the use of local and bespoke models. Weaviate is designed with scale, replication, and security in mind. 

Key features

  • Built-in modules for AI-powered searches, Q&A, combining LLMs with your data, and automated categorization.
  • Complete CRUD capabilities.
  • Cloud-native, distributed, grows with your workloads and operates nicely on Kubernetes.
  • Seamlessly transfer ML models to MLOps using this database.

6. Deep Lake 

Deep Lake
Deep Lake: https://github.com/activeloopai/deeplake

Open source? Yes 

GitHub stars: 6.4k

What problem does it solve?

Deep Lake is an AI database powered by a proprietary storage format designed specifically for deep-learning and LLM-based applications that leverage natural language processing. It helps engineers deploy enterprise-grade LLM-based products faster via vector storage and an array of features.

Deep Lake works with data of any size, is serverless, and allows you to store all data in a single location. 

It also offers tool integrations to help streamline your deep learning operations. For example, using Deep Lake and Weights & Biases, you can track experiments and achieve full model repeatability. The integration delivers dataset-related information (URL, commit hash, view ID) to your W&B runs automatically. 

Key features

  • Storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, and so on). 
  • Querying and vector search,
  • Data streaming during training models at scale.
  • Data versioning and lineage for workloads. 
  • Integrations with tools like LangChain, LlamaIndex, Weights & Biases, and many more.

7. Qdrant

Qdrant
Qdrant: https://github.com/qdrant/qdrant

Open source? Yes 

GitHub stars: 11.5k

What problem does it solve?

Qdrant is an open-source vector similarity search engine and database. It offers a production-ready service with an easy-to-use API for storing, searching, and managing points-vectors and high dimensional vectors with an extra payload. 

The tool was designed to provide extensive filtering support. Qdrant’s versatility makes it a good pick for neural network or semantic-based matching, faceted search, and other applications.

Key features

  • JSON payloads can be connected with vectors, allowing for payload-based storage and filtering. 
  • Supports a wide range of data types and query criteria, such as text matching, numerical ranges, geo-locations, and others. 
  • The query planner makes use of cached payload information to improve query execution.
  • Write-Ahead during power outages, with the update log recording all operations, allowing for easy reconstruction of the most recent database state.
  • Qdrant functions independently of external databases or orchestration controllers, which simplifies configuration.

8. Elasticsearch 

Elasticsearch
Elasticsearch: https://www.elastic.co/elasticsearch/

Open source? Yes 

GitHub stars: 64.4k

What problem does it solve?

Elasticsearch is an open-source, distributed, and RESTful analytics engine that can handle textual, numerical, geographic, structured, and unstructured data. Based on Apache Lucene, it was initially published in 2010 by Elasticsearch N.V. (now Elastic). Elasticsearch is part of the Elastic Stack, a suite of free and open tools for data intake, enrichment, storage, analysis, and visualization. 

Elasticsearch can handle a wide range of use cases – it centrally stores your data for lightning fast search, finetuned relevance, and sophisticated analytics that scale easily. It expands horizontally to accommodate billions of events per second while automatically controlling how indexes and queries are dispersed throughout the cluster for slick operations.

Key features

  • Clustering and high availability.
  • Automatic node recovery and data rebalancing.
  • Horizontal scalability.
  • Cross-cluster and data center replication, which allows a secondary cluster to operate as a hot backup. 
  • Cross-datacenter replication.
  • Elasticsearch identifies errors in order to keep clusters (and data) secure and accessible. 
  • Works in a distributed architecture that was built from the ground up to provide constant peace of mind.

9. Vespa

Vespa
Vespa: https://vespa.ai/

Open source? Yes 

GitHub stars: 4.5k

What problem does it solve?

Vespa is an open-source data serving engine that allows users to store, search, organize, and make machine-learned judgments over massive data at serving time.

Huge data sets must be dispersed over numerous nodes and examined in parallel, and Vespa is a platform that handles these tasks for you while maintaining excellent availability and performance. 

Key features

  • Writes are acknowledged back to the client, issuing them in a few milliseconds when they are durable and visible in queries.
  • While servicing requests, writes can be delivered at a continuous rate of thousands to tens of thousands per node per second.
  • Data is copied with redundancy that may be configured.
  • Queries can include any combination of structured filters, free text search operators, and vector search operators, as well as enormous tensors and vectors.
  • Matches to a query can be grouped and aggregated based on a query definition.
  • All of the matches are included, even if they are running on several machines at the same time.

10. Vald 

Vald
Vald: https://vald.vdaas.org/

Open source? Yes 

GitHub stars: 1274

What problem does it solve?

Vald is a distributed, scalable, and fast vector search engine. Built with cloud-native in mind, it employs the quickest ANN algorithm, NGT, to help find neighbors.

Vald offers automated vector indexing and index backup, as well as horizontal scaling, allowing it to search across billions of feature vector data. It’s simple to use and extremely configurable – for example, the highly configurable Ingress/Egress filter you can customize to work with the gRPC interface.

Key features

  • Vald offers automatic backups through Object Storage or Persistent Volume, allowing for disaster recovery.
  • It distributes vector indexes to numerous agents, each of which retains a unique index.
  • The tool replicates indexes by storing each index in many agents. When a Vald agent goes down, automatically rebalance the duplicate.
  • Highly adaptable – you may choose the number of vector dimensions, replicas, and so forth.
  • Python, Golang, Java, Node.js, and more programming languages are supported.

11. ScaNN 

ScaNN
ScaNN: https://github.com/google-research/google-research/tree/master/scann

Open source? Yes 

GitHub stars: –

What problem does it solve?

ScaNN (Scalable Nearest Neighbors) is a method for efficiently searching for vector similarity at scale. Google’s ScaNN proposes a brand-new compression method that significantly increases accuracy. This allows it to outperform other vector similarity search libraries by a factor of two, according to ann-benchmarks.com

It includes search space trimming and quantization for Maximum Inner Product Search, as well as additional distance functions like Euclidean distance. The implementation is intended for x86 processors that support AVX2. 

12. Pgvector

Pgvector
Pgvector: https://github.com/pgvector/pgvector

Open source? Yes 

GitHub stars: 4.5k

What problem does it solve?

pgvector is a PostgreSQL extension for searching for vector similarity. You can also use it to keep embeddings as well. Ultimately, pgvector helps you store all of the application data in one place. 

Its users get to benefit from ACID compliance, point-in-time recovery, JOINs, and all of the other fantastic features we love PostgreSQL for.

Key features

  • Exact and approximate nearest neighbor search
  • L2 distance, inner product, and cosine distance
  • any language with a PostgreSQL client

13. Faiss 

Faiss
Faiss: https://github.com/facebookresearch/faiss

Open source? Yes 

GitHub stars: 23k

What problem does it solve?

Developed by Facebook AI Research, Faiss is an open-source library for fast, dense vector similarity search and grouping. It includes methods for searching sets of vectors of any size, up to those that may not fit in RAM. It also comes with code for evaluation and parameter adjustment.

Faiss is based on an index type that maintains a set of vectors and offers a function for searching through them using L2 and/or dot product vector comparison. Some index types, such as precise search, are simple baselines. 

Key features

  • Returns not just the nearest neighbor but also the second nearest, third nearest, and k-th nearest neighbor.
  • You can search several vectors at once rather than just one (batch processing). 
  • Uses the greatest inner product search rather than a minimal Euclidean search. 
  • Other distances (L1, Linf, etc.) are also supported to a lesser extent.
  • Returns all elements within a specified radius of the query location (range search).
  • Instead of storing the index in RAM, you can save it to disk.

14. ClickHouse

ClickHouse
ClickHouse: https://clickhouse.com/

Open source? Yes

GitHub stars: 31.8k

What problem does it solve?

ClickHouse is an open-source column-oriented DBMS for online analytical processing that enables users to produce analytical reports in real time by running SQL queries. The actual column-oriented DBMS design is at the heart of ClickHouse’s uniqueness. This distinct design provides compact storage with no unnecessary data accompanying the values, which significantly improves processing performance.

It uses vectors to process data, which improves CPU efficiency and contributes to ClickHouse’s exceptional speed.

Key features

  • Data compression is a feature that significantly improves ClickHouse’s performance.
  • ClickHouse combines low-latency data extraction with the cost-effectiveness of employing standard hard drives.
  • It uses multicore and multiserver setups to accelerate massive queries, which is a rare feature in columnar DBMSs.
  • With robust SQL support, ClickHouse excels at processing a wide range of queries.
  • ClickHouse’s continuous data addition and quick indexing meet real-time demands.
  • Its low latency provides quick query processing, which is critical for online activities.

15. OpenSearch

OpenSearch
OpenSearch: https://opensearch.org/

Open source? Yes 

GitHub stars: –

What problem does it solve?

This is an interesting solution among other vector databases. Using OpenSearch as a vector database brings together the power of classical search, analytics, and vector search into a single solution. The vector database features of OpenSearch help speed up AI application development by minimizing the work required for developers to operationalize, manage, and integrate AI-generated assets.

You can bring in your models, vectors, and information to enable vector, lexical, and hybrid search and analytics, with built-in performance and scalability.

Key features

  • As a vector database, OpenSearch may be used for a variety of purposes, such as search, personalization, data quality, and vector database engine.
  • Among its search use cases, you can find multimodal search, semantic search, visual search, and gen AI agents.
  • You can create product and user embeddings using collaborative filtering techniques and fuel your recommendation engine with OpenSearch.
  • To aid data quality operations, OpenSearch users can use similarity search to automate pattern matching and duplication in data.
  • The solution lets you create a platform with an integrated, Apache 2.0-licensed vector database that offers a dependable and scalable solution for embeddings and power vector search.

16. Apache Cassandra

Apache Cassandra
Apache Cassandra: https://cassandra.apache.org/

Open source? Yes 

GitHub stars: 8.3k

What problem does it solve?

Cassandra is a distributed, wide-column store, NoSQL database management system that is free and open-source. It was designed to handle massive volumes of data across many commodity servers while maintaining high availability with no single point of failure.

Cassandra will soon be equipped with vector search, which demonstrates the Cassandra community’s dedication to delivering dependable innovations quickly. Cassandra’s popularity is growing among AI developers and businesses dealing with big data volumes as provides them with the capabilities to build complex, data-driven applications.

Key features

  • Cassandra will have a new data type to facilitate the storage of high dimensional vectors. This will allow for the manipulation and storage of Float32 embeddings, which are extensively used in AI applications.
  • The tool will also provide a new storage-attached index (SAI) dubbed “VectorMemtableIndex” to support approximate nearest neighbor (ANN) search capabilities.
  • It will offer a new Cassandra Query Language (CQL) operator, ANN OF, to make it easier for users to run ANN searches on their data.
  • Cassandra’s new vector search feature is designed as an extension to the existing SAI framework, eliminating the need to redesign the fundamental indexing engine.

How to choose the right vector database for your project

When picking a vector database for your project, consider the following factors:

  • Do you have an engineering team to host the database, or do you need a fully managed database?
  • Do you have the vector embeddings, or do you need a vector database to generate them?
  • Latency requirements, such as batch or online, 
  • Developer experience in the team, 
  • The learning curve of the given tool, 
  • Solution reliability, 
  • Implementation and maintenance costs, 
  • Security and compliance.

As AI continues to take the tech industry by storm, vector databases will become commonplace, and it’s only natural that we’ll see more and more tools emerge in this market.

How are you using vector database solutions today? Let us know in our Slack community!

Frequently Asked Questions

The answer to this question will depend on many factors: your requirements, experience in the team, availability of vector embeddings, and more. Some of the primary considerations for selecting vector databases can be speed, scalability, developer experience, community, and pricing.

An AI project can definitely benefit from vector databases thanks to their capabilities such as similarity search. A good vector database offers a basis for applications by including characteristics such as data management (including vector data), fault tolerance, important security features, and a query engine. Users can use these features to operationalize their workloads in order to ease scaling, maintain high scalability, and meet security needs.

Vector data is particularly suitable for storing and expressing data with distinct boundaries, such as borders or building footprints, roadways and other modes of transportation, and position points.

Many of the vector databases mentioned in our overview are licensed as open source and you can use them free of charge.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Where is data engineering heading in 2024? Find out in this year’s State of Data Engineering Report -

    Read it here
    +