Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
The lakeFS team
The lakeFS team Author

lakeFS is on a mission to simplify the lives of...

July 31, 2023

Vector databases first emerged a few years ago to power a new generation of search engines based on neural networks. Today, they play a new role: helping organizations deploy applications based on large language models like GPT4.

Vector databases differ from standard relational databases, such as PostgreSQL, which were built to store tabular data in rows and columns. They’re also distinct from newer NoSQL databases like MongoDB that store data as JSON. That’s because a vector database is designed to store and retrieve just one type of data: vector embeddings.

Vector embeddings are the distilled representations of the training data produced as an output from the training stage of the machine learning process. They serve as the filter through which fresh data is processed during inference.

What vector databases are available on the market today? Before we move on to the review of the 12 most promising vector databases and libraries, let’s clarify the difference between these two technologies.

Vector libraries vs. vector databases

While vector databases are specialized storage systems developed for efficient management of dense vectors, vector libraries are integrated into existing database management systems (DBMS) or search engines to provide similarity search.

Vector libraries are a good choice for static data applications such as academic information retrieval benchmarks. Vector databases are useful for applications that require frequent data changes, such as e-commerce suggestions, image search, and semantic similarity.

12 vector databases you should consider in 2023

1. Pinecone


Open source? No 

GitHub stars: –

What problem does it solve?

Pinecone is a managed, cloud-native vector database with a straightforward API and no infrastructure requirements. Users can launch, operate, and expand their AI solutions without the need for any infrastructure maintenance, service monitoring, or algorithm troubleshooting. 

The solution processes data quickly and lets users use metadata filters and sparse-dense index support for high-quality relevance, guaranteeing speedy and accurate results across a wide range of search needs.

Key features

  • Detection of duplicates.
  • Rank tracking.
  • Data search.
  • Classification.
  • Deduplication.

2. Milvus 


Open source? Yes 

GitHub stars: 21.1k

What problem does it solve?

Milvus is an open-source vector database designed to facilitate embedding similarity search and AI applications.  It was published in October 2019 under the open-source Apache License 2.0 and is now a graduate project under the auspices of the LF AI & Data Foundation.

The tool simplifies unstructured data search and delivers a uniform user experience independent of the deployment environment. To improve elasticity and adaptability, all components in the refactored version of Milvus 2.0 are stateless.

Use cases for Milvus include image search, chatbots, and chemical structure search.

Key features

  • Searching trillions of vector datasets in milliseconds.
  • Unstructured data management is simple.
  • Reliable vector database that is always available.
  • Highly scalable and adaptable.
  • Search hybrid.
  • Unified Lambda structure.
  • Supported by the community and acknowledged by the industry.

3. Chroma


Open source? Yes 

GitHub stars: 7k

What problem does it solve?

Chroma DB is an open-source, AI-native embedding database that aims to simplify the process of creating LLM applications by making knowledge, facts, and skills pluggable for LLMs – as well as to avoid hallucinations. 

Many engineers have expressed a desire for “ChatGPT but for data,” and Chroma offers this link via embedding-based document retrieval. It also provides ‘batteries included’ with everything teams need to store, embed, and query data, including strong capabilities like filtering, with more features like intelligent grouping and query relevance on the way.

Key features

  • Feature-rich: queries, filtering, density estimates, and many other features.
  • LangChain (Python and JavScript), LlamaIndex, and more will be added shortly.
  • The same API that runs in your Python notebook scales to your cluster for development, testing, and production.

4. Weaviate


Open source? Yes 

GitHub stars: 6.7k

What problem does it solve?

Weaviate is a cloud-native, open-source vector database that is resilient, scalable, and quick. The tool can convert text, photos, and other data into a searchable vector database using cutting-edge machine learning algorithms.

It can perform a 10-NN neighbor search in single-digit milliseconds over millions of items. Engineers can use it to vectorize their data during the import process or submit their own vectors, ultimately creating systems for question-and-answer extraction, summarization, and categorization.

Weaviate modules enable the use of prominent services and model hubs like OpenAI, Cohere, or HuggingFace, as well as the use of local and bespoke models. Weaviate is designed with scale, replication, and security in mind. 

Key features

  • Built-in modules for AI-powered searches, Q&A, combining LLMs with your data, and automated categorization.
  • Complete CRUD capabilities.
  • Cloud-native, distributed, grows with your workloads and operates nicely on Kubernetes.
  • Seamlessly transfer ML models to MLOps using this database.

5. Deep Lake 

Deep Lake
Deep Lake:

Open source? Yes 

GitHub stars: 6.4k

What problem does it solve?

Deep Lake is an AI database powered by a proprietary storage format designed specifically for deep-learning and LLM-based applications. It helps engineers deploy enterprise-grade LLM-based products faster via vector storage and an array of features.

Deep Lake works with data of any size, is serverless, and allows you to store all data in a single location. 

It also offers tool integrations to help streamline your deep learning operations. For example, Using Deep Lake and Weights & Biases, you can track experiments and achieve full model repeatability. The integration delivers dataset-related information (URL, commit hash, view id) to your W&B runs automatically. 

Key features

  • Storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, and so on). 
  • Querying and vector search,
  • Data streaming during training models at scale.
  • Data versioning and lineage for workloads. 
  • Integrations with tools like LangChain, LlamaIndex, Weights & Biases, and many more.

6. Qdrant


Open source? Yes 

GitHub stars: 11.5k

What problem does it solve?

Qdrant is an open-source vector similarity search engine and database. It offers a production-ready service with an easy-to-use API for storing, searching, and managing points-vectors with an extra payload. 

The tool was designed to provide extensive filtering support. Qdrant’s versatility makes it a good pick for neural network or semantic-based matching, faceted search, and other applications.

Key features

  • JSON payloads can be connected with vectors, allowing for payload-based storage and filtering. 
  • Supports a wide range of data types and query criteria, such as text matching, numerical ranges, geo-locations, and others. 
  • The query planner makes use of cached payload information to improve query execution.
  • Write-Ahead during power outages, with the update log recording all operations, allowing for easy reconstruction of the most recent database state.
  • Qdrant functions independently of external databases or orchestration controllers, which simplifies configuration.

7. Elasticsearch 


Open source? Yes 

GitHub stars: 64.4k

What problem does it solve?

Elasticsearch is an open-source, distributed, and RESTful analytics engine that can handle textual, numerical, geographic, structured, and unstructured data. Based on Apache Lucene, it was initially published in 2010 by Elasticsearch N.V. (now Elastic). Elasticsearch is part of the Elastic Stack, a suite of free and open tools for data intake, enrichment, storage, analysis, and visualization. 

Elasticsearch can handle a wide range of use cases – it centrally stores your data for lightning fast search, finetuned relevance, and sophisticated analytics that scale easily. It expands horizontally to accommodate billions of events per second while automatically controlling how indexes and queries are dispersed throughout the cluster for slick operations.

Key features

  • Clustering and high availability.
  • Automatic node recovery and data rebalancing.
  • Horizontal scalability.
  • Cross-cluster and data center replication, which allows a secondary cluster to operate as a hot backup. 
  • Cross-datacenter replication.
  • Elasticsearch identifies errors in order to keep clusters (and data) secure and accessible. 
  • Works in a distributed architecture that was built from the ground up to provide constant peace of mind.

8. Vespa


Open source? Yes 

GitHub stars: 4.5k

What problem does it solve?

Vespa is an open-source open data serving engine that allows users to store, search, organize, and make machine-learned judgments over massive data at serving time.

Huge data sets must be dispersed over numerous nodes and examined in parallel, and Vespa is a platform that handles these tasks for you while maintaining excellent availability and performance. 

Key features

  • Writes are acknowledged back to the client, issuing them in a few milliseconds when they are durable and visible in queries.
  • While servicing requests, writes can be delivered at a continuous rate of thousands to tens of thousands per node per second.
  • Data is copied with redundancy that may be configured.
  • Queries can include any combination of structured filters, free text search operators, and vector search operators, as well as enormous tensors and vectors.
  • Matches to a query can be grouped and aggregated based on a query definition.
  • All of the matches are included, even if they are running on several machines at the same time.

9. Vald 


Open source? Yes 

GitHub stars: 1274

What problem does it solve?

Vald is a distributed, scalable fast approximate closest neighbor dense vector search engine. Built with cloud-native in mind, it employs the quickest ANN Algorithm NGT, to help find neighbors.

Vald offers automated vector indexing and index backup, as well as horizontal scaling, allowing it to search across billions of feature vector data. It’s simple to use and extremely configurable – for example, the highly configurable Ingress/Egress filter you can customize to work with the gRPC interface.

Key features

  • Vald offers automatic backups through Object Storage or Persistent Volume, allowing for disaster recovery.
  • It distributes vector indexes to numerous agents, each of which retains a unique index.
  • The tool replicates indexes by storing each index in many agents. When a Vald agent goes down, automatically rebalance the duplicate.
  • Highly adaptable – you may choose the number of vector dimensions, replicas, and so forth.
  • Python, Golang, Java, Node.js, and more programming languages are supported.

10. ScaNN 


Open source? Yes 

GitHub stars: –

What problem does it solve?

ScaNN (Scalable Nearest Neighbors) is a method for efficiently searching for vector similarity at scale. Google’s ScaNN proposes a brand-new compression method that significantly increases accuracy. This allows it to outperform other vector similarity search libraries by a factor of two, according to 

It includes search space trimming and quantization for Maximum Inner Product Search, as well as additional distance functions like Euclidean distance. The implementation is intended for x86 processors that support AVX2. 

11. Pgvector


Open source? Yes 

GitHub stars: 4.5k

What problem does it solve?

pgvector is a PostgreSQL extension for searching for vector similarity. You can also use it to keep embeddings as well. Ultimately, pgvector helps you store all of the application data in one place. 

Its users get to benefit from ACID compliance, point-in-time recovery, JOINs, and all of the other fantastic features we love PostgreSQL for.

Key features

  • Exact and approximate nearest neighbor search
  • L2 distance, inner product, and cosine distance
  • any language with a PostgreSQL client

12. Faiss 


Open source? Yes 

GitHub stars: 23k

What problem does it solve?

Developed by Facebook AI Research, Faiss is an open-source library for fast dense vector similarity search and grouping. It includes methods for searching sets of vectors of any size, up to those that may not fit in RAM. It also comes with code for evaluation and parameter adjustment.

Faiss is based on an index type that maintains a set of vectors and offers a function for searching through them using L2 and/or dot product vector comparison. Some index types, such as precise search, are simple baselines. 

Key features

  • Returns not just the nearest neighbor but also the second nearest, third nearest, and k-th nearest neighbor.
  • You can search several vectors at once rather than just one (batch processing). 
  • Uses the greatest inner product search rather than a minimal Euclidean search. 
  • Other distances (L1, Linf, etc.) are also supported to a lesser extent.
  • Returns all elements within a specified radius of the query location (range search).
  • Instead of storing the index in RAM, you can save it to disk.

How to choose the right vector database for your project

When picking the database for your project, consider the following factors:

  • Do you have an engineering team to host the database, or do you need a fully managed database?
  • Do you have the embeddings, or do you need a vector database to generate them?
  • Latency requirements, such as batch or online, 
  • Developer experience in the team, 
  • The learning curve of the given tool, 
  • Solution reliability, 
  • Implementation and maintenance costs, 
  • Security and compliance.

As AI continues to take the tech industry by storm, vector databases will become commonplace and it’s only natural that we’ll see more and more tools emerge in this market.

How are you using vector databases today? Let us know in our Slack community!

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started