Databases for AI: Should You Use a Vector Database?

Databases for AI: Should You Use a Vector Database?

The first article in this series laid out the role of vectors in machine learning and LLMs, along with vector representations in software. Now we’ll embark on a quick tour of databases and how they meet the modern challenges of AI.

Vector Databases

Since AI is basically vectors and vectors are AI, why not design databases specially around the vector operations listed in the previous section? Over the past decade, many such databases have been created. The free software options include Milvus, Qdrant, Weaviate, Vespa, ChromaDB, and LanceDB.

What are the differences between these databases? A few are compared in this article. Some databases are optimized for small projects such as prototypes, while others are optimized for production. Given the importance of querying major AI services, as described in the previous article in this series, the databases vie for supremacy in their integration with outside services. They also offer different types of indexes and various tools for data management, such as replication. Many companies offer cloud access to the databases.

An article by TigerData (previously named Timescale), a company that offers tools for manipulating vectors within PostgreSQL, argues that vector databases are inappropriate for AI applications. The thrust of TigerData’s argument is that the vectors are usually consulted together with the content (text, audio, video, chemical and biological sequences, etc.) from which the vectors were generated in the first place. It makes sense, in the article’s argument, to keep all the varied types of data together. They also point out that the vectors are “derived data” in database terminology, that they need to change in tandem with changes to that data, and therefore that vectors lose much of their value when separated from the source data.

I will not make a judgment about the validity of the arguments in the article. Another article makes a more balanced assessment of the tradeoffs between traditional databases and vector databases. I think that many organizations use popular services or outside sources to generate vectors, and don’t want to store the source data or refer to it after generating the vectors. Some vector databases such as Weaviate provide storage for source data as well as output vectors. Furthermore, if a vector database performs its role in the application extremely well, managing its integration with other parts of the environment might be worthwhile.

Glauber Costa, the founder of Turso, also cites many observers who say that text search, working on the original data, can produce better results than vector searches.

Each organization has to figure out its own needs and what type of database best meets those needs. General-purpose databases are attempting to be of use for AI. So we’ll proceed to look at those.

SQL for vector operations

With the VECTOR data type in SQL, relational databases have evolved to support AI. Vectors are supported in MySQL and its fork, MariaDB, with the basic functions to measure the distance between vectors. Mark Hinkle, CEO of Peripety Labs, an AI consultancy, says that MariaDB has more vector support currently than MySQL, but that both are at early stages.

Whereas MySQL/MariaDB were always designed to be stripped down and simple (it was years before they even supported transactions), PostgreSQL seems to try to be all things to all people. Look for something that can run on an SQL database—PostgreSQL is almost certain to have an operation or extension that does it.

Two extensions by TigerData, pgai and pgvector, connect PostgreSQL to the larger AI world. pgai automatically updates the vectors when the source data changes in the database, while pgvector offers vector similarity search.

PostgreSQL also offers GPU acceleration. The TigerData article cites benchmarks in a claim that PostgreSQL can handle vectors much faster than a vector database.

The Turso company has created a new open source database from SQLite to include a column that contains a vector encoding, and to update it automatically when changes are made to the row.

Other open source databases that offer SQL interfaces along with specialized AI support include Yugabyte and ClickHouse.

The choice of a vector database versus a general-purpose one might turn on whether you need to search other data besides embeddings. A variety of “hybrid search” techniques retrieve rows from databases based on some balance between vector searches and more standard queries. Karl Fogel, a partner at Open Tech Strategies, LLC, advises LLM users to store data in a general-purpose database that has vector support, except for specialized applications with performance needs in specific operations that can be met only by a specific vector database.

And All the Rest

Nobody wants to be left out of the new AI territory. As the old cliché goes, “Be there or be square” (which is limited to only two dimensions in a field that manipulates millions).

MongoDB boasts of its Atlas Vector Search, which retrieves data from the major services. Their documentation shows an example of storing data in the database. Another important database engine from the NoSQL era, Cassandra, claims to be useful for generative AI.

Elasticsearch, a database with a long history in the area of text search, now has an extension to index, store, and search for vectors.

Even graph databases have gotten into the field of AI, as illustrated by an article titled “Neo4j for GenAI.”

An AI Ecosystem

The term “ecosystem” is overused (particularly in computing) but it’s useful when conceiving the riches of different software tools that work together. Particularly in free and open source software, a wealth of different tools come at trending topics from a variety of angles, each offering one little piece of the application pipeline.

The essence of AI is creating useful structures from unstructured data. I hope this article has helped to show how structured data can support modern AI.

<< Read the previous post of this series

Author

  • Andrew Oram

    Andy is a writer and editor in the computer field. His editorial projects at O'Reilly Media ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. Andy also writes often on health IT, on policy issues related to the Internet, and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM (Brussels), DebConf, and LibrePlanet. Andy participates in the Association for Computing Machinery's policy organization, USTPC.

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *