
The first article in this series laid out the role of vectors in machine learning and LLMs, along with vector representations in software. Now we’ll embark on a quick tour of databases and how they meet the modern challenges of AI.
Since AI is basically vectors and vectors are AI, why not design databases specially around the vector operations listed in the previous section? Over the past decade, many such databases have been created. The free software options include Milvus, Qdrant, Weaviate, Vespa, ChromaDB, and LanceDB.
What are the differences between these databases? A few are compared in this article. Some databases are optimized for small projects such as prototypes, while others are optimized for production. Given the importance of querying major AI services, as described in the previous article in this series, the databases vie for supremacy in their integration with outside services. They also offer different types of indexes and various tools for data management, such as replication. Many companies offer cloud access to the databases.
An article by TigerData (previously named Timescale), a company that offers tools for manipulating vectors within PostgreSQL, argues that vector databases are inappropriate for AI applications. The thrust of TigerData’s argument is that the vectors are usually consulted together with the content (text, audio, video, chemical and biological sequences, etc.) from which the vectors were generated in the first place. It makes sense, in the article’s argument, to keep all the varied types of data together. They also point out that the vectors are “derived data” in database terminology, that they need to change in tandem with changes to that data, and therefore that vectors lose much of their value when separated from the source data.
I will not make a judgment about the validity of the arguments in the article. Another article makes a more balanced assessment of the tradeoffs between traditional databases and vector databases. I think that many organizations use popular services or outside sources to generate vectors, and don’t want to store the source data or refer to it after generating the vectors. Some vector databases such as Weaviate provide storage for source data as well as output vectors. Furthermore, if a vector database performs its role in the application extremely well, managing its integration with other parts of the environment might be worthwhile.
Glauber Costa, the founder of Turso, also cites many observers who say that text search, working on the original data, can produce better results than vector searches.
Each organization has to figure out its own needs and what type of database best meets those needs. General-purpose databases are attempting to be of use for AI. So we’ll proceed to look at those.
With the VECTOR data type in SQL, relational databases have evolved to support AI. Vectors are supported in MySQL and its fork, MariaDB, with the basic functions to measure the distance between vectors. Mark Hinkle, CEO of Peripety Labs, an AI consultancy, says that MariaDB has more vector support currently than MySQL, but that both are at early stages.
Whereas MySQL/MariaDB were always designed to be stripped down and simple (it was years before they even supported transactions), PostgreSQL seems to try to be all things to all people. Look for something that can run on an SQL database—PostgreSQL is almost certain to have an operation or extension that does it.
Two extensions by TigerData, pgai and pgvector, connect PostgreSQL to the larger AI world. pgai automatically updates the vectors when the source data changes in the database, while pgvector offers vector similarity search.
PostgreSQL also offers GPU acceleration. The TigerData article cites benchmarks in a claim that PostgreSQL can handle vectors much faster than a vector database.
The Turso company has created a new open source database from SQLite to include a column that contains a vector encoding, and to update it automatically when changes are made to the row.
Other open source databases that offer SQL interfaces along with specialized AI support include Yugabyte and ClickHouse.
The choice of a vector database versus a general-purpose one might turn on whether you need to search other data besides embeddings. A variety of “hybrid search” techniques retrieve rows from databases based on some balance between vector searches and more standard queries. Karl Fogel, a partner at Open Tech Strategies, LLC, advises LLM users to store data in a general-purpose database that has vector support, except for specialized applications with performance needs in specific operations that can be met only by a specific vector database.
Nobody wants to be left out of the new AI territory. As the old cliché goes, “Be there or be square” (which is limited to only two dimensions in a field that manipulates millions).
MongoDB boasts of its Atlas Vector Search, which retrieves data from the major services. Their documentation shows an example of storing data in the database. Another important database engine from the NoSQL era, Cassandra, claims to be useful for generative AI.
Elasticsearch, a database with a long history in the area of text search, now has an extension to index, store, and search for vectors.
Even graph databases have gotten into the field of AI, as illustrated by an article titled “Neo4j for GenAI.”
The term “ecosystem” is overused (particularly in computing) but it’s useful when conceiving the riches of different software tools that work together. Particularly in free and open source software, a wealth of different tools come at trending topics from a variety of angles, each offering one little piece of the application pipeline.
The essence of AI is creating useful structures from unstructured data. I hope this article has helped to show how structured data can support modern AI.
You are currently viewing a placeholder content from Vimeo. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from YouTube. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More Information