Databases for AI: Vectors, Embeddings, and Architecture

Choosing a database for your AI application?

Everyone who comments on AI—whether machine learning or more contemporary techniques such as LLMs—notes the enormous volumes of data it requires, as well as the radically novel ways in which data is used. Thousands of new data centers are currently being built—and they’re getting larger.

So how is all that data getting stored? What does a database look like in this brave new world?

Actually, some old databases are becoming new again, with upgraded data types and APIs for AI. And there are some new types of databases, too. This article looks at how database developers everywhere are positioning themselves to capture a bit of the manic energy in the AI revolution, and what the major options are.

Everything is a Vector

To understand AI data and how it needs to be processed and stored, we have to hone in like an arrow on the concept of a vector.

If you got through high-school linear algebra, you studied vectors and matrices. A vector in computing terms is simply an array: an ordered collection of floating-point values. What’s special about a vector is that each element refers to some aspect of the real world you’re measuring: one number might be age, another might be height, another one weight, etc. Therefore, the different elements are called dimensions, and they conceptually represent a point in a multi-dimensional space. That’s why data scientists use the term “dimension” when their AI models consider the attributes of a customer or the factors that could lead to a machine part’s failure.

Vector arithmetic is extremely efficient for both machine learning and LLMs, both of which are obsessed with how similar or different things are. An LLM, for instance, will realize that the letters E and L are related in Spanish, where they make the common article el. The same letters have a similar relationship in French, but there they form the article le.

At a higher level, an LLM will recognize similarities between dogs and wolves (and even get them mixed up) and learn that they’re more similar than dogs and cats.

Similarities and differences in AI are represented mathematically as the difference between vectors. There are various ways to calculate difference (Euclidean, Manhattan, cosine, dot product). Finding vectors (also called “embeddings”) that are similar, such as for dogs and wolves, is called “vector similarity search.” More sophisticated operations have also been created for vectors and multi-dimensional structures (tensors), such as Gaussian elimination. Some operations, called “chunking,” combine some elements of data into mini-documents that are processed to find similarities in the content.

More and more databases are defining vector data types. The interfaces differ from one database to another, but I predict that they’ll converge on a standard API and that SQL will be enhanced to represent critical operations such as text-to-vector conversion.

Vectors in the form of digital data can benefit from database features that are common in other kinds of data. They can be indexed to help searches turn up relevant vectors more quickly. They can be compacted (a general term that covers many types of optimization) to reduce storage space and redundancy.

Special Factors in AI Data Handling

Using data for machine learning and LLM development is more complicated than traditional data manipulation because AI requires such enormous amounts of data. Traditional analytics used to run on data within the organization, such as finding correlations between age and purchase behavior among a retailer’s customers. But a single organization rarely has enough data to produce accurate results for machine learning, and certainly not for an LLM. Thus, modern AI goes searching far afield for input. (Hence the myriad lawsuits that content creators are currently raising against AI developers. Full disclosure: books I’ve worked on are involved in the Anthropic Copyright Settlement.)

Similar trends apply to storing data. Very few organizations can store all the data they collect to create your AI models. And these organizations are notoriously climbing over each other to build storage, as discussed in my recent article, “We will need all those data centers.” Most organizations won’t even create the incredibly complex systems that generate the critical vector data (embeddings) on which they base their AI. Instead, most organizations will query a powerful AI platform such as OpenAI, Google’s Gemini, or Claude.

The errors and grotesquely wrong results (“hallucinations”) in LLMs have disappointed and even alarmed users from the start. Users also discover that query results for rare and obscure topics, where relatively little material is available online, are less accurate than for common topics. So in the past couple years, model developers have made improvements through an enhancement they call retrieval augmented generation (RAG).

The simple idea behind RAG is that developers use vetted databases of reliable information to augment the models and produce better answers to queries, instead of just grabbing all the cruft out on the internet. RAG is a key part of what developers now like to call “real-world models,” an upgrade to large language models. Another term gaining popularity for this narrowing of tasks is “domain-specific language models” (DSLMs), which connects the trend to other computing topics that focus on particular applications, such as “domain-specific languages.”

To a lay person like me, exploiting carefully vetted data sources evokes a “Duh”—isn’t that solution obvious? But philosophically it’s a major shift in AI. From the earliest days of artificial intelligence in the 1950s, tech optimists aimed for artificial general intelligence (AGI). They believed that AI would do all the things the human mind could do, perhaps even better. AI would not only create music but could think up whole new genres such as bebop and hip-hop. AI would not only find the proteins that cure diseases but would think up whole new concepts in health, such as mRNA vaccines.

Of course, AI researchers from the 1950s through the 1980s had to relinquish such wild hopes and ratchet the field back to limited, domain-specific products such as knowledge-based or expert systems, developed through symbolic AI. But the discovery of machine learning and LLMs started to resurrect messianic hopes for general intelligence. Now, the trend toward RAGs and real world models shows that developers are retrenching to practical applications.

Mark Hinkle, CEO of Peripety Labs, an AI consultancy, who reviewed this article, pushes back on the AGI framing entirely. “AGI is a red herring,” he says. “The goal isn’t reaching human-level intelligence—it’s getting capable enough to offload menial work and, in the physical world, dangerous work to digital employees. RAG is developers building something actually useful.”

So currently, users seeking an accurate representation of reality in their particular domain of knowledge start with a general set of vectors obtained from one of the major AI vendors, and then append an organization’s internal data to the generic vectors to provide the unique insights that its own data can provide. This unique data might be research results, postings derived from social media, results from surveys, etc.

This article describes basic operations offered by vectors. Faiss (with C++ and Python interfaces) is an example of an open source library that performs vector operations. It can use GPUs, which are fundamental in AI, to speed up processing.

LangChain (with a Python interface) and LlamaIndex (with Python and TypeScript interfaces) are examples of open source libraries that allow programmers to promenade as billionaire-backed AI developers, querying online data in the popular AI services and other large data sets such as Wikipedia. Each of these libraries also uses GPUs to accelerate processing.

With this background, the next article in this series explores some databases out in the field.

Author

  • Andrew Oram

    Andy is a writer and editor in the computer field. His editorial projects at O'Reilly Media ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. Andy also writes often on health IT, on policy issues related to the Internet, and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM (Brussels), DebConf, and LibrePlanet. Andy participates in the Association for Computing Machinery's policy organization, USTPC.

Leave a Reply

Your email address will not be published. Required fields are marked *