AI23 September 2025

RAG, embeddings and vector search, explained simply

Ask ChatGPT about your company's notice periods or your product's warranty terms and it will answer confidently — and wrongly. A language model knows what it saw during training, and your internal documents weren't in there. Yet AI assistants that answer precisely from a company's own contracts, manuals and wikis exist and work well. The mechanism behind almost all of them is called RAG, and it is simpler than it sounds.

The problem: the model has never seen your data

There are two obvious-sounding fixes, and both are usually wrong. Retraining (fine-tuning) the model on your documents is expensive, slow to update — every changed document means another training round — and bad at exact facts. Pasting all your documents into the prompt fails too: models have a limited context window, and even large ones get slower, costlier and less accurate when you flood them with thousands of pages of mostly irrelevant text.

The practical fix: don't teach the model your documents. Hand it the right three paragraphs at the moment it needs them. Everything that follows is about how to find those three paragraphs.

Embeddings: meaning as coordinates

An embedding model takes a piece of text and turns it into a long list of numbers — a vector, typically with hundreds or a few thousand dimensions. Think of it as coordinates on a map of meaning: texts about similar things land close together, unrelated texts land far apart. "How do I terminate the contract?" and "cancellation notice period" share almost no words, but their vectors sit close, because they mean nearly the same thing.

That is the property that makes everything work: similarity of meaning becomes distance you can measure — which classic keyword search fundamentally can't do.

Vector search: finding the nearest meaning

To prepare a company's knowledge, you split every document into chunks — sections of a few paragraphs — compute an embedding for each chunk and store the pairs in a vector database (pgvector inside PostgreSQL, Qdrant, Pinecone and similar). When a user asks a question, you embed the question with the same model and ask the database for the chunks whose vectors lie nearest to it.

The result: out of ten thousand pages, you get the handful of passages that actually talk about what the user asked — in milliseconds, regardless of whether they used the same words as the document.

RAG: putting it together

RAG stands for retrieval-augmented generation, and the name is the whole recipe. For every question, the assistant runs three steps:

Retrieve — embed the question and pull the most relevant chunks from the vector database.
Augment — build a prompt that contains the question plus those chunks, with an instruction to answer only from the supplied material.
Generate — let the language model write the answer, citing the documents the chunks came from.

The model never gets retrained. Update a document, re-embed its chunks, and the assistant answers from the new version minutes later. Because the answer is grounded in retrieved text, the assistant can cite its sources — and can say "I don't know" when retrieval comes back empty, instead of inventing something plausible.

What a real deployment adds

The loop above fits in a weekend prototype. Production systems differ in the unglamorous parts:

Chunking strategy — splitting contracts mid-clause ruins answers; good systems split along document structure.
Hybrid search — combining vector similarity with classic keyword matching, because exact terms (invoice numbers, product codes) deserve exact matches.
Permissions — the assistant must not retrieve documents the asking user isn't allowed to see; access control belongs in the retrieval layer, not in the prompt.
Evaluation — a test set of real questions with known correct answers, run on every change, so quality is measured rather than felt.

This architecture — assistant on top, retrieval and permissions underneath — is what we build in custom AI applications and wire into existing systems through AI integrations. You can see it applied to real workflows in our data analysis use cases.

The takeaway

RAG is not magic and not a buzzword: embeddings turn meaning into geometry, vector search finds the nearest passages, and the model writes an answer grounded in them. The quality of the result is decided less by the model and more by the engineering around retrieval — which is good news, because engineering is something you can specify, test and improve.