What is a RAG system? (Retrieval-Augmented Generation)

In the world of artificial intelligence, the term RAG – which stands for retrieval-augmented generation – is becoming increasingly common. But what does it mean, and why is this technology becoming so important?

In short, RAG is a method that combines large language models such as ChatGPT with additional external knowledge. This ensures that responses are more accurate, up to date and better tailored to the context.

How does an RAG system work?

A RAG system consists of two important parts:

1. Retrieval

The system searches through a large collection of texts – for example, documents, websites or internal manuals. To do this, it uses vector databases (such as Qdrant or Faiss), which store texts as sequences of numbers (known as embeddings). This enables the system to find the passages that match the query at lightning speed.

2. Generation

The text passages found are then passed on to a language model (LLM). This uses the context to formulate a clear, precise answer.

Why is RAG so useful?

Normal language models have two major limitations:

They only know what they learned up to their last training date – they are not always up to date.
They do not have access to private or company-specific data.

A RAG system can circumvent these problems:

Integrate your own documents, PDFs or websites into the system.
Answer specific questions that are only contained in this data.
Significantly reduce incorrect or fabricated answers (so-called hallucinations).
Use current knowledge without having to retrain the model itself.

How does a request work in an RAG system?

User asks a question
     ↓
System searches for the most similar text passages in the database
     ↓
Found texts and the question are sent to the language model
     ↓
Language model writes a suitable answer

Example: Chatbot for company documents

Let's say an employee asks, ‘What is the reimbursement limit for travel?’

The RAG system searches through 80,000 text passages from the company manuals and finds the appropriate passage. This is presented to the language model, which then responds, ‘The maximum reimbursement is CHF 1,200 if the trip was approved.’

What does an RAG system consist of?

A RAG system consists of several components – here are the most important ones and some popular tools for them:

Creating embeddings: Tools such as OpenAI or Huggingface Transformers convert text into sequences of numbers.
Storing text in chunks: Vector databases such as Qdrant, Faiss or Weaviate are used for this purpose.
Implementing search: Search can be integrated using REST-API or programming languages such as Python, Node.j, PHP or Rust.
Using language models (LLM): OpenAI , Mistral or local models (e.g. LLaMA) help to generate responses.
Generating responses with context: Context is optimally transferred to the model via prompt templates, for example with a system message.

🧱 Building block	⚙️ Possible tools/technologies
Create embeddings	OpenAI, Huggingface Transformers
Save texts in chunks	Vector databases such as Qdrant, Faiss, Weaviate
Implement search	REST-APIs, Python, Node.js, PHP, Rust
Use language model (LLM)	OpenAI API, Mistral, local models (e.g. LLaMA, Mixtral)
Generate response with context	Prompt Templates (e.g. System message for the LLM)

Advantages of RAG at a glance

No expensive retraining of the language model (LLM) necessary.
Use your own private data – without sending it to third parties.
Modular and flexible – suitable for various programming languages and applications.
Combination of intelligent search and smart text generation (via LLM) leads to better answers.

Sample setups

PHP web server + Qdrant + OpenAI API
PHP web server + MariaDB 11.8 with vectorisation (HNSW) + DeepSeek AI API
Rust or Python programme + Faiss + local language model
Simple Bash script + REST-AP + JSON data (minimal example)