In the world of artificial intelligence, the term RAG – which stands for retrieval-augmented generation – is becoming increasingly common. But what does it mean, and why is this technology becoming so important?
In short, RAG is a method that combines large language models such as ChatGPT with additional external knowledge. This ensures that responses are more accurate, up to date and better tailored to the context.
How does an RAG system work?
A RAG system consists of two important parts:
1. Retrieval
The system searches through a large collection of texts – for example, documents, websites or internal manuals. To do this, it uses vector databases (such as Qdrant or Faiss), which store texts as sequences of numbers (known as embeddings). This enables the system to find the passages that match the query at lightning speed.
2. Generation
The text passages found are then passed on to a language model (LLM). This uses the context to formulate a clear, precise answer.
Why is RAG so useful?
Normal language models have two major limitations:
- They only know what they learned up to their last training date – they are not always up to date.
- They do not have access to private or company-specific data.
A RAG system can circumvent these problems:
- Integrate your own documents, PDFs or websites into the system.
- Answer specific questions that are only contained in this data.
- Significantly reduce incorrect or fabricated answers (so-called hallucinations).
- Use current knowledge without having to retrain the model itself.
How does a request work in an RAG system?
User asks a question
↓
System searches for the most similar text passages in the database
↓
Found texts and the question are sent to the language model
↓
Language model writes a suitable answer
Example: Chatbot for company documents
Let's say an employee asks, ‘What is the reimbursement limit for travel?’
The RAG system searches through 80,000 text passages from the company manuals and finds the appropriate passage. This is presented to the language model, which then responds, ‘The maximum reimbursement is CHF 1,200 if the trip was approved.’
What does an RAG system consist of?
A RAG system consists of several components – here are the most important ones and some popular tools for them:
- Creating embeddings: Tools such as OpenAI or Huggingface Transformers convert text into sequences of numbers.
- Storing text in chunks: Vector databases such as Qdrant, Faiss or Weaviate are used for this purpose.
- Implementing search: Search can be integrated using REST-API or programming languages such as Python, Node.j, PHP or Rust.
- Using language models (LLM): OpenAI , Mistral or local models (e.g. LLaMA) help to generate responses.
- Generating responses with context: Context is optimally transferred to the model via prompt templates, for example with a system message.
| 🧱 Building block | ⚙️ Possible tools/technologies |
|---|---|
| Create embeddings | OpenAI, Huggingface Transformers |
| Save texts in chunks | Vector databases such as Qdrant, Faiss, Weaviate |
| Implement search | REST-APIs, Python, Node.js, PHP, Rust |
| Use language model (LLM) | OpenAI API, Mistral, local models (e.g. LLaMA, Mixtral) |
| Generate response with context | Prompt Templates (e.g. System message for the LLM) |
Advantages of RAG at a glance
- No expensive retraining of the language model (LLM) necessary.
- Use your own private data – without sending it to third parties.
- Modular and flexible – suitable for various programming languages and applications.
- Combination of intelligent search and smart text generation (via LLM) leads to better answers.
Sample setups
- PHP web server + Qdrant + OpenAI API
- PHP web server + MariaDB 11.8 with vectorisation (HNSW) + DeepSeek AI API
- Rust or Python programme + Faiss + local language model
- Simple Bash script + REST-AP + JSON data (minimal example)
DE
EN