In the world of artificial intelligence, the term RAG – which stands for retrieval-augmented generation – is becoming increasingly common. But what does it mean, and why is this technology becoming so important?

In short, RAG is a method that combines large language models such as ChatGPT with additional external knowledge. This ensures that responses are more accurate, up to date and better tailored to the context.

 

How does an RAG system work?

A RAG system consists of two important parts:

1. Retrieval

The system searches through a large collection of texts – for example, documents, websites or internal manuals. To do this, it uses vector databases (such as Qdrant or Faiss), which store texts as sequences of numbers (known as embeddings). This enables the system to find the passages that match the query at lightning speed.

2. Generation

The text passages found are then passed on to a language model (LLM). This uses the context to formulate a clear, precise answer.

 

Why is RAG so useful?

Normal language models have two major limitations:

  • They only know what they learned up to their last training date – they are not always up to date.
  • They do not have access to private or company-specific data.

A RAG system can circumvent these problems:

  • Integrate your own documents, PDFs or websites into the system.
  • Answer specific questions that are only contained in this data.
  • Significantly reduce incorrect or fabricated answers (so-called hallucinations).
  • Use current knowledge without having to retrain the model itself.

 

How does a request work in an RAG system?

User asks a question
     ↓
System searches for the most similar text passages in the database
     ↓
Found texts and the question are sent to the language model
     ↓
Language model writes a suitable answer

 

Example: Chatbot for company documents

Let's say an employee asks, ‘What is the reimbursement limit for travel?

The RAG system searches through 80,000 text passages from the company manuals and finds the appropriate passage. This is presented to the language model, which then responds, ‘The maximum reimbursement is CHF 1,200 if the trip was approved.’

 

What does an RAG system consist of?

A RAG system consists of several components – here are the most important ones and some popular tools for them:

  • Creating embeddings: Tools such as OpenAI or Huggingface Transformers convert text into sequences of numbers.
  • Storing text in chunks: Vector databases such as Qdrant, Faiss or Weaviate are used for this purpose.
  • Implementing search: Search can be integrated using REST-API or programming languages such as Python, Node.j, PHP or Rust.
  • Using language models (LLM): OpenAI , Mistral or local models (e.g. LLaMA) help to generate responses.
  • Generating responses with context: Context is optimally transferred to the model via prompt templates, for example with a system message.

 

🧱 Building block ⚙️ Possible tools/technologies
Create embeddings OpenAI, Huggingface Transformers
Save texts in chunks Vector databases such as Qdrant, Faiss, Weaviate
Implement search REST-APIs, Python, Node.js, PHP, Rust
Use language model (LLM) OpenAI API, Mistral, local models (e.g. LLaMA, Mixtral)
Generate response with context Prompt Templates (e.g. System message for the LLM)

 

Advantages of RAG at a glance

  • No expensive retraining of the language model (LLM) necessary.
  • Use your own private data – without sending it to third parties.
  • Modular and flexible – suitable for various programming languages and applications.
  • Combination of intelligent search and smart text generation (via LLM) leads to better answers.

 

Sample setups

We use cookies

We use cookies on our website. Some of them are essential for the operation of the site, while others help us to improve this site and the user experience (tracking cookies). You can decide for yourself whether you want to allow cookies or not. Please note that if you reject them, you may not be able to use all the functionalities of the site.