Retrieval-Augmented Generation (RAG) for grounding LLM responses with external knowledge. Covers document chunking, embeddings, vector stores (pandas, ChromaDB), similarity search, and conversational RAG pipelines.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: rag description: | Retrieval-Augmented Generation (RAG) for grounding LLM responses with external knowledge. Covers document chunking, embeddings, vector stores (pandas, ChromaDB), similarity search, and conversational RAG pipelines.
Retrieval-Augmented Generation (RAG)
Overview
RAG enhances LLM responses by retrieving relevant context from a knowledge base before generation. This grounds responses in specific documents and reduces hallucination.
Quick Reference
| Step | Component |
|---|---|
| 1. Chunk | Split documents into segments |
| 2. Embed | Convert chunks to vectors |
| 3. Store | Save in vector database |
| 4. Retrieve | Find relevant chunks |
| 5. Generate | LLM answers with context |
Basic RAG Pipeline
1. Document Chunking
import textwrap
document = """
Your long document text here...
Multiple paragraphs of content...
"""
# Chunk into segments of max 1000 characters
chunks = textwrap.wrap(document, width=1000)
print(f"Created {len(chunks)} chunks")
2. Generate Embeddings
import os
from openai import OpenAI
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
EMBED_MODEL = "llama3.2:latest"
client = OpenAI(base_url=f"{OLLAMA_HOST}/v1", api_key="ollama")
def get_embedding(text):
response = client.embeddings.create(
model=EMBED_MODEL,
input=text
)
return response.data[0].embedding
# Embed all chunks
embeddings = [get_embedding(chunk) for chunk in chunks]
print(f"Embedding dimensions: {len(embeddings[0])}")
3. Create Vector Database (Pandas)
import pandas as pd
import numpy as np
vector_db = pd.DataFrame({
"text": chunks,
"embeddings": [np.array(e) for e in embeddings]
})
4. Similarity Search
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def search(query, n_results=5):
query_embedding = get_embedding(query)
similarities = vector_db["embeddings"].apply(
lambda x: cosine_similarity(query_embedding, x)
)
top_indices = similarities.nlargest(n_results).index
return vector_db.loc[top_indices, "text"].tolist()
# Find relevant chunks
relevant = search("What are the symptoms?", n_results=3)
5. Generate with Context
LLM_MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
def rag_query(question, n_docs=5):
# Retrieve context
context_chunks = search(question, n_results=n_docs)
context = "\n\n".join(context_chunks)
# Build prompt
messages = [
{"role": "system", "content": f"Answer based on this context:\n\n{context}"},
{"role": "user", "content": question}
]
# Generate
response = client.chat.completions.create(
model=LLM_MODEL,
messages=messages,
max_tokens=500
)
return response.choices[0].message.content
answer = rag_query("What are the main symptoms of Omicron?")
LangChain RAG with ChromaDB
Setup
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
# LLM for generation
llm = ChatOpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama",
model=MODEL
)
# Embeddings
embeddings = OpenAIEmbeddings(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama",
model="llama3.2:latest"
)
Create Vector Store
import textwrap
# Chunk document
document = "Your document text..."
chunks = textwrap.wrap(document, width=1000)
# Create ChromaDB store
vectorstore = Chroma.from_texts(
texts=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
Build RAG Chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_classic.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
# Prompt template
prompt = ChatPromptTemplate.from_messages([
("system", "Answer based on this context:\n\n{context}"),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}")
])
# Create chains
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
Conversational RAG
from langchain_core.messages import HumanMessage, AIMessage
chat_history = []
def chat(question):
result = rag_chain.invoke({
"input": question,
"chat_history": chat_history
})
# Update history
chat_history.append(HumanMessage(content=question))
chat_history.append(AIMessage(content=result["answer"]))
return result["answer"]
# Multi-turn conversation
print(chat("What is Omicron?"))
print(chat("What are its symptoms?"))
print(chat("How does it compare to Delta?"))
Chunking Strategies
Fixed Size
def fixed_chunks(text, size=1000):
return textwrap.wrap(text, width=size)
Sentence-Based
import re
def sentence_chunks(text, max_sentences=5):
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current = []
for sent in sentences:
current.append(sent)
if len(current) >= max_sentences:
chunks.append(" ".join(current))
current = []
if current:
chunks.append(" ".join(current))
return chunks
Overlap Chunks
def overlap_chunks(text, size=1000, overlap=200):
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks
Vector Store Options
Pandas DataFrame (Simple)
import pandas as pd
vector_db = pd.DataFrame({
"text": chunks,
"embeddings": embeddings
})
ChromaDB (Persistent)
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_texts(
texts=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
FAISS (Fast)
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_texts(chunks, embeddings)
vectorstore.save_local("./faiss_index")
Troubleshooting
Poor Retrieval Quality
Symptom: Retrieved chunks not relevant
Fix:
- Increase chunk overlap
- Use smaller chunk sizes
- Try different embedding models
- Increase
kin retriever
Slow Embedding
Symptom: Takes long to embed documents
Fix:
- Batch embeddings
- Use smaller embedding model
- Cache embeddings to disk
Out of Context
Symptom: LLM ignores retrieved context
Fix:
- Increase
max_tokens - Use explicit system prompt
- Reduce number of retrieved chunks
When to Use This Skill
Use when:
- LLM needs to answer from specific documents
- Reducing hallucination is critical
- Building Q&A systems over documents
- Need up-to-date information not in training data
Cross-References
overthink-jupyter:langchain- LangChain fundamentalsoverthink-jupyter:evaluation- Evaluate RAG qualityoverthink-ollama:python- Ollama embeddings API
More by atrawog
View allTransformer architecture fundamentals. Covers self-attention mechanism, multi-head attention, feed-forward networks, layer normalization, and residual connections. Essential concepts for understanding LLMs.
Direct REST API operations for Ollama using the requests library. Covers all /api/* endpoints for model management, text generation, chat completion, embeddings, and streaming responses.
JupyterLab ML/AI development environment management via Podman Quadlet. Supports multi-instance deployment, GPU acceleration (NVIDIA/AMD/Intel), token authentication, and per-instance configuration. Use when users need to configure, start, stop, or manage JupyterLab containers for ML development.
GPU monitoring and performance metrics for Ollama inference. Check GPU status, VRAM usage, loaded models, and inference performance metrics like tokens per second.
