Run local LLMs with Ollama. Configure models, manage resources, and integrate with applications. Use for local AI, private LLM deployments, and offline inference.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: ollama description: "Run local LLMs with Ollama. Configure models, manage resources, and integrate with applications. Use for local AI, private LLM deployments, and offline inference."
Ollama Skill
Complete guide for Ollama - run LLMs locally.
Quick Reference
Popular Models
| Model | Size | Use Case |
|---|---|---|
| llama3.2 | 3B/11B | General purpose |
| mistral | 7B | Fast, capable |
| codellama | 7B/13B/34B | Code generation |
| phi3 | 3.8B | Compact, fast |
| gemma2 | 2B/9B/27B | Google's model |
| qwen2.5 | 0.5B-72B | Multilingual |
| deepseek-coder | 6.7B/33B | Code specialist |
Commands
ollama run <model> # Interactive chat
ollama pull <model> # Download model
ollama list # List installed
ollama rm <model> # Remove model
ollama serve # Start server
1. Installation
macOS
# Download from ollama.ai or:
brew install ollama
Linux
curl -fsSL https://ollama.ai/install.sh | sh
Windows
# Download installer from ollama.ai
# Or use WSL2 with Linux instructions
Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# With GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
2. Basic Usage
Run Models
# Run interactively
ollama run llama3.2
# Run with prompt
ollama run llama3.2 "Explain quantum computing"
# Run specific size
ollama run llama3.2:3b
ollama run llama3.2:11b
# Run with system prompt
ollama run llama3.2 --system "You are a helpful coding assistant"
Model Management
# Pull model
ollama pull mistral
# List models
ollama list
# Show model info
ollama show llama3.2
# Show model file
ollama show llama3.2 --modelfile
# Copy model
ollama cp llama3.2 my-llama
# Remove model
ollama rm mistral
3. REST API
Generate Completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Chat Completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": false
}'
Streaming
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Write a poem"}],
"stream": true
}'
Embeddings
curl http://localhost:11434/api/embed -d '{
"model": "llama3.2",
"input": "The quick brown fox"
}'
List Models (API)
curl http://localhost:11434/api/tags
4. Python Integration
Official Library
pip install ollama
Basic Usage
import ollama
# Generate
response = ollama.generate(
model='llama3.2',
prompt='What is Python?'
)
print(response['response'])
# Chat
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Hello!'}
]
)
print(response['message']['content'])
Streaming
# Stream generate
for chunk in ollama.generate(model='llama3.2', prompt='Hello', stream=True):
print(chunk['response'], end='', flush=True)
# Stream chat
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a story'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
Embeddings
# Single embedding
response = ollama.embed(
model='llama3.2',
input='Hello, world!'
)
embedding = response['embeddings'][0]
# Multiple embeddings
response = ollama.embed(
model='llama3.2',
input=['Hello', 'World']
)
embeddings = response['embeddings']
Async Support
import asyncio
import ollama
async def main():
response = await ollama.AsyncClient().chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response['message']['content'])
asyncio.run(main())
5. LangChain Integration
Setup
from langchain_ollama import ChatOllama, OllamaEmbeddings
# Chat model
llm = ChatOllama(
model="llama3.2",
temperature=0.7
)
# Embeddings
embeddings = OllamaEmbeddings(model="llama3.2")
Use in Chains
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_template("Explain {topic} simply")
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"topic": "machine learning"})
6. Custom Models (Modelfile)
Basic Modelfile
# Modelfile
FROM llama3.2
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Set system prompt
SYSTEM """You are a helpful coding assistant specialized in Python.
Always provide clear, well-commented code examples."""
Create Custom Model
# Create model from Modelfile
ollama create my-coder -f ./Modelfile
# Run custom model
ollama run my-coder
Advanced Modelfile
FROM llama3.2:11b
# Model parameters
PARAMETER temperature 0.8
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
PARAMETER stop "<|im_end|>"
# System message
SYSTEM """You are an expert software architect. You provide:
1. Clear architectural recommendations
2. Design pattern suggestions
3. Best practices for scalability
4. Security considerations"""
# Template (for custom formats)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>"""
Import GGUF Models
# Import from GGUF file
FROM ./model.gguf
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant."
ollama create custom-model -f Modelfile
7. Vision Models
Using Vision
import ollama
import base64
# From file
with open('image.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.chat(
model='llava',
messages=[{
'role': 'user',
'content': 'What is in this image?',
'images': [image_data]
}]
)
print(response['message']['content'])
Via API
curl http://localhost:11434/api/chat -d '{
"model": "llava",
"messages": [{
"role": "user",
"content": "Describe this image",
"images": ["base64-encoded-image"]
}]
}'
8. Code Models
CodeLlama
# Pull code model
ollama pull codellama
# Or specialized variants
ollama pull codellama:7b-instruct
ollama pull codellama:13b-python
Code Generation
response = ollama.generate(
model='codellama',
prompt='''Write a Python function that:
1. Takes a list of numbers
2. Returns the median value
3. Handles empty lists'''
)
print(response['response'])
DeepSeek Coder
ollama pull deepseek-coder:6.7b
response = ollama.chat(
model='deepseek-coder:6.7b',
messages=[{
'role': 'user',
'content': 'Write a REST API in FastAPI for user management'
}]
)
9. Performance Tuning
Context Length
# Increase context window
response = ollama.generate(
model='llama3.2',
prompt='Long document here...',
options={
'num_ctx': 8192 # Default is 2048
}
)
GPU Layers
# Control GPU usage
response = ollama.generate(
model='llama3.2',
prompt='Hello',
options={
'num_gpu': 50 # Number of layers on GPU
}
)
Parameters
response = ollama.generate(
model='llama3.2',
prompt='Creative writing prompt',
options={
'temperature': 0.9, # Creativity (0-2)
'top_p': 0.95, # Nucleus sampling
'top_k': 40, # Top-k sampling
'repeat_penalty': 1.1, # Reduce repetition
'num_predict': 500, # Max tokens
'seed': 42 # Reproducibility
}
)
10. Server Configuration
Environment Variables
# Change host/port
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Custom model directory
OLLAMA_MODELS=/path/to/models ollama serve
# Limit GPU memory
OLLAMA_GPU_MEMORY=4096 ollama serve
Docker Compose
services:
ollama:
image: ollama/ollama
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open-webui:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
open-webui:
11. Common Patterns
RAG with Ollama
import ollama
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Create embeddings
embeddings = OllamaEmbeddings(model="llama3.2")
# Create vector store
vectorstore = Chroma.from_texts(
texts=["Document 1...", "Document 2..."],
embedding=embeddings
)
# Query
def rag_query(question: str) -> str:
# Retrieve relevant docs
docs = vectorstore.similarity_search(question, k=3)
context = "\n".join(doc.page_content for doc in docs)
# Generate answer
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': f'Answer using this context:\n{context}'},
{'role': 'user', 'content': question}
]
)
return response['message']['content']
Function Calling
import json
tools = [
{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
]
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': f'You have these tools: {json.dumps(tools)}. Call them by returning JSON with "tool" and "arguments".'},
{'role': 'user', 'content': 'What is the weather in Paris?'}
]
)
# Parse tool call from response
Batch Processing
import ollama
from concurrent.futures import ThreadPoolExecutor
def process_item(item):
response = ollama.generate(
model='llama3.2',
prompt=f"Summarize: {item}"
)
return response['response']
items = ["Document 1", "Document 2", "Document 3"]
with ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(process_item, items))
12. Troubleshooting
Common Issues
Model not found:
# Pull the model first
ollama pull llama3.2
# Check available models
ollama list
Out of memory:
# Use smaller model
ollama run llama3.2:3b # Instead of 11b
# Or reduce context
ollama run llama3.2 --num-ctx 2048
Slow generation:
# Check GPU usage
nvidia-smi
# Ensure model fits in VRAM
# Or use quantized versions
ollama pull llama3.2:3b-q4_0
Connection refused:
# Start server first
ollama serve
# Check if running
curl http://localhost:11434/api/tags
Best Practices
- Right-size models - Use smallest that works
- Quantization - Use Q4 for speed
- Custom models - Tune for your use case
- Batch requests - Reduce overhead
- Cache responses - Avoid repeat queries
- Monitor resources - Watch GPU/CPU
- Use streaming - Better UX
- Set timeouts - Handle slow responses
- Test prompts - Iterate on system messages
- Keep updated - New models regularly
More by fgarofalo56
View allEvent routing with Azure Event Grid. Configure topics, subscriptions, event handlers, and dead-lettering. Use for event-driven architectures, serverless triggers, and reactive systems on Azure.
Build monitoring dashboards with Grafana. Covers panel types, queries, variables, alerting, provisioning, and data sources like Prometheus and InfluxDB. Use for infrastructure monitoring, observability, and metrics visualization.
Build accessible web applications meeting WCAG 2.1 AA standards. Covers ARIA attributes, keyboard navigation, screen reader optimization, focus management, color contrast, semantic HTML, and accessible components. Use for a11y audits, accessible UI development, and inclusive design.
Structured outputs with Instructor. Extract typed data from LLMs using Pydantic models and validation. Use for data extraction, structured generation, and type-safe LLM responses.
