ollama

@fgarofalo56/ollama

2 forks

Updated 4/29/2026

Run local LLMs with Ollama. Configure models, manage resources, and integrate with applications. Use for local AI, private LLM deployments, and offline inference.

Installation

$npx agent-skills-cli install @fgarofalo56/ollama

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositoryfgarofalo56/Suppercharge_Microsoft_Fabric

Path.github/skills/ollama/SKILL.md

Branchmain

Scoped Name@fgarofalo56/ollama

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: ollama description: "Run local LLMs with Ollama. Configure models, manage resources, and integrate with applications. Use for local AI, private LLM deployments, and offline inference."

Ollama Skill

Complete guide for Ollama - run LLMs locally.

Quick Reference

Popular Models

Model	Size	Use Case
llama3.2	3B/11B	General purpose
mistral	7B	Fast, capable
codellama	7B/13B/34B	Code generation
phi3	3.8B	Compact, fast
gemma2	2B/9B/27B	Google's model
qwen2.5	0.5B-72B	Multilingual
deepseek-coder	6.7B/33B	Code specialist

Commands

ollama run <model>    # Interactive chat
ollama pull <model>   # Download model
ollama list           # List installed
ollama rm <model>     # Remove model
ollama serve          # Start server

1. Installation

macOS

# Download from ollama.ai or:
brew install ollama

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

# Download installer from ollama.ai
# Or use WSL2 with Linux instructions

Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# With GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

2. Basic Usage

Run Models

# Run interactively
ollama run llama3.2

# Run with prompt
ollama run llama3.2 "Explain quantum computing"

# Run specific size
ollama run llama3.2:3b
ollama run llama3.2:11b

# Run with system prompt
ollama run llama3.2 --system "You are a helpful coding assistant"

Model Management

# Pull model
ollama pull mistral

# List models
ollama list

# Show model info
ollama show llama3.2

# Show model file
ollama show llama3.2 --modelfile

# Copy model
ollama cp llama3.2 my-llama

# Remove model
ollama rm mistral

3. REST API

Generate Completion

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat Completion

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false
}'

Streaming

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Write a poem"}],
  "stream": true
}'

Embeddings

curl http://localhost:11434/api/embed -d '{
  "model": "llama3.2",
  "input": "The quick brown fox"
}'

List Models (API)

curl http://localhost:11434/api/tags

4. Python Integration

Official Library

pip install ollama

Basic Usage

import ollama

# Generate
response = ollama.generate(
    model='llama3.2',
    prompt='What is Python?'
)
print(response['response'])

# Chat
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Hello!'}
    ]
)
print(response['message']['content'])

Streaming

# Stream generate
for chunk in ollama.generate(model='llama3.2', prompt='Hello', stream=True):
    print(chunk['response'], end='', flush=True)

# Stream chat
for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a story'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Embeddings

# Single embedding
response = ollama.embed(
    model='llama3.2',
    input='Hello, world!'
)
embedding = response['embeddings'][0]

# Multiple embeddings
response = ollama.embed(
    model='llama3.2',
    input=['Hello', 'World']
)
embeddings = response['embeddings']

Async Support

import asyncio
import ollama

async def main():
    response = await ollama.AsyncClient().chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Hello!'}]
    )
    print(response['message']['content'])

asyncio.run(main())

5. LangChain Integration

Setup

from langchain_ollama import ChatOllama, OllamaEmbeddings

# Chat model
llm = ChatOllama(
    model="llama3.2",
    temperature=0.7
)

# Embeddings
embeddings = OllamaEmbeddings(model="llama3.2")

Use in Chains

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_template("Explain {topic} simply")
chain = prompt | llm | StrOutputParser()

result = chain.invoke({"topic": "machine learning"})

6. Custom Models (Modelfile)

Basic Modelfile

# Modelfile
FROM llama3.2

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Set system prompt
SYSTEM """You are a helpful coding assistant specialized in Python.
Always provide clear, well-commented code examples."""

Create Custom Model

# Create model from Modelfile
ollama create my-coder -f ./Modelfile

# Run custom model
ollama run my-coder

Advanced Modelfile

FROM llama3.2:11b

# Model parameters
PARAMETER temperature 0.8
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
PARAMETER stop "<|im_end|>"

# System message
SYSTEM """You are an expert software architect. You provide:
1. Clear architectural recommendations
2. Design pattern suggestions
3. Best practices for scalability
4. Security considerations"""

# Template (for custom formats)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>"""

Import GGUF Models

# Import from GGUF file
FROM ./model.gguf

PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant."

ollama create custom-model -f Modelfile

7. Vision Models

Using Vision

import ollama
import base64

# From file
with open('image.jpg', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model='llava',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': [image_data]
    }]
)
print(response['message']['content'])

Via API

curl http://localhost:11434/api/chat -d '{
  "model": "llava",
  "messages": [{
    "role": "user",
    "content": "Describe this image",
    "images": ["base64-encoded-image"]
  }]
}'

8. Code Models

CodeLlama

# Pull code model
ollama pull codellama

# Or specialized variants
ollama pull codellama:7b-instruct
ollama pull codellama:13b-python

Code Generation

response = ollama.generate(
    model='codellama',
    prompt='''Write a Python function that:
1. Takes a list of numbers
2. Returns the median value
3. Handles empty lists'''
)
print(response['response'])

DeepSeek Coder

ollama pull deepseek-coder:6.7b

response = ollama.chat(
    model='deepseek-coder:6.7b',
    messages=[{
        'role': 'user',
        'content': 'Write a REST API in FastAPI for user management'
    }]
)

9. Performance Tuning

Context Length

# Increase context window
response = ollama.generate(
    model='llama3.2',
    prompt='Long document here...',
    options={
        'num_ctx': 8192  # Default is 2048
    }
)

GPU Layers

# Control GPU usage
response = ollama.generate(
    model='llama3.2',
    prompt='Hello',
    options={
        'num_gpu': 50  # Number of layers on GPU
    }
)

Parameters

response = ollama.generate(
    model='llama3.2',
    prompt='Creative writing prompt',
    options={
        'temperature': 0.9,      # Creativity (0-2)
        'top_p': 0.95,           # Nucleus sampling
        'top_k': 40,             # Top-k sampling
        'repeat_penalty': 1.1,   # Reduce repetition
        'num_predict': 500,      # Max tokens
        'seed': 42               # Reproducibility
    }
)

10. Server Configuration

Environment Variables

# Change host/port
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Custom model directory
OLLAMA_MODELS=/path/to/models ollama serve

# Limit GPU memory
OLLAMA_GPU_MEMORY=4096 ollama serve

Docker Compose

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open-webui:

11. Common Patterns

RAG with Ollama

import ollama
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create embeddings
embeddings = OllamaEmbeddings(model="llama3.2")

# Create vector store
vectorstore = Chroma.from_texts(
    texts=["Document 1...", "Document 2..."],
    embedding=embeddings
)

# Query
def rag_query(question: str) -> str:
    # Retrieve relevant docs
    docs = vectorstore.similarity_search(question, k=3)
    context = "\n".join(doc.page_content for doc in docs)

    # Generate answer
    response = ollama.chat(
        model='llama3.2',
        messages=[
            {'role': 'system', 'content': f'Answer using this context:\n{context}'},
            {'role': 'user', 'content': question}
        ]
    )
    return response['message']['content']

Function Calling

import json

tools = [
    {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
]

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': f'You have these tools: {json.dumps(tools)}. Call them by returning JSON with "tool" and "arguments".'},
        {'role': 'user', 'content': 'What is the weather in Paris?'}
    ]
)

# Parse tool call from response

Batch Processing

import ollama
from concurrent.futures import ThreadPoolExecutor

def process_item(item):
    response = ollama.generate(
        model='llama3.2',
        prompt=f"Summarize: {item}"
    )
    return response['response']

items = ["Document 1", "Document 2", "Document 3"]

with ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(process_item, items))

12. Troubleshooting

Common Issues

Model not found:

# Pull the model first
ollama pull llama3.2

# Check available models
ollama list

Out of memory:

# Use smaller model
ollama run llama3.2:3b  # Instead of 11b

# Or reduce context
ollama run llama3.2 --num-ctx 2048

Slow generation:

# Check GPU usage
nvidia-smi

# Ensure model fits in VRAM
# Or use quantized versions
ollama pull llama3.2:3b-q4_0

Connection refused:

# Start server first
ollama serve

# Check if running
curl http://localhost:11434/api/tags

Best Practices

Right-size models - Use smallest that works
Quantization - Use Q4 for speed
Custom models - Tune for your use case
Batch requests - Reduce overhead
Cache responses - Avoid repeat queries
Monitor resources - Watch GPU/CPU
Use streaming - Better UX
Set timeouts - Handle slow responses
Test prompts - Iterate on system messages
Keep updated - New models regularly

More by fgarofalo56

View all

azure-event-grid

Event routing with Azure Event Grid. Configure topics, subscriptions, event handlers, and dead-lettering. Use for event-driven architectures, serverless triggers, and reactive systems on Azure.

grafana-dashboards

Build monitoring dashboards with Grafana. Covers panel types, queries, variables, alerting, provisioning, and data sources like Prometheus and InfluxDB. Use for infrastructure monitoring, observability, and metrics visualization.

accessibility-wcag

Build accessible web applications meeting WCAG 2.1 AA standards. Covers ARIA attributes, keyboard navigation, screen reader optimization, focus management, color contrast, semantic HTML, and accessible components. Use for a11y audits, accessible UI development, and inclusive design.

instructor

Structured outputs with Instructor. Extract typed data from LLMs using Pydantic models and validation. Use for data extraction, structured generation, and type-safe LLM responses.