Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, FineWeb dataset loading, modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
skills listSkill Instructions
name: nanogpt-training description: Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, FineWeb dataset loading, modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.
NanoGPT Training
Overview
Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:
- GPT-124M Architecture: Standard transformer with RoPE and modern optimizations
- FineWeb Dataset: Pre-tokenized data loading with memory-mapped files
- Modern Optimizers: Muon optimizer with Newton-Schulz orthogonalization
- Mixed Precision: bfloat16 training on A100 for 2x speedup
Two training modes:
- Baseline GPT: Standard residual connections
- mHC GPT: HyperConnections with doubly stochastic constraints
Quick Reference
| Topic | Reference |
|---|---|
| Model Architecture | GPT Architecture |
| Data Loading | FineWeb Data |
| Optimizers | Optimizers |
| Training Loop | Training Loop |
| Hyperparameters | Hyperparameters |
Installation
pip install torch einops numpy huggingface_hub
Minimal Example
import modal
app = modal.App("gpt-training")
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch", "einops", "numpy", "huggingface_hub"
)
@app.function(gpu="A100", image=image, timeout=3600)
def train():
import torch
from dataclasses import dataclass
@dataclass
class GPTConfig:
block_size: int = 1024
vocab_size: int = 50257
n_layer: int = 12
n_head: int = 12
n_embd: int = 768
dropout: float = 0.0
bias: bool = False
# Download data, build model, train
# ... (see references for full implementation)
return {"final_loss": final_loss}
@app.local_entrypoint()
def main():
results = train.remote()
print(results)
Common Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from dataclasses import dataclass
from einops import rearrange, repeat, reduce
import numpy as np
import math
When to Use What
| Scenario | Approach |
|---|---|
| Standard GPT training | Use baseline model with standard residuals |
| Stability experiments | Use mHC with num_streams=4 |
| Small experiments | Use T4/A10G GPU |
| Full training | Use A100 with bfloat16 |
| Custom data | Modify FineWebDataset class |
| Different model size | Adjust GPTConfig parameters |
Expected Results
| Metric | Baseline | mHC |
|---|---|---|
| Val Loss | < 4.5 | < 4.5 |
| Grad Norm Variance | Higher | Lower |
| Training Stability | Some spikes | More stable |
| Training Time (A100) | ~15 min | ~20 min |
External Resources
- nanoGPT: https://github.com/karpathy/nanoGPT
- build-nanogpt: https://github.com/karpathy/build-nanogpt
- modded-nanogpt: https://github.com/KellerJordan/modded-nanogpt
- FineWeb: https://huggingface.co/datasets/HuggingFaceFW/fineweb
More by benchflow-ai
View allRepair an (often imperfect) Flexible Job Shop Scheduling baseline into a downtime-feasible, precedence-correct schedule while staying within policy budgets and matching the evaluator’s exact metrics and “local minimal right-shift” checks.
Test Temporal workflows with pytest, time-skipping, and mocking strategies. Covers unit testing, integration testing, replay testing, and local development setup. Use when implementing Temporal workflow tests or debugging test failures.
Extract locational marginal prices (LMPs) from DC-OPF solutions using dual values. Use when computing nodal electricity prices, reserve clearing prices, or performing price impact analysis.
This skill should be used when the user asks to "design package structure", "create managed package", "configure 2GP", "set up namespace", "version management", or mentions managed package topics like "LMA", "subscriber orgs", or "package versioning". Provides comprehensive guidance for second-generation managed package (2GP) architecture, ISV development patterns, and package lifecycle management.
