benchflow-ai

nanogpt-training

@benchflow-ai/nanogpt-training
benchflow-ai
230
165 forks
Updated 1/18/2026
View on GitHub

Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, FineWeb dataset loading, modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.

Installation

$skills install @benchflow-ai/nanogpt-training
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathtasks/mhc-layer-impl/environment/skills/nanogpt-training/SKILL.md
Branchmain
Scoped Name@benchflow-ai/nanogpt-training

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

skills list

Skill Instructions


name: nanogpt-training description: Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, FineWeb dataset loading, modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.

NanoGPT Training

Overview

Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:

  • GPT-124M Architecture: Standard transformer with RoPE and modern optimizations
  • FineWeb Dataset: Pre-tokenized data loading with memory-mapped files
  • Modern Optimizers: Muon optimizer with Newton-Schulz orthogonalization
  • Mixed Precision: bfloat16 training on A100 for 2x speedup

Two training modes:

  • Baseline GPT: Standard residual connections
  • mHC GPT: HyperConnections with doubly stochastic constraints

Quick Reference

TopicReference
Model ArchitectureGPT Architecture
Data LoadingFineWeb Data
OptimizersOptimizers
Training LoopTraining Loop
HyperparametersHyperparameters

Installation

pip install torch einops numpy huggingface_hub

Minimal Example

import modal

app = modal.App("gpt-training")

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "einops", "numpy", "huggingface_hub"
)

@app.function(gpu="A100", image=image, timeout=3600)
def train():
    import torch
    from dataclasses import dataclass

    @dataclass
    class GPTConfig:
        block_size: int = 1024
        vocab_size: int = 50257
        n_layer: int = 12
        n_head: int = 12
        n_embd: int = 768
        dropout: float = 0.0
        bias: bool = False

    # Download data, build model, train
    # ... (see references for full implementation)

    return {"final_loss": final_loss}

@app.local_entrypoint()
def main():
    results = train.remote()
    print(results)

Common Imports

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from dataclasses import dataclass
from einops import rearrange, repeat, reduce
import numpy as np
import math

When to Use What

ScenarioApproach
Standard GPT trainingUse baseline model with standard residuals
Stability experimentsUse mHC with num_streams=4
Small experimentsUse T4/A10G GPU
Full trainingUse A100 with bfloat16
Custom dataModify FineWebDataset class
Different model sizeAdjust GPTConfig parameters

Expected Results

MetricBaselinemHC
Val Loss< 4.5< 4.5
Grad Norm VarianceHigherLower
Training StabilitySome spikesMore stable
Training Time (A100)~15 min~20 min

External Resources