validation-scripts

@benchflow-ai/validation-scripts

benchflow-ai

230

165 forks

Updated 1/18/2026

View on GitHub

Data validation and pipeline testing utilities for ML training projects. Validates datasets, model checkpoints, training pipelines, and dependencies. Use when validating training data, checking model outputs, testing ML pipelines, verifying dependencies, debugging training failures, or ensuring data quality before training.

Installation

$skills install @benchflow-ai/validation-scripts

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorybenchflow-ai/skillsbench

Pathregistry/terminal_bench_1.0/predict-customer-churn/environment/skills/validation-scripts/SKILL.md

Branchmain

Scoped Name@benchflow-ai/validation-scripts

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

skills list

Skill Instructions

name: validation-scripts description: Data validation and pipeline testing utilities for ML training projects. Validates datasets, model checkpoints, training pipelines, and dependencies. Use when validating training data, checking model outputs, testing ML pipelines, verifying dependencies, debugging training failures, or ensuring data quality before training. allowed-tools: Bash, Read, Write, Grep, Glob, Edit

ML Training Validation Scripts

Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.

Activation Triggers:

Validating training datasets before fine-tuning
Checking model checkpoints and outputs
Testing end-to-end training pipelines
Verifying system dependencies and GPU availability
Debugging training failures or data issues
Ensuring data format compliance
Validating model configurations
Testing inference pipelines

Key Resources:

scripts/validate-data.sh - Comprehensive dataset validation
scripts/validate-model.sh - Model checkpoint and config validation
scripts/test-pipeline.sh - End-to-end pipeline testing
scripts/check-dependencies.sh - System and package dependency checking
templates/test-config.yaml - Pipeline test configuration template
templates/validation-schema.json - Data validation schema template
examples/data-validation-example.md - Dataset validation workflow
examples/pipeline-testing-example.md - Complete pipeline testing guide

Quick Start

Validate Training Data

bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --format jsonl \
  --schema templates/validation-schema.json

Validate Model Checkpoint

bash scripts/validate-model.sh \
  --model-path ./checkpoints/epoch-3 \
  --framework pytorch \
  --check-weights

Test Complete Pipeline

bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --data ./data/sample.jsonl \
  --verbose

Check System Dependencies

bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

Validation Scripts

1. Data Validation (validate-data.sh)

Purpose: Validate training datasets for format compliance, data quality, and schema conformance.

Usage:

bash scripts/validate-data.sh [OPTIONS]

Options:

--data-path PATH - Path to dataset file or directory (required)
--format FORMAT - Data format: jsonl, csv, parquet, arrow (default: jsonl)
--schema PATH - Path to validation schema JSON file
--sample-size N - Number of samples to validate (default: all)
--check-duplicates - Check for duplicate entries
--check-null - Check for null/missing values
--check-length - Validate text length constraints
--check-tokens - Validate tokenization compatibility
--tokenizer MODEL - Tokenizer to use for token validation (default: gpt2)
--max-length N - Maximum sequence length (default: 2048)
--output REPORT - Output validation report path (default: validation-report.json)

Validation Checks:

Format Compliance: Verify file format and structure
Schema Validation: Check against JSON schema if provided
Data Types: Validate field types (string, int, float, etc.)
Required Fields: Ensure all required fields are present
Value Ranges: Check numeric values within expected ranges
Text Quality: Detect empty strings, excessive whitespace, encoding issues
Duplicates: Identify duplicate entries (exact or near-duplicate)
Token Counts: Verify sequences fit within model context length
Label Distribution: Check class balance for classification tasks
Missing Values: Detect and report null/missing values

Example Output:

{
  "status": "PASS",
  "total_samples": 10000,
  "valid_samples": 9987,
  "invalid_samples": 13,
  "validation_errors": [
    {
      "sample_id": 42,
      "field": "text",
      "error": "Exceeds max token length: 2150 > 2048"
    },
    {
      "sample_id": 156,
      "field": "label",
      "error": "Invalid label value: 'unknwn' (typo)"
    }
  ],
  "statistics": {
    "avg_text_length": 487,
    "avg_token_count": 128,
    "label_distribution": {
      "positive": 4892,
      "negative": 4895,
      "neutral": 213
    }
  },
  "recommendations": [
    "Remove or fix 13 invalid samples before training",
    "Label distribution is imbalanced - consider class weighting"
  ]
}

Exit Codes:

0 - All validations passed
1 - Validation errors found
2 - Script error (invalid arguments, file not found)

2. Model Validation (validate-model.sh)

Purpose: Validate model checkpoints, configurations, and inference readiness.

Usage:

bash scripts/validate-model.sh [OPTIONS]

Options:

--model-path PATH - Path to model checkpoint directory (required)
--framework FRAMEWORK - Framework: pytorch, tensorflow, jax (default: pytorch)
--check-weights - Verify model weights are loadable
--check-config - Validate model configuration file
--check-tokenizer - Verify tokenizer files are present
--check-inference - Test inference with sample input
--sample-input TEXT - Sample text for inference test
--expected-output TEXT - Expected output for verification
--output REPORT - Output validation report path

Validation Checks:

File Structure: Verify required files are present
Config Validation: Check model_config.json for correctness
Weight Integrity: Load and verify model weights
Tokenizer Files: Ensure tokenizer files exist and are loadable
Model Architecture: Validate architecture matches config
Memory Requirements: Estimate GPU/CPU memory needed
Inference Test: Run sample inference if requested
Quantization: Verify quantized models load correctly
LoRA/PEFT: Validate adapter weights and configuration

Example Output:

{
  "status": "PASS",
  "model_path": "./checkpoints/llama-7b-finetuned",
  "framework": "pytorch",
  "checks": {
    "file_structure": "PASS",
    "config": "PASS",
    "weights": "PASS",
    "tokenizer": "PASS",
    "inference": "PASS"
  },
  "model_info": {
    "architecture": "LlamaForCausalLM",
    "parameters": "7.2B",
    "precision": "float16",
    "lora_enabled": true,
    "lora_rank": 16
  },
  "memory_estimate": {
    "model_size_gb": 13.5,
    "inference_vram_gb": 16.2,
    "training_vram_gb": 24.8
  },
  "inference_test": {
    "input": "Hello, world!",
    "output": "Hello, world! How can I help you today?",
    "latency_ms": 142
  }
}

3. Pipeline Testing (test-pipeline.sh)

Purpose: Test end-to-end ML training and inference pipelines with sample data.

Usage:

bash scripts/test-pipeline.sh [OPTIONS]

Options:

--config PATH - Pipeline test configuration file (required)
--data PATH - Sample data for testing
--steps STEPS - Comma-separated steps to test (default: all)
--verbose - Enable detailed output
--output REPORT - Output test report path
--fail-fast - Stop on first failure
--cleanup - Clean up temporary files after test

Pipeline Steps:

data_loading: Test data loading and preprocessing
tokenization: Verify tokenization pipeline
model_loading: Load model and verify initialization
training_step: Run single training step
validation_step: Run single validation step
checkpoint_save: Test checkpoint saving
checkpoint_load: Test checkpoint loading
inference: Test inference pipeline
metrics: Verify metrics calculation

Test Configuration (test-config.yaml):

pipeline:
  name: llama-7b-finetuning
  framework: pytorch

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  load_in_8bit: false
  lora:
    enabled: true
    r: 16
    alpha: 32

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]

Example Output:

Pipeline Test Report
====================

Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds

Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly

Overall: PASS (8/8 tests passed)

Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB

Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput

4. Dependency Checking (check-dependencies.sh)

Purpose: Verify all required system dependencies, packages, and GPU availability.

Usage:

bash scripts/check-dependencies.sh [OPTIONS]

Options:

--framework FRAMEWORK - ML framework: pytorch, tensorflow, jax (default: pytorch)
--gpu-required - Require GPU availability
--min-vram GB - Minimum GPU VRAM required (GB)
--cuda-version VERSION - Required CUDA version
--packages FILE - Path to requirements.txt or packages list
--platform PLATFORM - Platform: modal, lambda, runpod, local (default: local)
--fix - Attempt to install missing packages
--output REPORT - Output dependency report path

Dependency Checks:

Python Version: Verify compatible Python version
ML Framework: Check PyTorch/TensorFlow/JAX installation
CUDA/cuDNN: Verify CUDA toolkit and cuDNN
GPU Availability: Detect and validate GPUs
VRAM: Check GPU memory capacity
System Packages: Verify system-level dependencies
Python Packages: Check required pip packages
Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
Storage: Check available disk space
Network: Test internet connectivity for model downloads

Example Output:

{
  "status": "PASS",
  "platform": "modal",
  "checks": {
    "python": {
      "status": "PASS",
      "version": "3.10.12",
      "required": ">=3.9"
    },
    "pytorch": {
      "status": "PASS",
      "version": "2.1.0",
      "cuda_available": true,
      "cuda_version": "12.1"
    },
    "gpu": {
      "status": "PASS",
      "count": 1,
      "type": "NVIDIA A100",
      "vram_gb": 40,
      "driver_version": "535.129.03"
    },
    "packages": {
      "status": "PASS",
      "installed": 42,
      "missing": 0,
      "outdated": 3
    },
    "storage": {
      "status": "PASS",
      "available_gb": 128,
      "required_gb": 50
    }
  },
  "recommendations": [
    "Update transformers to latest version (4.36.0 -> 4.37.2)",
    "Consider upgrading to PyTorch 2.2.0 for better performance"
  ]
}

Templates

Validation Schema Template (templates/validation-schema.json)

Defines expected data structure and validation rules for training datasets.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["text", "label"],
  "properties": {
    "text": {
      "type": "string",
      "minLength": 10,
      "maxLength": 5000,
      "description": "Input text for training"
    },
    "label": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "description": "Classification label"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string"
        },
        "timestamp": {
          "type": "string",
          "format": "date-time"
        }
      }
    }
  },
  "validation_rules": {
    "max_token_length": 2048,
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "check_duplicates": true,
    "min_label_count": 100,
    "max_label_imbalance_ratio": 10.0
  }
}

Customization:

Modify required fields for your use case
Add custom properties and validation rules
Set appropriate length constraints
Define label enums for classification
Configure tokenizer and max length

Test Configuration Template (templates/test-config.yaml)

Defines pipeline testing parameters and expected behaviors.

pipeline:
  name: test-pipeline
  framework: pytorch
  platform: modal  # modal, lambda, runpod, local

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl  # jsonl, csv, parquet
  sample_size: 10  # Number of samples to use for testing

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  quantization: null  # null, 8bit, 4bit
  lora:
    enabled: true
    r: 16
    alpha: 32
    dropout: 0.05
    target_modules: ["q_proj", "v_proj"]

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5
  warmup_steps: 0
  eval_steps: 2

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
  fail_fast: true
  cleanup: true

validation:
  metrics: ["loss", "perplexity"]
  check_gradients: true
  check_memory_leak: true

gpu:
  required: true
  min_vram_gb: 16
  allow_cpu_fallback: false

Integration with ML Training Workflow

Pre-Training Validation

# 1. Check system dependencies
bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

# 2. Validate training data
bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --schema templates/validation-schema.json \
  --check-duplicates \
  --check-tokens

# 3. Test pipeline with sample data
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --verbose

Post-Training Validation

# 1. Validate model checkpoint
bash scripts/validate-model.sh \
  --model-path ./checkpoints/final \
  --framework pytorch \
  --check-weights \
  --check-inference

# 2. Test inference pipeline
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --steps inference,metrics

CI/CD Integration

# .github/workflows/validate-training.yml
- name: Validate Training Data
  run: |
    bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh \
      --data-path ./data/train.jsonl \
      --schema ./validation-schema.json

- name: Test Training Pipeline
  run: |
    bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh \
      --config ./test-config.yaml \
      --fail-fast

Error Handling and Debugging

Common Validation Failures

Data Validation Failures:

Token length exceeded → Truncate or filter long sequences
Missing required fields → Fix data preprocessing
Duplicate entries → Deduplicate dataset
Invalid labels → Clean label data

Model Validation Failures:

Missing config files → Ensure complete checkpoint
Weight loading errors → Check framework compatibility
Tokenizer mismatch → Verify tokenizer version

Pipeline Test Failures:

Out of memory → Reduce batch size, enable gradient checkpointing
CUDA errors → Check GPU drivers, CUDA version
Slow performance → Profile and optimize data loading

Debug Mode

All scripts support verbose debugging:

bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug

Outputs:

Detailed progress logs
Stack traces for errors
Performance metrics
Memory usage statistics

Best Practices

Always validate before training - Catch data issues early
Use schema validation - Enforce data quality standards
Test with sample data first - Verify pipeline before full training
Check dependencies on new platforms - Avoid runtime failures
Validate checkpoints - Ensure model integrity before deployment
Monitor validation metrics - Track data quality over time
Automate validation in CI/CD - Prevent bad data from entering pipeline
Keep validation schemas updated - Reflect current data requirements

Performance Tips

Use --sample-size for quick validation of large datasets
Run dependency checks once per environment, cache results
Parallelize validation with GNU parallel for multi-file datasets
Use --fail-fast in CI/CD to save time
Cache tokenizers to avoid re-downloading

Troubleshooting

Script Not Found:

# Ensure you're in the correct directory
cd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help

Permission Denied:

# Make scripts executable
chmod +x scripts/*.sh

Missing Dependencies:

# Install required tools
pip install jsonschema pandas pyarrow
sudo apt-get install jq bc

Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local

More by benchflow-ai

View all

fjsp-baseline-repair-with-downtime-and-policy

230

Repair an (often imperfect) Flexible Job Shop Scheduling baseline into a downtime-feasible, precedence-correct schedule while staying within policy budgets and matching the evaluator’s exact metrics and “local minimal right-shift” checks.

temporal-python-testing

230

Test Temporal workflows with pytest, time-skipping, and mocking strategies. Covers unit testing, integration testing, replay testing, and local development setup. Use when implementing Temporal workflow tests or debugging test failures.

locational-marginal-prices

230

Extract locational marginal prices (LMPs) from DC-OPF solutions using dual values. Use when computing nodal electricity prices, reserve clearing prices, or performing price impact analysis.

Managed Package Architecture

230

This skill should be used when the user asks to "design package structure", "create managed package", "configure 2GP", "set up namespace", "version management", or mentions managed package topics like "LMA", "subscriber orgs", or "package versioning". Provides comprehensive guidance for second-generation managed package (2GP) architecture, ISV development patterns, and package lifecycle management.