Data validation and pipeline testing utilities for ML training projects. Validates datasets, model checkpoints, training pipelines, and dependencies. Use when validating training data, checking model outputs, testing ML pipelines, verifying dependencies, debugging training failures, or ensuring data quality before training.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
skills listSkill Instructions
name: validation-scripts description: Data validation and pipeline testing utilities for ML training projects. Validates datasets, model checkpoints, training pipelines, and dependencies. Use when validating training data, checking model outputs, testing ML pipelines, verifying dependencies, debugging training failures, or ensuring data quality before training. allowed-tools: Bash, Read, Write, Grep, Glob, Edit
ML Training Validation Scripts
Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.
Activation Triggers:
- Validating training datasets before fine-tuning
- Checking model checkpoints and outputs
- Testing end-to-end training pipelines
- Verifying system dependencies and GPU availability
- Debugging training failures or data issues
- Ensuring data format compliance
- Validating model configurations
- Testing inference pipelines
Key Resources:
scripts/validate-data.sh- Comprehensive dataset validationscripts/validate-model.sh- Model checkpoint and config validationscripts/test-pipeline.sh- End-to-end pipeline testingscripts/check-dependencies.sh- System and package dependency checkingtemplates/test-config.yaml- Pipeline test configuration templatetemplates/validation-schema.json- Data validation schema templateexamples/data-validation-example.md- Dataset validation workflowexamples/pipeline-testing-example.md- Complete pipeline testing guide
Quick Start
Validate Training Data
bash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--format jsonl \
--schema templates/validation-schema.json
Validate Model Checkpoint
bash scripts/validate-model.sh \
--model-path ./checkpoints/epoch-3 \
--framework pytorch \
--check-weights
Test Complete Pipeline
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--data ./data/sample.jsonl \
--verbose
Check System Dependencies
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16
Validation Scripts
1. Data Validation (validate-data.sh)
Purpose: Validate training datasets for format compliance, data quality, and schema conformance.
Usage:
bash scripts/validate-data.sh [OPTIONS]
Options:
--data-path PATH- Path to dataset file or directory (required)--format FORMAT- Data format: jsonl, csv, parquet, arrow (default: jsonl)--schema PATH- Path to validation schema JSON file--sample-size N- Number of samples to validate (default: all)--check-duplicates- Check for duplicate entries--check-null- Check for null/missing values--check-length- Validate text length constraints--check-tokens- Validate tokenization compatibility--tokenizer MODEL- Tokenizer to use for token validation (default: gpt2)--max-length N- Maximum sequence length (default: 2048)--output REPORT- Output validation report path (default: validation-report.json)
Validation Checks:
- Format Compliance: Verify file format and structure
- Schema Validation: Check against JSON schema if provided
- Data Types: Validate field types (string, int, float, etc.)
- Required Fields: Ensure all required fields are present
- Value Ranges: Check numeric values within expected ranges
- Text Quality: Detect empty strings, excessive whitespace, encoding issues
- Duplicates: Identify duplicate entries (exact or near-duplicate)
- Token Counts: Verify sequences fit within model context length
- Label Distribution: Check class balance for classification tasks
- Missing Values: Detect and report null/missing values
Example Output:
{
"status": "PASS",
"total_samples": 10000,
"valid_samples": 9987,
"invalid_samples": 13,
"validation_errors": [
{
"sample_id": 42,
"field": "text",
"error": "Exceeds max token length: 2150 > 2048"
},
{
"sample_id": 156,
"field": "label",
"error": "Invalid label value: 'unknwn' (typo)"
}
],
"statistics": {
"avg_text_length": 487,
"avg_token_count": 128,
"label_distribution": {
"positive": 4892,
"negative": 4895,
"neutral": 213
}
},
"recommendations": [
"Remove or fix 13 invalid samples before training",
"Label distribution is imbalanced - consider class weighting"
]
}
Exit Codes:
0- All validations passed1- Validation errors found2- Script error (invalid arguments, file not found)
2. Model Validation (validate-model.sh)
Purpose: Validate model checkpoints, configurations, and inference readiness.
Usage:
bash scripts/validate-model.sh [OPTIONS]
Options:
--model-path PATH- Path to model checkpoint directory (required)--framework FRAMEWORK- Framework: pytorch, tensorflow, jax (default: pytorch)--check-weights- Verify model weights are loadable--check-config- Validate model configuration file--check-tokenizer- Verify tokenizer files are present--check-inference- Test inference with sample input--sample-input TEXT- Sample text for inference test--expected-output TEXT- Expected output for verification--output REPORT- Output validation report path
Validation Checks:
- File Structure: Verify required files are present
- Config Validation: Check model_config.json for correctness
- Weight Integrity: Load and verify model weights
- Tokenizer Files: Ensure tokenizer files exist and are loadable
- Model Architecture: Validate architecture matches config
- Memory Requirements: Estimate GPU/CPU memory needed
- Inference Test: Run sample inference if requested
- Quantization: Verify quantized models load correctly
- LoRA/PEFT: Validate adapter weights and configuration
Example Output:
{
"status": "PASS",
"model_path": "./checkpoints/llama-7b-finetuned",
"framework": "pytorch",
"checks": {
"file_structure": "PASS",
"config": "PASS",
"weights": "PASS",
"tokenizer": "PASS",
"inference": "PASS"
},
"model_info": {
"architecture": "LlamaForCausalLM",
"parameters": "7.2B",
"precision": "float16",
"lora_enabled": true,
"lora_rank": 16
},
"memory_estimate": {
"model_size_gb": 13.5,
"inference_vram_gb": 16.2,
"training_vram_gb": 24.8
},
"inference_test": {
"input": "Hello, world!",
"output": "Hello, world! How can I help you today?",
"latency_ms": 142
}
}
3. Pipeline Testing (test-pipeline.sh)
Purpose: Test end-to-end ML training and inference pipelines with sample data.
Usage:
bash scripts/test-pipeline.sh [OPTIONS]
Options:
--config PATH- Pipeline test configuration file (required)--data PATH- Sample data for testing--steps STEPS- Comma-separated steps to test (default: all)--verbose- Enable detailed output--output REPORT- Output test report path--fail-fast- Stop on first failure--cleanup- Clean up temporary files after test
Pipeline Steps:
- data_loading: Test data loading and preprocessing
- tokenization: Verify tokenization pipeline
- model_loading: Load model and verify initialization
- training_step: Run single training step
- validation_step: Run single validation step
- checkpoint_save: Test checkpoint saving
- checkpoint_load: Test checkpoint loading
- inference: Test inference pipeline
- metrics: Verify metrics calculation
Test Configuration (test-config.yaml):
pipeline:
name: llama-7b-finetuning
framework: pytorch
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
load_in_8bit: false
lora:
enabled: true
r: 16
alpha: 32
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
Example Output:
Pipeline Test Report
====================
Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds
Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly
Overall: PASS (8/8 tests passed)
Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB
Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput
4. Dependency Checking (check-dependencies.sh)
Purpose: Verify all required system dependencies, packages, and GPU availability.
Usage:
bash scripts/check-dependencies.sh [OPTIONS]
Options:
--framework FRAMEWORK- ML framework: pytorch, tensorflow, jax (default: pytorch)--gpu-required- Require GPU availability--min-vram GB- Minimum GPU VRAM required (GB)--cuda-version VERSION- Required CUDA version--packages FILE- Path to requirements.txt or packages list--platform PLATFORM- Platform: modal, lambda, runpod, local (default: local)--fix- Attempt to install missing packages--output REPORT- Output dependency report path
Dependency Checks:
- Python Version: Verify compatible Python version
- ML Framework: Check PyTorch/TensorFlow/JAX installation
- CUDA/cuDNN: Verify CUDA toolkit and cuDNN
- GPU Availability: Detect and validate GPUs
- VRAM: Check GPU memory capacity
- System Packages: Verify system-level dependencies
- Python Packages: Check required pip packages
- Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
- Storage: Check available disk space
- Network: Test internet connectivity for model downloads
Example Output:
{
"status": "PASS",
"platform": "modal",
"checks": {
"python": {
"status": "PASS",
"version": "3.10.12",
"required": ">=3.9"
},
"pytorch": {
"status": "PASS",
"version": "2.1.0",
"cuda_available": true,
"cuda_version": "12.1"
},
"gpu": {
"status": "PASS",
"count": 1,
"type": "NVIDIA A100",
"vram_gb": 40,
"driver_version": "535.129.03"
},
"packages": {
"status": "PASS",
"installed": 42,
"missing": 0,
"outdated": 3
},
"storage": {
"status": "PASS",
"available_gb": 128,
"required_gb": 50
}
},
"recommendations": [
"Update transformers to latest version (4.36.0 -> 4.37.2)",
"Consider upgrading to PyTorch 2.2.0 for better performance"
]
}
Templates
Validation Schema Template (templates/validation-schema.json)
Defines expected data structure and validation rules for training datasets.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["text", "label"],
"properties": {
"text": {
"type": "string",
"minLength": 10,
"maxLength": 5000,
"description": "Input text for training"
},
"label": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Classification label"
},
"metadata": {
"type": "object",
"properties": {
"source": {
"type": "string"
},
"timestamp": {
"type": "string",
"format": "date-time"
}
}
}
},
"validation_rules": {
"max_token_length": 2048,
"tokenizer": "meta-llama/Llama-2-7b-hf",
"check_duplicates": true,
"min_label_count": 100,
"max_label_imbalance_ratio": 10.0
}
}
Customization:
- Modify
requiredfields for your use case - Add custom properties and validation rules
- Set appropriate length constraints
- Define label enums for classification
- Configure tokenizer and max length
Test Configuration Template (templates/test-config.yaml)
Defines pipeline testing parameters and expected behaviors.
pipeline:
name: test-pipeline
framework: pytorch
platform: modal # modal, lambda, runpod, local
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl # jsonl, csv, parquet
sample_size: 10 # Number of samples to use for testing
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
quantization: null # null, 8bit, 4bit
lora:
enabled: true
r: 16
alpha: 32
dropout: 0.05
target_modules: ["q_proj", "v_proj"]
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
warmup_steps: 0
eval_steps: 2
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
fail_fast: true
cleanup: true
validation:
metrics: ["loss", "perplexity"]
check_gradients: true
check_memory_leak: true
gpu:
required: true
min_vram_gb: 16
allow_cpu_fallback: false
Integration with ML Training Workflow
Pre-Training Validation
# 1. Check system dependencies
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16
# 2. Validate training data
bash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--schema templates/validation-schema.json \
--check-duplicates \
--check-tokens
# 3. Test pipeline with sample data
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--verbose
Post-Training Validation
# 1. Validate model checkpoint
bash scripts/validate-model.sh \
--model-path ./checkpoints/final \
--framework pytorch \
--check-weights \
--check-inference
# 2. Test inference pipeline
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--steps inference,metrics
CI/CD Integration
# .github/workflows/validate-training.yml
- name: Validate Training Data
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--schema ./validation-schema.json
- name: Test Training Pipeline
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh \
--config ./test-config.yaml \
--fail-fast
Error Handling and Debugging
Common Validation Failures
Data Validation Failures:
- Token length exceeded → Truncate or filter long sequences
- Missing required fields → Fix data preprocessing
- Duplicate entries → Deduplicate dataset
- Invalid labels → Clean label data
Model Validation Failures:
- Missing config files → Ensure complete checkpoint
- Weight loading errors → Check framework compatibility
- Tokenizer mismatch → Verify tokenizer version
Pipeline Test Failures:
- Out of memory → Reduce batch size, enable gradient checkpointing
- CUDA errors → Check GPU drivers, CUDA version
- Slow performance → Profile and optimize data loading
Debug Mode
All scripts support verbose debugging:
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug
Outputs:
- Detailed progress logs
- Stack traces for errors
- Performance metrics
- Memory usage statistics
Best Practices
- Always validate before training - Catch data issues early
- Use schema validation - Enforce data quality standards
- Test with sample data first - Verify pipeline before full training
- Check dependencies on new platforms - Avoid runtime failures
- Validate checkpoints - Ensure model integrity before deployment
- Monitor validation metrics - Track data quality over time
- Automate validation in CI/CD - Prevent bad data from entering pipeline
- Keep validation schemas updated - Reflect current data requirements
Performance Tips
- Use
--sample-sizefor quick validation of large datasets - Run dependency checks once per environment, cache results
- Parallelize validation with GNU parallel for multi-file datasets
- Use
--fail-fastin CI/CD to save time - Cache tokenizers to avoid re-downloading
Troubleshooting
Script Not Found:
# Ensure you're in the correct directory
cd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help
Permission Denied:
# Make scripts executable
chmod +x scripts/*.sh
Missing Dependencies:
# Install required tools
pip install jsonschema pandas pyarrow
sudo apt-get install jq bc
Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local
More by benchflow-ai
View allRepair an (often imperfect) Flexible Job Shop Scheduling baseline into a downtime-feasible, precedence-correct schedule while staying within policy budgets and matching the evaluator’s exact metrics and “local minimal right-shift” checks.
Test Temporal workflows with pytest, time-skipping, and mocking strategies. Covers unit testing, integration testing, replay testing, and local development setup. Use when implementing Temporal workflow tests or debugging test failures.
Extract locational marginal prices (LMPs) from DC-OPF solutions using dual values. Use when computing nodal electricity prices, reserve clearing prices, or performing price impact analysis.
This skill should be used when the user asks to "design package structure", "create managed package", "configure 2GP", "set up namespace", "version management", or mentions managed package topics like "LMA", "subscriber orgs", or "package versioning". Provides comprehensive guidance for second-generation managed package (2GP) architecture, ISV development patterns, and package lifecycle management.
