Profile code performance using callgrind and valgrind with nextest integration for analyzing instruction counts, cache behavior, and identifying bottlenecks
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
skills listSkill Instructions
name: profiling description: Profile code performance using callgrind and valgrind with nextest integration for analyzing instruction counts, cache behavior, and identifying bottlenecks
Profiling with Valgrind, Callgrind, and Nextest
The facet project has pre-configured valgrind integration for debugging crashes, memory leaks, and performance profiling.
Quick Usage
# Run test under valgrind (memory errors + leaks)
cargo nextest run --profile valgrind -p PACKAGE TEST_FILTER
# Run test under callgrind (profiling)
valgrind --tool=callgrind --callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_FILTER
# Analyze callgrind output
callgrind_annotate callgrind.out
# or with GUI
kcachegrind callgrind.out # Linux
qcachegrind callgrind.out # macOS
Nextest Valgrind Profile
The project has a pre-configured valgrind profile in .config/nextest.toml:
Configuration
[scripts.wrapper.valgrind]
# Leak checking configuration
command = 'valgrind --leak-check=full --show-leak-kinds=all --errors-for-leak-kinds=definite,indirect --error-exitcode=1'
[profile.valgrind]
# Apply to all tests on Linux
platform = 'cfg(target_os = "linux")'
filter = 'all()'
run-wrapper = 'valgrind'
What it does:
--leak-check=full- Show details for each leak--show-leak-kinds=all- Show all leak types for diagnostics--errors-for-leak-kinds=definite,indirect- Only fail on real leaks (not "still reachable")--error-exitcode=1- Exit with code 1 if errors found
Usage
# Run specific test
cargo nextest run --profile valgrind -p facet-format-json test_simple_struct
# Run all tests in a file
cargo nextest run --profile valgrind -p facet-format-json --test jit_deserialize
# Run with filter
cargo nextest run --profile valgrind -p facet-json booleans
Benefits:
- ✅ Automatic configuration - no manual valgrind commands
- ✅ Consistent flags across team
- ✅ Integrated with nextest filtering
- ✅ Clean, formatted output
Profiling with Callgrind
Callgrind is a valgrind tool for profiling instruction counts and function call graphs.
Basic Profiling
# Profile a specific test
valgrind --tool=callgrind \
--callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
# Analyze output
callgrind_annotate callgrind.out
Advanced Options
# Collect cache simulation data (slower but more detailed)
valgrind --tool=callgrind \
--cache-sim=yes \
--branch-sim=yes \
--callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
# Focus on specific function
valgrind --tool=callgrind \
--toggle-collect=main \
--callgrind-out-file=callgrind.out \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
# Compress output (can get large)
valgrind --tool=callgrind \
--compress-strings=yes \
--compress-pos=yes \
--callgrind-out-file=callgrind.out.gz \
cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME
Analyzing Callgrind Output
Command Line (callgrind_annotate)
# Full report
callgrind_annotate callgrind.out
# Focus on specific functions
callgrind_annotate --include='facet::' callgrind.out
# Show only top functions
callgrind_annotate --auto=yes --threshold=1 callgrind.out
# Compare two runs
callgrind_annotate --diff callgrind.old.out callgrind.new.out
Reading the output:
Ir # Instruction reads (total)
I1mr # L1 instruction cache misses
ILmr # Last-level instruction cache misses
Dr # Data reads
Dw # Data writes
D1mr, D1mw # L1 data cache read/write misses
DLmr, DLmw # Last-level data cache read/write misses
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
1,234,567 (45%) facet_format_json::deserialize
987,654 (35%) facet_format::parse_value
...
GUI (KCachegrind/QCachegrind)
Install:
# Linux
sudo apt install kcachegrind
# macOS
brew install qcachegrind
# Windows (WSL)
sudo apt install kcachegrind
Launch:
kcachegrind callgrind.out # Linux
qcachegrind callgrind.out # macOS
GUI features:
- Call graph visualization
- Flamegraph-like views
- Source code annotation (if debug symbols available)
- Caller/callee relationships
- Multiple metrics (instructions, cache misses, branches)
Profiling Benchmarks
The generated benchmark tests (from benchmarks.kdl) can be profiled:
1. As Tests (Recommended for Callgrind)
# Profile a benchmark test under callgrind
valgrind --tool=callgrind \
--callgrind-out-file=callgrind_simple_struct.out \
cargo nextest run --profile valgrind -p facet-json test_simple_struct
# Analyze
callgrind_annotate callgrind_simple_struct.out
Why use tests:
- Single iteration = cleaner callgrind output
- No benchmark harness overhead
- Easier to focus on hot path
- Faster to run
2. As Benchmarks (For Realistic Instruction Counts)
The benchmark harness (gungraun) already uses valgrind internally:
# Run gungraun benchmark (uses callgrind automatically)
cargo bench --bench unified_benchmarks_gungraun --features jit simple_struct
# Check output in bench-reports/gungraun-*.txt
gungraun automatically collects:
- Instructions executed
- Estimated cycles
- L1/LL cache hits
- RAM hits
- Total read/write operations
This data appears in bench-reports/perf/RESULTS.md.
Common Profiling Workflows
Debug a Crash
# 1. Run under valgrind to find memory error
cargo nextest run --profile valgrind -p PACKAGE TEST_NAME
# 2. Read valgrind output for exact error location
# Example: "Invalid read of size 8 at 0x123456"
# 3. Fix the bug
# 4. Verify fix
cargo nextest run -p PACKAGE TEST_NAME
Find Performance Bottleneck
# 1. Profile with callgrind
valgrind --tool=callgrind \
--callgrind-out-file=profile.out \
cargo nextest run --no-fail-fast -p facet-json test_booleans
# 2. Analyze
callgrind_annotate --auto=yes profile.out | head -30
# 3. Identify hot functions (high instruction counts)
# 4. Optimize hot functions
# 5. Re-profile and compare
valgrind --tool=callgrind \
--callgrind-out-file=profile_after.out \
cargo nextest run --no-fail-fast -p facet-json test_booleans
callgrind_annotate --diff profile.out profile_after.out
Optimize Tier-2 JIT
# 1. Check RESULTS.md for slow benchmarks
grep "⚠" bench-reports/perf/RESULTS.md
# 2. Profile the slow benchmark test
valgrind --tool=callgrind \
--callgrind-out-file=jit_profile.out \
cargo nextest run --profile valgrind -p facet-json test_long_strings --features jit
# 3. Analyze with GUI for visual call graph
kcachegrind jit_profile.out
# 4. Look for:
# - Helper function calls in tight loops
# - Redundant alignment checks
# - Allocation hot spots
# 5. Optimize based on findings
# 6. Verify with benchmarks
cargo xtask bench long_strings
Compare Before/After Optimization
# Before
git checkout main
valgrind --tool=callgrind --callgrind-out-file=before.out \
cargo nextest run --no-fail-fast -p facet-json test_target
# After
git checkout my-optimization-branch
valgrind --tool=callgrind --callgrind-out-file=after.out \
cargo nextest run --no-fail-fast -p facet-json test_target
# Compare
callgrind_annotate --diff before.out after.out
Interpreting Valgrind Output
Memory Error Example
==12345== Invalid read of size 8
==12345== at 0x123456: facet_format_json::parse_number (parse.rs:42)
==12345== by 0x234567: facet_format_json::deserialize (lib.rs:123)
==12345== Address 0x789abc is 0 bytes after a block of size 16 alloc'd
==12345== at 0x345678: alloc (alloc.rs:88)
==12345== by 0x456789: Vec::push (vec.rs:1234)
Translation:
- Reading 8 bytes from invalid address
- Happened in
parse_numberat line 42 - Address is just past end of 16-byte allocation
- Fix: Check bounds before reading, or fix off-by-one error
Leak Example
==12345== 128 bytes in 1 blocks are definitely lost in loss record 1 of 10
==12345== at 0x123456: malloc (vg_replace_malloc.c:299)
==12345== by 0x234567: alloc (alloc.rs:88)
==12345== by 0x345678: Box::new (boxed.rs:123)
==12345== by 0x456789: setup_jit (jit.rs:456)
Translation:
- 128 bytes allocated but never freed
- Allocated in
setup_jitfunction - Fix: Ensure cleanup/Drop implementation
Cachegrind Output Example
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
1,234,567 123 45 456,789 234 12 123,456 67 8 facet::deserialize
987,654 98 32 345,678 189 9 98,765 43 5 - facet::parse_value
234,567 23 10 98,765 45 2 23,456 12 1 - facet::parse_string
Key metrics:
Ir- Instructions executed (most important for optimization)D1mr/D1mw- L1 data cache misses (indicates poor locality)DLmr/DLmw- Last-level cache misses (very expensive)
Optimization targets:
- High
Ircount = time-consuming function - High
D1mr= poor data locality, consider restructuring - High
DLmr= main memory accesses, critical to optimize
Profiling Flags
Valgrind (Memory Debugging)
--leak-check=full # Detailed leak info
--show-leak-kinds=all # Show all leak types
--track-origins=yes # Track uninitialized values (slower)
--verbose # More diagnostic info
--log-file=valgrind.log # Save output to file
Callgrind (Profiling)
--callgrind-out-file=FILE # Output file (default: callgrind.out.<pid>)
--cache-sim=yes # Simulate cache behavior
--branch-sim=yes # Simulate branch prediction
--collect-jumps=yes # Collect jump information
--dump-instr=yes # Dump instruction info
--compress-strings=yes # Compress output (smaller files)
Cargo Nextest
--no-fail-fast # Continue running after first failure
--profile valgrind # Use valgrind profile from nextest.toml
--test-threads=1 # Run single-threaded (better for profiling)
Tips and Tricks
Speed Up Profiling
-
Profile in release mode (but keep debug symbols):
# Add to Cargo.toml [profile.release] debug = true -
Use
--no-fail-fastto avoid stopping early -
Filter to specific tests - don't profile everything at once
-
Disable address randomization for reproducible runs:
setarch $(uname -m) -R valgrind --tool=callgrind ...
Read Callgrind Data Programmatically
# Example: Parse callgrind output for automation
def parse_callgrind(filename):
import re
costs = {}
with open(filename) as f:
for line in f:
if m := re.match(r'(\d+)\s+(.+)', line):
cost, func = m.groups()
costs[func] = int(cost)
return costs
# Compare two profiles
before = parse_callgrind('before.out')
after = parse_callgrind('after.out')
for func in before:
if func in after:
delta = after[func] - before[func]
percent = (delta / before[func]) * 100
if abs(percent) > 5: # More than 5% change
print(f"{func}: {percent:+.1f}% ({delta:+,} instructions)")
Don't Do This
❌ Run valgrind without nextest profile - inconsistent flags ❌ Profile debug builds - too slow and unrepresentative ❌ Ignore "still reachable" leaks in FFI code - sometimes OK ❌ Profile with multiple test threads - non-deterministic results ❌ Forget to clean between profiling runs - stale data
Do This Instead
✅ Use --profile valgrind for memory debugging
✅ Use callgrind for performance profiling
✅ Profile release builds with debug symbols
✅ Focus on hot paths (high Ir counts)
✅ Compare before/after with --diff
✅ Use GUI tools (kcachegrind) for complex call graphs
Files and Locations
.config/nextest.toml # Valgrind profile configuration
callgrind.out.* # Callgrind output files (gitignored)
bench-reports/gungraun-*.txt # Gungraun output (includes instruction counts)
Troubleshooting
Valgrind complains about "unrecognized instruction"
- Update valgrind:
sudo apt update && sudo apt install valgrind - Or use
--vex-iropt-register-updates=allregs-at-mem-access
Callgrind output is huge
- Use
--compress-strings=yes --compress-pos=yes - Or filter to specific functions with
--toggle-collect=function_name
Profile doesn't match benchmark results
- Ensure you're profiling the same code path
- Check if JIT compilation is cached (use setup functions in gungraun)
- Profile release build, not debug
Can't open callgrind file in GUI
- Check file permissions
- Ensure file isn't corrupted (run
callgrind_annotatefirst) - Try different viewer (kcachegrind vs qcachegrind)
See Also
- Valgrind manual: https://valgrind.org/docs/manual/manual.html
- Callgrind manual: https://valgrind.org/docs/manual/cl-manual.html
- Nextest wrapper scripts: https://nexte.st/docs/configuration/wrapper-scripts/
- KCachegrind handbook: https://docs.kde.org/stable5/en/kcachegrind/
- Project nextest config:
.config/nextest.toml - Benchmark debugging: See
benchmarking.md
More by facet-rs
View allDebug crashes, segfaults, and memory errors using valgrind integration with nextest through pre-configured profiles
Orientation to facet-format JIT deserialization (tiering, fallbacks, key types/entry points) and where to look when changing or debugging JIT code
Systematic workflow for debugging by reproducing bugs with real data, reducing test cases to minimal examples, and adding regression tests
Guidelines for using facet crates (facet-json, facet-toml, facet-args) instead of serde-based alternatives for consistent dogfooding