name: profiling description: Profile code performance using callgrind and valgrind with nextest integration for analyzing instruction counts, cache behavior, and identifying bottlenecks

Profiling with Valgrind, Callgrind, and Nextest

The facet project has pre-configured valgrind integration for debugging crashes, memory leaks, and performance profiling.

Quick Usage

# Run test under valgrind (memory errors + leaks)
cargo nextest run --profile valgrind -p PACKAGE TEST_FILTER

# Run test under callgrind (profiling)
valgrind --tool=callgrind --callgrind-out-file=callgrind.out \
  cargo nextest run --no-fail-fast -p PACKAGE TEST_FILTER

# Analyze callgrind output
callgrind_annotate callgrind.out
# or with GUI
kcachegrind callgrind.out  # Linux
qcachegrind callgrind.out  # macOS

Nextest Valgrind Profile

The project has a pre-configured valgrind profile in .config/nextest.toml:

Configuration

[scripts.wrapper.valgrind]
# Leak checking configuration
command = 'valgrind --leak-check=full --show-leak-kinds=all --errors-for-leak-kinds=definite,indirect --error-exitcode=1'

[profile.valgrind]
# Apply to all tests on Linux
platform = 'cfg(target_os = "linux")'
filter = 'all()'
run-wrapper = 'valgrind'

What it does:

--leak-check=full - Show details for each leak
--show-leak-kinds=all - Show all leak types for diagnostics
--errors-for-leak-kinds=definite,indirect - Only fail on real leaks (not "still reachable")
--error-exitcode=1 - Exit with code 1 if errors found

Usage

# Run specific test
cargo nextest run --profile valgrind -p facet-format-json test_simple_struct

# Run all tests in a file
cargo nextest run --profile valgrind -p facet-format-json --test jit_deserialize

# Run with filter
cargo nextest run --profile valgrind -p facet-json booleans

Benefits:

✅ Automatic configuration - no manual valgrind commands
✅ Consistent flags across team
✅ Integrated with nextest filtering
✅ Clean, formatted output

Profiling with Callgrind

Callgrind is a valgrind tool for profiling instruction counts and function call graphs.

Basic Profiling

# Profile a specific test
valgrind --tool=callgrind \
  --callgrind-out-file=callgrind.out \
  cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME

# Analyze output
callgrind_annotate callgrind.out

Advanced Options

# Collect cache simulation data (slower but more detailed)
valgrind --tool=callgrind \
  --cache-sim=yes \
  --branch-sim=yes \
  --callgrind-out-file=callgrind.out \
  cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME

# Focus on specific function
valgrind --tool=callgrind \
  --toggle-collect=main \
  --callgrind-out-file=callgrind.out \
  cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME

# Compress output (can get large)
valgrind --tool=callgrind \
  --compress-strings=yes \
  --compress-pos=yes \
  --callgrind-out-file=callgrind.out.gz \
  cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME

Analyzing Callgrind Output

Command Line (callgrind_annotate)

# Full report
callgrind_annotate callgrind.out

# Focus on specific functions
callgrind_annotate --include='facet::' callgrind.out

# Show only top functions
callgrind_annotate --auto=yes --threshold=1 callgrind.out

# Compare two runs
callgrind_annotate --diff callgrind.old.out callgrind.new.out

Reading the output:

Ir                                     # Instruction reads (total)
I1mr                                   # L1 instruction cache misses
ILmr                                   # Last-level instruction cache misses
Dr                                     # Data reads
Dw                                     # Data writes
D1mr, D1mw                            # L1 data cache read/write misses
DLmr, DLmw                            # Last-level data cache read/write misses

--------------------------------------------------------------------------------
Ir               file:function
--------------------------------------------------------------------------------
1,234,567 (45%)  facet_format_json::deserialize
  987,654 (35%)  facet_format::parse_value
  ...

GUI (KCachegrind/QCachegrind)

Install:

# Linux
sudo apt install kcachegrind

# macOS
brew install qcachegrind

# Windows (WSL)
sudo apt install kcachegrind

Launch:

kcachegrind callgrind.out   # Linux
qcachegrind callgrind.out   # macOS

GUI features:

Call graph visualization
Flamegraph-like views
Source code annotation (if debug symbols available)
Caller/callee relationships
Multiple metrics (instructions, cache misses, branches)

Profiling Benchmarks

The generated benchmark tests (from benchmarks.kdl) can be profiled:

1. As Tests (Recommended for Callgrind)

# Profile a benchmark test under callgrind
valgrind --tool=callgrind \
  --callgrind-out-file=callgrind_simple_struct.out \
  cargo nextest run --profile valgrind -p facet-json test_simple_struct

# Analyze
callgrind_annotate callgrind_simple_struct.out

Why use tests:

Single iteration = cleaner callgrind output
No benchmark harness overhead
Easier to focus on hot path
Faster to run

2. As Benchmarks (For Realistic Instruction Counts)

The benchmark harness (gungraun) already uses valgrind internally:

# Run gungraun benchmark (uses callgrind automatically)
cargo bench --bench unified_benchmarks_gungraun --features jit simple_struct

# Check output in bench-reports/gungraun-*.txt

gungraun automatically collects:

Instructions executed
Estimated cycles
L1/LL cache hits
RAM hits
Total read/write operations

This data appears in bench-reports/perf/RESULTS.md.

Common Profiling Workflows

Debug a Crash

# 1. Run under valgrind to find memory error
cargo nextest run --profile valgrind -p PACKAGE TEST_NAME

# 2. Read valgrind output for exact error location
# Example: "Invalid read of size 8 at 0x123456"

# 3. Fix the bug

# 4. Verify fix
cargo nextest run -p PACKAGE TEST_NAME

Find Performance Bottleneck

# 1. Profile with callgrind
valgrind --tool=callgrind \
  --callgrind-out-file=profile.out \
  cargo nextest run --no-fail-fast -p facet-json test_booleans

# 2. Analyze
callgrind_annotate --auto=yes profile.out | head -30

# 3. Identify hot functions (high instruction counts)

# 4. Optimize hot functions

# 5. Re-profile and compare
valgrind --tool=callgrind \
  --callgrind-out-file=profile_after.out \
  cargo nextest run --no-fail-fast -p facet-json test_booleans

callgrind_annotate --diff profile.out profile_after.out

Optimize Tier-2 JIT

# 1. Check RESULTS.md for slow benchmarks
grep "⚠" bench-reports/perf/RESULTS.md

# 2. Profile the slow benchmark test
valgrind --tool=callgrind \
  --callgrind-out-file=jit_profile.out \
  cargo nextest run --profile valgrind -p facet-json test_long_strings --features jit

# 3. Analyze with GUI for visual call graph
kcachegrind jit_profile.out

# 4. Look for:
#    - Helper function calls in tight loops
#    - Redundant alignment checks
#    - Allocation hot spots

# 5. Optimize based on findings

# 6. Verify with benchmarks
cargo xtask bench long_strings

Compare Before/After Optimization

# Before
git checkout main
valgrind --tool=callgrind --callgrind-out-file=before.out \
  cargo nextest run --no-fail-fast -p facet-json test_target

# After
git checkout my-optimization-branch
valgrind --tool=callgrind --callgrind-out-file=after.out \
  cargo nextest run --no-fail-fast -p facet-json test_target

# Compare
callgrind_annotate --diff before.out after.out

Interpreting Valgrind Output

Memory Error Example

==12345== Invalid read of size 8
==12345==    at 0x123456: facet_format_json::parse_number (parse.rs:42)
==12345==    by 0x234567: facet_format_json::deserialize (lib.rs:123)
==12345==  Address 0x789abc is 0 bytes after a block of size 16 alloc'd
==12345==    at 0x345678: alloc (alloc.rs:88)
==12345==    by 0x456789: Vec::push (vec.rs:1234)

Translation:

Reading 8 bytes from invalid address
Happened in parse_number at line 42
Address is just past end of 16-byte allocation
Fix: Check bounds before reading, or fix off-by-one error

Leak Example

==12345== 128 bytes in 1 blocks are definitely lost in loss record 1 of 10
==12345==    at 0x123456: malloc (vg_replace_malloc.c:299)
==12345==    by 0x234567: alloc (alloc.rs:88)
==12345==    by 0x345678: Box::new (boxed.rs:123)
==12345==    by 0x456789: setup_jit (jit.rs:456)

Translation:

128 bytes allocated but never freed
Allocated in setup_jit function
Fix: Ensure cleanup/Drop implementation

Cachegrind Output Example

Ir               I1mr  ILmr  Dr        D1mr   DLmr   Dw        D1mw   DLmw
--------------------------------------------------------------------------------
1,234,567        123   45    456,789   234    12     123,456   67     8   facet::deserialize
  987,654        98    32    345,678   189    9      98,765    43     5   - facet::parse_value
  234,567        23    10    98,765    45     2      23,456    12     1   - facet::parse_string

Key metrics:

Ir - Instructions executed (most important for optimization)
D1mr/D1mw - L1 data cache misses (indicates poor locality)
DLmr/DLmw - Last-level cache misses (very expensive)

Optimization targets:

High Ir count = time-consuming function
High D1mr = poor data locality, consider restructuring
High DLmr = main memory accesses, critical to optimize

Profiling Flags

Valgrind (Memory Debugging)

--leak-check=full          # Detailed leak info
--show-leak-kinds=all      # Show all leak types
--track-origins=yes        # Track uninitialized values (slower)
--verbose                  # More diagnostic info
--log-file=valgrind.log    # Save output to file

Callgrind (Profiling)

--callgrind-out-file=FILE  # Output file (default: callgrind.out.<pid>)
--cache-sim=yes            # Simulate cache behavior
--branch-sim=yes           # Simulate branch prediction
--collect-jumps=yes        # Collect jump information
--dump-instr=yes           # Dump instruction info
--compress-strings=yes     # Compress output (smaller files)

Cargo Nextest

--no-fail-fast            # Continue running after first failure
--profile valgrind        # Use valgrind profile from nextest.toml
--test-threads=1          # Run single-threaded (better for profiling)

Tips and Tricks

Speed Up Profiling

Profile in release mode (but keep debug symbols):

# Add to Cargo.toml
[profile.release]
debug = true

Use --no-fail-fast to avoid stopping early
Filter to specific tests - don't profile everything at once

Disable address randomization for reproducible runs:

setarch $(uname -m) -R valgrind --tool=callgrind ...

Read Callgrind Data Programmatically

# Example: Parse callgrind output for automation
def parse_callgrind(filename):
    import re
    costs = {}
    with open(filename) as f:
        for line in f:
            if m := re.match(r'(\d+)\s+(.+)', line):
                cost, func = m.groups()
                costs[func] = int(cost)
    return costs

# Compare two profiles
before = parse_callgrind('before.out')
after = parse_callgrind('after.out')

for func in before:
    if func in after:
        delta = after[func] - before[func]
        percent = (delta / before[func]) * 100
        if abs(percent) > 5:  # More than 5% change
            print(f"{func}: {percent:+.1f}% ({delta:+,} instructions)")

Don't Do This

❌ Run valgrind without nextest profile - inconsistent flags ❌ Profile debug builds - too slow and unrepresentative ❌ Ignore "still reachable" leaks in FFI code - sometimes OK ❌ Profile with multiple test threads - non-deterministic results ❌ Forget to clean between profiling runs - stale data

Do This Instead

✅ Use --profile valgrind for memory debugging ✅ Use callgrind for performance profiling ✅ Profile release builds with debug symbols ✅ Focus on hot paths (high Ir counts) ✅ Compare before/after with --diff ✅ Use GUI tools (kcachegrind) for complex call graphs

Files and Locations

.config/nextest.toml         # Valgrind profile configuration
callgrind.out.*              # Callgrind output files (gitignored)
bench-reports/gungraun-*.txt # Gungraun output (includes instruction counts)

Troubleshooting

Valgrind complains about "unrecognized instruction"

Update valgrind: sudo apt update && sudo apt install valgrind
Or use --vex-iropt-register-updates=allregs-at-mem-access

Callgrind output is huge

Use --compress-strings=yes --compress-pos=yes
Or filter to specific functions with --toggle-collect=function_name

Profile doesn't match benchmark results

Ensure you're profiling the same code path
Check if JIT compilation is cached (use setup functions in gungraun)
Profile release build, not debug

Can't open callgrind file in GUI

Check file permissions
Ensure file isn't corrupted (run callgrind_annotate first)
Try different viewer (kcachegrind vs qcachegrind)

profiling

Installation

Details

Usage

Skill Instructions

name: profiling description: Profile code performance using callgrind and valgrind with nextest integration for analyzing instruction counts, cache behavior, and identifying bottlenecks

Profiling with Valgrind, Callgrind, and Nextest

Quick Usage

Nextest Valgrind Profile

Configuration

Usage

Profiling with Callgrind

Basic Profiling

Advanced Options

Analyzing Callgrind Output

Command Line (callgrind_annotate)

GUI (KCachegrind/QCachegrind)

Profiling Benchmarks

1. As Tests (Recommended for Callgrind)

2. As Benchmarks (For Realistic Instruction Counts)

Common Profiling Workflows

Debug a Crash

Find Performance Bottleneck

Optimize Tier-2 JIT

Compare Before/After Optimization

Interpreting Valgrind Output

Memory Error Example

Leak Example

Cachegrind Output Example

Profiling Flags

Valgrind (Memory Debugging)

Callgrind (Profiling)

Cargo Nextest

Tips and Tricks

Speed Up Profiling

Read Callgrind Data Programmatically

Don't Do This

Do This Instead

Files and Locations

Troubleshooting

Valgrind complains about "unrecognized instruction"

Callgrind output is huge

Profile doesn't match benchmark results

Can't open callgrind file in GUI

See Also

More by facet-rs