Risk-controlled formal judging for math reasoning

Trust Lean only when coverage is enough.

COVCAL turns Lean traces into a selective prediction system: accept formal answers when the proof signal is covered and calibrated; abstain when unresolved rival mass makes the judgment unsafe.

Read the paper ↗ Try the coverage demo

Pauline Bourigault · Xiaotong Ji · Matthieu Zimmer · Rasul Tutunov · Haitham Bou-Ammar

lean-trace.log

1answer_classes = {1/2, 3/4, 0.5}
2proved class: 1/2
3unresolved rival: 3/4
4Ctyp = 0.82 · Cprf = 0.76
5margin = 0.51
6decision: accept formal answer

finiteselective-risk certificate

partialobservations, not labels

safeabstention under low coverage

The problem

Proof failure is not wrongness.

A Lean proof is strong positive evidence for the statement it checked. But missing proof can mean bad formalization, missing library context, timeout, or semantic drift. COVCAL treats these cases as unresolved instead of negative labels.

✓

Positive formal evidence

When a candidate class is kernel-proved, the proof gives strong evidence for that generated Lean statement.

Unresolved rivals

Unproved answer classes may still be correct, unformalized, or merely out of prover budget. They cannot be treated as false.

⌁

Coverage cliff

The formal selector becomes reliable only when enough answer mass is typed, proved, and separated from unresolved rivals.

Method

A selective wrapper around Lean.

The pipeline groups raw candidates into normalized answer classes, records Lean statuses, computes coverage diagnostics, then accepts only when a calibrated rule certifies the risk of formal predictions.

Candidates

Normalize answers

Cluster equivalent strings like 1/2, 0.5, and 2/4 into answer classes with self-consistency weights.

Lean trace

Observe statuses

Each class is marked proved, typechecked, timeout, illtyped, or unformalized from the formal pipeline.

Diagnostics

Measure coverage

Compute typed coverage, proved coverage, unresolved rival mass, and formal margin.

Certificate

Accept or abstain

Bonferroni or dev-then-cal calibration certifies accepted formal answers, not fallback predictions.

Results

The signal is real — but coverage-governed.

On MATH-500, the formal signal is sharply coverage-dependent. Better autoformalization changes certificate feasibility more than the certificate machinery itself.

96%

proof-winning answer accuracy at high proved coverage, compared with only 20% at low proved coverage.

28%problems with a proved class using the 7B autoformalizer.

≈43%proved statements judged faithful in manual audit.

17/20Bonferroni partitions certified with Goedel-Prover-V2-8B.

0.98accepted accuracy when the prover-specialized formalizer reaches high coverage.

Low coverage

20%

High coverage

96%

High margin

98%

Interactive intuition

Move the diagnostics. Watch the judge decide.

This toy panel mirrors the COVCAL decision logic: high typed coverage, high proved coverage, and a positive formal margin make formal selection trustworthy; otherwise the system abstains.

Typed coverage: 0.78 Proved coverage: 0.74 Formal margin: 0.42

Coverage sufficient

Accept formal answer.

The proved class dominates the unresolved rival mass, so the formal trace is inside the trusted region.

Read the full paper.

Full pdf here.

Paper ↗