Methodology

How problems are built and verified. Full contract: data/SCHEMA.md.

sources probmods · dippl · forest · posteriordb
problem statement given / model / query
+ typed answer spec
realizations WebPPL · Pyro · Stan
two gates solver re-derivation
cross-language
verified dataset 160 problems
Statements are language-neutral prose that pin the answer and leave the program open. A ground truth ships after surviving both gates.

Measured tolerance

ground truth k seeded runs
noise floor max pairwise distance
tolerance margin × floor
Each problem’s threshold comes from its own ground-truth noise: deterministic problems demand near-exact answers; stochastic ones get exactly the slack their sampling justifies.

Gate 1 — solver re-derivation

rendered statement prose only — no code shown
independent solvers 2 × LLM
judged vs ground truth within measured tolerance
accept
Solvers agreeing with each other but not the GT marks the GT suspect — investigated to a measured root cause before anything changes.

Gate 2 — cross-language consistency

WebPPL / gold-reference GT k seeded runs
agree within margin × max(floors) idiomatic library use, audited by judgment
Pyro / Stan GT k seeded runs
Stronger than solver consensus, which shares the GT's blind spots — this gate caught a biased, already-accepted WebPPL ground truth. The posteriordb corpus crosschecks its Stan column against gold reference draws.

Provenance

Ground truth matching its textbook source is authoritative; statements are the rewritable layer. Source overrides require documented evidence of internal inconsistency. Retired problems are parked with reasons.