Methodology

How problems are built and verified. Full contract: data/SCHEMA.md.

sources probmods · dippl · forest · posteriordb

problem statement given / model / query
+ typed answer spec

realizations WebPPL · Pyro · Stan

two gates solver re-derivation
cross-language

verified dataset 160 problems

Statements are language-neutral prose that pin the answer and leave the program open. A ground truth ships after surviving both gates.

Measured tolerance

ground truth k seeded runs

noise floor max pairwise distance

tolerance margin × floor

Each problem’s threshold comes from its own ground-truth noise: deterministic problems demand near-exact answers; stochastic ones get exactly the slack their sampling justifies.

Gate 1 — solver re-derivation

rendered statement prose only — no code shown

independent solvers 2 × LLM

judged vs ground truth within measured tolerance

Solvers agreeing with each other but not the GT marks the GT suspect — investigated to a measured root cause before anything changes.

Gate 2 — cross-language consistency

WebPPL / gold-reference GT k seeded runs

agree within margin × max(floors) idiomatic library use, audited by judgment

Pyro / Stan GT k seeded runs

Stronger than solver consensus, which shares the GT's blind spots — this gate caught a biased, already-accepted WebPPL ground truth. The posteriordb corpus crosschecks its Stan column against gold reference draws.

Provenance

Ground truth matching its textbook source is authoritative; statements are the rewritable layer. Source overrides require documented evidence of internal inconsistency. Retired problems are parked with reasons.