Situated Evals
Situated Evals builds uncontaminated, closed-evaluation benchmarks that measure how faithfully AI represents specific, hard-to-reach populations — scored on data that has never entered a pretraining corpus.
01 · About
AI systems are increasingly used to stand in for people — to pre-test survey instruments, impute missing responses, and simulate how specific communities might answer. But the evidence on whether these simulacra are faithful comes from public benchmarks that models have already seen.
Situated Evals runs on real-world microdata that has never been publicly released, scored under strictly proper rules, inside a closed-evaluation architecture where submissions travel to the data — not the other way around. Nothing is exfiltrated; nothing leaks into the next model's training run.
Our focus is deliberately on the populations where AI representations are weakest and matter most: communities outside WEIRD samples, in development and humanitarian contexts, where collapsing heterogeneity and miscalibrated confidence carry real cost.
Uncontaminated
evaluation data held out of every public pretraining corpus
Closed
evaluation architecture — code travels to the data, records never leave
Proper
scoring rules that reward calibrated, truthful probabilities
02 · Competitions
Register your interest to hear when new evaluations and the SimulacraBench leaderboard go live. If you steward population data and want it evaluated without releasing it, we'd like to talk.