Situated Evals

Evaluating AI
where the data is
hardest to reach.

Situated Evals builds uncontaminated, closed-evaluation benchmarks that measure how faithfully AI represents specific, hard-to-reach populations — scored on data that has never entered a pretraining corpus.

FlagshipSimulacraBench · NeurIPS 2026
MethodUncontaminated · closed evaluation
PartnersStanford · UNICEF · UNHCR

Explore SimulacraBench What we do

01 · About

Benchmarks lie when the test set is in the training set. We build evaluations that can't be memorized.

AI systems are increasingly used to stand in for people — to pre-test survey instruments, impute missing responses, and simulate how specific communities might answer. But the evidence on whether these simulacra are faithful comes from public benchmarks that models have already seen.

A different design.

Situated Evals runs on real-world microdata that has never been publicly released, scored under strictly proper rules, inside a closed-evaluation architecture where submissions travel to the data — not the other way around. Nothing is exfiltrated; nothing leaks into the next model's training run.

Our focus is deliberately on the populations where AI representations are weakest and matter most: communities outside WEIRD samples, in development and humanitarian contexts, where collapsing heterogeneity and miscalibrated confidence carry real cost.

Uncontaminated

evaluation data held out of every public pretraining corpus

Closed

evaluation architecture — code travels to the data, records never leave

Proper

scoring rules that reward calibrated, truthful probabilities

02 · Competitions

Our work runs as open competitions. Here's the first.

Now · NeurIPS 2026 Competition Track

SimulacraBench

Closing the simulacra gap in development data. Validate AI representations of hard-to-reach populations on ~55,000 unreleased UNICEF and UNHCR respondents across 23 countries. Public launch August 1, 2026.

Explore the competition →

Follow the work.