Situated Evals

Evaluating AI
where the data is
hardest to reach.

Situated Evals builds uncontaminated, closed-evaluation benchmarks that measure how faithfully AI represents specific, hard-to-reach populations — scored on data that has never entered a pretraining corpus.

01 · About

Benchmarks lie when the test set is in the training set. We build evaluations that can't be memorized.

AI systems are increasingly used to stand in for people — to pre-test survey instruments, impute missing responses, and simulate how specific communities might answer. But the evidence on whether these simulacra are faithful comes from public benchmarks that models have already seen.

A different design.

Situated Evals runs on real-world microdata that has never been publicly released, scored under strictly proper rules, inside a closed-evaluation architecture where submissions travel to the data — not the other way around. Nothing is exfiltrated; nothing leaks into the next model's training run.

Our focus is deliberately on the populations where AI representations are weakest and matter most: communities outside WEIRD samples, in development and humanitarian contexts, where collapsing heterogeneity and miscalibrated confidence carry real cost.

Uncontaminated

evaluation data held out of every public pretraining corpus

Closed

evaluation architecture — code travels to the data, records never leave

Proper

scoring rules that reward calibrated, truthful probabilities

02 · Competitions

Our work runs as open competitions. Here's the first.

Now · NeurIPS 2026 Competition Track

SimulacraBench

Closing the simulacra gap in development data. Validate AI representations of hard-to-reach populations on ~55,000 unreleased UNICEF and UNHCR respondents across 23 countries. Public launch August 1, 2026.

Explore the competition →

Follow the work.

Register your interest to hear when new evaluations and the SimulacraBench leaderboard go live. If you steward population data and want it evaluated without releasing it, we'd like to talk.