Andreas Haupt
Stanford HAI · Digital Economy Lab
HAI Postdoctoral Fellow jointly in Stanford's Economics and Computer Science departments. PhD from MIT; co-author of the forthcoming textbook Machine Learning from Human Preferences.
Situated Evals · NeurIPS 2026 Competition Track
SimulacraBench is a Situated Evals competition to validate AI representations of hard-to-reach populations, evaluated on unreleased UN microdata.
01 · About
UNICEF, UNHCR, and humanitarian programs increasingly rely on rapid behavioural surveys to set policy and programming. Field data is slow, expensive, and gappy — driving interest in using LLMs as simulacra of specific populations to pre-test instruments, impute non-response, and run subgroup what-ifs.
Today's evidence on whether these simulacra are faithful is contaminated. Public benchmarks (ANES, GSS, World Values Survey) are in pretraining corpora; models can memorize them. Outside WEIRD subpopulations, simulacra collapse heterogeneity and miscalibrate confidence — silently.
This competition runs on UN behavioural microdata that has never been publicly released, scored under a strictly proper rule, with a closed-evaluation architecture where submissions travel to the data rather than the other way around.
~55,000
respondents across 23 countries in unreleased UNICEF and UNHCR microdata
0
of these microdata in any model's pretraining corpus
4
live UN survey programs (CRA 2.0, Faith & Immunisation, MENA Climate KAP, UNHCR ERPIS)
02 · Why compete
03 · The Task
N respondents × K items (≈ 55,000 × ~150 in this competition)
Given X ∈ ℝN×K of survey responses with mask Ωtrain, learn p̂(Xij | context). Score by held-out log-loss — a strictly proper rule.
Items are categorical: binary, unordered nominal, ordered Likert, multi-select, and binned continuous. Skip-logic gating is treated as a distinguished response level (NA_GATED), not as missingness — participants must place mass on it where appropriate.
Track A
A subset of items held out MCAR for each training respondent — the non-response regime.
Track B
All non-sociodemographic items masked for a held-out subset of respondents — the simulacrum test.
Teams may submit to either track; top overall recognition requires strong performance on both.
04 · Data
The three UNICEF assets (CRA 2.0, Faith & Immunisation, MENA Climate KAP) are confirmed; the UNHCR ERPIS instrument targets Syrian refugees in four host countries and is included conditional on UNHCR data-governance approval. Participants do not receive the microdata. You receive (i) a schema-only specification with column names, types, and response-category codes; (ii) a small synthetic sandbox to debug the submission pipeline; (iii) the submission API specification.
More data may be added before launch. The instrument list below reflects what is confirmed today. Additional UN survey programs may join the benchmark up until the public launch on August 1, 2026 — the schema specification will be updated as new assets clear data governance.
| CRA 2.0 | Faith & Immunisation | MENA Climate | ERPIS 2025 | |
|---|---|---|---|---|
| Countries | 6 | 10 | 3 | 4 |
| Waves | 3 | 1 | 1 | 2 |
| N total | 20,229 | 19,847 | 1,236 | 13,821 |
| Items / wave | 72 | 26 | 168 | 110 |
| Socio-demographic vars | 10 | 5 | 16 | 15 |
| Attitude / behaviour vars | 50 | 13 | 129 | 115 |
05 · Submission & Rules
06 · Timeline
Jun 2026
Materials posted
Schema, sandbox, baselines public
Jul 2026
Dry run
Harness stress-tested with invited teams
Aug 1, 2026
Public launch
Development phase opens · daily leaderboard
Nov 1–14, 2026
Test phase
Final test submissions · leaderboard frozen
Dec 2026
NeurIPS results
Competition Track session · top-team talks
Q1 2027
Proceedings paper
Authorship for top-3 per track
07 · Recognition
Recognition categories
How teams are recognized
Organizing team
Stanford HAI · Digital Economy Lab
HAI Postdoctoral Fellow jointly in Stanford's Economics and Computer Science departments. PhD from MIT; co-author of the forthcoming textbook Machine Learning from Human Preferences.
UN Innovation Network
Senior Advisor on Behavioural Science to the Executive Office of the UN Secretary-General; leads the UN Behavioural Science Group. Convenes the UNICEF and UNHCR data-custodian counterparts.
UNHCR
Data and innovation specialist at UNHCR; supports the governance, preparation, and quality assurance of refugee and asylum-seeker microdata contributed to the benchmark.
UNICEF
Behavioural science global lead at UNICEF; data steward for the Community Rapid Assessment 2.0 and the Faith & Immunisation Survey.
UNHCR
Leads innovation data work at UNHCR over refugee and asylum-seeker microdata; owns the technical specification and ingestion pathway for UNHCR-contributed data.
Stanford
PhD researcher at Stanford working at the intersection of economics and machine learning; coordinates the competition design, evaluation protocol, and participant operations.
Stanford
Researcher at Stanford on the measurement and evaluation of AI systems; builds and maintains the closed-evaluation harness, submission API, and leaderboard.
Stanford CS · STAIR
Associate Professor of Computer Science at Stanford and director of Stanford Trustworthy AI Research (STAIR). Methodological expertise on trustworthy evaluation and benchmark design.
Stanford HAI · MIT
Toshiba Professor Emeritus at MIT, Professor (Research) at Stanford. Long-standing engagement with multilateral institutions on data governance for development and humanitarian contexts.
Registration opens with the public launch on August 1, 2026. Register your interest to be notified when the leaderboard and starter kit go live — or write to us with questions about the task, the data, or eligibility.