Situated Evals · NeurIPS 2026 Competition Track

Closing the
Simulacra Gap in
Development Data

SimulacraBench is a Situated Evals competition to validate AI representations of hard-to-reach populations, evaluated on unreleased UN microdata.

Public launchAugust 1, 2026
Two tracksImputation · Cross-respondent
~55,000 respondents23 countries · UNICEF & UNHCR

01 · About

AI is already filling data gaps for hard-to-reach populations. Nobody knows how well.

UNICEF, UNHCR, and humanitarian programs increasingly rely on rapid behavioural surveys to set policy and programming. Field data is slow, expensive, and gappy — driving interest in using LLMs as simulacra of specific populations to pre-test instruments, impute non-response, and run subgroup what-ifs.

Why a new benchmark is needed.

Today's evidence on whether these simulacra are faithful is contaminated. Public benchmarks (ANES, GSS, World Values Survey) are in pretraining corpora; models can memorize them. Outside WEIRD subpopulations, simulacra collapse heterogeneity and miscalibrate confidence — silently.

This competition runs on UN behavioural microdata that has never been publicly released, scored under a strictly proper rule, with a closed-evaluation architecture where submissions travel to the data rather than the other way around.

~55,000

respondents across 23 countries in unreleased UNICEF and UNHCR microdata

of these microdata in any model's pretraining corpus

live UN survey programs (CRA 2.0, Faith & Immunisation, MENA Climate KAP, UNHCR ERPIS)

02 · Why compete

A clean dataset, a proper score, four baseline families. Apples-to-apples.

A new dataset, not benchmaxed. Four UN behavioural-science instruments not in any pretraining corpus.
Calibration-first, log-loss-scored evaluation. A strictly proper rule that rewards truthful probabilities, not accuracy on the modal class.
Apples-to-apples comparison across tabular diffusion, IRT, low-rank matrix completion, and LLM-prompted simulacra — on identical held-out masks.
Open-source starter kit. MIT-licensed baselines, schema-only specification, and a synthetic sandbox at launch.
Authorship path. Top-3 teams per track are invited to contribute to the Competition Track proceedings paper; the top overall team gets an invited talk slot.

Anatomy of the competition

Task: Probabilistic completion of a respondent × question matrix
Tracks: (A) within-respondent imputation · (B) cross-respondent generalization
Data: UNICEF microdata (CRA 2.0, Faith & Immunisation, MENA KAP) plus UNHCR ERPIS
Architecture: Closed evaluation: organizers run code; data never leaves Stanford
Metric: Mean test log-loss — strictly proper, calibration-sensitive
Compute cap: 100 min on a single A100 at test; outbound network disabled

03 · The Task

Complete a respondent × question matrix — for both behaviours and opinions.

Reported behavior

Opinion

N respondents × K items (≈ 55,000 × ~150 in this competition)

Observed (training)
Held-out — predict P(answer)
Genuinely missing

Formally

Given X ∈ ℝ^N×K of survey responses with mask Ω_train, learn p̂(X_ij | context). Score by held-out log-loss — a strictly proper rule.

Items are categorical: binary, unordered nominal, ordered Likert, multi-select, and binned continuous. Skip-logic gating is treated as a distinguished response level (NA_GATED), not as missingness — participants must place mass on it where appropriate.

Track A

Within-respondent imputation

A subset of items held out MCAR for each training respondent — the non-response regime.

Track B

Cross-respondent generalization

All non-sociodemographic items masked for a held-out subset of respondents — the simulacrum test.

Teams may submit to either track; top overall recognition requires strong performance on both.

04 · Data

Three UNICEF assets, one UNHCR instrument. None publicly available.

The three UNICEF assets (CRA 2.0, Faith & Immunisation, MENA Climate KAP) are confirmed; the UNHCR ERPIS instrument targets Syrian refugees in four host countries and is included conditional on UNHCR data-governance approval. Participants do not receive the microdata. You receive (i) a schema-only specification with column names, types, and response-category codes; (ii) a small synthetic sandbox to debug the submission pipeline; (iii) the submission API specification.

More data may be added before launch. The instrument list below reflects what is confirmed today. Additional UN survey programs may join the benchmark up until the public launch on August 1, 2026 — the schema specification will be updated as new assets clear data governance.

	CRA 2.0	Faith & Immunisation	MENA Climate	ERPIS 2025
Countries	6	10	3	4
Waves	3	1	1	2
N total	20,229	19,847	1,236	13,821
Items / wave	72	26	168	110
Socio-demographic vars	10	5	16	15
Attitude / behaviour vars	50	13	129	115

05 · Submission & Rules

Submit code, not predictions. Organizers run it inside the sandbox.

How submission works

Submit a containerized image or a Python script + environment file that implements the defined API.
The harness loads your container, instantiates the model, runs it against the held-out cells, and returns scalar per-track scores.
Code can train on the unmasked portion of the matrix before predicting.
Outbound network access from the submission container is disabled at evaluation time.

Compute & quotas

Single A100 GPU. 10-minute wall-clock budget per development submission; 100-minute budget per test submission.
Development phase (Aug 1 – Oct 31, 2026): 1 leaderboard submission per team per day, scored on a fresh random 10% shard.
Test phase (Nov 1 – Nov 14, 2026): 1 final test submission per team, evaluated on the full datasets.
Pretrained external weights are allowed if publicly downloadable at a fixed commit hash specified before test phase opens.

Baselines provided

Per-item marginal — trivial baseline.
2PL IRT with categorical-logistic likelihood — strong classical baseline.
Low-rank ALS matrix completion.
TabDDPM — modern tabular diffusion baseline.
All released under an MIT license alongside the schema and synthetic sandbox.

Eligibility & ethics

Open to teams from academia, industry, and independent research, except where precluded by sanctions or law.
Each submission must be accompanied by a 4-page method description; top-3 teams per track supply source code under a non-commercial research license.
Microdata are not released. Any attempt to exfiltrate records is grounds for disqualification.
Ties within paired-bootstrap significance are recognized jointly.

06 · Timeline

Eight months from launch to results at NeurIPS.

Jun 2026

Materials posted

Schema, sandbox, baselines public
Jul 2026

Dry run

Harness stress-tested with invited teams
Aug 1, 2026

Public launch

Development phase opens · daily leaderboard
Nov 1–14, 2026

Test phase

Final test submissions · leaderboard frozen
Dec 2026

NeurIPS results

Competition Track session · top-team talks
Q1 2027

Proceedings paper

Authorship for top-3 per track

07 · Recognition

Authorship, talks, and operational impact.

Recognition categories

Top overallStrong performance on both tracks
Track A winnerWithin-respondent imputation
Track B winnerCross-respondent generalization
Travel grantsReserved for LMIC and under-represented teams

How teams are recognized

Proceedings authorshipTop-3 per track invited to co-author the Competition Track paper
NeurIPS podium10-minute method talks for top-3 per track
Invited talkTop overall team gets an invited slot at the session
UN Applied Impact commendationJointly awarded with the UN Behavioural Science Group for approaches considered for follow-up evaluation in UN operational workflows

Organizing team

Nine people across Stanford, UNICEF, UNHCR, and the UN Behavioural Science Group.

Lead organizers

Andreas Haupt

Stanford HAI · Digital Economy Lab

HAI Postdoctoral Fellow jointly in Stanford's Economics and Computer Science departments. PhD from MIT; co-author of the forthcoming textbook Machine Learning from Human Preferences.

Mary MacLennan

UN Innovation Network

Senior Advisor on Behavioural Science to the Executive Office of the UN Secretary-General; leads the UN Behavioural Science Group. Convenes the UNICEF and UNHCR data-custodian counterparts.

Data partners

Ahmed Galal Abukhashaba

UNHCR

Data and innovation specialist at UNHCR; supports the governance, preparation, and quality assurance of refugee and asylum-seeker microdata contributed to the benchmark.

Ukasha Ramli

UNICEF

Behavioural science global lead at UNICEF; data steward for the Community Rapid Assessment 2.0 and the Faith & Immunisation Survey.

Rebeca Moreno Jiménez

UNHCR

Leads innovation data work at UNHCR over refugee and asylum-seeker microdata; owns the technical specification and ingestion pathway for UNHCR-contributed data.

Competition

José Ramón Enríquez

Stanford

PhD researcher at Stanford working at the intersection of economics and machine learning; coordinates the competition design, evaluation protocol, and participant operations.

Yegor Denisov-Blanch

Stanford

Researcher at Stanford on the measurement and evaluation of AI systems; builds and maintains the closed-evaluation harness, submission API, and leaderboard.

Academic supervisors

Sanmi Koyejo

Stanford CS · STAIR

Associate Professor of Computer Science at Stanford and director of Stanford Trustworthy AI Research (STAIR). Methodological expertise on trustworthy evaluation and benchmark design.

Alex Pentland

Stanford HAI · MIT

Toshiba Professor Emeritus at MIT, Professor (Research) at Stanford. Long-standing engagement with multilateral institutions on data governance for development and humanitarian contexts.

Be ready for August 1.

Registration opens with the public launch on August 1, 2026. Register your interest to be notified when the leaderboard and starter kit go live — or write to us with questions about the task, the data, or eligibility.

Closing theSimulacra Gap inDevelopment Data

AI is already filling data gaps for hard-to-reach populations. Nobody knows how well.

Why a new benchmark is needed.

A clean dataset, a proper score, four baseline families. Apples-to-apples.

Anatomy of the competition

Complete a respondent × question matrix — for both behaviours and opinions.

Formally

Within-respondent imputation

Cross-respondent generalization

Three UNICEF assets, one UNHCR instrument. None publicly available.

Submit code, not predictions. Organizers run it inside the sandbox.

How submission works

Compute & quotas

Baselines provided

Eligibility & ethics

Eight months from launch to results at NeurIPS.

Authorship, talks, and operational impact.

Nine people across Stanford, UNICEF, UNHCR, and the UN Behavioural Science Group.

Lead organizers

Andreas Haupt

Mary MacLennan

Data partners

Ahmed Galal Abukhashaba

Ukasha Ramli

Rebeca Moreno Jiménez

Competition

José Ramón Enríquez

Yegor Denisov-Blanch

Academic supervisors

Sanmi Koyejo

Alex Pentland

Be ready for August 1.

Closing the
Simulacra Gap in
Development Data