Call for
Shared Task
Social Sim'26
Social Sim'26
Social Sim '26 invites authors to submit contributions to a shared task. This track is separate from the workshop main track. It will be reviewed separately and listed alongside the workshop main track on the website. Submissions to the shared task can be presented at the workshop event.
Submission Portal: https://openreview.net/group?id=colmweb.org/COLM/2026/Workshop/Social_Sim_Shared_Task
Submission Deadline: August 7, 2026, AoE
Accept/Reject Notification: August 22, 2026, AoE
Workshop Date: October 9, 2026
LLM-based social simulation research is a young field, and the community has not converged on what makes a simulation good, faithful, or scientifically useful (see ICML Position Paper and contemporary evaluation proposals: PIMMUR, TRAILS, EASE, Sim2Real, …).
Is there something to be learned from a simulation when subtle design choices alter relevant outcomes?
Moreover, can we do it in a structured way that supports frictionless reproducibility?
Let's find out together by designing and analyzing metrics in the context of adjudicating hypotheses about simulation results.
Help us get answers by contributing to this task!
There is a lot of infrastructure work on which simulation metric analysis operates, and it will be hard to compare submissions if everyone has their own bespoke implementation. So, we have implemented scenarios in the Silicon Society Sandbox (SiliSocS), an EASE-compliant, experimentation-oriented simulation codebase. All provided study data and code (beyond SiliSocS) can be found in a dedicated shared task repository. By providing this system, we hope to reduce the friction in participation while also facilitating comparison.
We provide 5 scenario studies (see below), together covering a breadth of phenomena relevant to social simulations (incentive understanding, reasoning over other agents, emergent social structures, persona expression, and opinion expression).
A scenario is a base EASE (Environment-Agents-Simulation engine-Evaluation) simulation configuration. We have left Evaluation up to you to add.
Each scenario ships with a design.yaml specifying:
Anchors: contextualized published works that have presented/analyzed the phenomena
Hypotheses: derived from anchors that can in principle be tested with the data provided
Variables: relevant config variables with distinct values
Sweep data: paths to a set of already-computed run sweeps over these values
Submissions will be reviewed (see criteria below) and accepted submissions will posted on the workshop website. We will include a special session at the workshop event for presenting shared task submissions and during which we will present awards to notable submissions.
A submission should pick one or more scenarios and evaluate the listed hypotheses over the provided data using evaluation metrics that they think are well-suited for the task.
Shared-task submission (no simulation needed!).
Propose an evaluation(s) metric/methodology for one or more of the 5 provided studies that can be run on the provided sim data to establish the robustness of the associated hypotheses.
Write the code that implements the evaluation on the provided simulation output. Run your code on the corresponding simulation output, and store the results.
Then submit
A 2-page paper (unlimited appendix) that
presents the evaluation, outlining its motivation and methodology,
includes tables of results from the processed data, and
discusses the results in light of the study and the task.
Link to a code repository with post-processing code and result table data.
Submit on OpenReview: https://openreview.net/group?id=colmweb.org/COLM/2026/Workshop/Social_Sim_Shared_Task
This is not a benchmark: we solicit evaluation proposals, so a submission is judged on how well it measures and on what it teaches us, operationalized as
Metric quality (30%;)
Well-defined and reproducible. The evaluation maps the provided output logs to a result unambiguously, with documented, runnable code; any run variation it relies on is declared in the config, not hard-coded.
Discriminating. The metric moves with a desirable signal and stays flat under desirable noise — recovering planted results where we provide them (shared task track), and otherwise (open-task track) revealing behaviour of the contributed study’s target observable.
Robust. Its conclusions survive reasonable design perturbations and obvious shortcuts.
These are operationalized with a scoring rubric.
Conceptual contribution (70%)
Illuminating. It clarifies what we actually learn from the simulation — grounded in stated precedent (e.g. field-specific epistemic norm) or principle. The analysis should start pointing to a source with in the simulation rather than treating the simulator as a black box.
Distinctive. It opens new ways of thinking about simulation quality and articulates what LLM-based modeling makes visible that earlier approaches could not.
The submissions will be judged by an expert panel consisting of workshop speakers, organizers, and external referees. There will be awards for the winner and runner-up, as well as an award for creativity.