SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Language Models

Slack Channel: semeval2026task11

About the Task

SemEval-2026 Task 11 aims to investigate how language models can acquire content-independent multilingual reasoning mechanisms, thereby mitigating content biases that affect their logical reasoning capabilities across languages.

Motivation

The extent to which language models can learn generalizable, content-independent reasoning mechanisms is still largely disputed in the NLP research community. Recent work, for example, has shown that LLMs suffer from content biases when assessing or formulating logical arguments - i.e., they tend to overestimate the formal validity of arguments that are compatible with world knowledge, underestimate the formal validity of less plausible arguments, or exhibit biases depending on the content of the arguments (including concrete entities, terms, and languages).

This phenomenon, known as content effect, suggests that formal reasoning in language models is inherently entangled with the world knowledge acquired during pre-training. This contributes to well-known limitations, such as susceptibility to spurious correlations and biases in decision-making, which ultimately affect the applicability of LLMs in critical real-world applications.

Because of its impact on reasoning and reliability, several works have proposed solutions for better disentangling content from formal reasoning, including neuro-symbolic methods, prompting techniques, supervised fine-tuning via reasoning demonstrations, and steering methods. However, there is still no effective solution to fully address this challenge. Moreover, current studies on content effect are still largely limited to English and have yet to be extended to multilingual settings.

The Task

We propose a shared task on multilingual syllogistic reasoning to improve our understanding of how to disentangle content from formal reasoning in LLMs. In this task, participants will be presented with syllogistic arguments in different languages that can be aligned (i.e. plausible) or misaligned (i.e. implausible) with world knowledge. The goal is to build models that can assess the formal validity of the arguments, regardless of their plausibility.

To this end, we will release a novel multilingual dataset spanning multiple syllogistic schemes in different languages, measuring both accuracy in assessing the validity of the arguments and how the content effect varies across languages. We hope this task will bring together different practitioners and researchers to shed light on potential solutions to mitigate such a timely and persistent challenge in the field.

While any solution is welcome, in the spirit of advancing scientific knowledge and understanding, we encourage participants to submit solutions based on open-source or open-weight models that are natively multilingual and that, at the same time, can shed light on the mechanisms leading to reasoning biases across different model sizes and families.

References

Valentino, M., Kim, G., Dalal, D., Zhao, Z., & Freitas, A. (2025). Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering. arXiv preprint arXiv:2505.12189.

Kim, G., Valentino, M. and Freitas, A. (2024). Reasoning circuits in language models: A mechanistic interpretation of syllogistic inference . Findings of ACL 2025.

Seals, T. and Shalin, V. (2024). Evaluating the deductive competence of large language models. NAACL 2024.

Ranaldi, L., Valentino, M., and Freitas, A. (2025). Improving chain-of-thought reasoning via quasi-symbolic abstractions. ACL 2025.

Bertolazzi, L., Gatt, A., and Bernardi, R. (2024). A systematic analysis of large language models as soft reasoners: The case of syllogistic inferences. EMNLP 2024.

Ozeki, K., Ando, R., Morishita, T., Abe, H., Mineshima, K., and Okada, M. (2024). Exploring reasoning biases in large language models through syllogism: Insights from the neubaroco dataset. Findings of ACL 2024.

Wysocka, M., Carvalho, D., Wysocki, O., Valentino, M., and Freitas, A. (2024). Syllobio-NLI: Evaluating large language models on biomedical syllogistic reasoning. NAACL 2025.

Dasgupta, I., Lampinen, A. K., Chan, S. C. Y., Sheahan, H. R., Creswell, A., Kumaran, D., McClelland, J. L., and Hill, F. (2022). Language models show human-like content effects on reasoning tasks. arXiv preprint arXiv:2207.07051.

Eisape, T., Tessler, M., Dasgupta, I., Sha, F., Steenkiste, S., and Linzen, T. (2024). A systematic comparison of syllogistic reasoning in humans and language models. NAACL 2024.

Quan, X., Valentino, M., Dennis, L., and Freitas, A. (2024). Verification and refinement of natural language explanations through llm-symbolic theorem proving. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M., and Callison-Burch, C. (2023). Faithful chain-of-thought reasoning. AACL 2023.

Xu, J., Fei, H., Pan, L., Liu, Q., Lee, M., and Hsu, W. (2024). Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357.

Timeline

Training data ready (1 September 2025)

Evaluation kit release, information on languages, and output formats (31 October 2025)

Evaluation data ready & Codabench competition (1 December 2025)

Practice Phase (10 December 2025 - 31 December 2025)

Evaluation Phase (10 January 2026 - 31 January 2026)

Paper submission (February 2026)

Notification to authors (March 2026)

Camera-ready (April 2026)

SemEval workshop Summer 2026 (co-located with a major NLP conference)

Task & Dataset Overview

SemEval-2026 Task 11 challenges models to assess the formal validity of syllogistic arguments, a key test of logical reasoning. The task consists of four subtasks (two of which are multilingual) and an accompanying training set (only in English).

Training and test datasets are hosted on our GitHub repository.

Training Data

The training set will be exclusively in English, simulating a low-resource environment. Participants are free to explore data augmentation strategies for the training set. In both the training and test sets, arguments can be either plausible (aligned with world knowledge) or implausible (misaligned). The primary objective is to predict the validity of the argument, completely independent of its plausibility. Plausibility information will be provided only in the training set to help participants understand this distinction.

Here is an example from the training set:

{

"id": "0",

"syllogism": "Not all canines are aquatic creatures known as fish. It is certain that no fish belong to the class of mammals. Therefore, every canine falls under the category of mammals.",

"validity": false,

"plausibility": true

}

Here, a model should predict the correct validity label (i.e., "false").

The full training set in English is available here. The evaluation data for the different subtasks with details on the specific languages for the multilingual evaluation will be released in December.

Subtasks

Detailed information on the evaluation metrics adopted for the subtasks can be found on our GitHub repository. This includes the evaluation scripts used to assess the submitted solutions.

Subtask 1: Syllogistic Reasoning in English

This subtask evaluates a model's ability to determine the validity of syllogisms in English. The test set will contain the same types of syllogisms as the training data. We will measure two key metrics:

Accuracy: The percentage of correct validity predictions.
Intra-Plausibility Content Effect: The average difference in accuracy between valid and invalid arguments given a plausibility value (measures biases towards a specific validity label).
Cross-Plausibility Content Effect: The average difference in accuracy between plausible and implausible arguments given a formal validity value (measures biases towards the plausibility value).
Total Content Effect: The average between intra and cross-plausibility content effect. A lower content effect is preferable, as it indicates that the model is relying on logical structure rather than real-world content or biases.

Ranking will be based on the ratio of accuracy to total content effect. A higher ratio indicates a model that is both accurate and robust against content bias.

Codabench competition: https://www.codabench.org/competitions/11928/

Subtask 2: Syllogistic Reasoning with Irrelevant Premises in English

This subtask tests a model's robustness by including irrelevant or "noisy" premises. The main goal is still to determine the syllogism's validity, but with an added challenge: you must also identify and select only the relevant premises. The set of relevant premises is the set of statements that are necessary and sufficient to entail the conclusion. This means that only "valid" syllogisms will have relevant premises.

Binary Prediction Evaluation: We will use the same accuracy and content effect metrics as in the first subtask.
Premise Selection Evaluation: We will use the F1 score to measure how well the model identifies the correct, relevant premises. The task is to jointly predict the validity of the syllogism and the set of relevant premises that entail the conclusion.

Ranking will be based on the ratio of the average between accuracy and F1 to the total content effect. A higher ratio indicates a model that is both accurate and robust against content bias.

Codabench competition: https://www.codabench.org/competitions/11929/

Subtask 3: Multilingual Syllogistic Reasoning

This subtask extends the binary classification to multiple languages. We will measure:

Accuracy: The percentage of correct validity predictions.
Intra-Plausibility Content Effect: The average difference in accuracy between valid and invalid arguments given a plausibility value (measures biases towards a specific validity label).
Cross-Plausibility Content Effect: The average difference in accuracy between plausible and implausible arguments given a formal validity value (measures biases towards the plausibility value).
Total Content Effect: The average between intra and cross-plausibility content effect. A lower content effect is preferable, as it indicates that the model is relying on logical structure rather than real-world content or biases.

Ranking will be based on the ratio of accuracy to total content effect. A higher ratio indicates a model that is both accurate and robust against content bias.

Codabench competition: https://www.codabench.org/competitions/11930/

Subtask 4: Multilingual Syllogistic Reasoning with Irrelevant Premises

This subtask is the multilingual version of the robustness evaluation, challenging your model to handle noisy, irrelevant premises in multiple languages. Your model will be evaluated on both the binary classification and the premise selection subtasks. As in the English task, the set of relevant premises is the set of statements that are necessary and sufficient to entail the conclusion. This means that only "valid" syllogisms will have relevant premises.

Binary Prediction Evaluation: We will use the same accuracy and content effect metrics as in the other subtask.
Premise Selection Evaluation: We will use the F1 score to measure how well the model identifies the correct, relevant premises. The task is to jointly predict the validity of the syllogism and the set of relevant premises that entail the conclusion.

Ranking will be based on the ratio of the average between accuracy and F1 to the total content effect. A higher ratio indicates a model that is both accurate and robust against content bias.

Codabench competition: https://www.codabench.org/competitions/11931/

Languages

The sutasks will include the following languages:

Subtask 1 & 2:

English (en)

Subtask 3 & 4

English (en), German (de), Spanish (es), French (fr), Italian (it), Dutch (nl), Portuguese (pt), Russian (ru), Chinese (zh), Swahili (sw), Bengali (bn), Telugu (te)