Github: https://github.com/neuro-symbolic-ai/semeval_2026_task_11
Slack Channel: semeval2026task11
SemEval-2026 Task 11 aims to investigate how Large Language Models (LLMs) can acquire content-independent reasoning mechanisms, thereby mitigating content biases that affect their logical reasoning capabilities in a multilingual setting.
The extent to which LLMs can learn generalizable, content-independent reasoning mechanisms is still largely disputed in the NLP research community. Recent work, for example, has shown that LLMs suffer from content biases when assessing or formulating logical arguments - i.e., they tend to overestimate the formal validity of arguments that are compatible with world knowledge, underestimate the formal validity of less plausible arguments, or exhibit biases depending on the content of the arguments (including concrete entities, terms, and languages).
This phenomenon, known as content effect, suggests that formal reasoning in LLMs is inherently entangled with the world knowledge acquired during pre-training. This contributes to well-known limitations, such as susceptibility to spurious correlations and biases in decision-making, which ultimately affect the applicability of LLMs in critical real-world applications.
Because of its impact on reasoning and reliability, several works have proposed solutions for better disentangling content from formal reasoning, including neuro-symbolic methods, prompting techniques, supervised fine-tuning via reasoning demonstrations, and steering methods. However, there is still no effective solution to fully address this challenge. Moreover, current studies on content effect are still largely limited to English and have yet to be extended to multilingual settings.
We propose a shared task on multilingual syllogistic reasoning to improve our understanding of how to disentangle content from formal reasoning in LLMs. In this task, participants will be presented with syllogistic arguments in different languages that can be aligned (i.e. plausible) or misaligned (i.e. implausible) with world knowledge. The goal is to build models that can assess the formal validity of the arguments, regardless of their plausibility.
To this end, we will release a novel large-scale dataset spanning multiple syllogistic schemes in different languages, measuring both accuracy in assessing the validity of the arguments and how the content effect varies across languages. We hope this task will bring together different practitioners and researchers to shed light on potential solutions to mitigate such a timely and persistent challenge in the field.
While every type of solution is allowed, in the spirit of advancing scientific knowledge and understanding, we encourage participants to submit solutions based on open-source or open-weight models that are natively multilingual and that, at the same time, shed light on the internal reasoning mechanisms.
Valentino, M., Kim, G., Dalal, D., Zhao, Z., & Freitas, A. (2025). Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering. arXiv preprint arXiv:2505.12189.
Ranaldi, L., Valentino, M., and Freitas, A. (2025). Improving chain-of-thought reasoning via quasi-symbolic abstractions. ACL 2025.
Kim, G., Valentino, M. and Freitas, A. (2024). Reasoning circuits in language models: A mechanistic interpretation of syllogistic inference . Findings of ACL 2025.
Wysocka, M., Carvalho, D., Wysocki, O., Valentino, M., and Freitas, A. (2024). Syllobio-NLI: Evaluating large language models on biomedical syllogistic reasoning. NAACL 2025.
Seals, T. and Shalin, V. (2024). Evaluating the deductive competence of large language models. NAACL 2024.
Ozeki, K., Ando, R., Morishita, T., Abe, H., Mineshima, K., and Okada, M. (2024). Exploring reasoning biases in large language models through syllogism: Insights from the neubaroco dataset. Findings of ACL 2024.
Dasgupta, I., Lampinen, A. K., Chan, S. C. Y., Sheahan, H. R., Creswell, A., Kumaran, D., McClelland, J. L., and Hill, F. (2022). Language models show human-like content effects on reasoning tasks. arXiv preprint arXiv:2207.07051.
Bertolazzi, L., Gatt, A., and Bernardi, R. (2024). A systematic analysis of large language models as soft reasoners: The case of syllogistic inferences. EMNLP 2024.
Eisape, T., Tessler, M., Dasgupta, I., Sha, F., Steenkiste, S., and Linzen, T. (2024). A systematic comparison of syllogistic reasoning in humans and language models. NAACL 2024.
Quan, X., Valentino, M., Dennis, L., and Freitas, A. (2024). Verification and refinement of natural language explanations through llm-symbolic theorem proving. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M., and Callison-Burch, C. (2023). Faithful chain-of-thought reasoning. AACL 2023.
Xu, J., Fei, H., Pan, L., Liu, Q., Lee, M., and Hsu, W. (2024). Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357.
SemEval-2026 Task 11 challenges models to assess the formal validity of syllogistic arguments, a key test of logical reasoning. The task consists of four subtasks and an accompanying training set.
The training set will be exclusively in English, simulating a low-resource environment. In both the training and test sets, arguments can be either plausible (aligned with world knowledge) or implausible (misaligned). The primary objective is to predict the validity of the argument, completely independent of its plausibility. Plausibility information will be provided only in the training set to help participants understand this distinction.
Here is an example from the training set:
{
"id": "0",
"syllogism": "Not all canines are aquatic creatures known as fish. It is certain that no fish belong to the class of mammals. Therefore, every canine falls under the category of mammals.",
"validity": false,
"plausibility": true
}
Here, a model should predict the correct validity label (i.e., "false").
A pilot dataset is available here. The full training set is expected to be released in September 2025. Details on the specific languages for the multilingual subtasks will be announced soon.
This subtask evaluates a model's ability to determine the validity of syllogisms in English. The test set will contain the same types of syllogisms as the training data. We will measure two key metrics:
Accuracy: The percentage of correct validity predictions.
Intra-Plausibility Content Effect: The average difference in accuracy between valid and invalid arguments given a plausibility value (measures biases towards a specific validity label).
Cross-Plausibility Content Effect: The average difference in accuracy between plausible and implausible arguments given a formal validity value (measures biases towards the plausibility value).
Total Content Effect: The average between intra and cross-plausibility content effect. A lower content effect is preferable, as it indicates that the model is relying on logical structure rather than real-world content or biases.
Ranking will be based on the ratio of accuracy to total content effect. A higher ratio indicates a model that is both accurate and robust against content bias.
This subtask tests a model's robustness by including irrelevant or "noisy" premises. The main goal is still to determine the syllogism's validity, but with an added challenge: you must also identify and select only the relevant premises.
Binary Prediction Evaluation: We will use the same accuracy and content effect metrics as in the first subtask.
Premise Selection Evaluation: We will use the F1 score to measure how well the model identifies the correct, relevant premises.
Ranking will be based on the ratio of the average between accuracy and F1 to the total content effect. A higher ratio indicates a model that is both accurate and robust against content bias.
This subtask extends the binary classification to multiple languages. We will measure:
Accuracy: The percentage of correct validity predictions in each target language.
Multilingual Content Effect: This two-part metric will measure both the total content effect (calculated via intra and cross-plausibility metrics) within the target language (e.g., Chinese) and the difference in total content effect between the target language and English. A smaller difference suggests better cross-lingual generalization.
Ranking will be based on the ratio of accuracy to the multilingual content effect score.
This subtask is the multilingual version of the robustness evaluation, challenging your model to handle noisy, irrelevant premises in multiple languages. Your model will be evaluated on both the binary classification and the premise selection subtasks.
Binary Prediction Evaluation: We will use the same accuracy and multilingual content effect metrics as in subtask 3.
Premise Selection Evaluation: We will use the F1 score to measure how well your model identifies the correct, relevant premises.
Ranking will be based on the ratio of the average between accuracy and F1 to the multilingual content effect score.
The competition will be hosted on Codabench. Additional details will be announced soon!
University of Sheffield
University of Edinburgh
University of Aberdeen
University of Rome Tor Vergata
University of Manchester & Idiap Research Institute