Cruciverb-IT at EVALITA 2026 consists of two tasks.
The first task consists of answering clues extracted from Italian crosswords. Specifically, the task is formatted as a question-answering problem: participants are presented with a set of clues C={c₁, c₂, ..., cₙ} and are asked to build a system that for a given clue cᵢ is able to produce one or multiple candidate solutions S={s₁, s₂, ..., sₙ}, possibly containing the correct answer sᵢ. To simulate a more realistic crossword solving scenario and to further guide the systems towards the correct answer space, each clue cᵢ is paired with the character length of the target answer sᵢ. For example: given the clue and the target character length “Sono un fiore di straordinaria bellezza, 4”, the systems should produce a list of one or more candidates, i.e. {iris, rosa, rose, yuzu, fior, ...} eventually containing the correct answer rosa.
The second task consists of autonomously solving Italian crossword grids. The participants are presented with a set of empty crossword grids G={G₁, G₂, ..., Gₖ} where each grid Gᵢ is paired with a list of clues, each one annotated with the (x, y) coordinates of the square where the corresponding solution starts in the grid and the direction, either down (verticale) or across (orizzontale). A crossword grid consists of a matrix Gᵢ of size Rⁿˣⁿ and each square is either blank or a black square. The developed systems should autonomously fill the grid with the appropriate solutions, yielding a fully or partially filled crossword grid that ensures a consistent overlap between the characters of crossing words and maximizes the number of appropriate solutions correctly placed in the grid.
For the Cruciverb-IT task we relied on two datasets. Please refer to the Data page for their description.
Participants can pick either one of the tasks or both. Importantly, for both tasks, leveraging external data sources that explicitly contain crossword clues is forbidden. All other kinds of external data and resources are allowed (e.g., dictionaries, Wikipedia, Wordnet, pre-trained/fine-tuned language models, embeddings and so on). For such cases, participants must also provide the list of external data and resources used for developing their systems. We decided to enforce this rule since augmenting data with crossword-related external sources may result in cases of training contamination and, given enough gold examples and highly engineered search techniques, these systems can achieve comparable and even superior performance against professional human solvers (Dr. Fill). For both tasks, systems must produce for each clue a list of candidate solutions of arbitrary size sorted by probability (i.e., the first candidate is the most probable).
For clues-answering, our baseline is obtained by approaching the task as an information retrieval problem: given a clue cᵢ from the test set Cₜₑₛₜ = {c₁, ..., cₙ}, our system ranks the most similar clues by computing a similarity score between cᵢ and each clue in the training set Cₜᵣₐᵢₙ = {c₁, ..., cₘ}. After selecting the top ten most similar clues, we extract the corresponding ten answers. The similarity scores between clues are estimated using the BM25 algorithm, a well-established ranking function in the field of Information Retrieval.
For the task of solving crossword grids, our baseline is computed by leveraging the aforementioned ranker baseline combined with an additional module that optimizes for a solution by maximizing satisfied constraints while respecting the grid’s hard constraints. By treating crossword puzzles as a weighted Max-SMT problem, as partially described in Kulshreshtha et al. (2022), we defined a set of hard and soft logical constraints over the grid variables (squares): each clue corresponds to a sequence of grid variables constrained to match one of its candidate answers, obtained through the task-1 baseline, forming a disjunctive (OR) group. These candidate-level constraints are then combined conjunctively (AND) across all clues. Intersections are enforced implicitly by shared cell variables (i.e., crossing words write into the same cell), ensuring character consistency between overlapping horizontal and vertical words. Each candidate is paired with the corresponding ranking weight in order to consider candidate importance as a soft preference during maximization. The final formulation is passed to the Z3 optimizer (we modify an open-source implementation: https://github.com/pncnmnp/Crossword-Solver) (de Moura et al., 2008), which satisfies all hard constraints and maximizes the weighted satisfaction of soft constraints. Importantly, our baseline approach can yield partially filled grids. We simply run the solver with a candidate size of 10 per clue.
The evaluation of the systems will be conducted with the following metrics:
Task 1: Accuracy@1/10, that is the accuracy in retrieving the correct solution word given the corresponding clue, considering the top 1 and 10 candidates produced by the a system; Mean Reciprocal Rank (MRR), that is the average of the reciprocal ranks of the first relevant item across all clues.
Task 2: % of correct characters (CharAcc), that is the accuracy in inserting the correct characters in the correct slots; % of correct words (WordAcc), accuracy in inserting the correct word in the correct slots; % of grids solved correctly (GridAcc), the accuracy in solving the entire grid. A partially filled grid will be evaluated counting empty squares as errors.