LLMs4OL 2026: Large Language Models for Ontology Learning
The 3rd LLMs4OL Challenge @ ISWC 2026
ISWC 2026, Bari, Italy | 25-29 October
ISWC 2026, Bari, Italy | 25-29 October
End-to-End Ontology Learning
Definition: Given raw text, construct a primitive ontology including terms, types, taxonomy, and relations.
Motivation: Real-world ontology construction is not a collection of independent subtasks. It requires joint reasoning across: term/type discovery, term typing, axonomy discovery, and non-taxonomic relationship extraction. This task evaluates whether LLM-based systems can move from component intelligence to pipeline intelligence.
Given raw domain text, participants must construct a structured ontology by integrating all major Ontology Learning (OL) stages into a single, coherent pipeline. Unlike prior OL benchmarks that isolate subtasks (e.g., only taxonomy extraction), this flagship task evaluates how effectively systems—especially LLM-augmented systems—can compose multiple OL stages into an end-to-end ontology construction workflow.
A system that, starting from unstructured text, produces a primitive ontology containing:
Types (concepts or classes)
Terms (instances)
Term Typings (mappend instances to classes)
Taxonomic Discovery (is-a/subclass)
Non-taxonomic relations
A connected, coherent ontology graph
All stages must be automatically derived from text via an integrated pipeline.
Provided raw text:
In a smart home system, sensors monitor environmental conditions. A temperature sensor measures room temperature. A motion sensor detects movement and triggers the alarm system. The smart thermostat receives temperature readings from the temperature sensor and adjusts the heating system. A mobile app allows the user to control the smart thermostat remotely.
➡️ 1. Term Extraction
➡️ 2. Types Extraction
➡️ 3. Term Typing
➡️ 4. Taxonomic Discovery
➡️ 5. Non-Taxonomic RE
The final output would be as follows:
(temperature sensor, instance-of, sensor)
(motion sensor, instance-of sensor)
(alarm system, instance-of, system)
(smart thermostat, instance-of device)
(heating system, instance-of, system)
(mobile app, instance-of, application)
(user, instance-of, person)
(sensor, is-a, device)
(system, is-a, device)
(application, is-a, device)
(temperature sensor, measures, room temperature)
(motion sensor, detects, movement)
(motion sensor, triggers, alarm system)
(smart thermostat, receives, temperature reading)
(smart thermostat, controls, heating system)
(mobile app, controls, smart thermostat)
(user, uses, mobile app)
(sensor, monitors, environmental condition)
The dataset for this task consists of 4,303 samples for training with id, context, and primitive-ontology-triples. Participants can use training data for finetuning or developing their own approaches, where in the test set, there will be only id and context values per sample, where participants should submit the primitive-ontology-triples for evaluations.
The following is an example of the dataset
{
"id": "205b1042e99d4a109edde273e33b4071",
"context": "Title: The Interplay of Performance and Expression in Music \n\n Content: \nA music group operates as a cohesive group, bringing together artists to interpret and perform creative works. The foundation of these performances often rests upon specific forms of musical expression. For example, a libretto defines the text and narrative of an opera, while lyrics articulate the thematic content of a song. Musicians also rely on the score, a detailed notation that guides their execution. The result of this collaboration is sound, the audible phenomenon that carries the artistic message. In contemporary practice, this audio is frequently processed into a signal for recording or transmission. Historically, an analogue signal was used to maintain continuous variation matching the original waveform, but modern technology favors the digital signal, which converts the data into binary format. Despite the medium, the analogue signal, the digital signal, and the acoustic sound all function as distinct mechanisms for conveying the underlying musical expression.",
"primitive-ontology-triples": [
["music group", "is-a", "group"],
["libretto", "is-a", "musical expression"],
["signal", "is-a", "musical expression"],
["analogue signal", "is-a", "signal"],
["digital signal", "is-a", "signal"],
["lyrics", "is-a", "musical expression"],
["sound", "is-a", "musical expression"],
["score", "is-a", "musical expression"],
["digital signal", "disjoint with", "analogue signal"],
["analogue signal", "disjoint with", "digital signal"]
]
}
The context for triplets consists of a title and a content body that are combined to form a context for primitive ontology. The content is a short scientific variant on the title.
The Flagship Task train dataset is available for download via https://github.com/sciknoworg/LLMs4OL-Challenge/tree/main/2026/TaskA-Flagship
Important notes:
The dataset is designed to support different domains; even though we didn't provide this information in the dataset, it is important to consider the multi-domain perspective modeling, but it is not mandatory.
The dataset went into a multi-phase quality check to make sure it is well-suited for modeling. This makes it ideal for data augmentation if it is necessary.
Note that some samples might have only types, or some might have terms.
List of ontologies to avoid using for training in compliance with the challenge policy: OBI, FOAF, CopyrightOnto, Metadata4Ing, PROCO, PTO, SWEET, SPDocument, MDSOnto, MatOnto, AgrO, TimelineOntology, MusicOntology, GTS, PeriodicTable, GND, QUDT, SchemaOrg, GeoNames, FoodOn, DOID, GoodRelations, BFO, ENVO, Conference, VIMMP, VIBSO, OM, DoCO, AUTO, Wine, DOLCE, CCO, DBpedia, MaterialInformation, LexInfo.
Standard Metrics: Precision, Recall, F1
Standard Task-specific Metrics: Precision, Recall, F1 scores, for Term Typing, Taxonomy Discovery, and Non-Taxonomic RE.
Graph Similarity Metric
Fine-tuning an LLM model for triplet extractions: Fine-tuning a smaller LLM (e.g., LLaMA 3, Mistral) with structured output supervision on the full pipeline.
Use classic NLP + LLMs in combination: Named Entity Recognition (NER) for term extraction, Relation Extraction (RE) models (e.g., fine-tuned BERT/LLM) for non-taxonomic relations, LLM prompting for typing and taxonomy induction.
Agentic AI approach: Moving beyond single-step predictions to multi-step reasoning systems that use different techniques to build the triplets.
Iterative / Self-Refinement Prompting: Have the LLM generate an initial ontology triplets, then critique and refine it in subsequent passes — checking for consistency (e.g., ensuring is-a hierarchies are acyclic, that all terms have a type). This can improve graph coherence, which is explicitly evaluated.
or ...
There is no restriction in terms of approaches!