LLMs4OL 2026: Large Language Models for Ontology Learning

The 3rd LLMs4OL Challenge @ ISWC 2026

‌ISWC 2026, Bari, Italy | 25-29 October

Flagship Task

End-to-End Ontology Learning

Definition: Given raw text, construct a primitive ontology including terms, types, taxonomy, and relations.

Motivation: Real-world ontology construction is not a collection of independent subtasks. It requires joint reasoning across: term/type discovery, term typing, axonomy discovery, and non-taxonomic relationship extraction. This task evaluates whether LLM-based systems can move from component intelligence to pipeline intelligence.

Objective

Given raw domain text, participants must construct a structured ontology by integrating all major Ontology Learning (OL) stages into a single, coherent pipeline. Unlike prior OL benchmarks that isolate subtasks (e.g., only taxonomy extraction), this flagship task evaluates how effectively systems—especially LLM-augmented systems—can compose multiple OL stages into an end-to-end ontology construction workflow.

What must participants build?

A system that, starting from unstructured text, produces a primitive ontology containing:

Types (concepts or classes)
Terms (instances)
Term Typings (mappend instances to classes)
Taxonomic Discovery (is-a/subclass)
Non-taxonomic relations
A connected, coherent ontology graph

All stages must be automatically derived from text via an integrated pipeline.

A toy example

Provided raw text:

In a smart home system, sensors monitor environmental conditions. A temperature sensor measures room temperature. A motion sensor detects movement and triggers the alarm system. The smart thermostat receives temperature readings from the temperature sensor and adjusts the heating system. A mobile app allows the user to control the smart thermostat remotely.

➡️ 1. Term Extraction

temperature sensormotion sensoralarm systemsmart thermostatheating systemmobile appuserroom temperaturetemperature readingmovement

➡️ 2. Types Extraction

devicesensorsystemapplicationenvironmental conditionmeasurementperson

➡️ 3. Term Typing

temperature sensor ---> sensormotion sensor ---> sensoralarm system ---> systemsmart thermostat ---> deviceheating system ---> systemmobile app ---> applicationuser ---> personroom temperature ---> environmental conditiontemperature reading ---> measurementmovement ---> environmental condition

➡️ 4. Taxonomic Discovery

(sensor, is-a, device)(system, is-a, device)(application, is-a, device)

➡️ 5. Non-Taxonomic RE

(temperature sensor, measures, room temperature)(motion sensor, detects, movement)(motion sensor, triggers, alarm system)(smart thermostat, receives, temperature reading)(smart thermostat, controls , heating system)(mobile app, controls, smart thermostat)(user, uses , mobile app)(sensor , monitors, environmental condition)

The final output would be as follows:

(temperature sensor, instance-of, sensor)

(motion sensor, instance-of sensor)

(alarm system, instance-of, system)

(smart thermostat, instance-of device)

(heating system, instance-of, system)

(mobile app, instance-of, application)

(user, instance-of, person)

(sensor, is-a, device)

(system, is-a, device)

(application, is-a, device)

(temperature sensor, measures, room temperature)

(motion sensor, detects, movement)

(motion sensor, triggers, alarm system)

(smart thermostat, receives, temperature reading)

(smart thermostat, controls, heating system)

(mobile app, controls, smart thermostat)

(user, uses, mobile app)

(sensor, monitors, environmental condition)

Dataset

The dataset for this task consists of 4,303 samples for training with id, context, and primitive-ontology-triples. Participants can use training data for finetuning or developing their own approaches, where in the test set, there will be only id and context values per sample, where participants should submit the primitive-ontology-triples for evaluations.

The following is an example of the dataset

{

"id": "205b1042e99d4a109edde273e33b4071",

"context": "Title: The Interplay of Performance and Expression in Music \n\n Content: \nA music group operates as a cohesive group, bringing together artists to interpret and perform creative works. The foundation of these performances often rests upon specific forms of musical expression. For example, a libretto defines the text and narrative of an opera, while lyrics articulate the thematic content of a song. Musicians also rely on the score, a detailed notation that guides their execution. The result of this collaboration is sound, the audible phenomenon that carries the artistic message. In contemporary practice, this audio is frequently processed into a signal for recording or transmission. Historically, an analogue signal was used to maintain continuous variation matching the original waveform, but modern technology favors the digital signal, which converts the data into binary format. Despite the medium, the analogue signal, the digital signal, and the acoustic sound all function as distinct mechanisms for conveying the underlying musical expression.",

"primitive-ontology-triples": [

["music group", "is-a", "group"],

["libretto", "is-a", "musical expression"],

["signal", "is-a", "musical expression"],

["analogue signal", "is-a", "signal"],

["digital signal", "is-a", "signal"],

["lyrics", "is-a", "musical expression"],

["sound", "is-a", "musical expression"],

["score", "is-a", "musical expression"],

["digital signal", "disjoint with", "analogue signal"],

["analogue signal", "disjoint with", "digital signal"]

]

}

The context for triplets consists of a title and a content body that are combined to form a context for primitive ontology. The content is a short scientific variant on the title.

The Flagship Task train dataset is available for download via https://github.com/sciknoworg/LLMs4OL-Challenge/tree/main/2026/TaskA-Flagship

Important notes:

The dataset is designed to support different domains; even though we didn't provide this information in the dataset, it is important to consider the multi-domain perspective modeling, but it is not mandatory.
The dataset went into a multi-phase quality check to make sure it is well-suited for modeling. This makes it ideal for data augmentation if it is necessary.
Note that some samples might have only types, or some might have terms.
List of ontologies to avoid using for training in compliance with the challenge policy: OBI, FOAF, CopyrightOnto, Metadata4Ing, PROCO, PTO, SWEET, SPDocument, MDSOnto, MatOnto, AgrO, TimelineOntology, MusicOntology, GTS, PeriodicTable, GND, QUDT, SchemaOrg, GeoNames, FoodOn, DOID, GoodRelations, BFO, ENVO, Conference, VIMMP, VIBSO, OM, DoCO, AUTO, Wine, DOLCE, CCO, DBpedia, MaterialInformation, LexInfo.

Evaluation Metrics

Standard Metrics: Precision, Recall, F1
Standard Task-specific Metrics: Precision, Recall, F1 scores, for Term Typing, Taxonomy Discovery, and Non-Taxonomic RE.
Graph Similarity Metric: To evaluate end-to-end ontology construction, we measure how closely a predicted ontology graph matches the gold-standard graph. The Graph Similarity Score combines three complementary signals into a single score between 0 and 1: Graph Similarity Metric = (Edge F1 + Neighborhood Similarity + Taxonomy Similarity) / 3.
- Metrics:
  - Edge F1 (Exact Triple Match): Standard precision/recall/F1 over the full set of triples (subject, predicate, object). A prediction is considered correct only if the entire triple matches exactly. This component rewards exact recovery of relations and penalizes missing or hallucinated triples.
  - Neighborhood Similarity (Local Structure): For each node, we construct its outgoing neighborhood as the set of (predicate, object) pairs. We then compute Jaccard similarity between the gold and predicted neighborhoods and average over all nodes. This measures whether entities are connected to structurally similar contexts in both graphs, even when individual triples are not identical.
  - Taxonomy Similarity (Hierarchical Structure): Focuses solely on is-a edges. For each node, compare the full ancestor sets in the gold vs. predicted taxonomy using Jaccard overlap. Rewards systems that recover the correct hierarchical position of concepts, even when some intermediate links differ.
- All three components and the final score lie in [0.0 – 1.0]. A score of 1.0 indicates a perfect match with the gold ontology graph; 0.0 indicates no overlap in structure, hierarchy, or triples.
- This metric is designed to reward systems that produce connected, coherent, and hierarchically consistent ontology graphs, rather than systems that only optimize for isolated triple extraction.
- The Graph Similarity Metric calculation script is available at https://github.com/sciknoworg/LLMs4OL-Challenge/blob/main/2026/graph_similarity_metric.py

What approaches can be developed?

Fine-tuning an LLM model for triplet extractions: Fine-tuning a smaller LLM (e.g., LLaMA 3, Mistral) with structured output supervision on the full pipeline.
Use classic NLP + LLMs in combination: Named Entity Recognition (NER) for term extraction, Relation Extraction (RE) models (e.g., fine-tuned BERT/LLM) for non-taxonomic relations, LLM prompting for typing and taxonomy induction.
Agentic AI approach: Moving beyond single-step predictions to multi-step reasoning systems that use different techniques to build the triplets.
Iterative / Self-Refinement Prompting: Have the LLM generate an initial ontology triplets, then critique and refine it in subsequent passes — checking for consistency (e.g., ensuring is-a hierarchies are acyclic, that all terms have a type). This can improve graph coherence, which is explicitly evaluated.
or ...

There is no restriction in terms of approaches!

Page updated

Google Sites

Report abuse