Research Highlight Generation from Scientific Papers

(SciHigh)

@ FIRE-2025

17th - 20th December, 2025

Results of the SciHigh-2025 Shared Task are now available here

Task Description

This shared task focuses on automatically generating research highlights from scientific paper abstracts using the MixSub dataset proposed by us in an earlier work [1]. Both research highlights and abstracts serve as summaries of a research paper, but highlights offer a more structured and concise version of the key contributions. The goal of this task is to develop machine learning models that can generate high-quality highlights similar to those written by authors.
Participants will explore different summarization techniques, including transformer-based models, retrieval-augmented approaches, and fine-tuned neural networks.
The task aims to improve the efficiency and accuracy of highlight generation, which can benefit researchers and academic indexing platforms.

Use Cases

The generated highlights can be useful in multiple scenarios, such as:

✔ Helping researchers quickly understand the key contributions of scientific papers. They are often easier to read and grasp than a longer paragraph, especially on hand-held devices.

✔ Reducing the time needed to extract relevant information from research articles.

✔ Enhancing metadata for academic search engines and digital libraries.

✔ Evaluating the effectiveness of different summarization techniques for scientific papers.

Important Dates

25th May, 2025 - Training and Validation data release(Training_dataset_download, Validation_dataset_download)

15th June, 2025 - Test data release(Test_dataset_download(Masked)) Test_dataset_download

30th June, 2025 - Run submission deadline

10th July, 2025 – Run submission deadline (extended)

15th July, 2025 - Results declared

30th July, 2025 - Results declared(extended)

30th August, 2025 - Working notes due

30th September, 2025 - Camera-ready copies of working notes and overview paper due

17th December, 2025 - FIRE conference

NOTE: All dates are in AoE timezone .

Python script to evaluate your submission

Dataset

For this shared task, we utilize a subset of the MixSub dataset [1], referred to as MixSub-SciHigh. The MixSub corpus was created by collecting research articles from ScienceDirect, encompassing a diverse range of scientific domains. It comprises 19,785 research papers published in the year 2020. Each data instance is structured as a pair consisting of the abstract and the corresponding author-written research highlights.

Each entry in the dataset includes:

Abstract: A concise summary of the research paper.

Research Highlights: Key contributions manually written by the authors.

An example is given below.

An (abstract, highlights) pair from the MixSub dataset. Taken from https://www.sciencedirect.com/science/article/pii/S0001457519307213

The MixSub-SciHigh dataset is split into three sets:

Training Set: 10,000 data instances,

Validation Set: 1985 data instances,

Test Set: 1840 data instances(Masked ground-truth).

Format: CSV files containing 3 column as Filename, Abstract, Highlights

Evaluation Plan

Submissions will be evaluated using the following automatic metrics:

ROUGE-1, 2, L: Measures lexical overlap with reference highlights.
METEOR: Evaluates semantic similarity using synonym and stemming matching.

✔ Participants are encouraged to analyse common challenges such as hallucinations (incorrect information) and factual inconsistencies in the generated highlights.

✔ The submitted entries will be ranked using the F1-score for the ROUGE-L metric. Innovative ideas in the proposed solution will be appreciated.

Submission Format

You should submit a single .zip file containing all required files to this email ID only: tohidarehman.it@jadavpuruniversity.in

Submission Requirements:

Participants must upload their trained model checkpoint to Hugging Face and share the link to the uploaded model.
Along with the Hugging Face model link, you must include a .csv file containing your results.

CSV File Format:

Each line in the .csv file should contain three comma-separated columns in the following order:

Filename
Abstract
Predicted_Highlights

File Naming Conventions:

The .csv file must be named using the following format: <team_name>_<run_identifier>.csv
Example: TeamA_run1.csv, where:
- TeamA is your team name
- run1 identifies the specific run
✅ Use underscores (_) only as shown above.
❌ Do not use blank spaces, tabs, or additional underscores in the file name.
The .zip file should be named after the email ID used for registration (excluding the domain).
For example, if your registered email ID is hello@gmail.com, then your zip file should be named:
hello.zip

Additional Notes:

You may submit up to two solutions for this task to demonstrate different approaches or refinements of your work.

Contact

Email: tohidarehman.it@jadavpuruniversity.in

References

[1] Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, and Partha Pratim Das. Generation of Highlights From Research Papers Using Pointer-Generator Networks and SciBERT Embeddings. IEEE Access,Volume 11, pages 91358–91374, 2023, DOI:https://doi.org/10.1109/ACCESS.2023.3292300.

Page updated

Report abuse