Research Highlight Generation from Scientific Papers

(SciHigh-2026)

@ FIRE-2026

17th - 20th December, 2026

Task Description

This shared task focuses on three subtasks such as:

Research Highlight Generation:

This subtask focuses on automatically generating research highlights from scientific paper abstracts using the MixSub dataset proposed in our earlier work [1] . Both research highlights and abstracts serve as summaries of a research paper, but highlights offer a more structured and concise version of the key contributions. The goal of this task is to develop machine learning models that can generate high-quality highlights similar to those written by authors.
Participants will explore different summarization techniques, including transformer-based models, retrieval-augmented approaches, and fine-tuned neural networks.
We have fine-tuned a transformer-based model on the training subset of MixSub and evaluated it on a separate validation set using metrics ROUGE-1, 2, L, METEOR, and BERTScore. The fine-tuned Pegasus model achieved a ROUGE-L F1 score of 23.45%,securing the 1st position in the SciHigh track of FIRE 2025 conference.
The task aims to improve the efficiency and accuracy of highlight generation, which can benefit researchers and academic indexing platforms.

Title Generation from Abstracts:

This subtask focuses on generating concise and informative titles from scientific paper abstracts. Unlike general summarization, title generation requires capturing the main contribution in a highly compressed form while preserving clarity, relevance, and main theme from the abstract[2].
Participants are expected to explore different models that can generate accurate, relevant and meaningful titles that reflect the main idea of the abstract. This task is particularly challenging as it involves abstraction, semantic understanding, and precise wording.
Automatically generating titles from scientific paper abstracts using the SpringerSSAT dataset proposed by us in [3]. We curated a new dataset using papers from four Springer journals in the domain of social sciences. Specifically, we collected 1,152 abstract-title pairs from SN Social Sciences, 1,135 pairs from Society, 416 pairs from Race and Social Problems, and 770 pairs from Social Justice Research. The dataset consists of 3,473 abstract–title pairs, which we refer to as the SpringerSSAT dataset. We split the dataset into 2,778 papers for training, 347 for validation, and 348 for testing.
This subtask aims to improve the efficiency and correctness of title generation, thereby supporting novice researchers and enhancing academic writing assistance tools.

English–Bengali Title Translation:

To facilitate multilingual accessibility of scientific content, we propose a subtask on translating scientific paper titles from English to Bengali. We construct a curated dataset, SpringerSSAT-Tiny-Multilingual, by selecting a subset of 100 papers from the SpringerSSAT dataset. The titles of these papers are manually translated into Bengali with the help of domain experts to ensure high-quality and semantically faithful translations.
The resulting dataset consists of triplets of the form (abstract, title in English, title in Bengali). It is further divided into training, development, and test splits in 3:1:1 ratio. The task is to develop models that can accurately translate English titles into Bengali, preserving meaning, technical terminology, and conciseness.
The generated translations will be evaluated using a combination of automatic and human evaluation metrics.

Use Cases

The proposed shared task has multiple practical applications across both subtasks:

Research Highlight Generation:

The generated highlights can be useful in multiple scenarios, such as:

Helping researchers quickly understand the key contributions of scientific papers. They are often easier to read and grasp than a longer paragraph, especially on hand-held devices.
Reducing the time required to extract relevant information from a large number of research articles.
Enhancing metadata for academic search engines and digital libraries.
Evaluating the effectiveness of different summarization techniques for scientific papers.

Title Generation from Abstracts:

The generated titles can be beneficial in the following scenarios:

Assisting novice researchers in writing concise, relevant and informative titles for their papers.
Enhancing discoverability of research papers in search engines and digital libraries through better title formulation.
Supporting editorial and indexing workflows by providing candidate titles for review.
Contributing to research in text generation by benchmarking models on highly compressed semantic representation tasks.

English–Bengali Title Translation:

Multilingual titles enhance the reach of scientific papers by making them accessible to non-English-speaking audiences, particularly Bengali-speaking researchers, students, and practitioners.

Important Dates

30th June, 2026 - Training and Validation data release

Subtask1: MixSub Dataset(Train, Val)

Subtask2: SpringerSSAT Dataset(Train, Val)

Subtask3: SpringerSSAT-Tiny-Multilingual Dataset(Train, Val)

25th July, 2026 - Test data release

30th July, 2026 – Run submission deadline

28th August, 2025 - Results declared

15th September, 2026 - Working notes due

30th September, 2026 - Camera-ready copies of working notes and overview paper due

17th December, 2026 - FIRE conference

NOTE: All dates are in AoE timezone .

Dataset

For this shared task, we utilize a subset of the MixSub dataset [1], referred to as MixSub-SciHigh. The MixSub corpus was created by collecting research articles from ScienceDirect, encompassing a diverse range of scientific domains. It comprises 19,785 research papers published in the year 2020. Each data instance is structured as a pair consisting of the abstract and the corresponding author-written research highlights.

Each entry in the dataset includes:

Abstract: A concise summary of the research paper.

Research Highlights: Key contributions manually written by the authors.

An example is given below.

An (abstract, highlights) pair from the MixSub dataset. Taken from https://www.sciencedirect.com/science/article/pii/S0001457519307213

Dataset for Research Highlight Generation

The MixSub-SciHigh dataset is split into three sets:
Training Set: 10,000 data instances,
Validation Set: 1985 data instances,
Test Set: 1840 data instances(Masked ground-truth).
Format: CSV files containing 3 column as Filename, Abstract, Highlights

Dataset for Title Generation from Abstracts

For the second subtask, we introduce the SpringerSSAT dataset [3], curated from four Springer journals in the domain of social sciences: SN Social Sciences, Society, Race and Social Problems, and Social Justice Research.
Each instance in this dataset is structured as a pair: (abstract, title).
The dataset consists of 3,473 abstract–title pairs, including:
Training Set: 2,778 instances
Validation Set: 347 instances
Test Set: 348 instances(Masked ground-truth).
Format: CSV files containing abstracts and corresponding titles
For evaluation, the test set will not include ground-truth titles to maintain a blind evaluation process.

Dataset for Bengali Title Generation from Abstracts

For the second subtask, we introduce the SpringerSSAT-Tiny-Multilingual dataset, a subset of 100 papers from the SpringerSSAT dataset for the translation task.
Each instance in this dataset is structured as a pair: (abstract, title in English, title in Bengali).
The dataset consists of 100 abstract–title in English–title in Bengali triples
Training Set: 60 instances
Validation Set: 20 instances
Test Set: 20 instances(Masked ground-truth).
Format: CSV files containing abstracts and corresponding titles
For evaluation, the test set will not include ground-truth titles to maintain a blind evaluation process.

Evaluation Plan

Submissions for all subtasks, Research Highlight Generation and Title Generation from Abstracts will be evaluated using the following automatic metrics:

ROUGE-1,2, L: Measures lexical overlap with reference highlights.
METEOR: Evaluates semantic similarity using synonym and stemming matching.
BERTScore: Evaluates semantic similarity using BERT embeddings.
COMET: Useful for measuring translation quality.

✓Participants are encouraged to analyse common challenges such as hallucinations (incorrect information) and factual inconsistencies in the generated highlights or titles.

✓ The submitted systems will be ranked based on the ROUGE-L F1-score. In addition to quantitative performance, novel and innovative approaches in model design and methodology will be appreciated.

Submission Format

You should submit a single .zip file containing all required files to this email ID only: tohidarehman.it@jadavpuruniversity.in

Submission Requirements:

📁 1. Required Files Inside the ZIP

Each submission must include:

(a) Trained Model

Upload your trained model checkpoint to Hugging Face
Include the Hugging Face model link in the submission

(b) Prediction File (.csv)

A .csv file containing model outputs for the test set.

📄 General CSV Format: Each line in the .csv file should contain three comma-separated columns in the following order: Filename, Abstract, Predicted_Output

For Task 1 → Predicted_Output = Highlights
For Task 2 → Predicted_Output = Title
For Task 3 → Predicted_Output = Bengali Title

🧠 Task-wise Format Summary

🔹 Task 1: Research Highlight Generation

Filename, Abstract, Predicted_Highlights

🔹 Task 2: Title Generation from Abstracts

Abstract, Predicted_Title

🔹 Task 3: Bengali Title Generation

Abstract, Predicted_Title_English, Predicted_Title_Bengali

File Naming Conventions:

The .csv file must be named using the following format: <team_name><Task_Number>_<run_identifier>.csv
Example: TeamA_Task1_run1.csv, where:
- TeamA is your team name
- Task1 is your sub-task number
- run1 identifies the specific run
✅ Use underscores (_) only as shown above.
❌ Do not use blank spaces, tabs, or additional underscores in the file name.
The .zip file should be named after the email ID used for registration (excluding the domain).
For example, if your registered email ID is hello@gmail.com, then your zip file should be named:
hello.zip

Additional Notes:

You may submit up to two solutions for each sub-task to demonstrate different approaches or refinements of your work.

Contact

Email: tohidarehman.it@jadavpuruniversity.in

References

[1] Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, and Partha Pratim Das. Generation of Highlights From Research Papers Using Pointer-Generator Networks and SciBERT Embeddings. IEEE Access,Volume 11, pages 91358–91374, 2023, DOI:https://doi.org/10.1109/ACCESS.2023.3292300.

[2] Tohida Rehman, Debarshi Kumar Sanyal, and Samiran Chattopadhyay. Can pre-trained language models generate titles for research papers?. International Conference on Asian Digital Libraries. Singapore: Springer Nature Singapore, 2024. DOI: https://doi.org/10.1007/978-981-96-0865-2_13

[3] Tohida Rehman, Debarshi Kumar Sanyal, and Samiran Chattopadhyay. Automatic Generation of Titles for Research Papers Using Language Models. Accepted for the Journal of International Journal on Digital Libraries, 2026.

Page updated

Report abuse

Research Highlight Generation from Scientific Papers

(SciHigh-2026)

Dataset for Title Generation from Abstracts

Dataset for Bengali Title Generation from Abstracts

📁 1. Required Files Inside the ZIP

📄 General CSV Format: Each line in the .csv file should contain three comma-separated columns in the following order: Filename, Abstract, Predicted_Output

🧠 Task-wise Format Summary