Research contribution-centric named entity recognition (NER) in Computer Science

Rule-based Titles Parser

The systems were implemented only on Scholarly Article Titles in Computational Linguistics (CL).

The first system CL-TitleParser parses and types scientific entities from the titles of Computational Linguistics scholarly articles written in English. Specifically, types the entities as one of six concepts: research problem, solution, resource, language, tool, and method.

Code: https://github.com/jd-coderepos/cl-titles-parser/

Publication: Jennifer D’Souza and Sören Auer (2021). Pattern-Based Acquisition of Scientific Entities from Scholarly Article Titles. Ke HR., Lee C.S., Sugiyama K. (eds) Towards Open and Trustworthy Digital Societies. ICADL 2021. Lecture Notes in Computer Science, vol 13133. Springer, Cham. (Pre-print available at https://arxiv.org/abs/2109.00199)


The second system CL-ShortTitles-Parser parses and types phrases from the titles of Computational Linguistics scholarly articles written in English as scientific entities. It types the entities as one of the following seven semantic concepts: research problem, solution, resource, language, tool, method and dataset.

Code: https://github.com/jd-coderepos/cl-shorttitles-parser

Machine-learning-based Titles and Abstracts Parser

The ORKG CS-NER system is based on a standardized set of seven contribution-centric scholarly entities viz., research problem, solution, resource, language, tool, method, and dataset. It can automatically extract all seven entity types from Computer Science publication titles. Furthermore, it can extract research problem and method entity types from Computer Science publication abstracts. The details of the sequence labeling machine learner can be found in our preprint publication.

D'Souza, Jennifer, and Sören Auer. Computer Science Named Entity Recognition in the Open Research Knowledge Graph. arXiv preprint arXiv:2203.14579 (2022).

Download our dataset: https://github.com/jd-coderepos/contributions-ner-cs

Funding Statement

This work is supported by TIB Leibniz Information Centre for Science and Technology, the EU H2020 ERC project ScienceGRaph (GA ID: 819536)}