Keyword Extraction in Scientific Documents

at SwissText 2022

Announcements

Welcome to the workshop!

If you have not done so, fill this background probe Google form so that we get to know you better!

Also join the Slack channel [updated!] for more streamlined communication!

July 6: Our proceedings submission is now on arXiv!
June 13: Proceeding information is now available in a new page!
June 7: Workshop slides is online!
June 7: More information and Slack channel published!
June 4: The complete dataset and Google Colabolatory Jupyter Notebooks were released!
May 27: Exact venue and room announced
May 17: The training dataset was released!

About the Workshop

The scientific publications grow at an exponential speed. Therefore, it is an increasingly challenging task to keep track of the trends and changes. Understanding scientific documents is an important step in downstream tasks, such as knowledge graph building, text mining, and discipline classification.

In this workshop, we aim to provide a better understanding of keyword and keyphrase extraction from the abstract of scientific publications. Beyond this workshop, the methods are also applicable to further text data such as texts in the media.

Program Overview

The workshop, as a part of SwissText 2022, will take place in Lugano-Viganello at the East Campus of USI-SUPSI.

It will take place at Room A1.03.

The workshop will take place on Wednesday 8th June 2022 from 13:00 to 16:30.

The tentative schedule is as follows:

13:00 – 13:30 Workshop introduction
13:30-16:00 Hands-on section, Team formation
- 13:30 – 14:00 Research presentation
- 14:15 – 15:00 System coaching
  - System 1: TextRank algorithm
  - System 2: TextRank algorithm and clustering
  - System 3: Named-Entity Recognition (NER)
- 16:00 Submission of the final results (optional)
16:00 – 16:30 Exchange and feedback

Note:

Optionally, if you would like to have your results from this workshop included in the workshop proceeding, make sure you fill your information in the Google form and send us your results by June 20. More details are on the page 'Proceedings Information', accessible from the top bar.

Google Colaboratory

We will work solely on Google Colaboratory so that no local installation on your laptop is necessary.

The Jupyter Notebook for each system can be found in the following links:

In order for the code to be executable, the notebook must be copied to a local Google Drive. Thus, a Google sign-in is required.

Furthermore, we also provide an evaluation function for all systems at the following link:

Note:

Evaluation of keyword extraction performance is very tricky! If interested, you may find more explorations of different evaluation metrics on this Medium post.

Data

We will use the abstract and keywords of approximately 40,000 scientific papers as a dataset.

The complete dataset can be found in this ETH Polybox folder.

Note:

The same keyword extraction and named-entity recognition method works for other types of texts. For example, check out the 20 Newsgroup dataset, which is often used as a baseline for natural language processing tasks.

About the Organizers

This workshop is organized by ETH Zürich and Neue Zürcher Zeitung (NZZ)

Susie Xi Rao (ETH Zurich, DS3Lab and Chair of Applied Economics), srao@ethz.ch
Parijat Ghoshal (NZZ), parijat.ghoshal@nzz.ch
Piriyakorn Piriyatamwong (ETH Zurich, DS3Lab and Chair of Applied Economics), ppiriyata@student.ethz.ch