A sequence labeling approach to deriving word variants

About


The Word Variant Derivation task by suffixation can be described as follows. Given a word, to determine its unchangeable beginning portion (loosely, the base), identifying any remaining ending as replaceable (loosely, the suffix), and potential candidate suffixes for replacement or for appending to the base resulting in valid derived words (e.g., happ-iness from happ-y).

A project has been initiated for automatic word variant derivation at the National Library of Medicine's (NLM) Lexical Systems Group. On their project page is elicited in a step-wise manner the components involved in their system for automatically deriving variants of words. Briefly, their approach to the task is by encoding observable grammatical and linguistic patterns for deriving one word from another as rules. Rule-based systems are advantageous in that they are capable of modeling most precisely human intuition for a task. But their main limitation is in that their application remains confined only within the domain of the observed patterns. In addition, the ever-evolving nature of language leads to the emergence of new rules of transformation between words, and sometimes even the loss of earlier ways of transformation. In light of these considerations, we propose a learning-based approach for the task of automatic word variant derivation, either as a standalone or in a hybrid architecture together with rule-based systems.

Our sequence learning approach for the Word Variant Derivation task is described in the following paper.

Jennifer D'Souza. 2015. A Sequence Labeling Approach to Deriving Word Variants. In Proceedings of the 29th National Conference on Artificial Intelligence. AAAI Press, Student Abstract and Poster Program, Austin, Texas, USA.

This page serves to provide additional information about our sequence learning approach for word variant derivation by suffixation, as well as to make accessible the datasets used in our experiments.


A typical machine learning experimental setup


It involves the following components.

  • A machine learning algorithm.

  • A labeled dataset for the task-at-hand on which to train and test the chosen machine learning algorithm.

  • A set of features that capture only information from the dataset pertaining to the task.

  • And, a method to evaluate the machine learning algorithm's performance in identifying the correct labels given features of the data.


Our Experimental Setup


  • Chosen machine learning algorithm. We have selected Conditional Random Fields (CRFs) for labeling sequential data as the learning algorithm for our task of automatically deriving word variants. We make use of the CRF++ tool which is a freely available, open source implementation of the algorithm.

  • Raw unlabeled data. While our proposed approach is domain-independent, for simplicity, our experimental analyses for automatically deriving word variants using a learning-based approach is restricted only to words from clinical notes.

Download bag-of-words extracted from about 400,000 clinical notes.

  • Features. Our chosen features to represent the data are: 1) the alphabets of the word; 2) spelled phoneme pronounciations for each alphabet in the word automatically extracted using DirecTL+; and 3) hyphenation breakpoints of the word automatically obtained using the python implementation of Frank Liang's hyphenation algorithm.

Download bag-of-words with their features.

  • Labeled training and testing data. The process of creating the final dataset involved several steps including representing each word in the bag-of-words in the form of their features, forming derivational word variant groups, and separating the now labeled data as training and testing sets.


Download experimental data directly usable by the CRF++ toolkit.

  • Results. We then organize output predictions from the sequence learning algorithm into clusters of derivational word variant groups, by merging together a word and its predicted variants with other words and their predicted variants provided the groups share a word in common. We can then evaluate the performance of our approach by comparing the predicted derivational word variant groups with the actual groups in test data