Description & Ideas

Here, we describe ideas and motivation behind subword and character-level modeling of natural language. We hope that this will enable researchers to think of interesting research topics in this general area of NLP. We also provide a list of previous work references in this direction. (Henceforth, The term "character-level" will refer to both "character-level" and "subword-level".)


Text processing generally starts with tokenization: the text is first segmented into tokens, most commonly by means of rules (e.g., split on delimiters like white space and punctuation). Traditionally, the tokens were either used as unanalyzed symbols or morphologically analyzed.

Tokenization works well for clean, well-edited English text. But as we will discuss below, tokenization is problematic for:

    • noisy text
    • many languages other than English
    • adaptation to new domains

This has motivated recent character-level work that has departed from traditional tokenization-based NLP in three (not necessarily mutually exclusive) respects.

  1. Character-level word models. This line of work still starts with tokenization, but tokens are analyzed on the character level, i.e., not treated as unanalyzed symbols. These models either work together with a morphological subsystem or are proposed as an alternative to morphological analysis.
  2. Character-level embeddings. This line of work learns representations on the character level, either instead of or in addition to representations on the word and phrase level. The resulting text representation is based on a tokenized form of the input in some cases and on raw untokenized text in other cases.
  3. End-to-end approaches reject tokenization as feature engineering and attempt to train NLP systems end to end. The input is the raw character (or byte) sequence.

There has been a large number of publications in the last 2-3 years on character-level models. This demonstrates that there is great interest in this emerging area.

Motivation & Ideas

Human natural language processing (NLP) is robust against noise in the sense that small perturbations of the input do not affect processing negatively. Such perturbations -- including letter insertions, deletions, substitutions and transpositions and the insertion of spaces ("guacamole'' or "gua camole") and the deletion of spaces ("ran fast'' or "ranfast'') -- often cause token-based processing to fail. Character-level models may be better equipped do deal with this noise.

Orthographic productivity, form-meaning representations: The truth possibly lies between the following two extremes. The character sequence of a token is:

  1. arbitrary and uninformative
  2. perfectly and compositionally predicts its linguistic properties.

Morphology is the most important phenomenon of limited predictability that lies between these two extremes. But there are many other less prominent phenomena that taken together have the potential of improving NLP models considerably if they could be handled at the level of human competence:

  1. Properties of names predictable from character patterns, e.g., "Yaghoobzadeh'' is identifiable as a Farsi surname, "Darnique'' and "Delonda'' are identifiable as names of girls, "osinopril'' is most likely a medication;
  2. blends and modifications of existing words, e.g., "staycation'', "Obamacare'', "mockumentary'', "dramedy'';
  3. non-morphological orthographic productivity in certain registers, domains and genres: character repetition in tweets ("coooooooooool''), shm-reduplication ("fancy-shmancy''), the pseudo-derivational suffix ``-gate'' signifying "scandal'' ("Irangate'', "Dieselgate'');
  4. sound symbolism, phonesthemes, e.g., "gl-'' ("gleam'', "glint'', "glisten'', "glow'')
  5. onomatopoeia, e.g., "oink'', "sizzle'', "tick tock''.

An advantage of character-level models is that the problem of out-of-vocabulary (OOV) words disappears. Of course, if the NLP system encounters an OOV that is opaque even to a human reader, then the character-level model has no advantage. However, in many cases a great deal can be predicted from the character string of an OOV.

It is currently unclear what the relationship between morphology and character-level models is. Are these two independent subsystems that both provide analyses to downstream components of the pipeline? Are they arranged in sequence as elements of the pipeline, first a character-based model, then a morphological analyzer, or vice versa? Do they work together to deal with problems like noise and orthographic creativity?

Most linguists and most NLP people (including the authors of this proposal) have a strong intuition that morphology is a subsystem of language the responsibility for which should not be handed over to a black-box system that deals with a disparate set of phenomena most of which have properties that are fundamentally different from inflection and derivation. A key question is how domain knowledge about morphology can be exploited for character-level models, e.g., in Bayesian approaches or by informing the architecture of neural network models.

Tokenization-free models. One of the drawbacks of token-based models is that they usually tokenize text early on and it is difficult to correct these early tokenization decisions later. While it is theoretically possible to generate all possible tokenizations and pass any tokenization ambiguity through the entire NLP pipeline (e.g., by using lattices), this is inefficient and often incompatible with the requirements of subsequent processing modules. For this reason, text-to-text machine translation systems usually only consider a single tokenization of source and target.

Tokenization causes limited damage in English although even in English there are difficult cases like "Yahoo!'', "San Francisco-Los Angeles flights'', "[she] was selected 'The Apprentice'-style'' and hashtags like "#starwars''. In other languages, tokenization is even more problematic. In Chinese, tokens are not separated by spaces or other typesetting conventions. For most NLP applications, German compounds should be split. Tokens in agglutinative languages like Turkish also pose difficulties.

Morphology and phrases are two sides of the same coin. Tokenization rules often fail to capture structure both within tokens (e.g., morphology) and across multiple tokens (e.g., multi-word expressions).

End-to-end learning, direct models of the data. Most of statistical natural language processing relies heavily on feature engineering. In contrast, the philosophy of end-to-end learning is that manual feature design is prone to errors and omissions; and that a good set of features can best be found by training a well designed model in a well designed experimental setup. Thus, an important question is whether character-level models will reach and/or surpass token-based models and, if so, in which subareas of NLP.

OOV generation. There is currently no principled and general way for token-based end-to-end systems to generate tokens that are not part of the training vocabulary, e.g., for new named entities. Character-based systems in principle can handle OOV generation without the use of special mechanisms.

Domain knowledge. Successful machine learning requires inductive bias. If domain knowledge injected into models no longer consists of tokenization rules and morphological expertise, what would replace it?

Many current character-level models lack computational efficiency compared to word-level models because detecting syntactic and semantic relationships at the character-level is more expensive than at the word-level. How can we address the resulting challenges in scalability for character-level models?