# Collation Concepts

The contents below is out of date.

The previous section demonstrated many of the requirements imposed on string comparison routines that try to correctly collate strings according to conventions of more than a hundred different languages, written in many different scripts. This section describes the principles and architecture behind the ICU Collation Service.

## Sortkeys vs Comparison

Sort keys are most useful in databases, where the overhead of calling a function for each comparison is very large.

Generating a sort key from a Collator is many times more expensive than doing a compare with the Collator (for common use cases). That's if the two functions are called from Java or C. So for those languages, unless there is a very large number of comparisons, it is better to call the compare function.

Here is an example, with a little back-of-the-envelope calculation. Let's suppose that with a given language on a given platform, the compare performance (CP) is 100 faster than sortKey performance (SP), and that you are doing a binary search of a list with 1,000 elements. The binary comparison performance is BP. We'd do about 10 comparisons, getting:

compare: 10 * CP

sortkey: 1 * SP + 10 * BP

Even if BP is free, compare would be better. One has to get up to where log2(n) = 100 before they break even.

But even this calculation is only a rough guide. First, the binary comparison is not completely free. Secondly, the performance of compare function varies radically with the source data. We optimized for maximizing performance of collation in sorting and binary search, so comparing strings that are "close" is optimized to be much faster than comparing strings that are "far away". That optimization is important because normal sort/lookup operations compare close strings far more often -- think of binary search, where the last few comparisons are always with the closest strings. So even the above calculation is not very accurate.

## Comparison Levels

In general, when comparing and sorting objects, some properties can take precedence over others. For example, in geometry, you might consider first the number of sides a shape has, followed by the number of sides of equal length. This causes triangles to be sorted together, then rectangles, then pentagons, etc. Within each category, the shapes would be ordered according to whether they had 0, 2, 3 or more sides of the same length. However, this is not the only way the shapes can be sorted. For example, it might be preferable to sort shapes by color first, so that all red shapes are grouped together, then blue, etc. Another approach would be to sort the shapes by the amount of area they enclose.

Similarly, character strings have properties, some of which can take precedence over others. There is more than one way to prioritize the properties.

For example, a common approach is to distinguish characters first by their unadorned base letter (for example, without accents, vowels or tone marks), then by accents, and then by the case of the letter (upper vs. lower). Ideographic characters might be sorted by their component radicals and then by the number of strokes it takes to draw the character.

An alternative ordering would be to sort these characters by strokes first and then by their radicals.

The ICU Collation Service supports many levels of comparison (named "Levels", but also known as "Strengths"). Having these categories enables ICU to sort strings precisely according to local conventions. However, by allowing the levels to be selectively employed, searching for a string in text can be performed with various matching conditions.

Performance optimizations have been made for ICU collation with the default level settings. Performance specific impacts are discussed in the Performance section below.

Following is a list of the names for each level and an example usage:

1. Primary Level: Typically, this is used to denote differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character. This is also called the level-1 strength.

2. Secondary Level: Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). Other differences between letters can also be considered secondary differences, depending on the language. A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.

3. Note: In some languages (such as Danish), certain accented letters are considered to be separate base characters. In most languages, however, an accented letter only has a secondary difference from the unaccented version of that letter.

4. Tertiary Level: Upper and lower case differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In addition, a variant of a letter differs from the base form on the tertiary level (such as "A" and "Ⓐ"). Another example is the difference between large and small Kana. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called the level-3 strength.

5. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations (§)) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a primary, secondary or tertiary difference. This is also known as the level-4 strength. The quaternary level should only be used if ignoring punctuation is required or when processing Japanese text (see Hiragana processing (§)).

6. Identical Level: When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared at this level, just in case there is no difference at levels 1-4 . For example, Hebrew cantillation marks are only distinguished at this level. This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. Using this level substantially decreases the performance for

7. both incremental comparison and sort key generation (as well as increasing the sort key length). It is also known as level 5 strength.

## Backward Secondary Sorting

Some languages require words to be ordered on the secondary level according to the last accent difference, as opposed to the first accent difference. This was previously the default for all French locales, based on some French dictionary ordering traditions, but is currently only applicable to Canadian French (locale fr_CA), for conformance with the Canadian sorting standard. The difference in ordering is only noticeable for a small number of pairs of real words. For more information see UCA: Contextual Sensitivity.

Example:

Forward

secondary

cote

coté

côte

côté

Backward

secondary

cote

côte

coté

côté

## Contractions

A contraction is a sequence consisting of two or more letters. It is considered a single letter in sorting.

For example, in the traditional Spanish sorting order, "ch" is considered a single letter. All words that begin with "ch" sort after all other words beginning with "c", but before words starting with "d".

Other examples of contractions are "ch" in Czech, which sorts after "h", and "lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" respectively.

Example:

Order without contraction

la

li

lj

lja

ljz

lk

lz

ma

Order with contraction "lj" sorting after letter "l"

la

li

lk

lz

lj

lja

ljz

ma

Contracting sequences such as the above are not very common in most languages. They are very important, however, since they are also used in the ordering of accented letters. This is because the implementation of ICU treats tailored precomposed characters (such as Ã in Spanish) as contracting sequence (e.g. N + ~).

Since ICU 2.2, and as required by the UCA, if a completely ignorable code point appears in text in the middle of contraction, it will not break the contraction. For example, in Czech sorting, cU+0000h will sort as it were ch

## Expansions

If a letter sorts as if it were a sequence of more than one letter, it is called an expansion.

For example, in German phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae." All words starting with "ä" will sort between words starting with "ad" and words starting with "af".

In the case of Unicode encoding, characters can often be represented either as pre-composed characters or in decomposed form. For example, the letter "à" can be represented in its decomposed (a+`) and pre-composed (à) form. Most applications do not want to distinguish text by the way it is encoded. A search for "à" should find all instances of the letter, regardless of whether the instance is in pre-composed or decomposed form. Therefore, either form of the letter must result in the same sort ordering. The architecture of the ICU Collation Service supports this.

## Contractions Producing Expansions

It is possible to have contractions that produce expansions.

One example occurs in Japanese, where the vowel with a prolonged sound mark is treated to be equivalent to the long vowel version:

カアー<<< カイー and

キイー<<< キイー

Since ICU 2.0 Japanese tailoring uses prefix analysis (§) instead of contraction producing expansions.

## Normalization

In the section on expansions, we discussed that text in Unicode can often be represented in either pre-composed or decomposed forms. There are other types of equivalences possible with Unicode, including Canonical and Compatibility. The process of

Normalization ensures that text is written in a predictable way so that searches are not made unnecessarily complicated by having to match on equivalences. Not all text is normalized, however, so it is useful to have a collation service that can address text that is not normalized, but do so with efficiency.

The ICU Collation Service handles un-normalized text properly, producing the same results as if the text were normalized.

In practice, most data that is encountered is in normalized or semi-normalized form already. The ICU Collation Service is designed so that it can process a wide range of normalized or un-normalized text without a need for normalization processing. When a case is encountered that requires normalization, the ICU Collation Service drops into code specific to this purpose. This maximizes performance for the majority of text that does not require normalization.

In addition, if the text is known with certainty not to contain un-normalized text, then even the overhead of checking for normalization can be eliminated. The ICU Collation Service has the ability to turn Normalization Checking either on or off. If Normalization Checking is turned off, it is the user's responsibility to insure that all text is already in the appropriate form. This is true in a great majority of the world languages, so normalization checking is turned off by default for most locales.

If the text requires normalization processing, Normalization Checking should be on. Any language that uses multiple combining characters such as Arabic, ancient Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking to be on, or the text to go through a normalization process before collation.

ICU supports two modes of normalization: on and off. Java.text.* classes offer compatibility decomposition mode, which is not supported in ICU.

## Ignoring Punctuation

In some cases, punctuation can be ignored while searching or sorting data. For example, this enables a search for "biweekly" to also return instances of "bi-weekly". In other cases, it is desirable for punctuated text to be distinguished from text without punctuation, but to have the text sort close together.

These two behaviors can be accomplished if there is a way for a character to be ignored on all levels except for the quaternary level. If this is the case, then two strings which compare as identical on the first three levels (base letter, accents, and case) are then distinguished at the fourth level based on their punctuation (if any). If the comparison function ignores differences at the fourth level, then strings that differ by punctuation only are compared as equal.

The following table shows the results of sorting a list of terms in 3 different ways. In the first column, punctuation characters (space " ", and hyphen "-") are not ignored (" " < "-" < "b"). In the second column, punctuation characters are ignored in the first 3 levels and compared only in the fourth level. In the third column, punctuation characters are ignored in the first 3 levels and the fourth level is not considered. In the last column, punctuated terms are equivalent to the identical terms without punctuation.

For more options and details see the “Ignore Punctuation” Options page.

Non-ignorable

black bird

black Bird

black birds

black-bird

black-Bird

black-birds

blackbird

blackBird

blackbirds

Ignorable and Quaternary strength

black bird

black-bird

blackbird

black Bird

black-Bird

blackBird

black birds

black-birds

blackbirds

Ignorable and Tertiary strength

black bird

black-bird

blackbird

black Bird

black-Bird

blackBird

black birds

black-birds

blackbirds

The strings with the same font format in the last column are compared as equal by ICU Collator.

Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that follow shifted code points will be completely ignored. This means that an accent following a space will compare as if it was a space alone.

## Case Ordering

The tertiary level is used to distinguish text by case, by small versus large Kana, and other letter variants as noted above.

Some applications prefer to emphasize case differences so that words starting with the same case sort together. Some Japanese applications require the difference between small and large Kana be emphasized over other tertiary differences.

The UCA does not provide means to separate out either case or Kana differences from the remaining tertiary differences. However, the ICU Collation Service has two options that help in customize case and/or Kana differences. Both options are turned off by default.

### CaseFirst

The Case-first option makes case the most significant part of the tertiary level. Primary and secondary levels are unaffected. With this option, words starting with the same case sort together. The Case-first option can be set to make either lowercase sort before

uppercase or uppercase sort before lowercase.

Note: The case-first option does not constitute a separate level; it is simply a reordering of the tertiary level.

ICU makes use of the following three case categories for sorting

1. uppercase: "ABC"

2. mixed case: "Abc", "aBc"

3. normal (lowercase or no case): "abc", "123"

Mixed case is always sorted between uppercase and normal case when the "case-first" option is set.

### CaseLevel

The Case Level option makes a separate level for case differences. This is an extra level positioned between secondary and tertiary. The case level is used in Japanese to make the difference between small and large Kana more important than the other tertiary differences. It also can be used to ignore other tertiary differences, or even secondary differences. This is especially useful in matching. For example, if the strength is set to primary only (level-1) and the case level is turned on, the comparison ignores accents and tertiary differences except for case. The contents of the case level are affected by the case-first option.

The case level is independent from the strength of comparison. It is possible to have a collator set to primary strength with the case level turned on. This provides for comparison that takes into account the case differences, while at the same time ignoring accents and tertiary differences other than case. This may be used in searching.

Example:

## Script Reordering

Script reordering allows scripts and some other groups of characters to be moved relative to each other. This reordering is done on top of the DUCET/CLDR standard collation order. Reordering can specify groups to be placed at the start and/or the end of the collation order.

By default, reordering codes specified for the start of the order are placed in the order given after several special non-script blocks. These special groups of characters are space, punctuation, symbol, currency, and digit. Script groups can be intermingled with these special non-script groups if those special groups are explicitly specified in the reordering.

The special code others stands for any script that is not explicitly mentioned in the list. Anything that is after others will go at the very end of the list in the order given. For example, [Grek, others, Latn] will result in an ordering that puts all scripts other than Greek and Latin between them.

### Examples:

Note: All examples below use the string equivalents for the scripts and reorder codes that would be used in collator rules. The script and reorder code constants that would be used in API calls will be different.

Example 1: set reorder code - [Grek] result - [space, punctuation, symbol, currency, digit, Grek, others]Example 2: set reorder code - [Grek] result - [space, punctuation, symbol, currency, digit, Grek, others] followed by: set reorder code - [Hani] result - [space, punctuation, symbol, currency, digit, Hani, others] That is, setting a reordering always modifies the DUCET/CLDR order, replacing whatever was previously set, rather than adding on to it. In order to cumulatively modify an ordering, you have to retrieve the existing ordering, modify it, and then set it. Example 3: set reorder code - [others, digit] result - [space, punctuation, symbol, currency, others, digit]Example 4: set reorder code - [space, Grek, punctuation] result - [symbol, currency, digit, space, Grek, punctuation, others]Example 5: set reorder code - [Grek, others, Hani] result - [space, punctuation, symbol, currency, digit, Grek, others, Hani]Example 6: set reorder code - [Grek, others, Hani, symbol, Tglg] result - [space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg] followed by: set reorder code - [NONE] result - DUCET/CLDR Example 7: set reorder code - [Grek, others, Hani, symbol, Tglg] result - [space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg] followed by: set reorder code - [DEFAULT] result - original reordering for the locale which may or may not be DUCET/CLDR Example 8: set reorder code - [Grek, others, Hani, symbol, Tglg] result - [space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg] followed by: set reorder code - [] result - original reordering for the locale which may or may not be DUCET/CLDR Example 9: set reorder code - [Hebr, Phnx] result - error

Beginning with ICU 55, scripts only reorder together if they are primary-equal, for example Hiragana and Katakana.

ICU 4.8-54:

• Scripts were reordered in groups, each normally starting with a Recommended Script.

• Reorder codes moved as a group (were “equivalent”) if their scripts shared a primary-weight lead byte.

• For example, Hebr and Phnx were “equivalent” reordering codes and were reordered together. Their order relative to each other could not be changed.

• Only any one code out of any group could be reordered, not multiple of the same group.

## Sorting of Japanese Text (JIS X 4061)

Japanese standard JIS X 4061 requires two changes to the collation procedures: special processing of Hiragana characters and (for performance reasons) prefix analysis of text.

### Hiragana Processing

JIS X 4061 standard requires more levels than provided by the UCA. To offer conformant sorting order, ICU uses the quaternary level to distinguish between Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana symbols on quaternary level, thus causing Hiragana sequences to sort before corresponding Katakana sequences.

### Prefix Analysis

Another characteristics of sorting according to the JIS X 4061 is a large number of contractions followed by expansions (see Contractions Producing Expansions (§)). This causes all the Hiragana and Katakana codepoints to be treated as contractions, which reduces performance. The solution we adopted introduces the prefix concept which allows us to improve the performance of Japanese sorting. More about this can be found in the customization chapter .

## Thai/Lao reordering

UCA requires that certain Thai and Lao prevowels be reordered with a code point following them. This option is always on in the ICU implementation, as prescribed by the UCA.

This rule takes effect when:

1. A Thai vowel of the range \U0E40-\U0E44 precedes a Thai consonant of the range \U0E01-\U0E2E

2. or

3. A Lao vowel of the range \U0EC0-\U0EC4 precedes a Lao consonant of the range \U0E81-\U0EAE. In these cases the vowel is placed after the consonant for collation purposes.

There is a difference between java.text.* classes and ICU in regard to Thai reordering. Java.text.* classes allow tailorings to turn off reordering by using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai prevowels.

In many database products, fields are padded with null. To get correct results, the input to a Collator should omit any superfluous trailing padding spaces. The problem arises with contractions, expansions, or normalization. Suppose that there are two fields, one containing "aed" and the other with "äd". German phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" < "aed". But if both fields are padded with spaces to a length of 3, then this will reverse the order, since the first will compare as if it were one character longer. In other words, when you start with strings 1 and 2

they end up being compared on a primary level as if they were 1' and 2'

Since 2' has an extra character (the extra space), it counts as having a primary difference when it shouldn't. The correct result occurs when the trailing padding spaces are removed, as in 1" and 2"

1".

2".

## Collator naming scheme

Starting with ICU 54, the following naming scheme and its API functions are deprecated. Use ucol_open() with language tag collation keywords instead (see Collation API Details). For example, ucol_open("de-u-co-phonebk-ka-shifted", &errorCode) for German Phonebook order with "ignore punctuation" mode.

When collating or matching text, a number of attributes can be used to affect the desired result. The following describes the attributes, their values, their effects, their normal usage, and the string comparison performance and sort key length implications. It also includes single-letter abbreviations for both the attributes and their values. These abbreviations allow a 'short-form' specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which can be used to specific that the desired options are: UCA version 4.0.0; ignore spaces, punctuation and symbols; use Swedish linguistic conventions; compare case-insensitively.

A number of attribute values are common across different attributes; these include Default (abbreviated as D), On (O), and Off (X). Unless otherwise stated, the examples use the UCA alone with default settings.

a

a

e

e

d

d

In order to achieve uniqueness, collator name always has the attribute abbreviations sorted.

### Main References

If attributes in collator name are not overridden, it is assumed that they are the same as for the given locale. For example, a collator opened with an empty string has the same attribute settings as AN_CX_EX_FX_HX_KX_NX_S3_T0000.