Customization

From https://sites.google.com/site/icuprojectuserguide/collation/customization

ICU uses the CLDR root collation order as a default starting point for ordering. (The CLDR root collation is based on the UCA DUCET.) Not all languages have sorting sequences that correspond with the root collation order because no single sort order can simultaneously encompass the specifics of all the languages. In particular, languages that share a script may sort the same letters differently.

Therefore, ICU provides a data-driven, flexible, and run-time-customizable mechanism called "tailoring". Tailoring overrides the default order of code points and the values of the ICU Collation Service attributes.

Collation Rule

A RuleBasedCollator is built from a rule string which changes the sort order of some characters and strings relative to the default order. An empty string (or one with only white space and comments) results in a collator that behaves like the root collator.

A tailoring is specified via a string containing a set of rules. ICU implements the (CLDR) LDML collation rule syntax. For more details see there.

Each rule contains a string of ordered characters that starts with an anchor point or a reset value. The characters. For example, "&a < g", places "g" after "a" and before "b", and the "a" does not change place. This rule has the following sorting consequences:

CLDR + ICU detailed schedule

Without rule

apple

Abernathy

bird

Boston

green

Graham

With rule

apple

Abernathy Mark Davis Markus Scherer Steven R Loomis

green

bird

Boston

Graham

Note that only the word that starts with "g" has changed place. All the words sorted after "a" and "A" are sorted after "g".

This is a non-complex example of a tailoring rule. Tailoring rules consist of zero or more rules and zero or more options. There must be at least one rule or at least one option. The rule syntax is discussed in more detail in the following sections.

Note that the tailoring rules override the UCA ordering. In addition, if a character is reordered, it automatically reorders any other equivalent characters. For example, if the rule "&e<a" is used to reorder "a" in the list, "á" is also greater than "é".

Syntax

The following table summarizes the basic syntax necessary for most usages:

In releases prior to 1.8, ICU uses the notations ';' to represent secondary relations and ',' to represent tertiary relations. Starting in release 1.8, use '<<' symbols to represent secondary relations and '<<<' symbols to represent tertiary relation. Rules that use the ';' and ',' notations are still processed by ICU for compatibility; also, some of the data used for tailoring to particular locales has not yet been updated to the new syntax. However, one should consider these symbols deprecated.

See the LDML collation rule syntax and Properties and ICU Rule Syntax for information regarding syntax characters.

Repeated use of the same relation can be abbreviated, for example &a <* bcd-gp-s for &a < b < c < d < e < f < g < p < q < r < s. For details see the LDML collation spec, section Orderings.

Escaping Rules

Most of the characters can be used as parts of rules. However, whitespace characters will be skipped over, and all ASCII characters that are not digits or letters are considered to be part of syntax. In order to use these characters in rules, they need to be escaped. Escaping can be done in several ways:

    • Single characters can be escaped using backslash \ (U+005C).

    • Strings can be escaped by putting them between single quotes 'like this'.

    • The single quote (ASCII apostrophe) can be quoted using two single quotes '', both inside and outside single-quote-escaped strings.

Simple Tailoring Examples

Serbian (Latin) or Croatian: & C < č <<< Č < ć <<< Ć

This rule is needed because the root collation order usually considers accents to have secondary differences in order to base character. This ensures that 'ć' 'č' are treated as base letters.

UCA

CUKIĆ RADOJICA

ČUKIĆ SLOBODAN

CUKIĆ SVETOZAR

ČUKIĆ ZORAN

CURIĆ MILOŠ

ĆURIĆ MILOŠ

CVRKALJ ÐURO

Tailoring: & C < č <<< Č < ć <<< Ć

CUKIĆ RADOJICA

CUKIĆ SVETOZAR

CURIĆ MILOŠ

CVRKALJ ÐURO

ČUKIĆ SLOBODAN

ČUKIĆ ZORAN

ĆURIĆ MILOŠ

Serbian (Latin) or Croatian: & Ð < dž <<< Dž <<< DŽ

This rule is an example of a contraction. "D" alone is sorted after "C" and "Ž" is sorted after "Z", but "DŽ", due to the tailoring rule, is treated as a single letter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter after "D" in the UCA). Another thing to note in this example is capitalization of the letter "DŽ". There are three versions, since all three can legally appear in text. The fourth version "dŽ" is omitted since it does not occur.

UCA

dan

dubok

džabe

džin

Džin

DŽIN

đak

Evropa

Tailoring:

& Ð < dž <<< Dž <<< DŽ

dan

dubok

đak

džabe

džin

Džin

DŽIN

Evropa

Danish: &V <<< w <<< W

The letter 'W' is sorted after 'V', but is treated as a tertiary difference similar to the difference between 'v' and 'V'.

UCA

va

Va

VA

vb

Vb

VB

vz

Vz

VZ

wa

Wa

WA

wb

Wb

WB

wz

Wz

WZ

&V <<< w <<< W

va

Va

VA

wa

Wa

WA

vb

Vb

VB

wb

Wb

WB

vz

Vz

VZ

wz

Wz

WZ

Default Options

ICU implements the LDML collation options/settings. For more information see there.

The tailoring inherits all the attribute values from the root collator unless they are explicitly redefined in the tailoring. The following table summarizes the option settings. Default options are in emphasis.

A tailoring that consists only of options is also valid and has the same basic ordering as the root collation. For example, the Greek tailoring has option settings only: [normalization on][reorder Grek]

(The examples in this chapter might refer to older versions of data for particular languages. Check CLDR or ICU for actual, current tailorings.)

The following tailoring example reorders uppercase and lowercase and uses backwards-secondary ordering:

[caseFirst upper]

[backwards 2]

& C < č , Č

& G < ģ , Ģ

& I < y, Y

& K < ķ , Ķ

& L < ļ , Ļ

& N < ņ , Ņ

& S < š , Š

& Z < ž , Ž

Values for Reorder Codes

In addition, ISO 4-letter script codes can be used. Codes for scripts that do not have Unicode characters (according to the Unicode Script property values) are ignored.

Limitations of ICU 4.8-52: (Except Kore is still not usable because it refers to multiple scripts that do not sort primary-equal.)

    • For Chinese, use script code Hani, not Hans or Hant.

    • For Japanese, use both Kana and Hani (not Hira).

    • For Korean, use both Hang and Hani (not Kore).

Semantics of a List of Reorder Codes

This section is relevant for both the [reorder ...] rule syntax and the Collator.setReorderCodes() API.

For an introduction and examples see the section “Script Reordering” in the Collation Concepts chapter.

On the API, the special groups are represented with Collator.ReorderCodes (UColReorderCode) values rather than UScript (UScriptCode) values.

In ICU 4.8-54, not every script could be reordered independently. CLDR and ICU supported reordering of groups of scripts, each of which started with one of the Recommended Scripts. A script that is not Recommended always moved together with the Recommended Script that precedes it in DUCET order. (Hiragana sorts together with Katakana, Coptic with Greek, etc.) ICU allowed any one script of a (Recommended Script + DUCET-following) group in the [reorder] list, moving the whole set of scripts together. However, it was strongly recommended that only Recommended Scripts be used.

Beginning with ICU 55, scripts only reorder together if they are primary-equal, for example Hiragana and Katakana.

Zyyy=Common and Zinh=Inherited cannot be reordered.

The special code Zzzz (= Unknown script = UScript.UNKNOWN = Collator.ReorderCodes.OTHERS = "others") stands for any script that is not explicitly mentioned in the list of reordering codes. If Zzzz is mentioned in the list, then any groups and scripts mentioned later in the list will go at the very end of the reordering, in the order given. If Zzzz is not mentioned, then all scripts that are not explicitly listed follow at the end in DUCET order.

The special reorder code Collator.ReorderCodes.NONE (= UScript.UNKNOWN), when used alone (same as [reorder Zzzz] or not specifying a [reorder] rule in a tailoring), will remove any reordering for this collator. The result of setting no reordering will be to use the DUCET/CLDR order.

On the API (not applicable to rule syntax), the special reorder code Collator.ReorderCodes.DEFAULT (= UScript.INHERITED) will reset the reordering for the collator to its default order. The default reordering may be the DUCET/CLDR order or may be a reordering that was specified when this collator was created from resource data or from rules. The DEFAULT code must be the sole code supplied when it used.

For details see the section “Collation Reordering” in the LDML collation spec.

Advanced Syntactical Elements

Several other syntactical elements are needed in more specific situations. These elements are summarized in the following table:

Indirect Positioning of Collation Elements

Since ICU version 2.0, ICU allows for indirect positioning of collation elements (CE). Similar to the option top, these options allow for positioning of the tailoring relative to significant sections of the UCA table. You can use the [before] reset option to position before these sections.

Not all of the indirect-positioning anchors are useful. Most of the 'first' elements should be used with the [before] directive, in order to make sure that your tailoring will sort before an interesting section.

Complex Tailoring Examples

Following are several fragments of real tailorings, illustrating some of the advanced syntactical elements:

Expansion Example:

Swedish:

&t<<<þ/h

&T<<<Þ/H

The letter 'þ' (THORN) is normally treated by UCA/root collation as a separate letter that has primary-level sorting after 'z'. However, in Swedish and some other Scandinavian languages, 'þ' and 'Þ' should be treated as just a tertiary-level difference from the letters "th" and "TH" respectively. This is an example of an expansion.

UCA

az

Az

tha

Tha

THa

thz

za

Za

zz

þa

Þa

þz

&t<<<þ/h, &T<<<Þ/H

az

Az

tha

þa

Tha

THa

Þa

thz

þz

za

Za

zz

Prefix Example:

Prefixes are used in Japanese tailorings to reduce the number of contractions. A big number of contractions is a performance burden on the commonly-used base characters, as their processing is much more complicated than the processing of regular elements.

A prefix rule conditionally changes the CE of the character or string (e.g., ー) after the | symbol; unlike a contraction, it does not affect the CE of the preceding text (e.g., ァ). (By contrast, a contraction like ァー consumes both characters and can assign them a CE or expansion unrelated to ァ's CE.) A prefix rule is especially useful if the character or string (ー) after the | symbol occurs significantly less often than the first character of the prefix (ァ).

&[before 3]ァ <<< ァ|ー = ァ|ー = ぁ|ー

This could have been written as a series of contractions followed by expansion:

&[before 3]ァー <<< ァー = ァー = ぁー

However, in that case ァ, ァ and ぁ would start contractions. Since the prolonged sound mark (ー) occurs much less frequently than the other letters of Japanese Katakana and Hiragana, it is much more prudent to put the extra processing on it by using prefixes.

Reset example:

A "reset" always uses only the base character as the insertion point even if there is an expansion. So the following rule,

& J <<< K / B & K <<< M

is equivalent to

& J <<< K / B <<< M

Which produces the following sort order:

"JA"

"MA"

"KA"

"KC"

"JC"

"MC"

Assuming the letters "J", "K" and "M" have equal primary weights, the second letter contains the differences among these strings. However, the letter "K" is treated as if it always has a letter "B" following it while the letters "J" and "M" do not.

The following is an example of collation elements for these strings resulting from the specified rules:

Tailoring Issues

ICU uses canonical closure. This means that for each code point in Unicode, if the canonically composed form of a tailored string produces different collation elements than the canonically decomposed form, then the canonically composed form is effectively added to the ordering. If 'a' is tailored, for example, all of the accented 'a' characters are also tailored. Canonical closure allows collators to process Unicode strings in the FCD form as well as in NFD. (Note: Most but not all NFC strings are also in FCD. See http://www.unicode.org/notes/tn5/#FCD)

However, compatibility equivalents are NOT automatically added. If the rule "&b < a" is in tailoring, and the order of ⓐ (circled a) is important, it needs to be tailored explicitly.

Redundant tailoring rules are removed, with later rules "winning". The strengths around the removed rules are also fixed.

Example:

The following table summarizes effects of different redundant rules.

If two different reset lists use the same character it is removed from the first one (see 1 in the table above). If the second character is a reset, the second list is inserted in the first (see 2). If both are resets, then the same thing happens (see 3). Whenever such an insertion occurs, the second strength "postpones" the position (see 4).

If there is a "[before N]" on the reset, then the reset character is effectively replaced by the item that would be before it, either in a previous tailoring (if the letter occurs in one - see 5) or in the UCA. The N determines the 'distance' before, based on the strength of the difference (see 6-8). However, this is subject to postponement (see 9), so be careful!

Reset semantics

The reset semantic in ICU 1.8 and above is different from the previous ICU releases. Prior to version 1.8, the reset relation modifier was applicable only to the entry immediately following the reset entry. Also, the relation modifier applied to all entries that occurred until the next reset or primary relation.

For example,

&xyz << e <<< f

was equivalent to

&x << e/yz <<< f

prior to ICU version 1.8.

Starting with ICU version 1.8, the modifier is equivalent to

&x << e/yz <<< f/yz

The new semantic produces more intuitive results, especially when the character after the reset is decomposable. Since all rules are converted to NFD before they are interpreted, this can result in contractions that the rule-writer might not be aware of. Expansion propagates only until the next reset or primary relation occurs.

For example, the following rule:

&ab = c <<< d << e <<< f < g <<< h

was equivalent to the following prior to ICU 1.8 and in Java:

&a = c/b <<< d << e <<< f < g <<< h

Starting with 1.8, it is equivalent to

&a = c / b <<< d / b << e / b <<< f / b < g <<< h

Known Limitations

The following are known limitations of the ICU collation implementation. These are theoretical limitations, however, since there are no known languages for which these limitations are an issue. However, for completeness they should be fixed in a future version after 1.8.1. The examples given are designed for simplicity in testing, and do not match any real languages.

Expansion

The goal of expansion is to sort as if the expansion text were inserted right after the character. For example, with the rule

&a <<< c / e

The text "...c..." should sort as if it were right after "...ae..." with a tertiary difference. There are a few cases where this is not currently true.

Recursive Expansion

Given the rules

&a <<< c / e

&g <<< e / I

Expansion should sort the text "...c..." as if it were just after "...ae...", and that should also sort as if it were just after "...agi...". This requires that the compilation of expansions be recursive (and check for loops as well!). ICU currently does not do this.

Rules

& a = b / c

& d = c / e

Desired Order

add

b

adf

Current Order

b

add

adf

Contractions Spanning Expansions

ICU currently always pre-compiles the expansion into an internal format (a list of one or more collation elements) when the rule is compiled. If there is a contraction that spans the end of the expanded text and the start of the original text, however, that contraction will not match. A text case that illustrates this is:

Rules

& a <<< c / e

& g <<< eh

Desired Order

ad

c

af

g

ch

h

Current Order

ad

c

ch

af

g

h

Since the pre-compiled expansions are a huge performance gain, we will probably keep the implementation the way it is, but in the future allow additional syntax to indicate those few expansions that need to behave as if the text were inserted because of the existence of another contraction. Note that such expansions need to be recursively expanded (as in #1), but rather than at pre-compile time, these need to be done at runtime.

While it is possible to automatically detect these cases, it would be better to allow explicit control in case spanning is not desired. An example of such syntax might be something like:

&a <<< c // e

Notes: ICU does handle the case where there is a contraction that is completely inside the expansion.

Suppose that someone had the rules:

&a = c / e

&x = ae

These do not cause c to sort as if it were ae, nor should they.

Normalization

The Unicode Collation Algorithm specifies that all text sort as if it were first normalized into NFD. For performance reasons, ICU collation data is pre-processed so that there is no need to perform normalization on strings that are in FCD and do not contain any composite combining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81 } [[:^lccc=0:]&[:toNFD=/../:]] (These characters must be decomposed for discontiguous contractions to work properly. Use of these characters is discouraged by the Unicode Standard.). The vast majority of strings are in this form.

Nulls in Contractions

Nulls should not be used in contractions that could invoke normalization.

Rules

& a <<< '\u0000'^

Desired Order

a

'\u0000'^

Current Order

'\u0000'^

a

Contractions Spanning Normalization

The following rule specifies that a grave accent followed by a b is a contraction, and sorts as if it were an e.

& e <<< ` b

On this basis, "...àb..." should sort as if it were just after "...ae...". Because of the preprocessing, however, the contraction will not match if this text is represented with the pre-composed character à, but will match if given the decomposed sequence a + grave accent. The same thing happens if the contraction spans the start of a normalized sequence.

Variable Top

ICU lets you set the top of the variable range. This can be done, for example, to allow you to ignore just SPACES, and not punctuation.

Variable Top Exclusion

There is currently a limitation that causes variable top to (perhaps) exclude more characters than it should. This happens if you not only set variable top, but also tailor a number of characters around it with primary differences. The exact number that you can tailor depends on the internal "gaps" between the characters in the pre-compiled UCA table. Normally there is a gap of one. There are larger gaps between scripts (such as between Latin and Greek), and after certain other special characters. For example, if variable top is set to be at SPACE ('\u0020'), then it works correctly with up to 70 characters also tailored after space. However, if variable top is set to be equal to HYPHEN ('\u2010'), only one other value can be accommodated.

With ICU 1.8.1, the user is advised not to tailor the variable top to customize more than two primary relations (for example, "& x < y < [variable top]). Starting in ICU 2.0, setVariableTop() allows the user to set the variable top programmatically to a legal single character or a valid contracting sequence. In addition, the string that variable top is set to should not be treated as either inclusive or exclusive in the rules.

Case Level/First/Second

In ICU, it is possible to override the tertiary settings programmatically. This is used to change the default case behavior to be all upper first or all lower first. It can also be used for a separate case level, or to ignore all other tertiary differences (such as between circled and non-circled letters, or between half-width and full-width katakana). The case values are derived directly from the Unicode character properties, and not set by the rules.

Mixed Case Contractions

There is currently a limitation that all contractions of multiple characters can only have three special case values: upper, lower, and mixed. All mixed-case contractions are grouped together, and are not affected by the upper first vs. lower first flag.

Rules

& c < ch

<<< cH

<<< Ch

<<< CH

Desired Order

UPPER_FIRST

C

CH

Ch

cH

ch

Current Order

c

CH

cH

Ch

ch

Building on Existing Locales

All of the collation rules are additive; that is, they override what any previous rule expressed. That means that you can build on existing rules for given locales. Here is an example of this, which fetches the rules for a particular locale (Danish), then overrides some part (sorting '%' after 'm'). The syntax is Java, but C/C++ has similar features.

ULocale myLocale = new ULocale("da");

try {

RuleBasedCollator col = (RuleBasedCollator) Collator.getInstance(myLocale);

String rules = col.getRules();

String myRules = "& m < '%'";

RuleBasedCollator col2 = new RuleBasedCollator(rules + myRules);

// check the values

List<String> expected = Arrays.asList("a;m;%;z;aa".split(";"));

TreeSet<String> sorted = new TreeSet<String>(col2);

sorted.addAll(expected);

ArrayList<String> actual = new ArrayList<String>(sorted);

assertEquals("Customized rules with %", expected, actual);

} catch (Exception e) {

throw new IllegalArgumentException("Failed to create customized rules", e);

}

The root collator has an empty rules string (getRules() returns ""): Any collator's tailoring rules string defines how a collator differs from the root collator, and the tailoring rules string was the input for building the tailoring collator. By contrast, the root collator itself is built from a file with explicit mappings from characters/contractions to collation elements. This file represents the DUCET as modified by CLDR.

There are "extended" versions of getRules() which, when called with delta=UCOL_FULL_RULES (C/C++) or fullrules=true (Java), return "full rules" which are a concatenation of the "UCA rules" and the collator's tailoring. The "UCA rules" are published as UCA_Rules.txt in every UCA release.

    • "UCA rules" is a historical misnomer. The UCA specifies an Algorithm which applies to all collators, and provides the DUCET as its Default table.

    • ICU's root collator implements the CLDR-modified collation element table. The "UCA rules" returned from ICU functions are equivalently modified rules compared with those for the DUCET.

The "UCA rules" are an approximation of the root collator's sort order, but there are some differences because not all of the details of the root collator mappings can be expressed in rule syntax. In particular, a collator built from UCARules.txt has at least the following issues compared with the real root collator:

    • inefficient (long) collation element weights

    • CODAN (numeric collation) will not work (the 0 digit's primary weight is hardcoded, or specified in FractionalUCA.txt)

    • script reordering will not work

    • alternate=shifted will not work

    • the sort order has some differences from the regular root collator, including additional tertiary differences

The "full rules" are almost never used, or useful, at runtime. They are included in ICU for historical reasons and for UCA consistency tests. They might be usable for emulating the CLDR/ICU sort order with a collation implementation not based on CLDR/ICU.

Collation rule strings in general are not commonly used but are a significant portion of the data size in ICU collation resource bundles, especially for CJK languages. The rule strings can be omitted from those resource bundles by adding the --omitCollationRules option to the relevant genrb invocations (e.g., in ICU's source/data/Makefile.in).

If the tailoring rules are needed but the 150kB or so of "UCA rules" are not, then the line

UCARules:process(uca_rules){"../unidata/UCARules.txt"}

in source/data/coll/root.txt can be commented out or deleted.

Cautions

The following are not known rule limitations, but rather cautions.

Resets

Since resets always work on the existing state, the user is required to make sure that the rule entries are in the proper order.

Rules

& a < b

& a < c

Order

a

c

b

Comment

The rules mean: put b after a, then put c after a (inserting before the b.

Postpone Insertion

When using a reset to insert a value X with a certain strength difference after a value Y, it actually is inserted just before the next item of the same strength or higher following Y. Thus, the following are equivalent:

... m < a = c <<< d << e <<< f < g <<< h & a << x

... m < a = c <<< d << x << e <<< f < g <<< h

This is different from the Java semantics. In Java, the value is inserted immediately after the reset character.

Jamo Tailoring

If Jamo characters are tailored, that causes the code to go through a slow path, which will have a significant effect on performance.

Compatibility Decompositions

When tailoring a letter, the customization affects all of its canonical equivalents. That is, if tailoring rule sorts an 'a' after'e ', for example, then ""à", "á", ... are also sorted after 'e'.his is not true for compatibility equivalents. If the desired sorting order is for a superscript-a ("ª") to be after "e", it is necessary to specify the rule for that.

Case Differences

Similarly, when tailoring an "a" to be sorted after "e", including "A" to be after "e" as well, it is required to have a specific rule for that sorting sequence.

Automatic Expansions

ICU will automatically form expansions whenever a reset is to a multi-character value that is not a contraction. For example, & ab <<< c is equivalent to & a <<< c / b. The user may be unaware of this happening, since it may not be obvious that the reset is to a multi-character value. For example, & à<<< d is equivalent to & a <<< d / `