Clarification of implicit weights for ideographs in UCA
L2/xxx
To: UTC
From: Mark Davis, Markus Scherer
Re: Clarification of implicit weights for ideographs in UCA
Date: 2009-02-20
UCA has the following description of how to generate implicit weights:
The value for BASE depends on the type of character:
FB40
FB80
FBC0
CJK Ideograph
CJK Ideograph Extension A/B
Any other code point
http://unicode.org/reports/tr10/#Implicit_Weights
This is unfortunately not crystal-clear. It will also need to be updated when we add additional CJK Ideographic blocks.
The goal was to include all the Unified Ideographs. The issue is what counts as "CJK Ideograph", and what counts as "CJK Ideograph Extension A/B". Our presumption is that:
"CJK Ideograph" is the ideographic blocks defined in Unicode 1.1, that is, those in [:block=CJK Unified Ideographs:] plus [:block=cjk compatibility ideographs:]. The latter block only matters for the 12 characters that are NFC, according to the way that UCA works.
"CJK Ideograph Extension A/B" is all other blocks that contain unified ideographs: that is currently [[:block=cjk unified ideographs extension a:][:block=cjk unified ideographs extension b:]].
Note that the above definition means that they include some unassigned code points.
This is what we have followed for some time in the ICU implementation. We'd like to fix the text to correspond to the above.
Other Questions:
Is it worth adding a note that the block grew in Unicode 4.1 and Unicode 5.1?
Should we "future-proof" the second category by making it Extension A block + Plane 2 (minus non-characters)? What about Plane 3?
The text also needs a bit of editing because it uses "character" sometimes when "code point" is meant.