Emoji/ARIB Symbols Encoding Principles (Rough Draft)

Previous versions:
2008-08-12: L2/08-308 Symbols Encoding Principles (Rough Draft)

[TBD:
  Change the grammar to not use "we", and otherwise use the imperative mood.
  Add an example of each case, where possible.
]

This collects together the decisions made by the UTC in response to the various questions raised as to how to proceed in incorporating the Emoji and ARIB sets into Unicode. The intention is to apply this principles to future cases of similar sets of characters.

Principles for the encoding of Emoji symbols (or similar sets) as assigned characters in the Unicode Standard: (See also chapter 2 section 2.2 "Unicode Design Principles" of the Unicode Standard.)
  1. Already encoded: Symbols considered for encoding should already be encoded in a character set, called a source character set. Such a character set may be defined by a standards organization, a company, consortium or other organization. Such a character set should be in widespread use. Not every symbol in that character set need be in widespread use.
    • The character set may consist of a set of Unicode Private Use Area (PUA) code points, or it may use a non-Unicode encoding, or both (with a mapping table).
    • The source character set for Emoji consists of the union of the Emoji set proper and cp932 (also called the windows extensions to Shift JIS), since that is used with the Emoji.
  2. Source separation rule: If a single source character set separates two characters (anywhere in the character set, so including standard JIS codes), then we map them to two separate Unicode characters. (For Emoji this is a hard and fast rule, but not for ARIB.)
  3. Reuse: We map to existing Unicode symbols where appropriate. (Unification with existing characters.)
  4. Separating generic symbols: If Unicode had a set of related symbols, but no one character in the set is as generic as in the proposed symbol sets, then we encode a new character. For example, the Emoji symbol sets do not distinguish between waxing and waning crescent moons.
  5. Colors and Animation: We encode symbols as characters, abstracting away from colors and animation. We only distinguish by nominal color or animation for the source separation rule. (See naming below.)
  6. Existing cross-mapping tables: Where cross-mapping tables are established among related symbol character sets, we follow the tables as much as possible and unify among the symbol character sets, but we disunify in cases where the visual images are very different and not semantically associated. For example, among Emoji symbol character sets:
    1. We disunified the 'M' symbol for Metro from the Metro train image. The 'M' symbol would have translation problems. (This is similar to the problems with the international currency symbol and the proposal for a "generic decimal separator".)
    2. On the other hand, we unified the sets of Zodiac symbols, even though the images shown by carriers vary widely. This is because they clearly belong to a cohesive set which corresponds across carriers.
  7. Least-marked common symbol: For a set of symbols from related symbol character sets which each could map to an existing Unicode code point, we choose the symbol that is shared among the most carriers (according to the cross-mapping tables) and has the least-marked form.
  8. Naming: Character names are typically based on the glosses of the vendor symbols or the visual appearance. We follow the conventions for existing Unicode characters where possible, in particular using "BLACK" for "filled" and "WHITE" for "hollow". We exclude nominal color and animation from proposed character names except where necessary for distinction.
    • It is preferred to choose symbol character names by appearance rather than semantics because symbols tend to be used for different purposes and selected for desired appearance. An example is the Emoji Bank symbol, which includes the letters BK, and is used for "bakkureru" as well as Bank.
  9. Characters, not glyphs: We should avoid encoding glyph variations as characters.
    • For example, the ARIB standards have several Kanji with 70% of full size (ARIB row 92 cells 26..31). These should not be encoded separately.
    • We generally do not use variation sequences, but we reserve the ability to add them for cases where we have unified characters.
    • We may have characters which would have combatibility mappings (see the ARIB set for examples).
  10. Combining enclosing marks: For some symbols, it may be appropriate to encode them as sequences of an existing Unicode character with a combining enclosing mark of the right shape (circle, square, keycap, etc.). However, this cannot be done for enclosing multiple base characters, and should not be done for heavily styled characters where the enclosing mark does not express the styling well.
  11. When characters look like sequences of existing characters: 
    1. They may be encoded as compatibility decomposibles if they are similar to existing compatibility decomposibles.
    2. Encoded without decompositions, if not similar.
    3. If they are better represented as sequences, they would not be encoded.
  12. Duplicates within a source set. If the source-separation rule does not apply (ARIB), we may unify characters.
  13. Properties. Where the sets include characters that are commonly used as punctuation or other kinds of characters, then characters properties are a factor in unification decisions. 

Code point assignment guidelines

Use the principle of filling existing blocks in the BMP but not creating new blocks in that plane. While in modern use, it is felt that the few remaining spaces in the BMP should be reserved to scripts, not new symbols. New blocks are therefore allocated in the supplementary plane 1 (SMP) to accommodate characters that do not fit in existing BMP blocks.

Comments