Proposal for Encoding Emoji Symbols

(Japanese translation)

N3582

L2/09-025R2

Date: 2009-Mar-05

Authors:

Markus Scherer, Mark Davis, Kat Momoi, Darick Tong (Google Inc.)

Yasuo Kida, Peter Edberg (Apple Inc.)

Proposal

The Unicode Consortium proposes adding 674 UCS characters to enable complete UCS representation of the fixed set of Emoji that have been encoded in carrier-specific extended versions of Shift-JIS or ISO-2022-JP by the three primary cell phone carriers in Japan: NTT DoCoMo, KDDI, and Softbank. These characters are interchanged with many other systems. The Emoji symbols proposed here are not yet encoded in the UCS, while the remaining 114 Emoji symbols are represented by existing UCS characters.

Documents:

    • N3582 = L2/09-025R2: This proposal document

    • N3583 = L2/09-026R: Chart of proposed characters

    • N3585 = L2/09-078: Sources File

Rationale

As with the already-approved ARIB symbols, the purpose for adding these symbols is interoperability. Interoperability is important for those cellphone carriers and their clients, but not only for them. What is even more important is that the data be handled without loss or corruption by a vast array of other interoperating systems that use the UCS: email systems, search engines, publishing systems, databases, and so on. These symbols are also supported in web mail services by Yahoo! Mail and Google Mail, and in the Apple iPhone.

Users treat Emoji as text and expect that they can be interchanged like any other element of text. There are mapping tables in use in the industry between these character sets, with both roundtrip and fallback mappings. However, the only UCS representation is via Private-Use characters which are subject to misinterpretation and data corruption.

This core set of Emoji was encoded as extensions to Shift-JIS / ISO-2022-JP following a history of similar vendor-specific extensions by other companies and for other purposes in Japan. (For example, the Japanese recording industry standard RIS-506-1996 specifies an extension of Shift-JIS for use in Music CD text, and includes a number of characters similar to the Emoji in the current proposal.) This approach has enabled these Emoji to be used in plain text contexts such as SMS text messages, e-mail subject lines and address book entries, and many users depend on being able to use Emoji from the core set in this way.

For Emoji beyond this core set, vendors have added rich text support, and use approaches such as embedded graphics. Similar techniques (embedded graphics or escaped tags designating Emoji) are also used for emoji support in China and the Republic of Korea, and there is no demonstrated need for encoding of Emoji beyond the proposed core set. The core set cannot be further extended to support additional Emoji without creating interoperability problems even within the cell phone carriers' own networks. However, the existing core set of Emoji will continue to be represented as encoded characters, because users depend on capabilities that result from this and because this core set is already widespread in existing encoded data.

As of December 2008, there are 110.4 million cell phone users in Japan (about 87% of the population), and about 90.6% of the cell phones are 3G-enabled for internet use. Emoji are widely used, especially by people under 30. However, a June 2007 survey of 13,000 users — 80% of whom were 30 or older — found that even among this older group, 78% "often" or "sometimes" used Emoji in emails. Respondents reported using a wide variety of Emoji, including Emoji for faces, emotions, weather, vehicles and buildings, food and drink, animals, etc. Especially among younger users, email is mostly or exclusively used on cell phones instead of computers. Among cell phone users, 90% use email primarily on cell phones, and 60% use email exclusively on cell phones. Emoji have been used on Japanese cell phones for 10 years, and there is no evidence that use of Emoji is decreasing.

The UCS encoding of the proposed core set of Emoji is intended for interoperability with existing data generated by Japanese cell phone users. This interoperability requirement is analogous to the rationale for encoding Dingbats, for example. Because the overriding requirement is interoperability, the identity of the proposed characters is determined to some extent by their mappings to and from the vendor-specific encodings, and therefore this proposal includes a Sources table. Establishing character identity by mapping to sources follows the model used for East Asian ideographs.

Note that encoding of the complete set is essential in order to satisfy the interoperability requirement, otherwise data is lost in round-trip conversion and data interchange.

Sources File

The Unicode Consortium requests to make the emoji_sources.txt file (N3585=L2/09-078) a normative part of the contents of ISO/IEC 10646. This can be done by reference.

References

Factors

The following factors were considered in developing the proposal.

    1. Complete set: The proposed symbols are designed for complete round-trip conversion of each of the major carriers' symbols, taking unifications into account. This is necessary for interoperability. See the emoji_sources.txt file (N3585=L2/09-078) for complete source information.

    2. Source separation rule: If a single carrier separates two characters (anywhere in the character set, so including standard JIS codes), then they are mapped to two separate UCS characters. Similar to the case of CJK Unified Ideographs, this is a hard and fast rule.

    3. Reuse: Existing UCS symbols are used where appropriate. This includes unifications with "upcoming" characters in ISO/IEC 10646 AMD6 and Unicode 5.2. In particular, some unifications are with symbols from the ARIB set.

    4. Separating generic symbols: If the UCS had a set of related symbols, but no one character in the set was as generic as in the Emoji symbol sets, then a new character is added. For example, the Emoji sets do not distinguish between waxing and waning crescent moons.

    5. Colors and Animation: Colors and animation are not part of the encoding. The characters may have associated colors or animation in particular implementations but that is not part of their identity.

    6. Existing cross-mapping tables: The mapping tables mentioned above were followed as much as possible. For example, the sets of Zodiac symbols are unified, even though the images shown by carriers vary widely. This is because they clearly belong to a cohesive set which corresponds across carriers and is mapped across carriers. Characters were disunified in a small number of cases, where the visual images were very different and not semantically associated. For example, the 'M' symbol for Metro is disunified from the Metro train image. Round-trip mappings between carrier Shift-JIS character sets can be maintained, by having the mapping tables between Unicode and each carrier's Shift-JIS version use appropriate fallback mappings.

Symbol Identifiers

During proposal development, stable, internal identifiers were used, for example e-02A for the ALARM CLOCK symbol. These are used only for reference during development.

Code point assignments

Some symbols that are closely related to existing ones are allocated in the same blocks, or in related "supplementary" blocks, if there are unassigned code points available there. Most of the symbols are proposed to be assigned code points in a new block on the SMP. No new BMP blocks are proposed for Emoji symbols. Of the 674 symbols, 9 are proposed for encoding on the BMP, and the remaining 665 are proposed for encoding on the SMP.

Special, rarely used, carrier-specific symbols are proposed for encoding in the Emoji compatibility symbols block. They are needed to complete the set for interoperability but are only identified by their source mappings (N3585), not specific glyphs and names.

Properties

Most symbols are proposed with standard symbols properties, with Bidi_Mirroring=False, for example like

2702;BLACK SCISSORS;So;0;ON;;;;;N;;;;;

Exceptions:

    • One symbol, e-B08 LOOPED LENGTH MARK, is proposed as a punctuation character (gc=Pd). (This is related to U+3030 WAVY DASH which also has gc=Pd.)

    • There are several symbols with compatibility decompositions. These decompositions are noted in the chart (N3583). The set of characters with decompositions follows established practice in Unicode/ISO 10646.

Collation

Default collation order (DUCET/ISO 14651) follows the example of Dingbats, except:

    • e-B08 LOOPED LENGTH MARK sorts together with U+3030 WAVY DASH.

    • Enclosed characters with decompositions sort accordingly, as usual, with tertiary differences from their normalized equivalents.

Proposal Summary Form

ISO/IEC JTC 1/SC 2/WG 2

PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS

FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC 10646

Please fill all the sections A, B and C below.

Please read Principles and Procedures Document (P & P) from

http://www.dkuug.dk/JTC1/SC2/WG2/docs/principles.html for guidelines and details before filling this form.

Please ensure you are using the latest Form from http://www.dkuug.dk/JTC1/SC2/WG2/docs/summaryform.html.

See also http://www.dkuug.dk/JTC1/SC2/WG2/docs/roadmaps.html for latest Roadmaps.

Form number: N3452-F (Original 1994-10-14; Revised 1995-01, 1995-04, 1996-04, 1996-08, 1999-03, 2001-05, 2001-09, 2003-11, 2005-01, 2005-09, 2005-10, 2007-03, 2008-05)

A. Administrative

B. Technical - General

C. Technical - Justification