1. SummaryThe Java Locale has fallen out of date, and needs to be enhanced to avoid loss of data. Relatively small changes to Locale can update it to current standards, and avoid significant problems for companies using Java. This proposal recommends a series of enhancements to the JDK Locale in order to bring Java into conformance with IETF BCP47 and UTR35 (CLDR/LDML). 1.1 BackgroundMany years ago, the internal structure for Locale was modeled after IETF RFC 1766, which was the industry standard for the representation of languages and locales at the time. But the industry has moved on since then. 1. RFC 1766 is long obsolete. It has been superseded by IETF BCP 47, which makes a number of important additions needed for the representation of languages. BCP 47 is now the standard used and required by HTML, XML, HTTP, and many other specifications and programs. Among other features, BCP 47 provides the following:
These limitations are already causing significant implementation problems. For example, when a J2EE Servlet container implementation parses a language tag from an Accept Language http header and creates a Locale instance, it cannot map a script code into the new Locale object. 2. BCP 47 Unicode extensions are needed. The Unicode consortium launched the CLDR (Common Locale Data Repository) project years ago to construct and maintain the standard repository of locale data. Sun Java 6 is also a consumer of CLDR. The Unicode locale model provides an extension of BCP 47 to add keywords and codes needed in IT. These are needed to properly represent locale variants used in industry, such as dictionary vs phonebook sort orders for German. 1.2 APIThe proposed API additions are:
A key requirement is backwards compatibility.
2. IETF BCP 47 Language Tag and Locale Identifiers in UTS#35 (CLDR/LDML)The syntax of language tags (or language identifiers) is currently defined by RFC5646 (which is part of BCP 47).A language tag is normally composed of
A Unicode locale identifier as defined in the Unicode Technical Standard #35 UNICODE LOCALE DATA MARKUP LANGUAGE (LDML) inherits the basic structure of the BCP47 language tag. The differences are followings:
For example, "de-Latn-DE-u-ca-gregory-co-phonebk" is a valid BCP47 language tag as well as a valid Unicode locale identifier and interpreted as: Language: German("de")
Script: Latin("Latn") Territory: Germany("DE") Calendar("ca"): Gregorian("gregory") Collation("co"): Phonebook("phonebk") 2.1 Compatibility The ABNF for BCP 47 provides certain features for backwards compatibility with older versions of BCP 47, features that are not needed by Java.
There are a few special cases where this proposal for the Java locale deviates from BCP47 for compatibility reasons:
3. Locale Fields The current JDK Locale class has three logical fields - language, country and variant. To represent BCP 47 language tags and Unicode locale identifiers without data loss, a few new fields must be added. Also, the definition of existing fields must be extended.3.1. LanguageAccording to the API specification of JDK Locale, languages are limited to lower-case two-letter codes as defined by ISO 639 part 1. However, the implementation in the Oracle JDK has never checked the length of the language argument in the constructors. The language tag in BCP47 is either an ISO two-letter code or an ISO three-letter code registered in the IANA Language Subtag Registry.Practically, there is no problem using three-letter ISO 639 language codes in the current JDK Locale. But the API specification should be updated and clearly state that ISO three-letter language codes are a valid language representation as well as the ISO two-letter language codes. 3.2. ScriptThe language tag specification in BCP 47 uses ISO 15924 codes to represent scripts. A script code is represented by four alphabetic characters in ISO 15924. The repertoire of valid script codes in BCP 47 is registered in the IANA Language Subtag Registry. Appendix A. Script codes shows all script codes that can be used in a BCP 47 language tag at this moment.Support of script codes requires a structural change in the JDK Locale class. When a Locale object has a script, it should be treated as an independent logical field when populating the candidate Locale list for resource lookup. Thus, the script is stored in a new logically separated field. 3.3. RegionBCP 47 region subtags are used to indicate linguistic variations associated with or appropriate to a specific country, territory, or region. A valid region subtag is either a two-letter ISO 3166 country code or a three-digit UN M.49 numeric region code. Because ISO 3166 codes and UN M.49 codes are mutually exclusive in a single BCP 47 language tag, there is no need to introduce a new field in the JDK Locale, so the existing Country field is used.The use of UN M.49 codes in BCP 47 is limited to the repertoire registered in the IANA Language Subtag registry. Appendix B. Region codes shows all UN M.49 codes which can be used in a BCP 47 language tag. According to the API specification of JDK Locale, countries are limited to upper-case two-letter codes as defined by ISO 3166. However, the implementation in the Oracle JDK has never checked the character repertoire nor the length of country argument in the constructors. So the API specification can be updated to accept a UN M.49 region code without introducing any breaking changes. 3.4. VariantsThe current JDK Locale was designed based on the old language tag specification (RFC1766). The variant field in Locale class is used for the second and subsequent subtags. RFC1766 allows 1 to 8 letters for these subtags. In BCP 47, the syntax is defined by RFC5646, which only allows 5 to 8 letters when the value starts with [a-z][A-Z] or exactly 4 letters when the value starts with a digit [0-9].The variant in JDK Locale does not have any restriction on its length and character repertoire. The Oracle JDK uses variant for some locales, such as JP (ja_JP_JP), TH (th_TH_TH) and NY (no_NO_NY). These variants are illegal in BCP 47. So BCP 47 variant subtags can be mapped to the variant field in the Locale class, but not the other way without an additional transformation. The JDK implementation already assumes multiple variants are separated by LOW LINE ("_") characters. When variant subtags in BCP 47 are mapped into a JDK Locale, they are stored in a single field - variant - separated by LOW LINE characters. 3.5. BCP47 Extension/Private Use Subtag and Unicode Locale ExtensionThe current JDK Locale does not have any fields that can be used to store BCP47 extensions and private use. BCP47 extensions is a map between extension type, represented by a single letter, and its value. BCP47 private use subtags have similar structure identified by the letter 'x'/'X', although the syntax for value is slightly relaxed. In this design proposal, BCP47 extensions and private use both treated as Locale extensions represented in a single logical map using a single-character key.Unicode extends BCP47 for use in locales. Unicode locale extension subtags have a well-defined substructure representing a key-value map. The keys are always 2-letter alphanum, and the value subtags are restricted to 3 to 8-letter alphanum. As discussed in the previous section, the Oracle JDK currently uses the variant field to modify the base locales, such as ja_JP_JP and th_TH_TH. The variant JP specifies that Japanese imperial calendar is to be used for dates and the variant TH specifies that Thai local digits are to be used for numbers. Such usage won't work well when a locale needs to be transformed into a BCP47 language tag or vise versa. The Unicode locale extension defined by LDML provides the solution. For example, the JDK Locale ja_JP_JP is mapped to BCP47 language tag "ja-JP-u-ca-japanese". APIs dedicated for accessing each Unicode locale extension item are proposed, in addition to APIs for generic BCP47 extensions. With the API for accessing a BCP47 extension, a user can get the entire key-value map for the Unicode locale extension. (e.g. get "ca-gregory-co-phonebk" from "de-DE-u-ca-gregory-co-phonebk-x-jdk" by specifying extension letter 'u'.) With the API for accessing a Unicode locale extension, a user can get the individual Unicode locale extension value by specifing a key. (e.g. get "phonebk" from "de-DE-u-ca-gergory-co-phonebk-x-jdk" by specifying Unicode locale extension key "co".) 3.6. Summary of Proposed Locale Field ChangesThe table below illustrates proposed changes for JDK Locale fields and mapping to BCP 47 language subtags. The sections in bold/orange are new proposed enhancements.
4. Equality of LocalesIn this proposal, an instance of Locale is equal to another instance of Locale when each corresponding field has the exact same value. Also an instance of Locale is never equal to another instance of Locale when any corresponding field has different values. This section will discuss what needs to be done to keep this relationship across JDK releases.4.1. Deprecated Language CodesThere are some ISO 639 two-letter language codes used by the JDK Locale that are already deprecated. The Java API reference clearly explains that these codes ("iw", "ji" and "in") are used by the JDK Locale class, even when the new codes ("he", "yi" and "id") are supplied to the Locale constructor. This proposal does not change the mapping and the JDK Locale will continue to use the old codes for the language field.4.2. Three-letter Language CodesThis proposal introduces the use of ISO 639 three-letter language codes. Many three-letter codes in ISO 639 also have ISO 639 Part 1 two-letter codes. For example, English is represented by "eng" in ISO 639 Part 2 and 3, and "en" in ISO 639 Part 1. It is absolutely not worth having such variants, so the BCP 47 language tag specification and the Unicode language and locale identifiers only allow the shortest form, that is, an ISO 639 three-letter code can be used only when it does not have a ISO 639 two-letter definition.When a three-letter code is specified in the JDK Locale constructors (although this is illegal by the current API definition), it could be mapped to ISO 639 two-letter code if available. However, the behavior would break existing Java applications that might rely on the three-letter language codes being preserved in Locale instances. To avoid the backward compatibility problem, this proposal does not perform any such mapping. That is, new Locale("en") and new Locale("eng") will result two totally different Locale instances. It is the responsibility of the user to perform such mapping if desired. Similarly, three-digit region codes are not allowed in BCP47 when a two-letter region (country) code exists, but such a mapping and/or validation is again left to the user of Locale.
4.3. Other considerationsBecause of the lack of a script field, some Java users have used "zh_CN" as a language identifier for Simplified Chinese and "zh_TW" for Traditional Chinese. This does not work well for users who want to re-use the same resource for other Chinese locale variants. For example, some Java users may want to share the Traditional Chinese localized contents for zh_TW and zh_HK. To share the same Traditional Chinese resource, they currently have to write a custom ResourceBundle.Control (introduced in JDK 6). The introduction of script will allow them to maintain a single Traditional Chinese resource shared by multiple Chinese sub-locales without such customization. "zh_TW" by the definition of old JDK releases would be normally equivalent to "zh_Hant_TW". However, they are not exactly the same in term of BCP 47 language tag definition. Thus, this proposal interprets "zh_TW" and "zh_Hant_TW" as different locales, but such mapping is considered in locale resource/service lookup (Please refer to Section 7. Locale Resource/Service Lookup for the details).5. Locale ConstructionThe existing Locale constructors are not sufficient to fill in the new Locale fields. Also, there is a need to create an instance of Locale from a BCP 47 language tag or a Unicode locale identifier. This section describes the proposed enhancements related to Locale instantiation.5.1. ConstructorsThere are three constructors available in the current JDK Locale class.public Locale(String language)
public Locale(String language, String country) public Locale(String language, String country, String variant) There are some minor changes necessary in these existing constructors below -
5.2. Locale BuilderThe extensions field itself is represented internally by a LocaleExtensions object. Because its syntax is pretty strict, this proposal does not allow Java users to set the value directly. Because the extensions field might be used to customize the behavior of JDK locale service classes, the proposal allows finer-grained building of locales via a LocaleBuilder, a new class that allows Java users to edit each Locale field, including extensions, in a controlled way. An instance of LocaleBuilder is mutable, unlike Locale. Java users can construct a LocaleBuilder from scratch, or starting from an existing Locale, then call setter methods to modify field values. After editing the fields, the method public Locale build() returns a new instance of Locale (or maybe a cached locale instance) with the customized field contents.5.3 From BCP47 Language TagWe expect creating an instance of Locale from a BCP47 language tag will be common in Java applications. For example, J2EE Servlet Container creates a Locale from HTTP Accept-Language header. JDK currently does not have such API, so Java application developers are writing their own code to parse BCP47 language tags. Some of them assume the first subtag is always a language, the second subtag is always a country, and the rest arethe variant. Such an implementation would parse the input language tag "zh-Hant-TW" and create a new Locale using new Locale("zh", "Hant", "TW"), which is incorrect.This proposal adds a new static method: public static Locale forLanguageTag(String languageTag)
This method always returns an instance of Locale even if the input language tag is malformed. The implementation evaluates each subtag from the beginning. When it encounters a malformed subtag, the subtag and all following subtags will be truncated. If the last valid subtag requires extra subtags, for example, extension singleton lentter followed by a malformed extension subtag, the extension itself is truncated. For example, "en-US-12-345" -> en_US ("12" is illegal at the position)
"ja-JP-x-WindowsVista" -> ja_JP ("WindowsVista" exceeds 8 letters in length, and "x" alone is illegal) "a-b" -> ROOT ("a" is illegal at the position, therefore, entire input is truncated) Although this method is good enough for common use cases, some others may want to report an error if an input language tag contains malformed subtags. The Locale Builder class described above supports strict syntax checking and error reporting in the method below. public Builder setLanguageTag(String languageTag)
This method throws java.util.IllformedLocaleException (new) when the input language tag is invalid. The caller can access the error index through the exception. For example, for the input language tag "en-US-12-345", IllformedLocaleException#getErrorIndex() will return 6 (the start offset of subtag "12"). There are some design considerations with language tag parsing, as follows. Grandfathered TagsBCP47 supports some irregular language tags introduced in the past. Many of them are deprecated and have equivalent well-formed (langtag production of Language Tag) mappings. The implementation in JDK will transform those grandfathered tags that have well-formed mappings in these methods. These special mappings are:
Other grandfathered tags do not have well-formed mappings. In input, they convert as follows:
At this moment, these are all registered grandfathered tags. It is extremely unlikely that other grandfathered tags will be registered in the future. But if any new grandfathered tags are introduced that do not satisfy the regular RFC5646 langtag production of language tag syntax, they will be treated as invalid language tags in the JDK implementation. Extlang
Validity of SubtagsValid subtags are maintained in the IANA Language Subtag Registry. Any subtags not registered in the registry are actually invalid for use in BCP47 language tag and the registry is large and growing over time. Checking the validity of subtags would require the registry data to be imported into the JDK implementation, and it could easily become out of sync with the IANA registry. For this reason, this design does not check the validity of subtags. For example, the language subtag "ac" is not currently registered, and cannot be used in a BCP47 language tag at this moment. But the proposed implementation does not invalidate "ac" because it still satisfies the RFC5646 language tag syntax, and could become valid in the future.
5.4. System Default LocaleThis proposal introduces a Locale representation that contains fields that were not in previous Java releases. However, one of our
primary goals is not to break existing applications. Locales included
in the supported Locale list in former Java releases should not be
changed. For example, while a Locale with script field, such as zh_Hant_HK
could be used as the Java system Locale, the JVM implementation should
still use the existing form zh_HK. This proposal resolves such mapping
in the lookup algorithm, so Java users can put their own resource
bundle with script information. 6. String Representation of LocaleThe method public String toString() in the JDK Locale class returns the programmatic name of the Locale. The text value is used as Locale ID in many Java applications. This proposal introduces a couple of new fields, which ideally should be included in the text representation. But such incompatible changes might break existing Java applications.The solution adopted here is that values that cannot be represented in the older format are appended at the end of string, after the special prefix character '#'. When there is already a variant value, "_#" followed by the value is appended. When there is no variant value, "#" + the value is appended. To old Java programs using a String representation of Locale, the value looks like an additional variant tag. Thus, resource look-up based on truncating fields from right still returns the same result. Yet the information can be recovered from the string, by recognizing the special # character. That character was chosen so that it would be extremely unlikely to collide with existing usage of Locale.
The tables below illustrates how the script value and extension values are inserted in the result of toString().
6.1. ScriptThe script field in the JDK Locale will be appended at the end of its string representation, as illustrated in the following table:
6.2. ExtensionsThe extensions fields in the JDK Locale will be appended at the end of its string representation, as illustrated in the following table:
6.3. BCP47 Language Tagpublic static Locale forLanguageTag(String languageTag) creates an instance of Locale from a language tag. This proposal also add a traverse method, getting a language tag string for an instance of Locale:public String toLanguageTag()
The toLanguageTag method will return a well-formed (langtag production of RFC5646 language tag) language tag for the Locale instance. There are a few complications when generating a well-formed language tag from a Locale: Mandated Language SubtagThe langtag production of RFC5646 language tag requires a non-empty language subtag. The current JDK Locale allows an empty language field. When toLanguageTag() generates a language tag, "und" (Undetermined) will be used as the language subtag if Locale's language field is empty.Ill-formed FieldsEach subtag has its own syntax restriction. For example, language subtag must be 2 to 8 letter ALPHA ([a-z][A-Z]). Although the current API documentation for Locale constructors says two-letter ISO 639 language code is used for the language argument, the implementation does not check if the input language argument really satisfies the condition. For example, new Locale("Hello World!"); creates an instance of Locale with language field "hello world!". Of course, such a Locale cannot be mapped to a well-formed language tag. The method toLanguageTag() will omit any locale field that does not satisfy the language tag syntax requirements. (In this specific example, "hello world!" is not appropriate for language subtag. Because an empty language subtag is not allowed, "und" is returned.)Ill-formed Variant FieldThis is a special case of ill-formed field. The current JDK Locale allows any value in the variant field. Unlike the example above, in which "Hello World!" is an invalid language according to existing javadoc, such a value is a valid variant (there is no syntactical limitation to variant in the current Javadoc). The variant subtag in the language tag specification only allows 5 to 8 letter alphanum ([a-z][A-Z][0-9]) or a single digit followed by 3 alphanum. The implementation of toLanguageTag() will map Locale's variant field to variant subtag as long as it satisfies the syntax. Failing this, if it satisfies the syntax of the private use subtag, the segment and the following part will be mapped to the private use subtag with special prefix "-variant-". Finally, if the variant does not satisfy the private use syntax either, it is considered as segments (delimited by LOW_LINE) and the first failing segment and all following segments will be omitted. For example,new Locale("en", "US", "POSIX").toLanguageTag() -> "en-US-POSIX" ("POSIX" satisfies the variant subtag syntax. Note: en-US-POSIX is actually invalid, because POSIX is not a registered variant subtag.)
new Locale("en", "US", "Windows_XP" -> "en-US-Windows-x-variant-XP" ("Windows" satisfies the variant syntax, but "XP" does not. But the private use syntax allows "XP".) new Locale("en", "US", "Solaris10") -> "en-US" ("Solaris10" is 9 letters and illegal for any subtags, therefore, it is omitted.) 7. Locale Resource/Service LookupThe current JDK implementation uses relatively simple logic in locale resource/service lookup. It begins with a given Locale and trims fields one by one from variant and then country to locate a matching resource or service. In the real world, representing the hierarchy by three fields (language, country and variant) is not sufficient. The introduction of the script field allows applications to improve their lookup strategy. But supporting backwards compatibility requires a few additional changes to the lookup logic. Our goal for the built-in lookup strategy was to enhance the lookup to function correctly whether or not the new fields are used, while maintaining backwards compatibility.Note that resource bundle lookup order is different than resource item inheritance. The inheritance of resource items still follows a truncation model, since the resources are built presuming that model. For more information about lookup vs. inheritance, see LDML.
7.1. Basic Lookup StrategyFor locale resource or service lookup, this proposal handles the extensions separately from the rest of the fields. This proposal uses the term "Base Locale" which consists of the language, script, country, and variant (in other words, all fields except extensions). The primary purpose of the new field extensions in this proposal is to customize the behavior of locale services. When resolving a resource bundle, only the base locale is used. When resolving a service associated with a locale, the lookup logic also only uses the base locale, but since the service implementation might refer to the information stored in the extensions, the extensions are not truncated when invoking the resolved locale service implementation.7.2. Lookup OrderThe current JDK Locale does not support the script field. Such a distinction in writing system is vital, but can be only achieved by associating the difference with a country or variant. The introduction of script provides a clean solution, but the new implementation must not break existing applications. This is handled by using a lookup order that manages the interaction between script, region, and language. The chart below illustrates the proposed lookup order.Note: In the description below, L, S, C and V represent non-empty language, script, country and variant. A locale formed from those fields is represented by brackets around a list of these letters. Thus [L, C] represents a Locale that has a non-empty language and country. When indicating a specific value, a string will follow the letter in parenthesis, for example, L("xx") represents the language "xx".
7.3. Special Handling for Backward Compatibility SupportChineseThe current JDK represents Simplified Chinese as zh_CN and Traditional Chinese as zh_TW. With the introduction of script, these should be represented as zh_Hans and zh_Hant. Even when someone tags resource bundles with zh_Hans or zh_Hant, however, the application still needs to support zh_CN or zh_TW. To make the behavior backward compatible, this proposal handles input of some Chinese locales without script as a special case. When Locale zh_CN is requested, the implementation adds script internally before generating the candidate Locale list. This special expansion will be done only when the language field is "zh" and the country(region) is one of the codes below:
Also, existing applications are likely packaging Simplified Chinese bundles with zh_CN and Traditional Chinese bundles with zh_TW. Therefore, when locale zh_Hans is requested, zh_CN should be in the candidate list, and when locale zh_Hant is requested, zh_TW should be in the candidate list. The proposed implementation will supply country fields when a Chinese locale has a script, but empty country. The chart below illustrates the proposed behavior:
NorwegianThe current Oracle JDK uses Locale no_NO_NY for representing a locale Norwegian Nynorsk (Norway). This representation is illegal in a BCP47 language tag: it should actually use "nn" (Norwegian Nynorsk) for the language field. Also, Oracle JDK treats Locale no as Norwegian Bokmål, which should be represented by "nb" if it should be clearly distinguished from Norwegian Nynorsk. Because the current JDK implementation assign special semantics to these locales which is not compatible with the rest of world, special handling for Norwegian locale lookup is proposed as below:
8. Proposed New APIsThere are various technical topics discussed in earlier sections in this document. The table below gathers actual proposed APIs in JDK class and its descriptions.
References
Appendix A. Script codesThe table below shows all of ISO 15924 script codes which can be used for the Unicode script subtag.
Appendix B. Region codesThe table below shows all of UN M.49 region codes which can be used for the Unicode region subtag.
|