Java Locale Enhancement Design Proposal

Revision 1.01
Authors Yoshito Umaoka (
Steven Loomis (
Mark Davis (
Date 2010-06-25

1. Summary

The Java Locale has fallen out of date, and needs to be enhanced to avoid loss of data. Relatively small changes to Locale can update it to current standards, and avoid significant problems for companies using Java. This proposal recommends a series of enhancements to the JDK Locale in order to bring Java into conformance with IETF BCP47 and UTR35 (CLDR/LDML).

1.1 Background

Many years ago, the internal structure for Locale was modeled after IETF RFC 1766, which was the industry standard for the representation of languages and locales at the time. But the industry has moved on since then.

1. RFC 1766 is long obsolete. It has been superseded by IETF BCP 47, which makes a number of important additions needed for the representation of languages.  BCP 47 is now the standard used and required by HTML, XML, HTTP, and many other specifications and programs. Among other features, BCP 47 provides the following:

  1. Script codes needed for distinctions among languages that use different writing systems, such as Chinese simplified vs traditional script, or Uzbek in Arabic vs Latin script.
  2. Three-letter base language codes needed to represent such languages as Filipino (fil), the official language of the Philippines. (Three letters are needed for the over 8,000 world languages).
  3. Three-digit region codes for important variants used in IT such as Latin American Spanish ("es_419").

These limitations are already causing significant implementation problems. For example, when a J2EE Servlet container implementation parses a language tag from an Accept Language http header and creates a Locale instance, it cannot map a script code into the new Locale object.

2. BCP 47 Unicode extensions are needed. The Unicode consortium launched the CLDR (Common Locale Data Repository) project years ago to construct and maintain the standard repository of locale data.  Sun Java 6 is also a consumer of CLDR.  The Unicode locale model provides an extension of BCP 47 to add keywords and codes needed in IT. These are needed to properly represent locale variants used in industry, such as dictionary vs phonebook sort orders for German.

1.2 API

The proposed API additions are:
  • getScript, getDisplayScript (plus LocaleNameProvider SPI, as with getDisplayRegion)
  • getExtension(Keys), getUnicodeLocaleType/Keys
  • to/forLanguageTag
  • A Locale.Builder class for building from parts.
  • Some changes to ResourceBundle (as of this writing, those changes hadn't been done yet, pending a green light on the proposal).
A key requirement is backwards compatibility.

2. IETF BCP 47 Language Tag and Locale Identifiers in UTS#35 (CLDR/LDML)

The syntax of language tags (or language identifiers) is currently defined by RFC5646 (which is part of BCP 47).

A language tag is normally composed of
  • a base language subtag (2 or 3 letter ISO language code)
  • optionally followed by a script subtag (4 letter ISO script code)
  • optionally followed by a region code (2 letter ISO country code or 3 digit UN M.49 area code)
  • optionally followed by one or more variant subtags
  • optionally followed by one or more extensions
  • optionally followed by private use subtags
 Below is the ABNF of the BCP 47 language tag. 

Language-Tag  = langtag             ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags

langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse])

language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag

extlang = 3ALPHA ; selected ISO 639 codes
*2("-" 3ALPHA) ; permanently reserved

script = 4ALPHA ; ISO 15924 code

region = 2ALPHA ; ISO 3166-1 code
/ 3DIGIT ; UN M.49 code

variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)

extension = singleton 1*("-" (2*8alphanum))
; Single alphanumerics
; "x" reserved for private use
singleton = DIGIT ; 0 - 9
/ %x41-57 ; A - W
/ %x59-5A ; Y - Z
/ %x61-77 ; a - w
/ %x79-7A ; y - z

privateuse = "x" 1*("-" (1*8alphanum))

grandfathered = irregular ; non-redundant tags registered
/ regular ; during the RFC 3066 era

irregular = "en-GB-oed" ; irregular tags do not match
/ "i-ami" ; the 'langtag' production and
/ "i-bnn" ; would not otherwise be
/ "i-default" ; considered 'well-formed'
/ "i-enochian" ; These tags are all valid,
/ "i-hak" ; but most are deprecated
/ "i-klingon" ; in favor of more modern
/ "i-lux" ; subtags or subtag
/ "i-mingo" ; combination

A Unicode locale identifier as defined in the Unicode Technical Standard #35 UNICODE LOCALE DATA MARKUP LANGUAGE (LDML) inherits the basic structure of the BCP47 language tag.  The differences are followings:
  • Allows use of LOW LINE ("_") characters for separating fields as well as HYPHEN ("-")
  • The extlang and grandfathered fields are not supported
  • One extension field is defined, as follows:
Unicode has defined the character 'u' to be used for Unicode locale extensions, using the mechanism provided by BCP47 extending language tags for use in various applications. The 'u' extension is designed for specifying specific cultural preferences, such as calendar type, in a language tag. This is to meet industry requirements for uniform variations across locales, such as "traditional sort", and represent certain variations in current Java locales in a standard way. The syntax of the 'u' extension and valid extension subtags is defined by the LDML specification. For details, see Unicode Language and Locale Identifiers

For example, "de-Latn-DE-u-ca-gregory-co-phonebk" is a valid BCP47 language tag as well as a valid Unicode locale identifier and interpreted as:

Language: German("de")
Script: Latin("Latn")
Territory: Germany("DE")
Calendar("ca"): Gregorian("gregory")
Collation("co"): Phonebook("phonebk")

These fields were chosen for illustration in the example above; this combination would not be normally needed.

2.1 Compatibility

The ABNF for BCP 47 provides certain features for backwards compatibility with older versions of BCP 47, features that are not needed by Java. 

  • The extlang feature: this is not present in the canonical form of BCP47 tags; it maps to a language tag without an extlang.

  • The irregular tags are not necessary; they refer to constructs that are sufficiently ill-defined that there is no necessity for them.

There are a few special cases where this proposal for the Java locale deviates from BCP47 for compatibility reasons:

  • Three language codes are handled specially by the Java locale. For example, "he" is mapped to "iw". This mapping is maintained for compatibility. 

  • There are a few cases where some of the old locale structure needed to be grandfathered in, even though it does not follow BCP47 structure, such as

  • There are some special syntactic differences between toString and toLanguageTag for compatibility.

The proposal also maintains the previous practice of not validating codes (such as country codes), except syntactically. This is both for compatibility, and to prevent "overvalidation", whereby a subtag is valid in the current version of BCP47, but the user's version of Java does not know about that version.

3. Locale Fields

The current JDK Locale class has three logical fields - language, country and variant. To represent BCP 47 language tags and Unicode locale identifiers without data loss, a few new fields must be added. Also, the definition of existing fields must be extended.

3.1. Language

According to the API specification of JDK Locale, languages are limited to lower-case two-letter codes as defined by ISO 639 part 1. However, the implementation in the Oracle JDK has never checked the length of the language argument in the constructors. The language tag in BCP47 is either an ISO two-letter code or an ISO three-letter code registered in the IANA Language Subtag Registry.

Practically, there is no problem using three-letter ISO 639 language codes in the current JDK Locale. But the API specification should be updated and clearly state that ISO three-letter language codes are a valid language representation as well as the ISO two-letter language codes.

3.2. Script

The language tag specification in BCP 47 uses ISO 15924 codes to represent scripts. A script code is represented by four alphabetic characters in ISO 15924. The repertoire of valid script codes in BCP 47 is registered in the IANA Language Subtag Registry. Appendix A. Script codes shows all script codes that can be used in a BCP 47 language tag at this moment.

Support of script codes requires a structural change in the JDK Locale class. When a Locale object has a script, it should be treated as an independent logical field when populating the candidate Locale list for resource lookup. Thus, the script is stored in a new logically separated field.

3.3. Region

BCP 47 region subtags are used to indicate linguistic variations associated with or appropriate to a specific country, territory, or region. A valid region subtag is either a two-letter ISO 3166 country code or a three-digit UN M.49 numeric region code. Because ISO 3166 codes and UN M.49 codes are mutually exclusive in a single BCP 47 language tag, there is no need to introduce a new field in the JDK Locale, so the existing Country field is used.

The use of UN M.49 codes in BCP 47 is limited to the repertoire registered in the IANA Language Subtag registry. Appendix B. Region codes shows all UN M.49 codes which can be used in a BCP 47 language tag.

According to the API specification of JDK Locale, countries are limited to upper-case two-letter codes as defined by ISO 3166. However, the implementation in the Oracle JDK has never checked the character repertoire nor the length of country argument in the constructors. So the API specification can be updated to accept a UN M.49 region code without introducing any breaking changes.

3.4. Variants

The current JDK Locale was designed based on the old language tag specification (RFC1766). The variant field in Locale class is used for the second and subsequent subtags. RFC1766 allows 1 to 8 letters for these subtags. In BCP 47, the syntax is defined by RFC5646, which only allows 5 to 8 letters when the value starts with [a-z][A-Z] or exactly 4 letters when the value starts with a digit [0-9].

The variant in JDK Locale does not have any restriction on its length and character repertoire. The Oracle JDK uses variant for some locales, such as JP (ja_JP_JP), TH (th_TH_TH) and NY (no_NO_NY). These variants are illegal in BCP 47. So BCP 47 variant subtags can be mapped to the variant field in the Locale class, but not the other way without an additional transformation.

The JDK implementation already assumes multiple variants are separated by LOW LINE ("_") characters. When variant subtags in BCP 47 are mapped into a JDK Locale, they are stored in a single field - variant - separated by LOW LINE characters.

3.5. BCP47 Extension/Private Use Subtag and Unicode Locale Extension

The current JDK Locale does not have any fields that can be used to store BCP47 extensions and private use. BCP47 extensions is a map between extension type, represented by a single letter, and its value. BCP47 private use subtags have similar structure identified by the letter 'x'/'X', although the syntax for value is slightly relaxed. In this design proposal, BCP47 extensions and private use both treated as Locale extensions represented in a single logical map using a single-character key.

Unicode extends BCP47 for use in locales. Unicode locale extension subtags have a well-defined substructure representing a key-value map. The keys are always 2-letter alphanum, and the value subtags are restricted to 3 to 8-letter alphanum.

As discussed in the previous section, the Oracle JDK currently uses the variant field to modify the base locales, such as ja_JP_JP and th_TH_TH. The variant JP specifies that Japanese imperial calendar is to be used for dates and the variant TH specifies that Thai local digits are to be used for numbers. Such usage won't work well when a locale needs to be transformed into a BCP47 language tag or vise versa. The Unicode locale extension defined by LDML provides the solution. For example, the JDK Locale ja_JP_JP is mapped to BCP47 language tag "ja-JP-u-ca-japanese".

APIs dedicated for accessing each Unicode locale extension item are proposed, in addition to APIs for generic BCP47 extensions. With the API for accessing a BCP47 extension, a user can get the entire key-value map for the Unicode locale extension. (e.g. get "ca-gregory-co-phonebk" from "de-DE-u-ca-gregory-co-phonebk-x-jdk" by specifying extension letter 'u'.) With the API for accessing a Unicode locale extension, a user can get the individual Unicode locale extension value by specifing a key. (e.g. get "phonebk" from "de-DE-u-ca-gergory-co-phonebk-x-jdk" by specifying Unicode locale extension key "co".)

3.6. Summary of Proposed Locale Field Changes

The table below illustrates proposed changes for JDK Locale fields and mapping to BCP 47 language subtags.  The sections in bold/orange are new proposed enhancements.

JDK Locale fields Available values Example values BCP 47 Language subtag representation (caseless) Note
ISO 639 Part 1 two-letter code
Lower case letters in JDK Locale.
ISO 639 Part 2 three-letter code (New) "kok"
ISO 639 Part 3 three-letter code (New) "aaa"
script (New) ISO 15924 four-letter script code "Hans"
(Simplified Chinese)
"Hans" The first letter is upper case and the rest of letters are lower case in JDK Locale.
(Traditional Chinese)
ISO 3166 two-letter country code
(United States)
Upper case letters when ISO 3166 code is used in JDK Locale.
UN M.49 three-digit region code (New) "029"
variant Variants supported by JDK



extension: "u-ca-japanese"



extension: "u-nu-thai"



language: "nn"
Registered BCP 47 variants


(de-1996, Greman orthography of 1996)



(en-scotland, Scottish Standard English)

User defined



extensions (New) Unicode Locale Extension Unicode locale key "cu": value "usd" "*-u-cu-usd" key/value pairs will be stored in a Map object in JDK Locale.
Unicode locale key "ca": value "japanese" "*-u-ca-japanese"
Generic Extension extension 'a': value "age-20"


(illustration, assuming that "a" is registered in IANA)

Private Use Extension
extension 'x' : value "type-admin" "x-type-admin"

4. Equality of Locales

In this proposal, an instance of Locale is equal to another instance of Locale when each corresponding field has the exact same value.  Also an instance of Locale is never equal to another instance of Locale when any corresponding field has different values.  This section will discuss what needs to be done to keep this relationship across JDK releases.

4.1. Deprecated Language Codes

There are some ISO 639 two-letter language codes used by the JDK Locale that are already deprecated.  The Java API reference clearly explains that these codes ("iw", "ji" and "in") are used by the JDK Locale class, even when the new codes ("he", "yi" and "id") are supplied to the Locale constructor.  This proposal does not change the mapping and the JDK Locale will continue to use the old codes for the language field.

4.2. Three-letter Language Codes

This proposal introduces the use of ISO 639 three-letter language codes.  Many three-letter codes in ISO 639 also have ISO 639 Part 1 two-letter codes.  For example, English is represented by "eng" in ISO 639 Part 2 and 3, and "en" in ISO 639 Part 1.  It is absolutely not worth having such variants, so the BCP 47 language tag specification and the Unicode language and locale identifiers only allow the shortest form, that is, an ISO 639 three-letter code can be used only when it does not have a ISO 639 two-letter definition.

When a three-letter code is specified in the JDK Locale constructors (although this is illegal by the current API definition), it could be mapped to ISO 639 two-letter code if available.  However, the behavior would break existing Java applications that might rely on the three-letter language codes being preserved in Locale instances.  To avoid the backward compatibility problem, this proposal does not perform any such mapping.  That is, new Locale("en") and new Locale("eng") will result two totally different Locale instances. It is the responsibility of the user to perform such mapping if desired.

Similarly, three-digit region codes are not allowed in BCP47 when a two-letter region (country) code exists, but such a mapping and/or validation is again left to the user of Locale.

4.3. Other considerations

Because of the lack of a script field, some Java users have used "zh_CN" as a language identifier for Simplified Chinese and "zh_TW" for Traditional Chinese.  This does not work well for users who want to re-use the same resource for other Chinese locale variants.  For example, some Java users may want to share the Traditional Chinese localized contents for zh_TW and zh_HK.  To share the same Traditional Chinese resource, they currently have to write a custom ResourceBundle.Control (introduced in JDK 6).  The introduction of script will allow them to maintain a single Traditional Chinese resource shared by multiple Chinese sub-locales without such customization.  "zh_TW" by the definition of old JDK releases would be normally equivalent to "zh_Hant_TW".  However, they are not exactly the same in term of BCP 47 language tag definition.  Thus, this proposal interprets "zh_TW" and "zh_Hant_TW" as different locales, but such mapping is considered in locale resource/service lookup (Please refer to Section 7. Locale Resource/Service Lookup for the details).

5. Locale Construction

The existing Locale constructors are not sufficient to fill in the new Locale fields.  Also, there is a need to create an instance of Locale from a BCP 47 language tag or a Unicode locale identifier.  This section describes the proposed enhancements related to Locale instantiation.

5.1. Constructors

 There are three constructors available in the current JDK Locale class.

public Locale(String language)
public Locale(String language, String country)
public Locale(String language, String country, String variant)

There are some minor changes necessary in these existing constructors below -
  • Update the API specification to accept ISO 639 three-letter language code (3.1. Language)
  • Update the API specification to accept UN M.49 three-digit numeric area code (3.2. Region)
This proposal introduces a couple of new fields - script and extensions.  In addition to the existing constructors, a couple of new constructors, which take a script and extensions, could be added.  However, not a few Java users have pointed out that an instance of Locale is immutable and creating multiple equivalent Locale instances is not a good idea.  This proposal defines a Locale Builder (described below) instead of new constructors.

5.2. Locale Builder

The extensions field itself is represented internally by a LocaleExtensions object.  Because its syntax is pretty strict, this proposal does not allow Java users to set the value directly.  Because the extensions field might be used to customize the behavior of JDK locale service classes, the proposal allows finer-grained building of locales via a LocaleBuilder, a new class that allows Java users to edit each Locale field, including extensions, in a controlled way.  An instance of LocaleBuilder is mutable, unlike Locale.  Java users can construct a LocaleBuilder from scratch, or starting from an existing Locale, then call setter methods to modify field values.  After editing the fields, the method public Locale build() returns a new instance of Locale (or maybe a cached locale instance) with the customized field contents.

5.3 From BCP47 Language Tag

We expect creating an instance of Locale from a BCP47 language tag will be common in Java applications.  For example, J2EE Servlet Container creates a Locale from HTTP Accept-Language header.  JDK currently does not have such API, so Java application developers are writing their own code to parse BCP47 language tags.  Some of them assume the first subtag is always a language, the second subtag is always a country, and the rest arethe  variant. Such an implementation would parse the input language tag "zh-Hant-TW" and create a new Locale using new Locale("zh", "Hant", "TW"), which is incorrect.

This proposal adds a new static method:

public static Locale forLanguageTag(String languageTag)

This method always returns an instance of Locale even if the input language tag is malformed.  The implementation evaluates each subtag from the beginning.  When it encounters a malformed subtag, the subtag and all following subtags will be truncated.  If the last valid subtag requires extra subtags, for example, extension singleton lentter followed by a malformed extension subtag, the extension itself is truncated.  For example,

"en-US-12-345" -> en_US ("12" is illegal at the position)
"ja-JP-x-WindowsVista" -> ja_JP ("WindowsVista" exceeds 8 letters in length, and "x" alone is illegal)
"a-b" -> ROOT ("a" is illegal at the position, therefore, entire input is truncated)

Although this method is good enough for common use cases, some others may want to report an error if an input language tag contains malformed subtags.  The Locale Builder class described above supports strict syntax checking and error reporting in the method below.

public Builder setLanguageTag(String languageTag)

This method throws java.util.IllformedLocaleException (new) when the input language tag is invalid.  The caller can access the error index through the exception.  For example, for the input language tag "en-US-12-345", IllformedLocaleException#getErrorIndex() will return 6 (the start offset of subtag "12").

There are some design considerations with language tag parsing, as follows.

Grandfathered Tags

BCP47 supports some irregular language tags introduced in the past.  Many of them are deprecated and have equivalent well-formed (langtag production of Language Tag) mappings.  The implementation in JDK will transform those grandfathered tags that have well-formed mappings in these methods.  These special mappings are:

grandfathered tag
regular tag

Other grandfathered tags do not have well-formed mappings. In input, they convert as follows:

grandfathered tag
regular tag

At this moment, these are all registered grandfathered tags.  It is extremely unlikely that other grandfathered tags will be registered in the future.  But if any new grandfathered tags are introduced that do not satisfy the regular RFC5646 langtag production of language tag syntax, they will be treated as invalid language tags in the JDK implementation.


The canonical form of BCP47 is used, whereby any extlang field replaces the language field. For example, input of "zh-yue" produces the same result as "yue" alone.

Validity of Subtags

Valid subtags are maintained in the IANA Language Subtag Registry.  Any subtags not registered in the registry are actually invalid for use in BCP47 language tag and the registry is large and growing over time.  Checking the validity of subtags would require the registry data to be imported into the JDK implementation, and it could easily become out of sync with the IANA registry.  For this reason, this design does not check the validity of subtags.  For example, the language subtag "ac" is not currently registered, and cannot be used in a BCP47 language tag at this moment. But the proposed implementation does not invalidate "ac" because it still satisfies the RFC5646 language tag syntax, and could become valid in the future.

5.4. System Default Locale

This proposal introduces a Locale representation that contains fields that were not in previous Java releases.  However, one of our primary goals is not to break existing applications.  Locales included in the supported Locale list in former Java releases should not be changed.  For example, while a Locale with script field, such as zh_Hant_HK could be used as the Java system Locale, the JVM implementation should still use the existing form zh_HK.  This proposal resolves such mapping in the lookup algorithm, so Java users can put their own resource bundle with script information.

Although the JVM implementation will use the legacy Locale form by default, it does not mean a new-style Locale cannot be used as the Java system Locale.  If a user program intentionally creates a Locale and sets it as the default Locale (using Locale.setDefault(Locale)), the new style Locale will be used as the system Locale.  Another case is when the JVM detects a system locale that requires the enhancements discussed in this proposal.  For example, if the underlying platform uses a language which can be only represented by an ISO 639 three-letter code, the new JVM implementation will use this three-letter language code in the initial default Java Locale.

6. String Representation of Locale

The method public String toString() in the JDK Locale class returns the programmatic name of the Locale.  The text value is used as Locale ID in many Java applications.  This proposal introduces a couple of new fields, which ideally should be included in the text representation. But such incompatible changes might break existing Java applications. 

For example, the script subtag is placed between language and region in a BCP47 language tag.  This design reflects the fact that the script is more important than the region when two language tags are evaluated for matching.  The same principle applies to the JDK Locale: for locale resource lookup, the script field is more important than the country field. However, there are existing Java programs doing their own resource lookup based on the string representation of Locale.  For these applications, inserting the script value between language and country might result in an unexpected lookup result.  For example, if someone stores Traditional Chinese content for Taiwan in a resource tagged with zh_TW, this content would not be resolved from a input locale string zh_Hant_TW if the implementation simply truncated fields from the right until a match was found.

The solution adopted here is that values that cannot be represented in the older format are appended at the end of string, after the special prefix character '#'.  When there is already a variant value, "_#" followed by the value is appended.  When there is no variant value, "#" + the value is appended.  To old Java programs using a String representation of Locale, the value looks like an additional variant tag.  Thus, resource look-up based on truncating fields from right still returns the same result. Yet the information can be recovered from the string, by recognizing the special # character. That character was chosen so that it would be extremely unlikely to collide with existing usage of Locale.

The tables below illustrates how the script value and extension values are inserted in the result of toString().

6.1. Script

The script field in the JDK Locale will be appended at the end of its string representation, as illustrated in the following table:

Region Variant

- Latn
- Latn

- Latn

6.2. Extensions

The extensions fields in the JDK Locale will be appended at the end of its string representation, as illustrated in the following table:

Language Script
de -
extension 'u': cu-eur
extension 'u': cu-eur
extension 'u': cu-eur
extension 'u': co-phonebk
extension 'x': jdk-1-7

6.3. BCP47 Language Tag

public static Locale forLanguageTag(String languageTag) creates an instance of Locale from a language tag.  This proposal also add a traverse method, getting a language tag string for an instance of Locale:

public String toLanguageTag()

The toLanguageTag method will return a well-formed (langtag production of RFC5646 language tag) language tag for the Locale instance.  There are a few complications when generating a well-formed language tag from a Locale:

Mandated Language Subtag

The langtag production of RFC5646 language tag requires a non-empty language subtag.  The current JDK Locale allows an empty language field.  When toLanguageTag() generates a language tag, "und" (Undetermined) will be used as the language subtag if Locale's language field is empty.

Ill-formed Fields

Each subtag has its own syntax restriction.  For example, language subtag must be 2 to 8 letter ALPHA ([a-z][A-Z]).  Although the current API documentation for Locale constructors says two-letter ISO 639 language code is used for the language argument, the implementation does not check if the input language argument really satisfies the condition.  For example, new Locale("Hello World!"); creates an instance of Locale with language field "hello world!".  Of course, such a Locale cannot be mapped to a well-formed language tag.  The method toLanguageTag() will omit any locale field that does not satisfy the language tag syntax requirements. (In this specific example, "hello world!" is not appropriate for language subtag.  Because an empty language subtag is not allowed, "und" is returned.)

Ill-formed Variant Field

This is a special case of ill-formed field.  The current JDK Locale allows any value in the variant field.  Unlike the example above, in which "Hello World!" is an invalid language according to existing javadoc, such a value is a valid variant (there is no syntactical limitation to variant in the current Javadoc). The variant subtag in the language tag specification only allows 5 to 8 letter alphanum ([a-z][A-Z][0-9]) or a single digit followed by 3 alphanum. The implementation of toLanguageTag() will map Locale's variant field to variant subtag as long as it satisfies the syntax.  Failing this, if it satisfies the syntax of the private use subtag, the segment and the following part will be mapped to the private use subtag with special prefix "-variant-".  Finally, if the variant does not satisfy the private use syntax either, it is considered as segments (delimited by LOW_LINE) and the first failing segment and all following segments will be omitted.  For example,

new Locale("en", "US", "POSIX").toLanguageTag() -> "en-US-POSIX" ("POSIX" satisfies the variant subtag syntax.  Note: en-US-POSIX is actually invalid, because POSIX is not a registered variant subtag.)
new Locale("en", "US", "Windows_XP" -> "en-US-Windows-x-variant-XP" ("Windows" satisfies the variant syntax, but "XP" does not.  But the private use syntax allows "XP".)
new Locale("en", "US", "Solaris10") -> "en-US" ("Solaris10" is 9 letters and illegal for any subtags, therefore, it is omitted.) 

7. Locale Resource/Service Lookup 

The current JDK implementation uses relatively simple logic in locale resource/service lookup.  It begins with a given Locale and trims fields one by one from variant and then country to locate a matching resource or service.  In the real world, representing the hierarchy by three fields (language, country and variant) is not sufficient.  The introduction of the script field allows applications to improve their lookup strategy.  But supporting backwards compatibility requires a few additional changes to the lookup logic. Our goal for the built-in lookup strategy was to enhance the lookup to function correctly whether or not the new fields are used, while maintaining backwards compatibility.

Note that resource bundle lookup order is different than resource item inheritance. The inheritance of resource items still follows a truncation model, since the resources are built presuming that model. For more information about lookup vs. inheritance, see LDML.

7.1. Basic Lookup Strategy

For locale resource or service lookup, this proposal handles the extensions separately from the rest of the fields.  This proposal uses the term "Base Locale" which consists of the language, script, country, and variant (in other words, all fields except extensions).  The primary purpose of the new field extensions in this proposal is to customize the behavior of locale services.  When resolving a resource bundle, only the base locale is used.  When resolving a service associated with a locale, the lookup logic also only uses the base locale, but since the service implementation might refer to the information stored in the extensions, the extensions are not truncated when invoking the resolved locale service implementation.

7.2. Lookup Order

The current JDK Locale does not support the script field. Such a distinction in writing system is vital, but can be only achieved by associating the difference with a country or variant.  The introduction of script provides a clean solution, but the new implementation must not break existing applications.  This is handled by using a lookup order that manages the interaction between script, region, and language. The chart below illustrates the proposed lookup order.

Note: In the description below, L, S, C and V represent non-empty language, script, country and variant. A locale formed from those fields is represented by brackets around a list of these letters. Thus [L, C] represents a Locale that has a non-empty language and country. When indicating a specific value, a string will follow the letter in parenthesis, for example, L("xx") represents the language "xx".

Request Locale
Candidate List
1. [L,S,C,V]
2. [L,S,C]
3. [L,S]
4. [L,S,V]
5. [L,C]
6. [L]
1. [L,S,C]
2. [L,S]
3. [L,C]
4. [L]
1. [L,S,V]
2. [L,S]
3. [L,V]
4. [L]
1. [L,S]
2. [L]

7.3. Special Handling for Backward Compatibility Support


The current JDK represents Simplified Chinese as zh_CN and Traditional Chinese as zh_TW.  With the introduction of script, these should be represented as zh_Hans and zh_Hant.  Even when someone tags resource bundles with zh_Hans or zh_Hant, however, the application still needs to support zh_CN or zh_TW.  To make the behavior backward compatible, this proposal handles input of some Chinese locales without script as a special case.  When Locale zh_CN is requested, the implementation adds script internally before generating the candidate Locale list.  This special expansion will be done only when the language field is "zh" and the country(region) is one of the codes below:


Also, existing applications are likely packaging Simplified Chinese bundles with zh_CN and Traditional Chinese bundles with zh_TW.  Therefore, when locale zh_Hans is requested, zh_CN should be in the candidate list, and when locale zh_Hant is requested, zh_TW should be in the candidate list.  The proposed implementation will supply country fields when a Chinese locale has a script, but empty country.  The chart below illustrates the proposed behavior:

Requested Locale
Candidate List
1. [L("zh"),S("Hans"),C("CN")]
2. [L("zh"),S("Hans")]
3. [L("zh"),C("CN")]
4. [L("zh")]
[L("zh"),S("Hant")] 1. [L("zh"),S("Hant"),C("TW")]
2. [L("zh"),S("Hant")]
3. [L("zh"),C("TW")]
4. [L("zh")]


The current Oracle JDK uses Locale no_NO_NY for representing a locale Norwegian Nynorsk (Norway).  This representation is illegal in a BCP47 language tag: it should actually use "nn" (Norwegian Nynorsk) for the language field.  Also, Oracle JDK treats Locale no as Norwegian Bokmål, which should be represented by "nb" if it should be clearly distinguished from Norwegian Nynorsk.  Because the current JDK implementation assign special semantics to these locales which is not compatible with the rest of world, special handling for Norwegian locale lookup is proposed as below:

Request Locale
Candidate List
1. [L("no"),C("NO")]
2. [L("nb"),C("NO")]
3. [L("no")]
4. [L("nb")]
1. [L("nb"),C("NO")]
2. [L("no"),C("NO")]
3. [L("nb")]
4. [L("no")]
1. [L("nn"),C("NO")]
2. [L("no"),C("NO"),V("NY")]
3. [L("nn")]
4. [L("no"),C("NO")]
5. [L("no")]
1. [L("no"),C("NO"),V("NY")]
2. [L("nn"),C("NO")]
3. [L("nn")]
4. [L("no"),C("NO")]
5. [L("no")]
1. [L("no")]
2. [L("nb")]
[L("nb")] 1. [L("nb")]
2. [L("no")]

8. Proposed New APIs

There are various technical topics discussed in earlier sections in this document.  The table below gathers actual proposed APIs in JDK class and its descriptions.

Class Signature
java.uti.Locale public String getScript() Returns the script code for this locale, which should either be the empty string or an ISO 15924 4-letter script code.
  public String getExtension(char key) Returns the extension (or private use) value associated with the specified singleton key, or null if there is no extension associated with the key.
  public Set<java.lang.Character> getExtensionKeys()
Returns the set of extension keys associated with this locale, or the empty set if it has no extensions.
  public String getUnicodeLocaleType(String key)
Returns the Unicode locale type associated with the specified Unicode locale key for this locale.
  public Set<java.lang.String> getUnicodeLocaleKeys()
Returns the set of keys for Unicode locale keywords defined by this locale, or null if this locale has no locale extension.
  public String toLanguageTag()
Returns a well-formed IETF BCP 47 language tag representing this locale.
  public static Locale forLanguageTag(String languageTag)
Returns a locale for the specified IETF BCP 47 language tag string.
  public String getDisplayScript()
Returns a name for the the locale's script code that is appropriate for display to the user.
  public String getDisplayScript(Locale inLocale)
Returns a name for the locale's script code that is appropriate for display to the user.
  public static final char PRIVATE_USE_EXTENSION
The key for the private use extension ('x').
  public static final char UNICODE_LOCALE_EXTENSION The key for the LDML extension ('u').
[New Class]
Builder is used to build instances of Locale  from values configured by the setter. Unlike the Locale  constructors, the Builder checks if a value configured by a setter satisfies the syntactical requirements defined by the Locale  class.

public Builder()
Constructs an empty Builder. The default value of all fields, extensions, and private use information is the empty string.
  public Builder(boolean isLenientVariant)
Constructs an empty Builder with an option whether to allow setVariant to accept a value that does not conform to the IETF BCP 47 variant subtag's syntax requirements.
  public boolean isLenientVariant()
Returns true if this Builder accepts a value that does not conform to the IETF BCP 47 variant subtag's syntax requirements in setVariant.
  public Builder setLocale(Locale locale)
Resets the Builder to match the provided locale.
  public Builder setLanguageTag(String languageTag)
Resets the builder to match the provided IETF BCP 47 language tag.
  public Builder setLanguage(String language)
Sets the language.
  public Builder setScript(String script)
Sets the script.
  public Builder setRegion(String region)
Sets the region.
  public Builder setVariant(String variant)
Sets the variant.
  public Builder setExtension(char key, String value)
Sets the extension for the given key.
  public Builder setUnicodeLocaleKeyword(String key, String type)
Sets the Unicode locale keyword type for the given key.
  public Builder clear()
Resets the builder to its initial, empty state.
  public Builder clearExtensions()
Resets the extensions to their initial, default state.
  public Locale build()
Returns an instance of Locale created from the fields set on this builder.
public abstract String getDisplayScript(String scriptCode, Locale locale) Returns a localized name for the given  IETF BCP47 script code and the given locale that is appropriate for display to the user.
[New Class]
Thrown by methods in java.util.Locale to indicate that a value is ill-formed.

public IllformedLocaleException(String message) Constructs a new IllformedLocaleException with the given message and -1 as the error index.
  public IllformedLocaleException(String message, int errorIndex)
Constructs a new IllformedLocaleException with the given message and error index.
  public int getErrorIndex()
Returns the index where the error was found, or -1 if unknown.


Appendix A. Script codes

 The table below shows all of ISO 15924 script codes which can be used for the Unicode script subtag.

Script Code Description
Arab Arabic
Armi Imperial Aramaic
Armn Armenian
Avst Avestan
Bali Balinese
Batk Batak
Beng Bengali
Blis Blissymbols
Bopo Bopomofo
Brah Brahmi
Brai Braille
Bugi Buginese
Buhd Buhid
Cakm Chakma
Cans Unified Canadian Aboriginal Syllabics
Cari Carian
Cham Cham
Cher Cherokee
Cirt Cirth
Copt Coptic
Cprt Cypriot
Cyrl Cyrillic
Cyrs Cyrillic (Old Church Slavonic variant)
Deva Devanagari (Nagari)
Dsrt Deseret (Mormon)
Egyd Egyptian demotic
Egyh Egyptian hieratic
Egyp Egyptian hieroglyphs
Ethi Ethiopic (Ge&#x2BB;ez), Ethiopic (Ge'ez)
Geok Khutsuri (Asomtavruli and Nuskhuri)
Geor Georgian (Mkhedruli)
Glag Glagolitic
Goth Gothic
Grek Greek
Gujr Gujarati
Guru Gurmukhi
Hang Hangul (Hang&#x16D;l, Hangeul)
Hani Han (Hanzi, Kanji, Hanja)
Hano Hanunoo (Hanun&#xF3;o)
Hans Han (Simplified variant)
Hant Han (Traditional variant)
Hebr Hebrew
Hira Hiragana
Hmng Pahawh Hmong
Hrkt (alias for Hiragana + Katakana)
Hung Old Hungarian
Inds Indus (Harappan)
Ital Old Italic (Etruscan, Oscan, etc.)
Java Javanese
Jpan Japanese (alias for Han + Hiragana + Katakana)
Kali Kayah Li
Kana Katakana
Khar Kharoshthi
Khmr Khmer
Knda Kannada
Kore Korean (alias for Hangul + Han)
Kthi Kaithi
Lana Lanna, Tai Tham
Laoo Lao
Latf Latin (Fraktur variant)
Latg Latin (Gaelic variant)
Latn Latin
Lepc Lepcha (R&#xF3;ng)
Limb Limbu
Lina Linear A
Linb Linear B
Lyci Lycian
Lydi Lydian
Mand Mandaic, Mandaean
Mani Manichaean
Maya Mayan hieroglyphs
Mero Meroitic
Mlym Malayalam
Mong Mongolian
Moon Moon, Moon code, Moon script, Moon type
Mtei Meitei Mayek, Meithei, Meetei
Mymr Myanmar (Burmese)
Nkoo N&#x2019;Ko
Ogam Ogham
Olck Ol Chiki (Ol Cemet', Ol, Santali)
Orkh Orkhon
Orya Oriya
Osma Osmanya
Perm Old Permic
Phag Phags-pa
Phli Inscriptional Pahlavi
Phlp Psalter Pahlavi
Phlv Book Pahlavi
Phnx Phoenician
Plrd Pollard Phonetic
Prti Inscriptional Parthian
Rjng Rejang, Redjang, Kaganga
Roro Rongorongo
Runr Runic
Samr Samaritan
Sara Sarati
Saur Saurashtra
Sgnw SignWriting
Shaw Shavian (Shaw)
Sinh Sinhala
Sund Sundanese
Sylo Syloti Nagri
Syrc Syriac
Syre Syriac (Estrangelo variant)
Syrj Syriac (Western variant)
Syrn Syriac (Eastern variant)
Tagb Tagbanwa
Tale Tai Le
Talu New Tai Lue
Taml Tamil
Tavt Tai Viet
Telu Telugu
Teng Tengwar
Tfng Tifinagh (Berber)
Tglg Tagalog
Thaa Thaana
Thai Thai
Tibt Tibetan
Ugar Ugaritic
Vaii Vai
Visp Visible Speech
Xpeo Old Persian
Xsux Cuneiform, Sumero-Akkadian
Yiii Yi
Zmth Mathematical notation
Zsym Symbols
Zxxx Code for unwritten documents
Zyyy Code for undetermined script
Zzzz Code for uncoded script

Appendix B. Region codes

 The table below shows all of UN M.49 region codes which can be used for the Unicode region subtag.

Region code Description
001 World
002 Africa
005 South America
009 Oceania
011 Western Africa
013 Central America
014 Eastern Africa
015 Northern Africa
017 Middle Africa
018 Southern Africa
019 Americas
021 Northern America
029 Caribbean
030 Eastern Asia
034 Southern Asia
035 South-Eastern Asia
039 Southern Europe
053 Australia and New Zealand
054 Melanesia
057 Micronesia
061 Polynesia
142 Asia
143 Central Asia
145 Western Asia
150 Europe
151 Eastern Europe
154 Northern Europe
155 Western Europe
419 Latin America and the Caribbean