en, fr-CA, or zh-Hant. The standard Unicode language identifiers follow IETF BCP 47, with some small differences defined in UTS #35: Locale Data Markup Language (LDML). Locale identifiers use the same format, with certain possible extensions.Within programs and structured data, languages are indicated with stable identifiers of the form
Often it is not clear which language identifier to use. For example, what most people call Punjabi in Pakistan actually has the code 'lah', and formal name "Lahnda". There are many other cases where the same name is used for different languages, or where the name that people search for is not listed in the IANA registry. Moreover, a language identifier uses not only the 'base' language code, like 'en' for English or 'ku' for Kurdish, but also certain modifiers such as en-CA for Canadian English, or ku-Latn for Kurdish written in Latin script. Each of these modifiers are called subtags (or sometimes codes), and are separated by "-" or "_". The language identifier itself is also called a language tag, and sometimes a language code.
Here is an example of the steps to take to find the right language identifier to use. Let's say you to find the identifier for a language called "Ganda" which you know is spoken in Uganda. You'll first pick the base language subtag as described below, then add any necessary script/territory subtags, and then verify. If you can't find the name after following these steps or have other questions, ask on the Unicode CLDR Mailing List.
If you are looking at a prospective language code, like "swh", the process is similar; follow the steps below, starting with the verification.
ISO (and hence BCP 47) has the notion of an individual language (like en = English) versus a Collection or Macrolanguage. For compatibility, Unicode language and locale identifiers always use the Macrolanguage to identify the predominant form. Thus the Macrolanguage subtag "zh" (Chinese) is used instead of "cmn" (Mandarin). Similarly, suppose that you are looking for Kurdish written in Latin letters, as in Turkey. It is a mistake to think that because that is in the north, that you should use the subtag 'kmr' for Northern Kurdish. You should instead use ku-Latn-TR. See also: ISO 636 Deprecation Requests.
Unicode language identifiers do not allow the "extlang" form defined in BCP 47. For example, use "yue" instead of "zh-yue" for Cantonese.
When searching, such as site:ethnologue.com ganda, be sure to completely disregard matches in Ethnologue 14 -- these are out of date, and do not have the right codes!
The Ethnologue is a great source of information, but it must be approached with a certain degree of caution. Many of the population figures are far out of date, or not well substantiated. The Ethnologue also focus on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and on fluent speakers (not necessarily native speakers). So, for example, it would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language subtag for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation.
Wikipedia is also a great source of information, but it must be approached with a certain degree of caution as well. Be sure to follow up on references, not just look at articles.
In some cases, systems (or companies) may have different conventions than the Preferred-Values in BCP 47 -- such as those in the Replacement column in the the online language identifier demo. For example, for backwards compatibility, "iw" is used with Java instead of "he" (Hebrew). When picking the right subtags, be aware of these compatibility issues. Implementations that use Unicode CLDR, would remap its standard codes for use within their system.
For compatibility, it is strongly recommended that all implementations accept both the preferred values and their alternates: for example, both "iw" and "he". Although BCP 47 itself only allows "-" as a separator; for compatibility, Unicode language identifiers allows both "-" and "_". Implementations should also accept both.
BCP 47 also provides for "variant subtags", such as zh-Latn-pinyin. When there are multiple variant subtags, the canonical format for Unicode language identifiers puts them in alphabetical order.