#Articles‎ > ‎

Pragmatic South Asian Romanisation :: Khayaal aapka

Permalink: http://sites.google.com/site/sarvabhashin/articles/southasianromanisation


My other article on the romanisation of Brahui and Balochi, invariably brings up the topic of romanisation of South Asian languages in general. With such romanisation affecting our daily lives – at least in India – in a considerable way, it seemed like a good excuse to delve into the matter of why there seems to be so much uncertainty about how to write names and words of Indian or South Asian origin in the Roman/Latin script.

Note: The term ‘transliteration’ normally refers to a one-to-one replacement of characters of one script with those of another script, while the term ‘romanisation’ can mean (i) transliteration only into the Roman script, or (ii) a sound-based transcription into the Roman script, with no regard to the characters used in the original script. In this article, I have used ‘romanisation’ to mean point (i) above. I have used the word ‘transcription’ to describe any occurrences of point (ii) above.

The origins of romanisation in South Asia are of course traceable to the British Raj, and a system of romanising and/or transcribing local place names for surveying purposes was developed by one William Wilson Hunter. The resulting system therefore came to be known as the Hunterian system of romanisation.

While the Hunterian system was reasonably suitable for romanising Hindi and related Indo-Aryan languages – albeit with some uncertainties – it did not provide the means to unambiguously romanise certain characters belonging to scripts of Dravidian or Tibeto-Burman languages, or even of other Indo-Aryan languages like Bengali.

In 1894, the International Alphabet for Sanskrit Transliteration (IAST) was established, but was geared towards a lossless romanisation of Sanskrit – an extremely phonetic language, where the script matched the pronunciation almost completely.

This system, as its name suggested, was aimed at romanising only Sanskrit, and therefore failed to address the romanisation of characters not occurring in Sanskrit, such as some characters occurring only in the scripts of Sinhalese, Dravidian and Tibeto-Burman languages etc.

Also, since many modern South Asian languages that used Brahmi-derived scripts (also called Indic scripts) no longer had a one-to-one character-sound mapping, or in other words were no longer truly phonetic—due to inevitable evolution—their romanisation according to IAST threw up its fair share of issues, as it did with the Hunterian system.

The National Library at Calcutta romanisation (NLC) issued in 1988 expanded on the IAST to include missing romanisation for certain characters in the scripts of Dravidian and Eastern Indo-Aryan languages. However, it too did not provide romanisations for certain characters in Tibeto-Burman languages (Ladakhi, Dzongkha, Tibetan proper, Lepcha etc.).

In the meantime, there also appeared the UNGEGN (1972, focussing only on romanising place names), ALA-LC (1997 latest) and ISCII (1991) romanisation standards for various Indic scripts. Out of these, ISCII, the Indian Script Code for Information Interchange, was mainly designed as a system for representing Indic scripts on computers, but also addressed the issue of romanisation to quite an extent. It played a major part in the correlation and logical organisation of Brahmic script code points, on which the Unicode blocks for Indic scripts were later based, but it too had certain shortcomings.

More than a century after the IAST was invented, the ISO 15919 standard was introduced in 2001, which essentially tried to gap any holes present in any of the current romanisation systems, and by those standards, was very comprehensive. It even included proposed romanisations for characters derived from Perso-Arabic in Brahmi-based scripts.

For some reason though, the ISO 15919 system still fell short in terms of the aforementioned romanisations for Tibeto-Burman languages. Also, since the Devanagari version of Kashmiri was not codified till later that decade, romanisation for Kashmiri too did not find mention in this system.

It may be argued that most of the above systems addressed only romanisations for Indic scripts, i.e., scripts derived from the Brahmi script, and therefore, are not intended to address the romanisations of Perso-Arabic-derived and Tibetan scripts.

However, the Tibetan script is Brahmi-derived, and therefore, would logically have formed part of any such system (a possible reason I could think of as to why Tibetan has been left uncovered is the presence of alternate romanisation systems for it such as Wylie and THL).

None of the above systems make any provisions for scripts not based on Brahmi or Perso-Arabic, such as Ol Chiki.

Add to all this the fact that in addition to a character-for-character transliteration, there are other phonetic considerations (e.g. unpronounced or multiple-sound Indic characters) to be taken into account when coming up with a romanisation system. In some of these systems, such considerations are incompletely or not addressed.

These doubts essentially mean (i) these romanisation systems are employable only in certain restricted contexts, either academic, or when referring only to a select few languages and scripts, and (ii) there exists no consistent romanisation scheme usable for all South Asian languages in general, irrespective of origin or script.

A comparison of the Hunterian and ISO 15919 systems, along with UN & ALA-LC romanisation schemes as applicable for certain Devanagari-based languages can be found here.


1) Hunterian System :: accepted in Kurrachee, not in Cawnpore

Examining the positive and not-so-positive points of each of the romanisation systems mentioned previously, let’s have a look first at the Hunterian system. It –

a) tried to ensure a character-for-character romanisation (except for diphthongs, aspirate consonants and some other consonants, whose romanisations used digraphs)

b) represented long vowels with an acute accent (later changed to a macron in 1954) over the romanised vowel character:

/aː/ = á
का /kaː/ =

c) took practicality into consideration (e.g. represented long vowels at the end of a word without an acute accent/macron, since word-ending vowels in many Indo-Aryan languages are pronounced long irrespective of whether they are written as long or short)

d) drew inspiration from existing English character-to-sound mappings:

श् / ष् /ɕ ~ ʃ/ = sh
च् /t͡ɕ/ = ch
ज् /d͡ʑ/ = j
य् /j/ = y

It also –

e) did not (initially) distinguish between retroflex and non-retroflex consonants:

ट् /ʈ/ and त् /t̪/ = t
ड् /ɖ/ and द् /d̪/ = d
ळ् /ɭ/ (Marathi et al) and ल् /l̪ ~ l/ = l
ड़् /ɽ/ and र् /r/ = r

The Hunterian system seemed to have been supplemented later by underdots for retroflex characters, as can be found in some dictionaries published in British times, although I could not find any info on when exactly this happened and who was the first one to do so.

f) romanised vowel nasalization /̃/, the dental/alveolar nasal stop /n̪ ~ n/ and the retroflex nasal stop /ɳ/ all as n

g) did not distinguish between multiple pronunciations of a particular character in the same language:
Marathi and Nepali च्, representing the sounds /t͡ɕ/ as well as /t͡sʰ/, was romanised ch for both sounds, presumably since the two different sounds are unmarked in their native scripts as well

h) did not distinguish between multiple pronunciations of a particular character in different languages:

ज्ञ = Hindi /ɡjə/ vs. Marathi /dnʲə/ (cf. Eastern Nagari equivalent character জ্ঞ with Bengali pronunciation /ɡɡɔ/)

i) did not provide for clear-cut transliteration of certain sounds occurring in Dravidian languages (and scripts):
எ ఎ ಎ എ – short /e/ (as opposed to long /eː/) ஒ ఒ ಒ ഒ – short /o/ (as opposed to long /oː/) ழ் ഴ് – /ɻ/

In spite of these deficiencies, the Hunterian model, seemingly the first attempt at conjuring a logical romanisation system for Indian names and words, did usher in some sanity among all the orthographical madness. Most importantly, it provided the base for most other romanisation schemes that followed.

After all, if it were not for this system, we might still be referring to Pakistan’s largest city as Kurrachee, Scinde!


2) International Alphabet for Sanskrit Transliteration :: takṣaśilā gets the nod, /koːɻikkoːɽ/ doesnt

The International Alphabet for Sanskrit Transliteration (IAST) intends to be a lossless romanisation for Sanskrit, and according to Wikipedia, for Pali as well.

This romanisation system, while obviously building on the Hunterian system, makes a few changes –

a) removes redundant or ambiguity-causing characters from digraphs in the Hunterian system:

ङ् /ŋ/ = Hunterian ng, IAST
च् /t͡ɕ/ = Hunterian ch, IAST c
ञ् /ɲ/ = Hunterian ny, IAST ñ
श् /ɕ ~ ʃ/ = Hunterian sh, IAST ś

b) specifies an underdot for retroflex character romanisations to distinguish them from non-retroflex ones:

ट् /ʈ/ = Hunterian t, IAST
ड् /ɖ/ = Hunterian d, IAST
ष् /ʂ/ = Hunterian sh, IAST

c) provides a unique romanisation for ऋ /r̩/ =

d) provides a unique romanisation for anusvara and visarga:

/̃m/ = IAST
/ɦ/ = IAST

e) uses the romanisation h to represent the glottal fricative as well as aspiration (for stop consonants), as did the Hunterian system:

/ɦ/ = Hunterian & IAST h
/kʰ/ = Hunterian & IAST kh
/bʱ/ = Hunterian & IAST bh

However, since this system was intended purely for the romanisation of Sanskrit (and the derived Pali), characters not present in Sanskrit are not covered by this system. Hence the romanisation of any non-Sanskrit characters, such as:

Tamil ழ் and Malayalam ഴ് – pronounced /ɻ/
– characters for the Dravidian short vowels /e/ and /o/
– Tibetan ཚ /t͡sʰ/ and ཞ /ʑ/
– characters in various scripts for the nasaliser (chandrabindu) ँ /̃/, and
– new invented characters such as ऍ and ऑ, used to represent English /æ/ and /ɔ/

are out of the scope of this system.

Also, the IAST uses the romanisation for the character ऌ /l̩/ and its equivalents. This conflicts with the romanisation for ळ /ɭ/ — also used in Pali and Vedic Sanskrit. I couldn’t find any info on whether there was an alternate non-conflicting romanisation provided for the latter.

In addition, if ever considered as a daily-life romanisation system for South Asian scripts, some may raise the following issues:

– how would the romanisation of scripts used for Indo-Aryan languages be affected by schwa deletion—a feature that is quite predictable in northern languages like Hindi, Punjabi et al, but not so much in southern ones like Marathi

– whether people can ‘adjust’ to the representation of च् /t͡ɕ/ = c, श् /ɕ ~ ʃ/ = ś and so on, since we are all ‘so used to’ च् /t͡ɕ/ = ch and श् /ɕ ~ ʃ/ = sh, due to English’s influence on our daily lives.


3) National Library at Calcutta System :: Truly national

The NLC romanisation system, issued in 1988, extends the IAST by the following characters:

a) எ ఎ ಎ എ – short /e/ (as opposed to long /eː/) = e
b) ஒ ఒ ಒ ഒ – short /o/ (as opposed to long /oː/) = o
c) ळ् ળ્ (ਲ਼੍) ଳ୍ ள் ళ్ ಳ್ ള് – /ɭ/
d) ற் ఱ్ ಱ್ റ് – alveolar /r, ɾ/
e) ன் – alveolar /n/ (as opposed to Tamil dental /n̪/) –
f) ழ் ഴ് – /ɻ/

The following modifications were made to existing characters in the IAST:

g) ए and its equivalents – /eː/ = ē h) ओ and its equivalents – /oː/ = ō

that is, e and o without macrons represented their short versions only.

I have not been able to find clear-cut specifications on the romanisation of the following characters according to the NLC system:

i) ड़् /ɽ/ and ढ़् /ɽʱ/

An article on French Wikipedia says that the NLC romanisations for these characters are d̂ and d̂h. However, it also provides and h as possible NLC romanisations, which conflicts with ऋ = . Specific sources for this info are not provided in the article.

I did find this link to the ISCII romanisation scheme, though. There is a mention of d̂ and d̂h in it, but no mention of whether it forms a part of the NLC system or not.

In addition, I remember seeing a reference to ऋ = and ड़् = a long time (almost 10 years) ago, but was unable to find it again on the internet.

j) Chandrabindu ँ /̃/ = ◌̃, m̐, n̐?

k) No mention of transliteration of scripts of Tibeto-Burman languages. Sinhalese also does not find a mention, although I would imagine that it was excluded as the NLC system was designed with ‘Indian’ languages in mind.


Other links:
NLC chart at IIT Madras ‘Acharya’ project


4) ISCII :: Information and script interchange

The Indian Script Code for Information Interchange (ISCII, also known as IS 13194) released in 1991 was mainly a coding scheme for computers and related devices, which had a system of ‘code points’ onto which equivalent letters from various Indic scripts were mapped. Thus, the following characters – क ক ਕ ક க క ಕ ക – were all mapped onto the same code point, as they were ‘equivalent’ characters, all representing the sound /k/.

By changing the script specification, say from Devanagari to Gurmukhi, the Devanagari character would be ‘transliterated’ into the equivalent Gurmukhi character. What actually would happen is that the code point would remain the same; only the ‘rendering’ would change as per the script specified.

In other words, the code points were the deep layer, and the script itself was the surface layer. Changing the surface layer would provide a ‘transliteration’ of a particular character or string of characters.

To make up for the lack of characters in Devanagari that would be equivalent to certain Dravidian-script characters, such as Tamil எ /e/, ஒ /o/, ன் /n/, ற் /r, ɾ/ and ழ் /ɻ/, ISCII introduced ‘invented’ Devanagari equivalents for these characters, namely ऎ, ऒ, ऩ्, ऱ् and ऴ् respectively.

According to the Wikipedia ISCII article (retrieved 2011-07-09):

“One motivation for the use of a single encoding is the idea that it will allow easy transliteration from one writing system to another. However, there are enough incompatibilities [to prove] that this is not really a practical idea.”

According to the same article:

“ISCII has not been widely used outside of certain government institutions and has now been rendered largely obsolete by Unicode.”

The article also states that Unicode “largely preserves the ISCII layout within each block”, seemingly a useful legacy of ISCII.

Speaking of transliteration, ISCII did provide a romanisation scheme as well (see ‘Other links’ below), making use of diacritical characters and based on the NLC system. Some of the main points were –

a) included romanisations for some obscure, Sanskrit-only characters such as ऌ /l̩/ and its equivalents =

b) transliterated Tamil ழ் and Malayalam ഴ് – /ɻ/ =

This obviously contradicts the NLC system, which romanises ழ் and ഴ് as

Instead, ISCII uses as a romanisation for ऌ /l̩/ (see point a) above). I wasn’t able to find enough references to throw light on this conflict.

c) transliterated ड़् /ɽ/ and ढ़् /ɽʱ/ (and its equivalents) as d̂ and d̂h respectively.

As mentioned in the previous section on the NLC system, I wasn’t able to find sufficient resources to verify whether this was a specification of the NLC system or an invention of the ISCII.

d) provided transliterations for certain Brahmic characters used to represent certain Perso-Arabic sounds such as /z/, /f/, /x/ and /ɣ/

e) did not include transliterations/romanisations for Brahmic scripts used for Tibeto-Burman languages


Other links:
ISCII overview


5) ISO 15919 :: From Pondicherry to Gangtok

The ISO 15919 standard, issued in 2001, is by far the most comprehensive romanisation standard for Brahmic scripts that has been drawn up to date. The ISO 15919 provides extensive information on not just romanisation, but cross-transliteration from one Indic script into another.

It also covers the transliteration of certain Perso-Arabic characters into their equivalent Brahmic ones, and in doing so, makes a mention of their recommended romanisations as well, albeit with some restrictions.
The ISO 15919 is extremely detailed, and a site dedicated to explaining it can be found here.

A few salient points about the system are:

a) builds on the ISCII romanisation system

b) Clarifies some conflicts in the IAST:

ड़् /ɽ/ and its equivalents =
/r̩/ and its equivalents = r̥
ळ् /ɭ/ and its equivalents =
/l̩/ and its equivalents = l̥

c) Changes ISCII Tamil ழ் and Malayalam ഴ് – /ɻ/ from to

d) describes precisely how chandrabindu ँ /̃/ and its equivalents are to be romanised (including Gurmukhi bindi and tippi)

e) provides a romanisation for the Sinhalese script

f) deals with the romanisation of rarely used characters such as avagraha

g) provides (rather strangely, in my opinion) guidelines for romanisation of Indic characters transliterated from Perso-Arabic-based scripts; not romanisation of the Perso-Arabic characters themselves.

A few things that seemed confusing to me are:

h) Sinhalese script ඇ /æ/ and ඈ ː/ are romanised æ and ǣ respectively, where æ is a ligature of a and e, and ǣ is the same character with a macron above. However, Devanagari ऍ /æ ~ æː/ (also written अ‍ॅ) is romanised ê. If these characters have the same sound, then maybe they could have been romanised the same way?

i) On the same lines, Bengali script অ‍্যা /æ ~ æː/ is romanised as a:yā, and not æ or ǣ

It’s possible that the logic behind this was to consider the characters/ligatures as historically different, and therefore to provide differing romanisations for them, irrespective of the fact that they have the same sound. After all, there are characters in different Brahmic scripts that have the same sound in modern times, but are romanised differently as their historical origins are different, such as Devanagari Hindi ज् and Eastern Nagari Bengali য্ (both pronounced /ʥ/).

This, however, means that ISO 15919, due to its emphasis on clarity and retraceability, loses out sometimes on aesthetics. For example, romanising অ‍্যাক্সিস ব্যাঙ্ক ‘Axis Bank’ as a:yāksisa byāka somehow seems ‘readable’ only in the Bengali script and not in the romanised form.

Of course, this is an extreme case where we’re considering the romanisation of words in the Bengali script which themselves are transcriptions of English words (axis & bank).

And for reasons of clarity and retraceability, there is of course is no provision for schwa deletion in ISO 15919, which again means that the usage of ISO 15919 in its current form as a ‘daily-life’ romanisation seems unfeasible, due to aesthetic considerations (readability) and the general prevailing trend of pronunciation-based loose romanisation.

Recall the renaming—if you can call it that—of Pondicherry into ‘Puducherry’ – a hybrid, ad-hoc romanisation seemingly trying to incorporate phoneticity as well as readability. According to ISO 15919, it would have been spelt putuccēri, after Tamil புதுச்சேரி /pud̪ɯʨʨeːri/.

While we debate Puducherry versus putuccēri, the verdict is still out on སྒང་ཐོག་ /ɡŋtʰók/, since ISO 15919 too does not include the Tibetan script in its scope. For now, we’ll stick with romanising སྒང་ཐོག་ as ‘Gangtok’.


Existing Roman-script writing systems for South Asian languages

A few South Asian languages—such as Mizo, Konkani and Divehi—are already written, officially or unofficially, in the Roman script. Mizo is probably the only language having some level of recognition in India (official language of Mizoram) whose only script is the Roman script. Konkani is officially written in Devanagari, as decreed by the Government of Goa, but the Roman script is widely used for it and campaigns are ongoing for it to be recognised as an official script of Konkani (see my other article on Goan Place Names).

As Mizo was previously unwritten, its Roman script is ‘original’ and therefore cannot be called a transliteration. As regards Konkani, its Roman script system tends towards being a transcription, with there not really being a 100% one-to-one correspondence between it and Devanagari Konkani, and therefore cannot be called a transliteration either.

Divehi—officially written in the Tana or Thaana script—has an official Roman transliteration system, with a one-to-one correspondence between particular Tana and Roman letters. Its aesthetics may be debated, but due to the fact that it uses only the standard 26 letters of the Roman alphabet and the apostrophe as the only diacritical mark, it is very easily reproducible, which incidentally was the intention behind its invention – for it to be used on Telex machines in the 1970s, which did not support the Tana script.

However, all these Roman script versions use varying sound-letter mappings for various languages, à la European languages written in the Roman script, and therefore have to be learnt individually.


Conclusion

No comprehensive and clear-cut romanisation system—either transliteration or transcription—for South Asian languages and scripts that is used as an academic as well as a daily-life standard—on the lines of Hanyu Pinyin for Mandarin or RR for Korean—seems to have yet emerged. It would, in my opinion, be highly useful to have such a standard for obvious purposes of convenience and clarity in information exchange, and also for (potentially) furthering literacy.

However, it’s inevitable that the makers of such a pan-South Asian standard will have an uphill task for the following reasons (among others) –

a) preservation of orthographic as well as phonetic fidelity in the romanisation is often conflictive, i.e., preservation of one often means loss of the other.

b) as an extension of the previous point, there is the tricky problem of how to consistently deal with equivalent (strings of) characters that are pronounced differently in different languages.

e.g.
Devanagari (as applicable to Hindi) अरविन्द /ərʋɪn̪d̪/
Devanagari (as applicable to Marathi) अरविंद /ɤ̞rᵊwin̪d̪ᵊ/
Eastern Nagari (as applicable to Bengali) অরবিন্দ /ɔrobin̪d̪o/
Eastern Nagari (as applicable to Assamese) অৰবিন্দ /ɔrɔbindɔ/

c) choice of particular characters or signs might aid legibility for one script or language, and hinder it for another.
e.g. the choice of ē and ō for long /eː/ and long /oː/ respectively is a logical choice for Dravidian scripts and languages, to differentiate them from short e /e/ and short o /o/, but is mostly redundant for scripts for Indo-Aryan languages, which usually do not have short /e/ and /o/.

Maintaining the macron above these letters for scripts of Indo-Aryan languages will result in needless orthographic clutter, while eliminating the macron will lead to orthographic inconsistency with romanisations for other scripts, such as those for Dravidian languages.

d) The very choice of which scripts (and languages are to be covered).

A Sanskrit quote which according to me sums up the situation very succinctly (romanisation in IAST, ‘standard’ pronunciation in IPA):

अमन्त्रमक्षरं नास्ति नास्ति मूलमनौषधम्‌ ।
अयोग्यः पुरुषो नास्ति योजकस्तत्र दुर्लभः॥

amantramakṣaraṁ nāsti nāsti mūlamanauṣadhaṁ
ayogyaḥ puruṣo nāsti yojakastatra durlabhaḥ

/əman̪t̪rəməkʂərə̃m naːs̪t̪i naːs̪t̪i muːləmənəwʂəd̪ʱə̃m
ajoːɡjəɦə puruʂoː naːs̪t̪i joːʥəkəs̪t̪ət̪rə d̪urləbʱəɦə/

“There is no syllable not a mantra, no plant not medicinal,
there is no person unworthy; what is lacking is an ‘enabler’”


Other links:
‘Vagaries of Transliteration’ at the IIT Madras ‘Acharya’ project website


P.S.: The words “khayaal aapka” in the title of this post are the tagline of the current ICICI Bank ad campaigns. Their ‘correct’ pronunciation is /xjaːl aːpkaˑ/, and roughly mean “thinking of you” or “caring for you” in Hindi/Urdu. The words have been romanised in an ad-hoc manner from Devanagari Hindi ख़्याल आपका and Urdu خیال آپ کا.


DISCLAIMER:

This article is by no means supposed to be a comprehensive or scholarly work on the topic of South Asian transliteration and romanisation. There may be errors, and also many related areas and topics which have been left uncovered, either unintentionally or intentionally. This article has been written purely out of a personal interest in the topic, and as such I welcome any corrections/additions/criticisms regarding it.


Updated 2011-07-11

Comments