Adding Middle Names to Chinese Names

A problem I've run into several times in my academic career is that a Chinese person's name will be mentioned in English, and it will be very hard to find them by searching for them. For example, I was trying to search for a person named "Wang Yang".

However google scholar lists 83 (!) people with exact name Yang Wang. Many of them work on overlapping subfields, for example, many are computer scientists. Google scholar lists 80 people with the name Chang Liu and 50 people with the name Chen Xin. It's usually possible to find someone by adding some keywords to the search, but this has some extra problems:

  • Early career scientists will have few published materials about them

  • People may want to change their exact area and still be searchable

  • It still doesn't work well because some fields are very popular, like machine learning

  • Searchers are lazy and might give up if they can't find the person they're looking for

This isn't a hypothetical problem - there have been several people who interviewed at my own lab who are Chinese, where I haven't been able to find them after googling their name. In some cases it took several false positives and cross-checking before I could find the right person.

Why are there so many overlapping Chinese names? Perhaps there are multiple reasons, but one reason is that a chinese character written in English can correspond to many different characters. To give an example, wikipedia gives these examples for "Wang Yang": 王扬 and 汪洋.

Chinese names are almost always one character (one syllable) for the last name, and one or two characters (one or two syllables) for the last name. Many English names have more syllables than this, which makes the disambiguation problem less severe.

There is a good amount of information loss when going from the character to the English spelling, but there is even some information loss when going from the pronunciation (called pinyin) to the English spelling, as the pinyin also contains tone information which isn't conveyed by the English spelling. Chinese has four numbered tones: (1) high, (2) rising, (3) falling, and (4) low (there is also an uncommon flat tone that I'll ignore).

Many Chinese names have the same English spelling but different characters and different tones. There can be the same tone with different characters, but different tones implies a different character. For example with "Wang Yang" mentioned before, one way of writing is Wang1 Yang2 and the other is Wang2 Yang2.

My modest proposal is to convert the tone of the characters into a middle name, which could either be written in full or abbreviation. An attractive feature of this property is that it makes it *easier* for someone reading the name in English to figure out how to write in Chinese, as the tone disambiguates the character to some extent. Another nice property is that the middle name is a deterministic function of the Chinese name, so it could be easier to introduce and drop than a nickname. It is perhaps also less alienating than having to pick an English nickname like "John" or "Steven".

Middle names are ubiquitous in English, and it's well understood that they can usually be omitted but can also be used to disambiguate. A simple example: the two US presidents with the name George Bush. One is George H.W. Bush (Herbert Walker) and one is George W. Bush (Walker).

The way this would work is that the middle name in Chinese would consist of the tones of the characters written in order, in their Chinese names: yi (1), er (2), san (3), si (4). I think that yiyi and erer sound strange, so if the same tone is used twice, it would be replaced with "yiliang" or "erliang" (etc.), as liang means "couple".

So for example Wang Yang could become: Wang Yier Yang or Wang Erliang Yang, with the abbreviations Wang Y.E. Yang and Wang E.L. Yang.

One issue here is that san and si both start with "s", so the abbreviation is ambiguous. However I don't see a nice way of fixing this which doesn't make things worse or counterintuitive, like using "A" for san.

For two character names, all the possibilities are:

Yiliang, Yier, Yisan, Yisi

Eryi, Erliang, Ersan, Ersi

Sanyi, Saner, Sanliang, Sansi

Siyi, Sier, Sisan, Siliang

For three characters, the possibilities that start with Yi:

Yiliangyi, Yilianger, Yiliangsan, Yiliangsi, Yieryi, Yierliang, Yiersan, Yiersi, Yisanyi, Yisaner, Yisanliang, Yisansi, Yisiyi, Yisier, Yisisan, Yisiliang

--

As an alternative to using yi, er, san, and si for the numbering, we could use the first four of the ten heavenly stems for the numbering: jia, yi, bing, and ding. This would give the two character combinations:

jialiang, jiayi, jiabing, jiading

yijia, yiliang, yibing, yiding

bingjia, bingyi, bingliang, bingding

dingjia, dingyi, dingbing, dingliang

These sound a bit better to me, but this might just be a question of taste.

--

How much does this reduce the information loss problem? I think it reduces it enough to make these Chinese names much easier to find. As an upper bound on performance, we can assume that the names are evenly distributed over the four tones. Then for two character names, the total number of overlap should be dropped by 4*4 = 16. In the case of "Wang Yang" on google scholar, this would reduce an unwieldy 83 results to a much more manageable 5.2 results, which is easily enough to fit on a single page. This is an optimistic estimate (since some tones will overlap), but even then, it may only come to 10 results, which is much easier to disambiguate.

The situation is even better with three character names, with a disambiguation factor of 4*4*4 = 64. However they don't need it as much as they already have three characters.

The nice thing about this is that there is no change to the underlying name or some arbitrary nickname being added. Academics are naturally wary of changing their names because it can make their work harder to find or disambiguate later. I know some academics who have even left misspellings in their name to make their old work easier to find. The other nice thing is that middle names are a well known and ubiquitous pattern in western names. So if a search system can't keep middle names straight, it will have a problem with westerners too. Another nice thing is that the middle names are written themselves with real chinese characters, so it itself could easily be written and pronounced in Chinese.