Introduction‎ > ‎Document Structure‎ > ‎Person‎ > ‎

Personal Names

Personal names around the world are not used in the same way as each other, and some things we take for granted in the West have no correspondence elsewhere. As well as variations due to married names, alternative spellings, nicknames, spellings in alternative languages, optional name parts, and stage names, the very structure of a name may be variable leaving it with little uniqueness and no obvious interpretation for our Western given-name/middlename/surname concepts. An in-depth discussion of the issues may be found under Worldwide Family History Data.

 

The handling of personal names needs to separate the acceptance and matching of the name variants from the generation of canonical names during output. Both of these must also support the temporal dependencies of those names (e.g. changes during marriage, adoption, etc) and potential overlaps of those time periods.

 

As a generic approach, STEMMA provides a prioritised set of patterns to match. A 'full name' is defined by a list of possible ‘token sequences'. These are in priority order and imply which should be tested first. Each ‘token sequence' is an ordered set from the following token types:

 

name              - simple name token, e.g. Tony

{name, ...}      - mandatory selection from alternative tokens

[name, ...]       - optional selection from alternative tokens

 

The following example might belong to someone called Grace Ann Murphy who doesn't always use her middle name and sometimes goes as Gracie. However, she's Irish and also has an Irish version of her name. This would require the following two 'token sequences':

 

{Grace,Gracie} [Ann] Murphy

Gráinne [Ann] Ní Murchú

 

An interesting issue here concerns the variations of individual name parts. In this example, Grace accepts "Gracie" as an informal version of her forename. However, the difference between Ann and Anne is more of a spelling error, during either recording, transcription or a subsequent lookup. This should be handled by the software unit, just as a soundex match might be. The same could apply to using a middle initial which is a very Western convention.

 

Such patterns are stored in STEMMA using the following elements:

 

NAME_VARIANTS=

 

<Names>

<Sequences [RANGE_FROM] [RANGE_TO]>

<Canonical [Style=’name-style’]> canonical-name </Canonical> ...

<Sequence [Type=’name-type’] [NAME_ATTRIBUTE] ... >

<Tokens [Optional=’boolean’]>

{ <Token> name-token-ucf-text </Token> } ...

</Tokens>

[ NARRATIVE_TEXT ] ...

</Sequence> ...

</Sequences> …

</Names>

 

RANGE_FROM=

 

AfterEvent=’key’ | FromEvent=’key’ | After=’std-date’ | From=’std-date

 

RANGE_TO=

 

BeforeEvent=’key’ | UntilEvent=’key’ | Before=’std-date’ | Until=’std-date

 

NAME_ATTRIBUTE=

 

Language=’code’ | Phonetic=’boolean’ | Romanised=’boolean’

 

As with Event constraints, After is >, From is >=, Before is <, and Until is <=.

 

The default setting for the Optional attribute is ‘0’ (i.e. False). The optional Event range attributes allow the applicability of a set of sequences to be constrained by relevant Events. The default attributes imply those sequences are always valid. A typical use of these is to differentiate maiden names from married names but they would be applicable for any type of name change. During name matching, it is recommended that the Event range attributes are ignored in order to provide a more relaxed operation. However, in order to derive a Person’s full formal name then they should be honoured and in the order they are written, just in case there’s any overlap due to fuzzy Event dates.

 

The name-attributes identifying the language, or whether the representation is phonetic, etc., probably need some clarification. There are a number of terms that often aren’t distinguished as well as they should be:

 

  • Transcription - Systematic representation of language (either spoken or prior textual form) in written form. May be phonetic transcription (mapping sounds) or orthographic transcription (mapping spoken words).

 

  • Translation - Conversion of a source language to a target language. Deals with the meaning expressed by the language.

 

 

  • Romanisation (or Latinisation) - Representation of language (either written or spoken) using the Latin script. May use transliteration for written text, or transcription for spoken words.

 

The name-style may be one of Formal, SemiFormal (default), Informal, and Listing, where ‘Listing’ is for sorting and collation purposes (e.g. Proctor, Anthony Charles). See Extended Vocabularies for defining custom name styles. Q: Do we need a sorts-as attribute where cases where name particles appear in an index but are invisible to the sorting process?

 

Q: Do we need to identify a subset of the tokens in a canonical name for highlighting as a surname, or family name, in software products? Note that a blind approach to marking tokens for highlighting avoids all the pitfalls associated with the rigorous categorisation of all name tokens.

 

The name-type may be one of the following. Again, see Extended Vocabularies for defining custom name types.

 

  • Alias – General pseudonym, including also-known-as, nom de plume, pen name, and nom de guerre. Some cases may have a specific type available for them.
  • Married – Name adopted after a marriage ceremony, or other type of union.
  • Nickname – Informal alias.
  • Personal (default) – Normal personal name.
  • Petname – Hypocorism. A term of endearment used in more intimate circumstances.
  • Private – For cases where a personal name is only used within certain circles, as with some Native American tribes.
  • Professional – Includes stage name.
  • Public – Some Native American tribes distinguish a private name, used within their own tribe, from a public name used outside of it.

 

In our example, Grace Murphy would be stored as:

 

<Names>

<Sequences>

<Canonical>Grace Ann Murphy</Canonical>

<Sequence>

<Tokens>

<Token>Grace</Token>

<Token>Gracie</Token>

</Tokens>

<Tokens Optional=’1’>

<Token>Ann</Token>

</Tokens>

<Tokens>

<Token>Murphy</Token>

</Tokens>

</Sequence>

<Sequence Language=’gle’>

<Tokens>

<Token>Gráinne</Token>

</Tokens>

<Tokens Optional=’1’>

<Token>Ann</Token>

</Tokens>

<Tokens>

<Token>Ní</Token>

</Tokens>

<Tokens>

<Token>Murchú</Token>

</Tokens>

</Sequence>

</Sequences>

</Names>

 

This approach would be familiar to anyone with some knowledge of computer-language parsers. The interpretation of the tokens as given names, etc., might be done by a genealogical product but it is not inherent in the stored data.

 

Character matching should be relaxed here, as for Place names. The most obvious case of this to people speaking in a Latin-based language is a "case-blind" match. However, when looking at other Western locales, the next most common instance of a relaxed match is an "accent-blind" one. This basically means treating, say, A-acute the same as A, etc. This is common in some locales where the accents are routinely dropped for uppercase. There are also characters that have very different representations in upper and lower case. For instance, the German lowercase sharp s in "straße" (known as eszett) usually (there are exceptions) uppercases to "SS", i.e. "STRASSE". After that, there are symbols with both "composed" forms (i.e. one Unicode character) and "decomposed" forms (i.e. 2 or more Unicode characters). For instance, the following should all be treated the same:

212B (Å) ANGSTROM SIGN
00C5 (Å) LATIN CAPITAL LETTER A WITH RING ABOVE
0041 (A) LATIN CAPITAL LETTER A + 030A (°) COMBINING RING ABOVE

Unicode makes specific recommendations about which composed and decomposed forms should be equivalent:
http://www.unicode.org/reports/tr15/.

 

In summary, any pair of tokens being compared must both be normalised to a “flattened” form that treats each of these categories as equivalent. Only the normalised forms should then be directly compared.

 

A final note on tokenisation of a name prior to applying the name-matching algorithm: Certain punctuation characters should be used to separate the tokens but should not be present during the matching, e.g. apostrophe and hyphen. Hence, Henri Cartier-Besson should be tokenised as the set [Henri, Cartier, Besson]. An exception to this might be the period which would have to be retained. Hence, James O. O'Seven would be tokenised as the set [James, O., O, Seven] to ensure the initial is distinct from a single-character token. See Worldwide Family History Data for further discussion.

Comments