Text Encoding

The ARTFL Project recommends that all users encode their texts following the Text Encoding Initiative's (TEI) TEI Lite encoding scheme.

Philologic is known to handle more than TEI Lite encoding and variations in metadata data, but has not been extensively tested on very heavily encoded documents.

PhiloLogic Specific Encoding

For optimal functionality under PhiloLogic we recommend the following specifications:

  • The TEI Header
    Below is an example of a valid TEI Header known to run under PhiloLogic:

    <!DOCTYPE TEI.2 SYSTEM "teixlite.dtd">
                <title>TITLE of Electronic Resource</title>
                <author>AUTHOR of Electronic Resource</author>
                <sponsor>SPONSOR of Electronic Resource</sponsor>
                <funder>FUNDER of Electronic Resource</funder>
                <principal>PRINCIPAL RESEARCHER of Electronic Resource</principal>
                   <resp>STATEMENT OF RESPONSIBILITY for Electronic Resource</resp>
                <edition>EDITION of Electronic Resource <date>DATE</date> </edition>
             <extent>EXTENT of Electronic Resource</extent>
                <publisher>PUBLISHER of Electronic Resource</publisher>
                <idno>UNIQUE IDENTIFIER</idno>
                <distributor>DISTRIBUTOR of Electronic Resource</distributor>
                   <p>COPYRIGHT of Electronic Resource</p>
                <title>SERIES TITLE (to which Electronic Resource belongs)</title>
                   <resp>STATEMENT OF RESPONSIBILITY for SERIES</resp>
                <idno>UNIQUE IDENTIFIER of SERIES</idno>
                   <author>AUTHOR of SOURCE DOCUMENT <date>AUTHOR DATES</date> </author>
                   <title>TITLE of SOURCE DOCUMENT</title>
                   <editor>EDITOR of SOURCE DOCUMENT</editor>
                   <extent>EXTENT (page range) of SOURCE DOCUMENT</extent>
                      <pubPlace>PLACE of PUBLICATION for SOURCE DOCUMENT</pubPlace>
                      <publisher>PUBLISHER of SOURCE DOCUMENT</publisher>
                      <date>DATE of PUBLICATION for SOURCE DOCUEMENT</date>
                <p>PROJECT DESCRIPTION (Encoding of SOURCE DOCUMENT)</p>
                <p>SAMPLING of TEXTS (for Corpus/Collection)</p>
                <p>CORRECTIONS to SOURCE DOCUMENT</p>
                <taxonomy id="genre">
                <taxonomy id="authorgender">
                      <catDesc>Author Gender</catDesc>
                <taxonomy id="period">
                <date>CREATION DATE of SOURCE DOCUMENT</date>
                   <addrLine>PLACE of CREATION</addrLine>
                <language>LANGUAGE of SOURCE DOCUMENT</language>
                <keywords scheme="genre">
                <keywords scheme="authorgender">
                <keywords scheme="period">

  • Notes

    Any note (end, foot, margin, etc.) occuring in the text should be coded in this manner:

      <ref type="note" id="ref1" target="n1" n="1"/>

    Where both id="xxx" and target="xxx" are unique identifiers and n="x" represents the actual note reference (usually a superscript numeral or an *). id and target must not be the same. By convention, we use an alpha (n or r) to distinguish them, e.g. for refs id="r1" and notes id="n1".

    In the "Notes" section of the document, this same note would appear as follows:

      <div1 type="notes">
      <pb n="nts"/>
      <note id="n1" place="foot" target="ref1" resp="Author">1 TEXT OF NOTE</note>

  • Internal Cross References

    Cross References to textual objects (Sections, Chapters, etc.) will should be coded in this manner:

    The Object itself:

      <div2 type="Chapter" id="c2">

    The Reference to the object:

      <ref type="cross" target="c2">See chapter 2</ref>

    Note that both id="xxx" and target="xxx" use the same unique identifier.

  • Images in the Text

    References to images embedded in the text should be coded in this manner:

    <figure n="filename.ext">

PhiloLogic using other encoding schemes

Currently PhiloLogic is known to run coherently on databases encoded using the following schemes:

  • MEP - The Model Editions Partnership (Example: The Sanger Archive in our Sample Databases)
  • CES - Corpus Encoding Standard (Example: BBC Urdu Sample - Restricted access)
  • ATE - ARTFL Text Encoding (Examples forthcoming). This is HTML, Dublin Core and optional extensions for pages, notes, sentences, and the like. We specify a small subset of HTML that we will actually do something with and need proper use of <h1-N tags for loading. PhiloLogic is known to load arbitrary HTML, but your mileage may vary. To load ATE and documents that look like ATE: philoload DBNAME texttype=ate and set TextType in philo-db.cfg to ate.

  • DocBook. Proof of concept only. We loaded the only three samples of literary texts we are able to find. The loader and system could easily be exapnded to handle most of DocBook if there is demand. Not sure that text analysis of the primary use of DocBook, technical documentation, is all that worthwhile. Load and configure with texttype=docbook
  • Plaintext. Tested on Gutenberg (plaintext) and Liberliber documents. Important caveat: input data files MUST be converted to UTF-8 before loading. Load and configure with texttype=plaintext. The loader will try to identify paragraphs, Gutenberg headers and trailers (available but not indexed for searching), "chunkify" the document into reasonable portions, and extract Author/Title info from Gutenberg files.