Text Encoding

The ARTFL Project recommends that all users encode their texts following the Text Encoding Initiative's (TEI) TEI Lite encoding scheme.

Click Here for a comprehensive description of TEI Lite.

Philologic is known to handle more than TEI Lite encoding and variations in metadata data, but has not been extensively tested on very heavily encoded documents.

PhiloLogic Specific Encoding

For optimal functionality under PhiloLogic we recommend the following specifications:

The TEI Header
Below is an example of a valid TEI Header known to run under PhiloLogic:
- <!DOCTYPE TEI.2 SYSTEM "teixlite.dtd">
- <TEI.2>
- <teiHeader>
- <fileDesc>
- <titleStmt>
- <title>TITLE of Electronic Resource</title>
- <author>AUTHOR of Electronic Resource</author>
- <sponsor>SPONSOR of Electronic Resource</sponsor>
- <funder>FUNDER of Electronic Resource</funder>
- <principal>PRINCIPAL RESEARCHER of Electronic Resource</principal>
- <respStmt>
- <resp>STATEMENT OF RESPONSIBILITY for Electronic Resource</resp>
- <name>NAME</name>
- </respStmt>
- </titleStmt>
- <editionStmt>
- <edition>EDITION of Electronic Resource <date>DATE</date> </edition>
- </editionStmt>
- <extent>EXTENT of Electronic Resource</extent>
- <publicationStmt>
- <publisher>PUBLISHER of Electronic Resource</publisher>
- <address>
- <addrLine>ADDRESS</addrLine>
- </address>
- <date>DATE</date>
- <idno>UNIQUE IDENTIFIER</idno>
- <distributor>DISTRIBUTOR of Electronic Resource</distributor>
- <availability>
- COPYRIGHT of Electronic Resource
- </availability>
- </publicationStmt>
- <seriesStmt>
- <title>SERIES TITLE (to which Electronic Resource belongs)</title>
- <respStmt>
- <resp>STATEMENT OF RESPONSIBILITY for SERIES</resp>
- <name>NAME</name>
- </respStmt>
- <idno>UNIQUE IDENTIFIER of SERIES</idno>
- </seriesStmt>
- <notesStmt>
- <note>NOTES</note>
- </notesStmt>
- <sourceDesc>
- <bibl>
- <author>AUTHOR of SOURCE DOCUMENT <date>AUTHOR DATES</date> </author>
- <title>TITLE of SOURCE DOCUMENT</title>
- <editor>EDITOR of SOURCE DOCUMENT</editor>
- <extent>EXTENT (page range) of SOURCE DOCUMENT</extent>
- <imprint>
- <pubPlace>PLACE of PUBLICATION for SOURCE DOCUMENT</pubPlace>
- <publisher>PUBLISHER of SOURCE DOCUMENT</publisher>
- <date>DATE of PUBLICATION for SOURCE DOCUEMENT</date>
- </imprint>
- </bibl>
- </sourceDesc>
- </fileDesc>
- <encodingDesc>
- <projectDesc>
- PROJECT DESCRIPTION (Encoding of SOURCE DOCUMENT)
- </projectDesc>
- <samplingDecl>
- SAMPLING of TEXTS (for Corpus/Collection)
- </samplingDecl>
- <editorialDecl>
- CORRECTIONS to SOURCE DOCUMENT
- </editorialDecl>
- <classDecl>
- <taxonomy id="genre">
- <category>
- <catDesc>Genre</catDesc>
- </category>
- </taxonomy>
- <taxonomy id="authorgender">
- <category>
- <catDesc>Author Gender</catDesc>
- </category>
- </taxonomy>
- <taxonomy id="period">
- <category>
- <catDesc>Period</catDesc>
- </category>
- </taxonomy>
- </classDecl>
- </encodingDesc>
- <profileDesc>
- <creation>
- <date>CREATION DATE of SOURCE DOCUMENT</date>
- <address>
- <addrLine>PLACE of CREATION</addrLine>
- </address>
- </creation>
- <langUsage>
- <language>LANGUAGE of SOURCE DOCUMENT</language>
- </langUsage>
- <textClass>
- <keywords>
- <list>
- <item>KEYWORDS</item>
- <item>KEYWORDS</item>
- </list>
- </keywords>
- <keywords scheme="genre">
- <list>
- <item>GENRE</item>
- </list>
- </keywords>
- <keywords scheme="authorgender">
- <list>
- <item>GENDER</item>
- </list>
- </keywords>
- <keywords scheme="period">
- <list>
- <item>PERIOD</item>
- </list>
- </keywords>
- </textClass>
- </profileDesc>
- <revisionDesc>
- <change>
- <date>DATE</date>
- <respStmt>
- <resp>BY</resp>
- <name>NAME</name>
- </respStmt>
- <item>CHANGE</item>
- </change>
- </revisionDesc>
- </teiHeader>

Notes
- Internal Cross References
 - Cross References to textual objects (Sections, Chapters, etc.) will should be coded in this manner:
 - The Object itself:
 - <div2 type="Chapter" id="c2">
 - The Reference to the object:
 - <ref type="cross" target="c2">See chapter 2</ref>
 - Note that both id="xxx" and target="xxx" use the same unique identifier.
- Images in the Text
 - References to images embedded in the text should be coded in this manner:
 - <figure n="filename.ext">
 - <figDesc>Caption</figDesc>
 - </figure>

PhiloLogic using other encoding schemes

Currently PhiloLogic is known to run coherently on databases encoded using the following schemes:

MEP - The Model Editions Partnership (Example: The Sanger Archive in our Sample Databases)
CES - Corpus Encoding Standard (Example: BBC Urdu Sample - Restricted access)
ATE - ARTFL Text Encoding (Examples forthcoming). This is HTML, Dublin Core and optional extensions for pages, notes, sentences, and the like. We specify a small subset of HTML that we will actually do something with and need proper use of <h1-N tags for loading. PhiloLogic is known to load arbitrary HTML, but your mileage may vary. To load ATE and documents that look like ATE: philoload DBNAME texttype=ate and set TextType in philo-db.cfg to ate.
DocBook. Proof of concept only. We loaded the only three samples of literary texts we are able to find. The loader and system could easily be exapnded to handle most of DocBook if there is demand. Not sure that text analysis of the primary use of DocBook, technical documentation, is all that worthwhile. Load and configure with texttype=docbook
Plaintext. Tested on Gutenberg (plaintext) and Liberliber documents. Important caveat: input data files MUST be converted to UTF-8 before loading. Load and configure with texttype=plaintext. The loader will try to identify paragraphs, Gutenberg headers and trailers (available but not indexed for searching), "chunkify" the document into reasonable portions, and extract Author/Title info from Gutenberg files.

Page updated

Google Sites

Report abuse