ATE : ARTFL Text Encoding

This document is NOT completed. It was, at one point, cleaner, but I have been adding some issues that we need to resolve. So, I need to write a public release level ATE specification. MVO.

Objective

The ATE (ARTFL Text Encoding) specification is used for internal data representation by our new generation of ARTFL search and reporting software, PhiloLogic. We have determined that attempting to develop a large-scale text searching and analysis package that would treat raw SGML documents by reading DTDs and automatically building a single consistent data representation would be too costly and expensive. Thus, we had to specify a data representation that can reasonably represent a wide variety of textual features with a basic level of consistency across various documents. ATE is our first attempt to develop an encoding standard for use in PhiloLogic.

Rather than create "yet another standard", we decided that our internal data representation should leverage existing standards, whenever possible, and use standards that are simple, well documented, stable, and for which there are a large number of existing and free/inexpensive tools. We determined that the most effective representation is the combination of (unqualified) Dublin Core 1.0 metadata specification with very basic HTML. Very few extensions are currently optional to build ARTFL style databases: page identifiers, explicit sentence tags, and proper name tags, all of which may be omitted. PhiloLogic IGNORES completely ALL SGML tagging which we have not specifically described below. This allows compatibility with other systems in that you can pass tags blindly thru the system.

By using a very basic and well documented set of encoding specifications, design of PhiloLogic and ATE will allow users to build databases from different encoding schemes (see our SGML to ATE document LINK COMING), allow individuals to develop directly ATE compatible documents using existing tools, and allow us to use PhiloLogic to index arbitrary collections of HTML documents in a WWW space with minimal encoding and no modification. In contrast to the TEI and other specifications, which work from the top-down to provide an infrastructure for all possible encodings in a system independent fashion, ATE is system dependent and built from the bottom-up. We specify only tagging that we actually have in production or that we are planning to treat in PhiloLogic. Extensions to this basic specification will be, as always, completely optional and will use existing schemes, preferably TEI specifications, whenever possible.

We believe that a simple representation is sufficiently rich for the creation of complex databases and easily used by many scholars in environments with which they are already familiar.

ATE: Dublin Core Header

We selected the Dublin Core specification because it is simple -- designed to be used by non-catalogers as well as resource description specialists -- interoperable across many domains, readily extensible, well supported, consistent with HTML documents, and is garnering international support. The simplicity and coverage allows us to map the kinds of bibliographic support that ARTFL has provided for years with an easy to manage encoding.

For default functionality of PhiloLogic bibliographic control, we require on a selected subset of the 15 base data elements of the Dublin Core. The 15 base elements are, with a quick indication of the desired contents of each element:

<head>

</head>

This information is mapped to a refer bibliography containing the following information in the standard layout which is used by PhiloLogic to build the bibliographic data handler:

%a Rousseau, J.-J.

%T Discours sur Sciences et Arts

%D 1750

%Y traite ou essai

%P In Oeuvres Completes, T.3. Paris, Gallimard, 1964.

%S DiScA

At this time, the default bibliographic representation allows for searching on

Author (DC.creator)
Title (DC.title)
Date of Publication (DC.date)

and displays in addition to the above,

Publisher (DC.publisher)
Short Identifier (DC.identifier)

We will certainly be implementing optional treatment for other DC fields, as defined below and/or by the default DC specification. Most important will be

Genre or type (DC.type)
Subject (DC.subject)

ATE adapts the Dublin Core specification to reflect the kinds of information that we typically want to have in basic database functionality. Some of these adaptations may not fit with the semantics as are being defined by DC. Finally, as noted below, PhiloLogic does not currently do anything with a number of DC elements. We strongly recommend that these are used, as specified by Dublin Core, for future reference. We expect that some or all of the materials will be used in PhiloLogic and for other systems.

We have provided a more detailed outline of Recommended Dublin Core Contents for use under PhiloLogic. This document specifies in detail the format and contents of Dublin Core elements that will work best under PhiloLogic. PhiloLogic is compatible with basic Dublin Core as defined in the unqualified Dublin Core specification.

Randomly selected example:

Important note: I need to document alternative bibliographic control, the refer format, since we do load databases in this way as well. In fact, the loader extracts the bibliographic information from the DC representation and generates a refer bibliography with some additional information required by the loader.

Text Data Encoding

This section contains BOTH data specifications and behavior, implementation specifications and needs to be separated out!!

We will adopt a multi-level structural hierarchy for main textual objects. The top level is the document, and lower levels are as follows:

This is HTML and maps well to the SGML/TEI div0 thru div_n (so <div1> ... </div1> --> <h1> ... </h1>, <div2> ... </div2> --> <h2> ... </h2>, etc.). At this time any division level beyond 3 will be translated as an HTML heading tag for display purposes only and will not indicate a structural level of textual objects. We are considering handling more than 3 nested levels, but to date we have only encountered 1 example of documents that required more than 3 levels: certain EAD documents. A document may not contain all these divisions (it may only have <h1> and <h2>). The three level hierarchy can reflect any number of possible structures, such as Book-Chapter-Verse, or Act-Scene (leaving <h3> blank). Another example might be the following:

<h1>Preface</h1>

some text and tags

<h1>Chapter One </h1>

some text and tags

some text and tags

<h1>Chapter Two </h1>

And so on...

Note: The system expects section breaks -- <h[1-n]> ... </h[1-n]> -- to appear on a seperate line. It would be best is the entire header title appear on a single line:

<h1>Header Title</h1>

These can be as long as you want, but remember that we use them to display tables of contents, which use indentations to indicate nesting, so really long header titles will wrap, making the display less effective.

Full document navigation is in full production for PhiloLogic databases as described in the sections on Retrieving and Navigating Documents and Navigating Documents from Word Searches of the PhiloLogic User's Manual.

Further object levels which descend from the lowest division level are as follows:

<p>

sentence

word

paragraph/stanza

delimited by punctuation or explicit tags (<sent>)

delimited by white space and punctuation

Pages will also constitute textual objects, even though they do not fit into the structural hierarchy outlined above. We will tag page breaks as <page n="[ANY STRING]">, where [ANY STRING] is a page object identifier (e.g. page number) from the source edition. Every text object should refer to its page identifier for display purposes: users will want to know the page identifier for the text object (such as a paragraph) containing the text they queried.

We have decided that the value of the page numbers noted here will NOT include spaces. Generally, I would keep these babies short, since it is a matter of display space in KWIC reports. In general, these look like:

but of course, pages can have letters and other oddities,

for page 12 of volume one. And, you can have alternates, just don't use a space:

but of course

<page n="23:3"> or

might be just as effective.

Important Note Page tags must appear on their own lines.....

For search and retrieval purposes, any front matter (such as the editor's preface, title page, etc.) or back matter (appendices, indices, etc.) will be considered a logical unit of text tagged as a first-level structure.

The new loader should implement this system for tagging text structure, but it should also be able to handle texts already marked up in HTML without further modification. In other words, missing <page n="..."> markers or <h1> ... <h1> tags should not prevent a file from loading into a database.

SENTENCE TAGGING

Ideally, all sentences would have explicit tags to differentiate them from abbreviations and other uses of punctuation. However, if we are going to accept any properly tagged HTML document, we will have to allow for implied sentences.

Explicit sentences will be tagged as follows:

<sent>Blah blah blah blah, etc., but blah blah blah.<sent> Blah blah!

Sentences will start, as expected, with any <p> or division break <h[1-3]>.

Implicit sentences will be identified by the punctuation marks . ? ! and object tags (e.g. <h1>, <p>, etc.) as we have in the current loader.

Note: the system will, for every paragraph, check to see if there are any <sent> tags. If not, it will apply implied sentence recognition.

WORD OBJECTS

In some cases, words have tags within them. In the OVI database, for example, we find medieval Italian words with italics inside: apri<i>le</i>, ma<i>no</i>. In order to index these words as word objects, the loader should not treat tags as spaces.

The current ARTFL loader handles all valid HTML special characters (for a list of these characters, see the offical list). There are also many SGML character entities which do not map to ISO Latin, such as &obar; (the letter "o" with a macron over it). Commercially available data offers a wide variety of unoffical SGML character representations, so the preceding example is one of many possibilities. In order to account for unusal character representations, the loader should generate a words.R file where each entry has two fields:

pot|p&obar;t

The first field is a reduced search key, the second is the full word. We can generalize this and add entries to the words.R file whenever there is a word containing an SGML character entity of the form

&[ONECHAR][MANYCHARS];

which would then lead to adding an entry to the words.R file as follows:

xx[ONECHAR]xxx|xx&[ONECHAR][MANYCHARS];xxx

We can also use this two-field index file to search for words containing parentheses, brackets, and other punctuation that automatically delimited a new word in the old loading format. We can map words with spelling variations (as is common with variant editions of the same medieval text) to the same canonical form, such as

firenze|fire(n)ze
firenze|fire[nze]

While the simplified spelling will help many users find all the variations of a word or phrase, we should also allow for searches that specify a particular spelling: users should be able to search only for fire(n)ze if they wish.

In the future, we may be able to extend this representation of unusual spellings and characters to include fields for things such as parts of speech, root forms, etc.

firenze|fire[nze]|PROPER NOUN|ROOT_FORM|etc.

Users could then construct search queries such as "Find a verb followed by some form of 'Firenze'" by typing something like "VERB AND firenze."

PROPER NOUNS

As an extension to this tag set for future search engine developments, we will include tags for proper nouns (to distinguish them from words which begin sentences, for instance). The tags will be <pn>Proper Noun</pn>.

NOT IMPLEMENTED YET

Speaker Tags

I want to reserve <spkr>Speaker's Name</spkr> and possibly the alternative construct

<spkr name="SOME NAME"> .... text, maybe lots of it </spkr>

for future reference. Leonid and I may be embedding these babies in the main ARTFL database (TLF). I suspect that IF implement something like that, it would be used mainly for theatre. I want to keep the door open. NOT IMPLEMENTED YET

Notes (footnotes and endnotes)

March '06: A while back, I decided that we would want to keep all notes in a their own <h1 (or <div1) and link to them. This allows for coherent object processing in the text AND rapid selection of objects that have notes. We have preprocessors that move notes to the end of <divs or <h1s

<h1>Notes</h1>

<notetext n="0" xpg="398" xpgobj="136">* Sono contrassegnati da

un asterisco i capitoli di altri a Veronica Franco.

</notetext>

Links to NOTES in the text (behaves like a ref):

deposited the mummies<note n="21" ref="21"> that had been

<note n="0" ref="a">, che degna gloria

where n=NOTE NUMBER -- the internal identifier which links to the note and ref=DISPLAY IDENTIFIER (the thing that gets displayed to indicate there is a note). We do not want numbers or other note identifiers in the running text, since these would be indexed as characters and break word adjacent searching.

The notetext tag appears as indicated. In 2t loaders, you want paragraphs between them, for 3t, the notetext tag will suffice.

<notetext n="21" xpg="58" xpgobj="58"> <i>mummies</i>.</notetext>

<notetext n="5" xpg="398" xpgobj="136">* Sono contrassegnati da u

n asterisco i capitoli di altri a Veronica Franco.

where n=NOTE NUMBER -- a string tied to the note tag -- xpg=PAGE IDENTIFIER -- usually a page number -- and xpgobj= is an INT counted from 0 in the doc. Thus, in

<notetext n="8" xpg="VIII" xpgobj="8">(9) Giulio Secondo....</notetext>

indicates that the page identifier = "VIII" which is the tag for the 8th page object. We normally echo out the Note Identifier at the beginning of the notetext, e.g. (9) or "*". It is called by

Most importantly is that the two values for n="VALUE" be identical since this tied reference to notetext....

Links to external resources through WWW/http

This is going to be a convention, since the main system will not have to handle this. Since we are using HTML as a base encoding, we can, of course, simply accept full URLs or PURLs. But we would rather NOT do this since it means hardwiring addresses in the database.

External object linkage is performed in format.ph, typically by expanding an internal tag to a URL:

s/<FIGURE INLINE="." SYSID="([^"]*)">/<IMG SRC=$image_server$1>/g;

where $image_server is set to an appropriate value in format.ph, such as:

$image_server = "http://www.lib.uchicago.edu/efts/EVD/figures/";

For administrative ease and consistency, it is wise to adopt general conventions which we can put in the default installed format.ph script. Conventions are noted below.

Images

Since almost all of our current databases only have images, this is the logical place to start. After looking at a variety of SGML specifications, I have to admit that I like the information provided in the Chadwick-Healy notation:

This allows us to clearly build a link and determine how an image should be displayed. Thus,

<FIGURE SYSID="FILE_NAME.EXT">

<FIGURE INLINE="Y" SYSID="FILE_NAME.EXT">

will be functionally equivalent, building a link to an image for display in the WWW browser using the <IMG SRC=... HTML construction. Similarly,

<FIGURE INLINE="N" SYSID="FILE_NAME.EXT">

will be used to build links to clickable image links using the <A HREF="..." construction.

There are a couple of assumptions here.

the variable set as $image_server will show the protocol://computer.address/directory/path/ to the image.
Image file names should be unique to the database OR the FILE_NAME.EXT should have a relative path from the path specified in $image_server

Example Code:

$image_server = "http://www.lib.uchicago.edu/efts/VOLTAIRE/figures/";

# The following goes into the Object formatter

s/<FIGURE INLINE="Y" SYSID="([^"]*)">/<IMG SRC="$image_server$1">/g;

s/<FIGURE SYSID="([^"]*)">/<IMG SRC="$image_server$1">/g;

# You can modify the formatting of the link

s/<FIGURE INLINE="N" SYSID="([^"]*)">/[<a href="$image_server$1">image<\/a>]/g;

Audio and other specifications

[to be determined]

Nota Bene

This is a random collection of implementation specific notes that will have to be moved into more complete documentation.

Character separation. Apostrophes are assumed to be part of words, so you must explicitly split them. We have done this in order to allow the database administrator to determine word breaking. Thus, l' état should have a space while aujourd'hui probably should not.
Word separation. SGML/HTML elements that are not specifically used by PhiloLogic typically are passed blindly through the system, as word separating white space. IN order to handle words that have italics, bold, underlines and super/sub scripts, we have decided NOT to treat the following tags as word separators:
Thus words separated only by such tags will be run together for indexing and display purposes. We have come across this in some SGML representations and warn you up front. The tags are removed from the INDEX for searching purposes (obviously).

April 2, 99: ALL hyphens will act as word separators!

Notes are a problem in general. We are putting them at the end of documents as an h3 ... in order to get page fetching functioning properly, we will also add a page tag before the h3 notes......

ARTFL Project, University of Chicago.

Revision Date: date here, please

Leonid Andreev, leonid@math.harvard.edu

Mark Olsen, mark@barkov.uchicago.edu

Page updated

Google Sites

Report abuse