This document is NOT completed. It was, at one point,
cleaner, but I have been adding some issues that we need to
resolve. So, I need to write a public release level ATE
The ATE (A
ncoding) specification is
used for internal data representation by our new generation of
ARTFL search and reporting software, PhiloLogic. We have determined
that attempting to develop a large-scale text searching and analysis
package that would treat raw SGML documents by reading DTDs and
automatically building a single consistent data representation would
be too costly and expensive. Thus, we had to specify a data
representation that can reasonably represent a wide variety of
textual features with a basic level of consistency across various
documents. ATE is our first attempt to develop an encoding standard
for use in PhiloLogic.
Rather than create "yet another standard", we decided that our
internal data representation should leverage existing standards,
whenever possible, and use standards that are simple, well
documented, stable, and for which there are a large number of existing
and free/inexpensive tools. We determined that the most effective
representation is the combination of (unqualified)
Dublin Core 1.0
metadata specification with very basic HTML. Very few extensions are
currently optional to build ARTFL style databases: page identifiers, explicit
sentence tags, and proper name tags, all of which may be omitted.
PhiloLogic IGNORES completely ALL SGML tagging which we have
not specifically described below. This allows compatibility with
other systems in that you can pass tags blindly thru the system.
By using a very basic and well documented set of encoding
specifications, design of PhiloLogic and ATE will allow users
to build databases from different encoding schemes (see our
SGML to ATE document LINK COMING), allow individuals to develop
directly ATE compatible documents using existing tools, and allow us
to use PhiloLogic to index arbitrary collections of HTML documents
in a WWW space with minimal encoding and no modification.
In contrast to the TEI and other specifications, which work from
the top-down to provide an infrastructure for all possible
encodings in a system independent fashion, ATE is system
dependent and built from the bottom-up. We specify
only tagging that we actually have in production or that we
are planning to treat in PhiloLogic. Extensions to this basic
specification will be, as always, completely optional and
will use existing schemes, preferably TEI specifications, whenever
We believe that a simple representation is sufficiently rich for
the creation of complex databases and easily used by many scholars
in environments with which they are already familiar.
ATE: Dublin Core Header
We selected the Dublin Core
specification because it is simple
-- designed to be used by
non-catalogers as well as resource description specialists --
across many domains, readily extensible,
well supported, consistent with HTML documents, and is garnering
international support. The simplicity and coverage allows us
to map the kinds of bibliographic support that ARTFL has provided
for years with an easy to manage encoding.
For default functionality of PhiloLogic bibliographic control,
we require on a selected subset of the 15 base data elements of
the Dublin Core. The 15 base elements are, with a quick indication
of the desired contents of each element:
<meta name="DC.title" content="Complete Title">
<meta name="DC.creator" content="Author Name">
<meta name="DC.publisher" content="Publisher">
<meta name="DC.date" content="date">
<meta name="DC.type" content="Genre or type">
<meta name="DC.identifier" content="Short Identifier">
<meta name="DC.contributor" content="Editor or other">
<meta name="DC.subject" content="TO BE DETERMINED">
<meta name="DC.format" content="TO BE DETERMINED">
<meta name="DC.language" content="TO BE DETERMINED">
<meta name="DC.description" content="TO BE DETERMINED">
<meta name="DC.relation" content="TO BE DETERMINED">
<meta name="DC.coverage" content="TO BE DETERMINED">
<meta name="DC.source" content="TO BE DETERMINED">
<meta name="DC.rights" content="TO BE DETERMINED">
This information is mapped to a refer bibliography
containing the following information in the standard layout
which is used by PhiloLogic to build the bibliographic data
%a Rousseau, J.-J.
%T Discours sur Sciences et Arts
%Y traite ou essai
%P In Oeuvres Completes, T.3. Paris, Gallimard, 1964.
At this time, the default bibliographic representation allows
for searching on
- Author (DC.creator)
- Title (DC.title)
- Date of Publication (DC.date)
and displays in addition to the above,
- Publisher (DC.publisher)
- Short Identifier (DC.identifier)
We will certainly be implementing optional treatment for other
DC fields, as defined below and/or by the default DC specification.
Most important will be
- Genre or type (DC.type)
- Subject (DC.subject)
ATE adapts the Dublin Core specification to reflect the kinds of
information that we typically want to have in basic database functionality.
Some of these adaptations may not fit with the semantics as are
being defined by DC. Finally, as noted below, PhiloLogic does not currently
do anything with a number of DC elements. We strongly recommend that
these are used, as specified by Dublin Core, for future reference. We
expect that some or all of the materials will be used in PhiloLogic and for
We have provided a more detailed outline of
Recommended Dublin Core Contents for
use under PhiloLogic. This document specifies in detail the
format and contents of Dublin Core elements that will work
best under PhiloLogic. PhiloLogic is compatible with basic
Dublin Core as defined in the unqualified Dublin Core specification.
Randomly selected example:
<meta name="DC.title" content="De rationali et ratione uti">
Important note: I need to document alternative bibliographic control,
the refer format, since we do load databases in this way as well.
In fact, the loader extracts the bibliographic information from the
DC representation and generates a refer bibliography with
some additional information required by the loader.
<meta name="DC.creator" content="Gerbertus Auriliacensis">
<meta name="DC.publisher" content="Patrologia latina, vol. 139. J. P. Migne, ed. Parisiis: excudebat Migne, 1853">
<meta name="DC.date" content="MED">
<meta name="DC.identifier" content="GerAur, DeRaEtR">
<meta name="DC.contributor" content="Chadwyck-Healey (Release 5: 1995)">
<meta name="DC.format" content="ARTFL HTML-SGML">
<meta name="DC.language" content="la">
<meta name="DC.rights" content="c. 1995 Chadwyck-Healey Inc. Do not export or print from this database without checking your licence agreement to see what is permitted.">
Text Data Encoding
This section contains BOTH data specifications and behavior,
implementation specifications and needs to be separated out!!
We will adopt a multi-level structural hierarchy for main textual objects.
The top level is the document, and lower levels are as follows:
||1st level division
||2nd level division
||3rd level division
This is HTML and maps well to the SGML/TEI div0 thru div_n (so <div1>
... </div1> --> <h1> ... </h1>, <div2> ... </div2> --> <h2>
... </h2>, etc.).
At this time any division level beyond 3 will be translated
as an HTML heading tag for display purposes only and will not indicate
a structural level of textual objects. We are considering handling
more than 3 nested levels, but to date we have only encountered 1 example
of documents that required more than 3 levels: certain
A document may not contain
all these divisions (it may only have <h1> and <h2>). The three
level hierarchy can reflect any number of possible structures, such as
Book-Chapter-Verse, or Act-Scene (leaving <h3> blank). Another
example might be the following:
text and tags
text and tags
text and tags
And so on...
Note: The system expects section breaks -- <h[1-n]> ... </h[1-n]> --
to appear on a seperate line. It would be best is the entire header
title appear on a single line:
These can be as long as you want, but remember that we use them to
display tables of contents, which use indentations to indicate nesting,
so really long header titles will wrap, making the display less
Full document navigation is in full production for PhiloLogic
databases as described in the
Retrieving and Navigating Documents and
Navigating Documents from Word Searches of
Further object levels which descend from the lowest division level
are as follows:
Pages will also constitute textual objects
||delimited by punctuation or explicit tags
||delimited by white space and punctuation
, even though they
do not fit into the structural hierarchy outlined above. We will
tag page breaks as <page n="[ANY STRING]">
where [ANY STRING] is a page object identifier (e.g. page number) from
the source edition. Every text object should refer to its page identifier
for display purposes: users will want to know the page identifier
for the text object (such as a paragraph) containing the text they queried.
We have decided that the value of the page numbers noted here will
NOT include spaces. Generally, I would keep these babies
short, since it is a matter of display space in KWIC reports. In
general, these look like:
but of course, pages can have letters and other oddities,
for page 12 of volume one. And, you can have alternates, just don't
use a space:
but of course
<page n="23:3"> or
might be just as effective.
Page tags must appear on their own lines.....
For search and retrieval purposes, any front matter (such as the editor's
preface, title page, etc.) or back matter (appendices, indices, etc.) will
be considered a logical unit of text tagged as a first-level structure.
The new loader should implement this system for tagging text structure,
but it should also be able to handle texts already marked up in HTML without
further modification. In other words, missing <page n="..."> markers
or <h1> ... <h1> tags should not prevent a file from loading into
Ideally, all sentences would have explicit tags to differentiate them from
abbreviations and other uses of punctuation. However, if we are going
to accept any properly tagged HTML document, we will have to allow for
Explicit sentences will be tagged as follows:
<sent>Blah blah blah blah, etc., but blah
blah blah.<sent> Blah blah!
Sentences will start,
as expected, with any <p> or division break <h[1-3]>.
Implicit sentences will be identified by the punctuation marks .
? ! and object tags (e.g. <h1>, <p>, etc.) as
we have in the current loader.
Note: the system will, for every paragraph, check to see if there are
any <sent> tags. If not, it will apply implied sentence recognition.
In some cases, words have tags within them. In the OVI database,
for example, we find medieval Italian words with italics inside:
In order to index these words as word objects, the loader should not
treat tags as spaces.
The current ARTFL loader handles all valid HTML special characters (for
a list of these characters, see the offical
list). There are also many SGML character entities which do not
map to ISO Latin, such as &obar;
(the letter "o" with a macron over it). Commercially
available data offers a wide variety of unoffical SGML character representations,
so the preceding example is one of many possibilities. In order to
account for unusal character representations, the loader should generate
a words.R file
where each entry has two fields:
The first field is a reduced search key, the second is the full word.
We can generalize this and add entries to the words.R
file whenever there is a word containing an SGML character entity of the
which would then lead to adding an entry to the words.R
file as follows:
We can also use this two-field index file to search
for words containing parentheses, brackets, and other punctuation that
automatically delimited a new word in the old loading format. We
can map words with spelling variations (as is common with variant editions
of the same medieval text) to the same canonical form, such as
While the simplified spelling will help many users find all the variations
of a word or phrase, we should also allow for searches that specify a particular
spelling: users should be able to search only for fire(n)ze
if they wish.
In the future, we may be able to extend this representation
of unusual spellings and characters to include fields for things such as
parts of speech, root forms, etc.
Users could then construct search queries such as
"Find a verb followed by some form of 'Firenze'" by typing something like
"VERB AND firenze."
As an extension to this tag set for future search
engine developments, we will include tags for proper nouns (to distinguish
them from words which begin sentences, for instance). The tags will
NOT IMPLEMENTED YET
I want to reserve <spkr>Speaker's Name</spkr>
and possibly the
<spkr name="SOME NAME">
maybe lots of it </spkr>
for future reference. Leonid and
I may be embedding these babies in the main ARTFL database (TLF).
I suspect that IF
implement something like that, it would
be used mainly for theatre. I want to keep the door open.
NOT IMPLEMENTED YET
Notes (footnotes and endnotes)
March '06: A while back, I decided that we would want to keep all
notes in a their own <h1 (or <div1) and link to them. This allows
for coherent object processing in the text AND rapid selection of
objects that have notes. We have preprocessors that move notes to
the end of <divs or <h1s
<notetext n="0" xpg="398" xpgobj="136">* Sono contrassegnati da
un asterisco i capitoli di altri a Veronica Franco.
Links to NOTES in the text (behaves like a ref):
deposited the mummies<note n="21" ref="21"> that had been
<note n="0" ref="a">, che degna gloria
<note n="5" ref="*">
where n=NOTE NUMBER -- the internal identifier which links to the
note and ref=DISPLAY IDENTIFIER (the thing that gets displayed to
indicate there is a note). We do not want numbers or other
note identifiers in the running text, since these would be indexed
as characters and break word adjacent searching.
The notetext tag appears as indicated. In 2t loaders, you
want paragraphs between them, for 3t, the notetext tag will suffice.
<notetext n="21" xpg="58" xpgobj="58"> <i>mummies</i>.</notetext>
<notetext n="5" xpg="398" xpgobj="136">* Sono contrassegnati da u
n asterisco i capitoli di altri a Veronica Franco.
where n=NOTE NUMBER -- a string tied to the note
xpg=PAGE IDENTIFIER -- usually a page number -- and xpgobj= is
an INT counted from 0 in the doc. Thus, in
<notetext n="8" xpg="VIII" xpgobj="8">(9) Giulio Secondo....</notetext>
indicates that the page identifier = "VIII" which is the tag for the
8th page object. We normally echo out the Note Identifier at the beginning
of the notetext, e.g. (9) or "*". It is called by
<note n="8" ref="9">
Most importantly is that the two values for n="VALUE"
since this tied reference to notetext....
Links to external resources through WWW/http
This is going to be a convention, since the main system
will not have to handle this. Since we are using HTML as
a base encoding, we can, of course, simply accept full URLs or PURLs.
But we would rather NOT
do this since it means hardwiring
addresses in the database.
External object linkage is performed in format.ph, typically
by expanding an internal tag to a URL:
s/<FIGURE INLINE="." SYSID="([^"]*)">/<IMG SRC=$image_server$1>/g;
where $image_server is set to an appropriate value in format.ph
$image_server = "http://www.lib.uchicago.edu/efts/EVD/figures/";
For administrative ease and consistency, it is wise to adopt general
conventions which we can put in the default installed format.ph
script. Conventions are noted below.
Since almost all of our current databases only have images, this is
the logical place to start. After looking at a variety of SGML specifications,
I have to admit that I like the information provided in the
<figure sys.id="V0740035.TIF" inline=n figno="35">
This allows us to clearly build a link and determine how an image
should be displayed. Thus,
will be functionally equivalent, building a link to an image
for display in the WWW browser using the
<IMG SRC=... HTML construction. Similarly,
<FIGURE INLINE="Y" SYSID="FILE_NAME.EXT">
<FIGURE INLINE="N" SYSID="FILE_NAME.EXT">
will be used to build links to clickable image links using the
<A HREF="..." construction.
There are a couple of assumptions here.
- the variable set as $image_server will show the
protocol://computer.address/directory/path/ to the image.
- Image file names should be unique to the database OR
the FILE_NAME.EXT should have a relative path from
the path specified in $image_server
$image_server = "http://www.lib.uchicago.edu/efts/VOLTAIRE/figures/";
# The following goes into the Object formatter
s/<FIGURE INLINE="Y" SYSID="([^"]*)">/<IMG SRC="$image_server$1">/g;
s/<FIGURE SYSID="([^"]*)">/<IMG SRC="$image_server$1">/g;
# You can modify the formatting of the link
s/<FIGURE INLINE="N" SYSID="([^"]*)">/[<a href="$image_server$1">image<\/a>]/g;
Audio and other specifications
[to be determined]
This is a random collection of implementation specific notes that
will have to be moved into more complete documentation.
- Character separation. Apostrophes are assumed to be
part of words, so you must explicitly split them. We
have done this in order to allow the database administrator
to determine word breaking. Thus, l' état should
have a space while aujourd'hui probably should not.
- Word separation. SGML/HTML elements that are not
specifically used by PhiloLogic typically are passed blindly
through the system, as word separating white space. IN order
to handle words that have italics, bold, underlines and super/sub
scripts, we have decided NOT
to treat the following tags as word separators:
Thus words separated only by such tags will be run together for
indexing and display purposes. We have come across this in
some SGML representations and warn you up front. The tags are
removed from the INDEX for searching purposes (obviously).
- underlining <u>, </u>, <U>, </U>
- bold <b>, </b>, <B>, </B>
- italics <i>, </i>, <I>, </I>
- superscript <sup>, </sup>, <SUP>, </SUP>
- subscript <sub>, </sub>, <SUB>, </SUB>
April 2, 99: ALL hyphens will act as word separators!
Notes are a problem in general. We are putting them at the end
of documents as an h3 ... in order to get page fetching functioning
properly, we will also add a page tag before the h3 notes......
University of Chicago.
: date here, please