This document is NOT completed. It was, at one point, cleaner, but I have been adding some issues that we need to resolve. So, I need to write a public release level ATE specification. MVO. ObjectiveThe ATE (ARTFL Text Encoding) specification is used for internal data representation by our new generation of ARTFL search and reporting software, PhiloLogic. We have determined that attempting to develop a large-scale text searching and analysis package that would treat raw SGML documents by reading DTDs and automatically building a single consistent data representation would be too costly and expensive. Thus, we had to specify a data representation that can reasonably represent a wide variety of textual features with a basic level of consistency across various documents. ATE is our first attempt to develop an encoding standard for use in PhiloLogic.Rather than create "yet another standard", we decided that our internal data representation should leverage existing standards, whenever possible, and use standards that are simple, well documented, stable, and for which there are a large number of existing and free/inexpensive tools. We determined that the most effective representation is the combination of (unqualified) Dublin Core 1.0 metadata specification with very basic HTML. Very few extensions are currently optional to build ARTFL style databases: page identifiers, explicit sentence tags, and proper name tags, all of which may be omitted. PhiloLogic IGNORES completely ALL SGML tagging which we have not specifically described below. This allows compatibility with other systems in that you can pass tags blindly thru the system. By using a very basic and well documented set of encoding specifications, design of PhiloLogic and ATE will allow users to build databases from different encoding schemes (see our SGML to ATE document LINK COMING), allow individuals to develop directly ATE compatible documents using existing tools, and allow us to use PhiloLogic to index arbitrary collections of HTML documents in a WWW space with minimal encoding and no modification. In contrast to the TEI and other specifications, which work from the top-down to provide an infrastructure for all possible encodings in a system independent fashion, ATE is system dependent and built from the bottom-up. We specify only tagging that we actually have in production or that we are planning to treat in PhiloLogic. Extensions to this basic specification will be, as always, completely optional and will use existing schemes, preferably TEI specifications, whenever possible. We believe that a simple representation is sufficiently rich for the creation of complex databases and easily used by many scholars in environments with which they are already familiar. ATE: Dublin Core HeaderWe selected the Dublin Core specification because it is simple -- designed to be used by non-catalogers as well as resource description specialists -- interoperable across many domains, readily extensible, well supported, consistent with HTML documents, and is garnering international support. The simplicity and coverage allows us to map the kinds of bibliographic support that ARTFL has provided for years with an easy to manage encoding.For default functionality of PhiloLogic bibliographic control, we require on a selected subset of the 15 base data elements of the Dublin Core. The 15 base elements are, with a quick indication of the desired contents of each element: <head>This information is mapped to a refer bibliography containing the following information in the standard layout which is used by PhiloLogic to build the bibliographic data handler: %a Rousseau, J.-J.At this time, the default bibliographic representation allows for searching on
We have provided a more detailed outline of Recommended Dublin Core Contents for use under PhiloLogic. This document specifies in detail the format and contents of Dublin Core elements that will work best under PhiloLogic. PhiloLogic is compatible with basic Dublin Core as defined in the unqualified Dublin Core specification.
Randomly selected example:
Text Data EncodingThis section contains BOTH data specifications and behavior, implementation specifications and needs to be separated out!!
We will adopt a multi-level structural hierarchy for main textual objects.
The top level is the document, and lower levels are as follows:
Note: The system expects section breaks -- <h[1-n]> ... </h[1-n]> -- to appear on a seperate line. It would be best is the entire header title appear on a single line: <h1>Header Title</h1> These can be as long as you want, but remember that we use them to display tables of contents, which use indentations to indicate nesting, so really long header titles will wrap, making the display less effective.
Full document navigation is in full production for PhiloLogic
databases as described in the
sections on
Retrieving and Navigating Documents and
Navigating Documents from Word Searches of
the
PhiloLogic
User's Manual.
Pages will also constitute textual objects, even though they do not fit into the structural hierarchy outlined above. We will tag page breaks as <page n="[ANY STRING]">, where [ANY STRING] is a page object identifier (e.g. page number) from the source edition. Every text object should refer to its page identifier for display purposes: users will want to know the page identifier for the text object (such as a paragraph) containing the text they queried. We have decided that the value of the page numbers noted here will NOT include spaces. Generally, I would keep these babies short, since it is a matter of display space in KWIC reports. In general, these look like:
<page n="12"> Important Note Page tags must appear on their own lines..... For search and retrieval purposes, any front matter (such as the editor's preface, title page, etc.) or back matter (appendices, indices, etc.) will be considered a logical unit of text tagged as a first-level structure. The new loader should implement this system for tagging text structure,
but it should also be able to handle texts already marked up in HTML without
further modification. In other words, missing <page n="..."> markers
or <h1> ... <h1> tags should not prevent a file from loading into
a database.
SENTENCE TAGGINGIdeally, all sentences would have explicit tags to differentiate them from abbreviations and other uses of punctuation. However, if we are going to accept any properly tagged HTML document, we will have to allow for implied sentences.Explicit sentences will be tagged as follows:
Implicit sentences will be identified by the punctuation marks . ? ! and object tags (e.g. <h1>, <p>, etc.) as we have in the current loader. Note: the system will, for every paragraph, check to see if there are any <sent> tags. If not, it will apply implied sentence recognition. WORD OBJECTSIn some cases, words have tags within them. In the OVI database, for example, we find medieval Italian words with italics inside: apri<i>le</i>, ma<i>no</i>. In order to index these words as word objects, the loader should not treat tags as spaces.The current ARTFL loader handles all valid HTML special characters (for a list of these characters, see the offical list). There are also many SGML character entities which do not map to ISO Latin, such as &obar; (the letter "o" with a macron over it). Commercially available data offers a wide variety of unoffical SGML character representations, so the preceding example is one of many possibilities. In order to account for unusal character representations, the loader should generate a words.R file where each entry has two fields:
firenze|fire[nze] In the future, we may be able to extend this representation
of unusual spellings and characters to include fields for things such as
parts of speech, root forms, etc.
PROPER NOUNSAs an extension to this tag set for future search engine developments, we will include tags for proper nouns (to distinguish them from words which begin sentences, for instance). The tags will be <pn>Proper Noun</pn>.NOT IMPLEMENTED YET Speaker TagsI want to reserve <spkr>Speaker's Name</spkr> and possibly the alternative construct<spkr name="SOME NAME"> .... text, maybe lots of it </spkr> for future reference. Leonid and I may be embedding these babies in the main ARTFL database (TLF). I suspect that IF implement something like that, it would be used mainly for theatre. I want to keep the door open. NOT IMPLEMENTED YET Notes (footnotes and endnotes)March '06: A while back, I decided that we would want to keep all notes in a their own <h1 (or <div1) and link to them. This allows for coherent object processing in the text AND rapid selection of objects that have notes. We have preprocessors that move notes to the end of <divs or <h1s
<h1>Notes</h1>Links to NOTES in the text (behaves like a ref): deposited the mummies<note n="21" ref="21"> that had beenwhere n=NOTE NUMBER -- the internal identifier which links to the note and ref=DISPLAY IDENTIFIER (the thing that gets displayed to indicate there is a note). We do not want numbers or other note identifiers in the running text, since these would be indexed as characters and break word adjacent searching. The notetext tag appears as indicated. In 2t loaders, you want paragraphs between them, for 3t, the notetext tag will suffice. <notetext n="21" xpg="58" xpgobj="58"> <i>mummies</i>.</notetext>where n=NOTE NUMBER -- a string tied to the note tag -- xpg=PAGE IDENTIFIER -- usually a page number -- and xpgobj= is an INT counted from 0 in the doc. Thus, in <notetext n="8" xpg="VIII" xpgobj="8">(9) Giulio Secondo....</notetext>indicates that the page identifier = "VIII" which is the tag for the 8th page object. We normally echo out the Note Identifier at the beginning of the notetext, e.g. (9) or "*". It is called by <note n="8" ref="9">Most importantly is that the two values for n="VALUE" be identical since this tied reference to notetext.... Links to external resources through WWW/httpExternal object linkage is performed in format.ph, typically by expanding an internal tag to a URL: s/<FIGURE INLINE="." SYSID="([^"]*)">/<IMG SRC=$image_server$1>/g;where $image_server is set to an appropriate value in format.ph, such as: $image_server = "http://www.lib.uchicago.edu/efts/EVD/figures/"; For administrative ease and consistency, it is wise to adopt general conventions which we can put in the default installed format.ph script. Conventions are noted below. ImagesSince almost all of our current databases only have images, this is the logical place to start. After looking at a variety of SGML specifications, I have to admit that I like the information provided in the Chadwick-Healy notation:<figure sys.id="V0740035.TIF" inline=n figno="35">
This allows us to clearly build a link and determine how an image
should be displayed. Thus,
There are a couple of assumptions here.
Example Code: $image_server = "http://www.lib.uchicago.edu/efts/VOLTAIRE/figures/"; Audio and other specifications[to be determined]Nota BeneThis is a random collection of implementation specific notes that will have to be moved into more complete documentation.
April 2, 99: ALL hyphens will act as word separators! Notes are a problem in general. We are putting them at the end of documents as an h3 ... in order to get page fetching functioning properly, we will also add a page tag before the h3 notes...... ARTFL Project, University of Chicago. Revision Date: date here, please Leonid Andreev, leonid@math.harvard.edu Mark Olsen, mark@barkov.uchicago.edu |