PhiloLogic 3 release notes

PhiloLogic Version 3 Release Notes

All test samples are loaded without any database specific customizations (except to point dictionary linking to appropriate resources), using the load argument

philoload DBNAME --loadsql --sqluser [DELETED] --sqlpass [DELETED] --linksourcetexts *.[tei/sgm/xml]

from the directory containing the texts. We want this to run out of the box for this set of tests.

PhiloLogic 3.1

This is a major re-write of most of the cgi-bin code to support multi-lingual user interfaces. All 300 system messages are now found in language specific arrays which can be specified by the database administrator. We currently have French and English messages. If you are using PhiloLogic and want to help by translating the interface into other languages, please let us know and we will be happy to assist you in any way that we can.

All language arrays are copied into each database lib directory. Selection of base language is done in philo-db.cfg.

Please note that we have NOT translated system generated search forms. We have found that search forms and headers are frequently heavily modified by users and administrators. We have also opted not to support dynamic selection in the distribution, but this would be a trivial function. If we find we need to do it, we will add a the patch to the PhiloLogic wiki. If you add this, please let us know.

PhiloLogic 3.002

Added several minor upgrades and numerous small fixes. BUTNOT operator: A word vector filter which subtracts a secondary pattern from a primary pattern, e.g. "christ* BUTNOT christm*" (no quotes in searches) -- or in concise notation "christ*|-christm*" -- to filter forms of christmas from christ* for searching. This works in standard PhiloLogic notion with AND, OR, NOT. BUTNOT is proposed because the NOT operator in PhiloLogic functions as a proximity NOT, so "NOT christ jesus NOT christ", finds instances of jesus when not preceded or followed by christ -- the personal jesus. Example:

http://philologic.uchicago.edu/philologic3/docsouth.html

Try "christ.* BUTNOT christ[im].*" (no quotes). Concise notation "christ.*|-christ[im].*" (Note the final ".*" required to distinguish from "[im]*").

Ajax note handler ... a configuration selectable note displayer that uses Ajax to get the note from the server and taggle display in the text rather than a pop-up browser window. Example:

http://cassat.uchicago.edu/cgi-bin/philologic/getobject.pl?c.8:3.lincoln

Works for ARTFL TEI formatted note tags and ATE (see below).

An experimental OS-X GUI loader. For those allergic to command line computing, this is an alternative to the command line loader and offers options. Proof-of-concept at this point.

NON-TEI encoding scheme support: ATE, DocBook, and plaintext.

Plaintext by popular demand. Yes indeed, we have had people ask for it to be included in PhiloLogic. Tested on Gutenberg and Liberliber documents. Character conversion to UTF-8 from whatever character encoding you might get them in is *strongly recommended*, because we can have result pages that mix materials from different documents. Example, Gutenberg Spanish and German documents:

http://philologic.uchicago.edu/philo3002demo/gutenberg.form.html

Note that PhiloLogic will handle earlier character representations, but some modification to headers, etc. would be required.

DocBook (Prototype support). Could be extended if there is interest. Again, proposed by a PhiloLogic user/hacker. Example:

http://philologic.uchicago.edu/philo3002demo/docbooklit.form.html

This is not fully supported, but could be if there is sufficient demand.

ATE: ARTFL Text Encoding. This is/was an intermediate internal encoding scheme consisting of Dublin Core headers, HTML (reasonably handled), with optional tagging for things like pages, notes, sentences, and so on, very lightly documented at http://philologic.uchicago.edu/ATE/ Example:

http://philologic.uchicago.edu/philo3002demo/lincoln.form.html

Caveat emptor: PhiloLogic will probably load arbitrary HTML, but this may not always work, particularly if you have use of

PhiloLogic 3.001

New internal search engine (search3). Resolves library incompatibility bugs in new Linux releases noted in search2. Extensible in new ways and supports full object searching. The Linux and OS-X installations now have 64 bit index addressing, so this should be able to handle about a terabyte of TEI encoded text data.

NOT text search operator: Try "NOT christ jesus NOT christ" (no quotes) as a test in docsouth or EEBO. Concise regex notation: !chr.st.? jesus !chr.st.?

Searching for and in divs by type, head, as well as fields extracted from opener/closer, author/signed, dateline, salutation. The table for divs also has fields for id, n, and lang -- being populated if found -- and placename, classification and partofspeech (not populated at the moment, future use). Merges biblio and object searches.

Full word searching on selected subdiv objects: lg, note, epigraph, sp, and a couple of others. You can search on tag -- lg -- and type (hymn). Merges biblio and object searches. Fields in this table are tag, type, n, id, who, lang, which are being populated when data is found.

SQL subdoc object management. This includes dynamic terms buttons which give frequencies of values with other values selected in the same object level. This is also required support to standoff nested object mark-up.

Automatic generation of "whizbang" search form templates with examples drawn from your data.

Reimplemented "more hits" ... a sliding list of twenty blocks. The block size and number of block are set in philo-db.cfg

No limit on search results ... well, a million. This is set in the general philo configuration.

In single document searching, user may select any object. Multiply included objects ... selecting a div1 and then a div3 in that div1 are ... are filtered out to avoid repeats.

KWIC resorting option on left and/or right contexts as well as selected bibliographic information.

Extensive debugging information, enabled only from philo-db.cfg as a security measure.

Standard support for ARTFL TEI Lite recommendations, including metadata, notes, etc. Consult our local encoding recommendations.

Metadata extraction in the poor man's extractor for TEI, MEP, and CES. Textload is known to handle all three. ARTFL Text Encoding (ATE) is in another set of recognizers.

Textload now has a configuration file in the philologic home, which allows you to define parameters for the load.

Word count per document (standard) and FREQUENCY PACKAGE.

A standard installation of the frequency package. This reads the word/document data generated by textload. This is an integrated package which can be optionally built after the database load. It requires SQL and may take significant time to load. Examples (loaded with command DBDIR/frequencies/makefrequencies DBNAME):

Timeseries: DocSouth or BibLioNet
Frequencies: DocSouth or BibLioNet

ADD-ONS: Full support for various dictionary look-ups. Enabled from configuration.

# Enable dictionary look-up function. Set to 0 to turn it off.

# See quickdickjs for further details.

# Options: 1 = ARTFL one look dictionary function with morphological

# package. Obviously for French.

# 2 = Oxford English Dictionary.

# 3 = ARTFL Websters Dictionary

# 4 = onelook.com

Additional code in "goodies" to hook-up PhiloLogic to TaporWare and to force a similarity search for "dirty OCR applications".

Discussions: some gory details mainly for developers

Bugs: Doubtless!! I think we've cleared most of the the list. Let us know.

NOTE: Some nested subdiv objects, most notably sp, lg, and stage tags, may conflict with one another. This has to do with preset object depths as a holdover from the PhiloLogic2 series. See FUTURE DEVELOPMENTS below. Slight modifications to rules will fix most of these, but long term requires deeper object index function.

Discussing: subdiv object report generator. Currently, if you search for subdiv objects (in a selected set of docs or whole database) it will simply report these by types and attributes by document. Not sure how these should be handled. Suggestions?

textload.cfg has an instruction to dump pretty raw XPATHS in div and subdiv tables. Might be useful.

Testing Needed: Well, everything, of course. But, internal document navigation. This is based on a single object link table. Seems to work. Using this for notes, and other internal cross references, such as tables of contents, indexes, etc.

NOT Implemented at this time:

stylometric statistic generation: See the todo list. Reason: TEI data is far too variant to decide on what constitutes a block -- paragraph. Sentence recognition is still pretty basic and subject to variations. So the most interesting data for stylometrics is too dependent on parser behavior to be very reliable. A good idea, but probably not now....

Future Development in rough order of priority:

Internationalization: we need to move all user level error messages to an array and support multiple language interfaces.
Extended Object Index Depths: the new underlying text search module supports extended object indexing for searching and retrieval. Once we have a canonical PhiloLogic3 running with the fixed object depth indicies, we will be implementing and testing extended object depth processing. This will require modifications to a variety of components. [Related to this will be new objects from NLP systems (noun/verb phrases) and possibly a word/object type field in the base indicies].
XML-tools based text parser? Or a rewrite of the poor man's textload to behave more like the poor man's metadata extractor? Probably as part of Extended Object development. And probably as an option ... still lots of big SGML databases out there and around here.
Let us know if there are things we should add.