Developer Documentation

  1. Technical Overview
  2. Quick Start
  3. Installing PhiloLogic
  4. Preparing Your Collection
  5. Creating a Bibliography
  6. Loading a Database
  7. Loading Bibliographic Data into MySQL
  8. Test Searching Your Database
  9. Results Headers and Footers

Technical Overview

Originally implemented to support large databases of French literature, PhiloLogic has been extended to support a wide variety of textual and hypermedia databases in collaboration with numerous academic institutions and, more recently, commercial organizations. PhiloLogic is a modular system, in which a textbase is treated as a set of coordinated or related databases, typically including an object (units of text such as a letter, scene, document, etc) database, a word forms database, a word concordance index mapped to textual objects, and an object manager mapping text objects to byte offsets in data files. Each of these databases is stored and managed using its own subsystem.

Quick Start

Linux

Type the following commands in a terminal to get started using PhiloLogic using some of our standard configuration options.

Grab the source:

curl http://philologic.uchicago.edu/philologic3/distribution/philologic-v3.001.t3.tar.gz > philologic-v3.001.t3.tar.gz

Unzip and untar the source:

tar -zxvf philologic-v3.001.t3.tar.gz

CD to the source directory and configure the installation. Configuration values will vary across platforms. The default settings are described in the section below.

cd philologic-v3.001

./configure --with-authuser-group=a-group-you-trust

Run the make script:

make install

CD to a directory containing texts that you would like to load under PhiloLogic and run a standard load:

cd /path/to/textstobeloaded/

philoload databasename *.xml

If all goes well, you will receive a SUCCESS message at the end of the load. You can run some sample searches using PhiloLogic's default search form by pointing your browser here: http://yourserver/philologic/databasename.whizbang.form.html

OS X

Type the following commands in a terminal on OS X to get started using PhiloLogic using our standard configuration options.

Grab the source:

curl http://philologic.uchicago.edu/philologic3/distribution/philologic-v3.001.osx.t1.tar.gz > philologic-v3.001.osx.t1.tar.gz

Unzip and untar the source:

tar -zxvf philologic-v3.001.osx.t1.tar.gz

CD to the source directory and configure the installation:

cd philologic-v3.001.osx

CFLAGS="-I/sw/include -L/sw/lib" ./configure --with-cgi-path=/Library/WebServer/CGI-Executables/philologic --with-cgi-url=/cgi-bin/philologic --with-web-path=/Library/WebServer/Documents/philologic --with-web-url=/philologic --with-authuser-group=admin --with-init-d=$HOME --with-boot-init-d=$HOME

Run the make script:

make install

CD to a directory containing texts that you would like to load under PhiloLogic and run a standard load:

cd /path/to/textstobeloaded/

philoload databasename *.xml

If all goes well, you will receive a SUCCESS message at the end of the load. You can run some sample searches using PhiloLogic's default search form by pointing your browser here: http://yourserver/philologic/databasename.whizbang.form.html

Installing PhiloLogic

PhiloLogic's distribution is still very much beta-quality, and to date it has been installed almost exclusively by people who are experts in its intricacies. If don't complete an installation, even if you just unpack it and lose interest, we'd really like to hear about it so we can make the package install in a more sensible way. Of course we'd love to hear that you have completed an installation and are searching tons of text, as well. Please write to support@philologic.uchicago.edu with your reports.

Creating a database using philologic is designed to be a simple process with several configurable options to tailor searching to fit your document set. You may choose to simply run a PhiloLogic load right out of the box using the default settings that we have found sufficient for our own full-text searching needs.

Some dependencies: gawk, perl, gdbm/gdbm-dev, gnutar, egrep

Optional dependencies: mysql, agrep

The first step to installing PhiloLogic is to unzip the installation files and enter the top-level source directory:

tar -xvzf philologic-v3.001.t3.tar.gz

cd philologic-v3.001

If you're on Debian, proceed with the following:

$ ./configure --with-authuser-group=a-group-you-trust

$ su

# make install

# /etc/init.d/nserver start

$ less LOADING

But if you're not on Debian (and not unlikely even if you are on Debian) you may well want to change some installation locations. These are the key ./configure arguments [default values follow in brackets]

--with-authuser-group=groupname name of group authorized to build

PhiloLogic databases [philologic]

--with-cgi-path=DIR sets filesystem path to cgi-bin directory

[/usr/lib/cgi-bin/newphilo]

--with-cgi-url=DIR sets URL path to cgi-bin directory [/cgi-bin/newphilo]

--with-web-path=DIR sets filesystem path for sample search page

and source downloads [/var/www/philologic]

--with-web-url=DIR sets URL path for sample search page

and source downloads [/philologic]

--with-init-d=DIR sets path to initscripts directory [/etc/init.d]

--with-boot-init-d=DIR sets path to default boot init.d directory

[/etc/rc2.d]

If you were on Mac OS X you would first need to download the OS X tarball: philologic-v3.001.osx.t1.tar.gz. Follow the same directions for unpacking the .gz file. Secondly, if you haven't already, be sure to install the XCode Developer Tools package that accompanies the OS installer discs. The most common dependencies unmet out of the box on OS X are autoconf (versions 2.5 or greater), grep and gdbm, all of which can be installed using Fink, the Unix Open Source package management system for Darwin. Once all the basic dependencies have been met, you should generate a fresh configure script by typing autoconf at a terminal prompt. Once this is done (it should only take a few seconds), proceed by running the configure script with a few Darwin specific options:

CFLAGS="-I/sw/include -L/sw/lib" ./configure --with-cgi-path=/Library/WebServer/CGI-Executables/philologic

--with-cgi-url=/cgi-bin/philologic

--with-web-path=/Library/WebServer/Documents/philologic

--with-web-url=/philologic --with-authuser-group=admin

--with-init-d=$HOME --with-boot-init-d=$HOME

You might also want to set the autoconf variables --sysconfdir=[something other than /etc], --bindir=[something other than /bin] and --localstatedir=[something other than /var (by default most of the install ends up in /var/lib/philologic)]. You can NOT successfully specify a PREFIX with make install PREFIX=/usr/local; this just ends up being ignored.

You have to specify web and cgi directories twice, once for where they are on the filesystem and once for where they are in URLs. You can, of course, put these things wherever you want as long as your search pages point to the right cgi locations.

Preparing Your Collection

Once the necessary components of PhiloLogic are installed on your machine, running a load is a single command-line process that takes as an argument the file(s) you wish to include in your collection. To specify which files you would like to load, you may either supply the location of a file containing a list of filenames or by listing the filenames on the command-line. It is important to remember that the order of the command-line arguments IS significant. The first argument must be the name of the database image you will be creating and the last argument must be the files to be loaded. You can create your own file list containing the filename for all the files you wish to load into your database. For example, suppose you are loading an entire directory of files named xmlfiles located in your home directory:

\ls /home/me/xmlfiles > ./files

Make sure there are no lines containing filenames for files that you don't want to load such as a DTD or an XML schema living in the same directory as your XML files. The \ before the ls ensures that any options passed to ls by shell aliases [e.g. syntax coloring] do not contaminate the file that you generate.

Creating a Bibliography

The next step in the load process is to create a bibliography according to the document set that is to be loaded. A default philologic load should locate much of the bibliographic data within a text. You may however choose to provide your own pre-built bibliography, the only requirement being that it is in tab-delimited format. Otherwise, we have provided two mechanisms for creating a tab-delimited bibliography. There is a non-xml-aware bibliographic loader that specifies default locations for certain metadata such as title, author, year, etc. We have also provided a non-validating XML-aware version of the bibliographic loader, built-on Michel Rodriguez's XML::Twig Perl Module, that by default checks for the same metadata fields as the non-xml-aware version. Both of these scripts output a tab-delimited file in the load directory called bibliography. The exact structure of the bibliography is described below:

title \t author (a1; a2; a3;...) \t year \t genre/doctype \t publisher \t place of publication \t extent \t editor \t publication date \t creation date \t author date \t keywords \t language \t collection \t gender \t notes \t period \t document identifier (5-20 alpha-numerics) \t filename \t filesize \t philologic id number

PhiloLogic was designed to be largely configurable. The bibliography module is no exception. The first thing you'll need to decide is whether you wish to use the non-xml-aware, the xml-aware version or to provide your own. The former assumes that the document header will be either a <teiHeader> or <mepHeader>. The Twig-based metadata extractor allows you to specify the header tag, with the default being <teiHeader>. Be aware that even though we allow for ANY type of header, the paths the extractor looks in by default for bibliographic data are those that our sample TEILITE XML and MEP XML documents have shown to most commonly contain the relevant data. Also note, that the "poor man's" metadata extractor will work with both XML and SGML while the Twig-based version requires XML well-formedness in order to parse correctly.

Reconfiguring the path arrays in order to point the metadata extractor to different elements is relatively straight forward. Note that the syntax used to denote a path in the "poor man's" metadata extractor is not XPATH compliant for three reasons: firstly, if a full-path is not being specified, two slashes must precede the first element specified to denote that the path is relative; secondly, you cannot terminate an xpath with a trailing slash; and lastly, the poor man's parser converts all elements to lowercase even though XML is case-sensitive. To add an xpath to a path array, you will need to edit mkbibliography.pl. Find the array that corresponds to the field that needs to be updated. For example, if you wish to update the title field, then you'll need to update @xptitles. Simply place your path inside quotes before the first element of the array, being sure to add a comma after the end quote (even if there are no other elements in the path array; better safe than sorry). Here is an example of an xpath that grabs a "subtitle" by pointing to an element with a type attribute with the value subordinate in a Shakespeare text:

the xpath = //teiHeader/fileDesc/titleStmt/title[@type="subordinate"]

the code:

========================== mkbibliography.pl ============================

@xptitles = ('//teiHeader/fileDesc/titleStmt/title[@type="subordinate"]',...);

Loading a Database

The PhiloLogic load process takes a number of TEI-encoded texts, processes them, and creates a directory tree that we call the database image or the system_dir. Loading a database under PhiloLogic, after all the initial preparations have been made, is a one-line command, the total execution time of which depends on a few factors: which bibliographic loader is being used and the size and complexity of the document set. We've written a command-line loader that should handle this, and more, for you, but it's fairly new and so probably immature, so you may find problems with it. But this is how you use it:

- cd to the directory that holds the texts you wish to load.

- run a command like: philoload mydatabase *.xml

"mydatabase" is the database name; it must be composed of only the letters A-Za-z and the numerals 0-9, but otherwise it can be whatever you want it to be. It must always be the first argument to philoload.

*.xml is a shell glob pattern that tells the system to load all the files in the directory whose names end in ".xml". This glob expansion is handled by the shell and it should appear to philoload as though you listed the files as: philoload mydatabase file1.xml anotherfile.xml file3.xml file?.xml

As noted previously, the order of the arguments is significant with the philoload command. The first argument must be the name of the database to appear in the philologic databases hash. The last argument must be the name of the files to be loaded or an optional regular expression such as *.xml (unless the optional argument of a path to a filelist is provided). philoload takes additional arguments that you may see by running it with no arguments. The optional arguments that may occur in between are:

Philoload Options

filelist [--filelist=/path/to/filelist]: a path to the list of files to be loaded

image [--image=/path/to/database/target/]: a location in which the database image is to be stored (by default this location is /var/lib/philologic/databases/dbname)

loader [--loader=/path/to/textloader]: the path to an alternate text loader script

mkbibliography [--mkbibliography=/path/to/makebibliograhpyscript]: the location of the bibliographic extractor script to be used (by default newextract.pl)

bibliography [--bibliography=/path/to/bibliography]: an optional path to a preloaded tab-delimited bibliography

loadsql [--loadsql]: an option that tells the loader to load the bibliographic data into a MySQL table automatically

sqlpass [--sqlpass=mysqlpassword]: MySQL password

sqluser [--sqluser=mysqluser]: if MySQL does not correctly assume the username you may specify it

nosqlpass [--nosqlpass]: if no MySQL password is required to load data into a table [--nosqlpass]

delete [--delete]: an option to delete any old versions of the same database [--delete]

dontclean [--dontclean]: an option to save all the state files generated during the load.

Thus, we have:

cd ~/textsamples/xml/textstobeloaded/

philoload dbname [--filelist=files2load] [--image=/path/to/database/target/] [--mkbibliography=/path/to/makebibliographyscript] [--bibliography=/path/to/preloaded/bibliography] [--loadsql] [--sqluser=username] [--sqlpass=password] [--nosqlpass] [--delete] [--dontclean] [ --linksourcetexts ] file1.xml file2.xml ...

loader and mkbibliography would be local versions/modifications of xml-sgmlloader.pl and newextract.pl that you wanted to use instead of those, but if you're just downloading and testing you probably won't be interested in that for some time. --loadsql loads the bibliographic metadata into a MySQL database, and --delete deletes pre-existing database images if you know you want to replace them (instead of giving them a different name).

Also note that when specifiying the location for the database image directory, you must terminate the path with a trailing slash.

As the database loads, you will see a lot of arcane symbols and warnings fly across the screen. If all goes well, none of this will be of any use to you. However, if there are any errors in the loading, it will be instructive to examine the loader log (the location of which will be provided for you by the loading script after a failure).

If all goes well, you will see a "SUCCESS"" message. The script will inform you where the automatically generated search forms have been created (in the web path that you specified upon installation; by default http://yourserver/philologic/databasename.whizbang.form.html). Go to that location in your browser, attempt a search, and verify that everything is working.

If your load fails and you can't figure it out, write to support@philologic.uchicago.edu and include (preferably gzipped) /var/lib/philologic/work/LOADER.LOG

Loading Bibliographic Data into MySQL

To benefit from this feature of PhiloLogic, you will need to have a recent version of MySQL installed. There are (currently) three options to philoload that deal with MySQL functionality:

[--loadsql] : you have to give this option if you want SQL metadata handling; if you don't give it, you can load it manually later (see below). When it gets to the SQL load step, mysql will prompt you for a password

[--sqluser=foo]: by default --loadsql just calls "mysql -p < load.database.sql", which allows mysql to guess your username however it wants. With this option it calls "mysql -u foo -p < load.database.sql";

[--nosqlpass]: if the sql user (either assumed by mysql or named with --sqluser) doesn't have a password, this suppresses the "-p" option to mysql

If you include the [--loadsql] argument when running the load, the bibliographic data will be loaded into an aptly named database and table automatically. If after running a load with the --loadsql option bibliographic searches are not returning correct results, you will need to make sure that all of the MySQL-related settings are appropriately configured. The first thing to check is that $SOCKETARG has been set correctly. The mysql_socket argument differs widely across operating systems. If philologic is unable to locate the mysql socket during installation, you can try locating it on your system manually by typing trying the following at a command-line prompt:

locate mysql.sock

or

locate mysqld.sock

If you do not include the argument or need to reload the bibliographic data for any reason, you can do it manually later simply by not including the [--loadsql] argument. By default you are expected to load the bibliographic data manually. If the --loadsql argument is flagged, you can suppress the "-p" option to MySQL by including the --nosqlpass option.

To load the SQL metadata manually after running a load, cd to the database image directory (by default this is /var/lib/philologic/databases/dbname) and locate the file load.database.sql. This is the SQL load script that will enter all of the bibliographic data into a MySQL table. Before importing the data however, you may need to make a few modifications to this file. If this is the first time you are loading a database of a given name, you will want to comment out the first line which drops a previously existing table of the same name if it exists:

========================== load.database.sql ============================

# DROP TABLE dbname IF EXISTS;

By default, the table name will be the same as the database name. You may choose to make the table name different. For example, if you are running an alternate load of a certain database and are unsure if the bibliographies are identical, it's better to be on the safe side and create a new table rather than replacing the old one. The table name will need to be edited in two places - in the CREATE TABLE line and in the LOAD DATA LOCAL INFILE. Below is the MySQL load syntax and table structure:

CREATE TABLE dbname(

...

load data local infile "/var/lib/philologic/databases/dbname/imgname/bibliography" into table dbname

========================================================================

The command to load the bibliography manually is:

mysql --password=password < load.database.sql

The following variables will be set automatically in /path/to/database/target/lib/philo-db.cfg if the --loadsql flag is set on load:

$SQLenabled = 1;

$HOST = "localhost";

$DATABASE = "philologic";

$USER = "mysql_username";

$PASSWD = "mysql_password";

$TABLE = "dbname";

mv gimme gimme.egrep

mv gimme.sql gimme

field SQL-type gimme-type Note

title VARCHAR(250) reg-exp

author VARCHAR(250) reg-exp author(s) name(s)

date SMALLINT(4) numeric Earliest year INT

genre VARCHAR(250) reg-exp

publisher VARCHAR(250) reg-exp

pubplace VARCHAR(250) reg-exp

extent VARCHAR(250) reg-exp

editor VARCHAR(250) reg-exp

pubdate VARCHAR(250) reg-exp string, range, etc.

createdate VARCHAR(250) reg-exp string, range, etc.

authordates VARCHAR(250) reg-exp string, range, etc.

keywords VARCHAR(250) reg-exp various types, LC subject, etc.

language VARCHAR(250) reg-exp language(s) of document

collection VARCHAR(250) reg-exp collection or series

gender VARCHAR(250) reg-exp gender of author(s)

sourcenote TEXT reg-exp notes regarding the document

period VARCHAR(250) reg-exp period (string, eg. rennaisance)

shrtcite VARCHAR(250) reg-exp required/reserved: often a local id

filename VARCHAR(250) reg-exp required/reserved

filesize VARCHAR(250) reg-exp required/reserved

philodocid SMALLINT(4) exact required/reserved docs num from 0

If the MySQL bibliographic table does not have enough fields for your data, you may add as many fields as you need. There are two ways of doing this. You can, when running your load, use the --nosqlpass flag and then load the bibliography manually as indicated above after making the necessary modifications to the load.database.sql file in your database image directory. Or you can edit /var/lib/philologic/etc/load.database.sql and change the structure there. Note that modifying /var/lib/philologic/load.database.sql will change the default bibliographic database strucutre for all future PhiloLogic loads.

We don't currently have explicit functionality for loading mysql metadata on a separate server. In theory you could just change $HOST to point to the other server and load the load.database.sql into the remote MySQL database. It's possible the code has some assumptions that it's searching on the local host, though the presence of the $HOST variable in the first place suggestst that it will work.

Test Searching Your Database

After you have finished the load process successfully, you will want to run a few test searches (you can also do this before loading the bibliographic data into MySQL). PhiloLogic generates a default search form that you can locate in a browser at the following default URL:

http://localhost/philologic/dbname.whizbang.html

If the searches are returning unexpected results or no results at all, please check your server error log for details.

Results Headers and Footers

The first thing you'll want to do after finishing a load is verify that it loaded properly by running a few searches. Every PhiloLogic load automatically generates a default "Whizbang" search form using a generic header and footer and an embedded stylesheet. You may choose to rework or replace the provided search form template, of perhap reference your own external stylesheet. Likewise, you may wish to replace the default results header and footer with one of your own. To do this you'll need to edit the readnavbar function in philosubs.pl in the database image's lib directory. Suppose you wanted to use a header file you've created and saved as header-new.html, you're changes to philosubs.pl would like this:

cd /path/to/database/target/

emacs philosubs.pl

========================== philosubs.pl ============================

sub readnavbar {

local ($navigbar, $navin, $mvosysdir);

$mvosysdir = $SYSTEM_DIR;

if (!$mvosysdir) {

$mvosysdir = $sys_dir;

}

open (NAVBAR, $mvosysdir . "lib/header-new.html");

And if you wanted to use your own footer file named footer-new.html, you would make the following modifications to the readfooternavbar function:

sub readfooternavbar {

local ($footernavbar, $navin, $mvosysdir);

$mvosysdir = $SYSTEM_DIR;

if (!$mvosysdir) {

$mvosysdir = $sys_dir;

}

open (NAVBAR, $mvosysdir . "lib/footer-new.html");

# objectheader: Reads the result header and gets the bibliography

# for the document. Objects are called for only 1 document.

# Called from:

# ----------------------------------------------------------------------

sub objectheader {

local ($txt, $qs, $test);

$txt = &readnavbar;

$txt .= &getbiblioLine ($doc,"link") . "\n";

$txt .= "<span class=mwright>";

$test = '[<a href="http://tapor.ualberta.ca/Tools/Dispatch/?tool=HyperPo';

$test .= '&delta_iLang=en&url=';

# You need the full URL since this is going as an argument.

$qs = "http://thyme.uchicago.edu/cgi-bin/xphilo/getobject_?";

$qs .= $ENV{'QUERY_STRING'};

$test .= $qs;

$test .= '" target=_blank>Analyze Part with HyperPo</a>]</span>';

$txt .= $test;

$txt .= "<hr noshade>\n";

return "$txt";

}

To push an entire document to HyperPo, get the table of contents. Modify the NavigBiblio. Now, if the file is in WWW space, the following could simply point to it. But to get the link from inside of PhiloLogic, use the following:

# ----------------------------------------------------------------------

# NavigBiblio: generate the bibliography for document navigation/TOC.

# Called from: the cgi-bin function navigate

# ----------------------------------------------------------------------

sub NavigBiblio {

local ($doc, $rtn, $test, $txt);

$doc = $_[0];

$rtn = "<span class=navhead>Table of Contents</span><p>\n";

$rtn .= "<span class=navbiblio>";

$rtn .= &getbiblioLine($doc);

$rtn .= "</span>\n";

$txt = "<span class=mwright>";

$test = '[<a href="http://tapor.ualberta.ca/Tools/Dispatch/?tool=HyperPo';

$test .= '&delta_iLang=en&url=';

# You need the full URL since this is going as an argument.

$test .= "http://thyme.uchicago.edu/cgi-bin/xphilo/getrawdoc.pl?";

$test .= $SYSTEM_DIR . "." . $doc;

$test .= '" target=_blank>Analyze Document with HyperPo</a>]</span>';

$txt .= $test;

$rtn .= $txt;

$rtn .= "<p>\n";

return $rtn;

}

Note that this requires a little cgi-bin function called getrawdoc.pl. We will not put this in the standard release package (I don't think), so install the following:

-----------------------START getrawdoc.pl-------------------------------

#! /usr/bin/perl

# This is just a mechanism to send a completely raw document to

# a calling process. I'm not giving out a file name since I want

# to check to see if the file is in the docinfo and exists in the

# TEXTS directory as a security precaution.

$QS = $ENV{'QUERY_STRING'};

($SYSTEM_DIR, $doc) = split ('\.', $QS);

$i = $doc + 1;

open (DOCINFO, $SYSTEM_DIR . "docinfo");

while ( $i-- ) {

$c = ;

}

close (DOCINFO);

$filename = (split (" ", $c))[0];

if (!$filename) {

print "Content-type: text/html; charset=UTF-8\n\n";

print "ERROR: No File";

exit;

}

$pathfile = $SYSTEM_DIR . "TEXTS/" . $filename;

if (open (RAWFILE, $pathfile)) {

print "Content-type: text/html; charset=UTF-8\n\n";

while () {

print;

}

}

else {

print "Content-type: text/html; charset=UTF-8\n\n";

print "ERROR: No File";

}

--------------------END getrawdoc.pl------------------------------------

Note that I am setting a standard Content Type. This could be conditionalized by checking the kind of document you have, etc.

You may notice a file called format.ph. This file is a symlink to philosubs.pl and exists only for the purpose internal record keeping. Any changes can be made directly to philosubs.pl.