Optional Code

1 KWIC report linking to paragraphs, other objects
2 Linking DIV-level search results to Pages
3 For Perseus Latin: modified crapser matching
4 Multi-byte dot searching for greek
5 Conditionalizing the use of the QuickDict
6 Link to page images or other external resources
7 Generate Concordance, KWIC, and SortedKwic Results without Bibliography
8 Simple Time Period Histogram by Rate/1000 words
9 Hall of Shame Hack to handle more than 32K documents
10 Les fréquences en ordre chronologique

KWIC report linking to paragraphs, other objects

If your data has no page tags, the kwic report links to page will not work. This little bit of code takes care of that problem and can, with editing, allow links to different object levels.

In philosubs.pl, in sub getKWICtitlestandard, these bits of code can be swapped out:

if (!$ChainLinksRestricted) {

$href = "<A HREF=\"" . $CONTEXTUALIZER . "?p.";

$href .= $doc . "." . $dbname . "." . join (".", @o) . "\">";

}

else {

$hitlist =~ s/^.*\.//g;

$href = "<A HREF=\"" . $CONTEXTUALIZER . "?p.";

$href .= $hitlist . "." . $dbname . "." . ($counter-1);

$href .= "." . $CONTEXT . "." . $WORDS . ".0\">";

}

........

# Then the pagenumber and the link to the

# page if we have a page

if ($pagenum eq "?" || $pagenum eq "0") {

$title .= "<tt>p." . $pagenum . ")</tt>";

}

else {

$title .= $href . "<tt>p." . $pagenum . "</a>)</tt>";

}

with this:

if ($pagenum eq "?" || $pagenum eq "0" || $pagenum eq "na") {

$thisobject = $doc . ":" . @index[0];

$thisobject .= ":" . @index[1];

$thisobject .= ":" . @index[2];

$thisobject .= ":" . @index[3];

$offs = join(".", @o); ## adapt this above for pageless docs

$offs =~ s/\.[0-9]*$//;

$href = "<a href=\"" . $PHILOGETOBJECT . "?c." . $thisobject;

$href .= "." . $dbname . "." . $offs . "\">";

}

else {

$href = "<A HREF=\"" . $CONTEXTUALIZER . "?p.";

$href .= $doc . "." . $dbname . "." . join (".", @o) . "\">";

}

.............

# Then the pagenumber and the link to the

# page if we have a page

if ($pagenum eq "?" || $pagenum eq "0" || $pagenum eq "na") {

$title .= $href . "<tt>para</a>)</tt>";

}

else {

$title .= $href . "<tt>p." . $pagenum . "</a>)</tt>";

}

Linking DIV-level search results to Pages

By runing a script called mk_art2object.pl and editing the DivDisplayLine subroutine in philosubs.pl and adding a routine called DivPageLink, you can get page numbers to show up with links after them. Full instructions are on the page for mk_art2object.pl.

For Perseus Latin: modified crapser matching

Added lowercase macrons to crapser capital letter matching and also I/J and U/V because there are many orthographic variants.

Block copy this to crapser if you need macron searching in UTF-8. Recall that we like to match on various accented characters, typically French, by entering an upper case letter.

'C', "(c|\xc3\xa7)",

'N', "(n|\xc3\xb1)",

'Y', "(y|\xc3\xbf|xc3\xbd|\xc8\xb3)" );

and absolutely don't forget to alter the regexes that looks for accented characters:

$yeswords =~ s/[ACEIJNOUVY]/$ACCENTS{$&}/ge;

...

$butnotwords =~ s/[ACEIJNOUVY]/$ACCENTS{$&}/ge;

Multi-byte dot searching for greek

To search for unicode ents with the dot wildcard, edit this little line in crapser:

$DOTPATTERN = "([a-zA-Z0-9]|[\xa0-\xc3][\xa0-\xc3])";

to include these ranges for unicode greek:

$DOTPATTERN = "([a-zA-Z0-9]|[\xce-\xcf][\x80-\xbf]|[\xe1-\xe2][\xbc-\xbf][\x80-\xbf])";

We think it works.

Conditionalizing the use of the QuickDict

See QuickDict.

Link to page images or other external resources

See ExternalLinks

Generate Concordance, KWIC, and SortedKwic Results without Bibliography

For a number of projects, we don't always want the bibliographic info displayed at the bottom of the page. This can be done conditionally by modifying the Bibliography generators in cgi-bin/ artfl_conc.pl, artfl_kwic.pl,artfl_sortedkwic.pl, and kwicresort.pl.

In philo-db.cfg add:

$OMITKWICBIBLIOGRAPHY = 1;

$OMITCONCBIBLIOGRAPHY = 1;

$OMITKWICSORTBIBLIOGRAPHY = 1;

Then for each of the scripts, test for this. Example from artfl_conc.pl:

if (!$OMITCONCBIBLIOGRAPHY) {

print "<hr><h2>" . $philomessage[209] . "</h2>\n";

for $doc (@docs) {

print &getbiblioLine ( $doc, "link" ) . "<p>\n";

}

Others: artfl_kwic.pl

if (!$OMITKWICBIBLIOGRAPHY) {

print "<hr>\n<h2>" . $philomessage[209] . "</h2>";

for $doc (@docs) {

print &getbiblioLine ( $doc, "link" ) . "<p>\n";

}

artfl_sortedkwic.pl

if (!$OMITKWICSORTBIBLIOGRAPHY) {

print "<hr>\n<h2>" . $philomessage[209]. "</h2>";

for $doc (@docs) {

$kwicbibline = &getbiblioLine ( $doc, "link" ) . "<p>\n";

print $kwicbibline;

if ($KWIC_RESORT_ON) {

if (defined (&KwicResortBibKey)) {

$biblioLineRead = &KwicResortBibKey($biblioLineRead);

}

$allkwicrawblines{$doc} = $biblioLineRead;

$allkwicbiblines .= $kwicbibline;

}

kwicresort.pl

if (!$OMITKWICSORTBIBLIOGRAPHY) {

print "<center><h2>" . $philomessage[209] . "</h2></center>\n";

open (BIBFILE, $bibfile);

while ($linein = <BIBFILE>) {

print $linein;

}

close(BIBFILE);

}

Simple Time Period Histogram by Rate/1000 words

Martin Mueller posted on Humanist (May 12, 2007)an interesting suggestion:

But if 'text' becomes 'data' in whatever environment and for whatever reason you don't 'read' but look at results that are in some form tabulated or quantified. For instance, the Philologic search engine lets you scan across some 600 million words of English, pick out 144,000 occurrences of various spellings and grammatical forms of 'liberty' and will return results by decade and frequency per 10,000 words. These results are much more easily interpreted as a chart because you "see" at once that there are quite sharp spikes in the 1650's and 1680's

So, I hacked up a quick proof of concept. This takes the frequency by period report and simply puts a little histogram of the relative rate by selected time periods. Here is a little example:

Bibliographic criteria: date=1600-1899

Searching 1839 documents for tradition.*|coutu.*.

Number of Unique Forms: 64

Search Terms: coutumace | coutumaces | coutumance [truncated]....

Your search found 9851 occurrences

Period Rate Count Histogram (Rate * 10)

1600-24 0.05 18

1625-49 0.32 126 ***

1650-74 0.19 117 **

1675-99 0.80 479 ********

1700-24 0.61 339 ******

1725-49 1.01 747 **********

1750-74 0.74 759 *******

1775-99 0.68 705 *******

1800-24 0.73 575 *******

1825-49 1.02 1924 **********

1850-74 1.08 1555 ***********

1875-99 1.65 2507 ****************

SAME Search by Half centuries

Period Rate Count Histogram (Rate * 10)

1600-49 0.19 144 **

1650-99 0.49 596 *****

1700-49 0.84 1086 ********

1750-99 0.71 1464 *******

1800-49 0.93 2499 *********

1850-99 1.37 4062 **************

AND decade:

Period Rate Count Histogram (Rate * 10)

1600-09 0.05 4

1610-19 0.07 10

1620-29 0.03 9

1630-39 0.46 54 *****

1640-49 0.49 67 *****

1650-59 0.06 27

1660-69 0.59 78 ******

1670-79 0.40 78 ****

1680-89 0.79 182 ********

1690-99 0.93 231 *********

1700-09 0.57 97 ******

1710-19 0.51 155 *****

1720-29 0.96 137 **********

1730-39 1.18 505 ************

1740-49 0.78 192 ********

1750-59 0.77 283 ********

1760-69 0.75 306 ********

1770-79 0.71 338 *******

1780-89 0.73 345 *******

1790-99 0.57 192 ******

1800-09 0.61 241 ******

1810-19 0.82 162 ********

1820-29 0.91 294 *********

1830-39 0.88 538 *********

1840-49 1.10 1264 ***********

1850-59 1.14 728 ***********

1860-69 1.16 660 ************

1870-79 0.73 359 *******

1880-89 1.24 693 ************

1890-99 2.29 1622 ***********************

This is not code to be used for real, but something that we might want to add in the next release. Add this at about line 2130 of search3t.

print "<hr>";

open (MVOGENERATED, "$PHILOTMP/mvogenerated.$$");

while (<MVOGENERATED>) {

s/\t\n//;

@periodfreq = split ('\t', $_);

if ($DF) {

$thisdataline = $periodfreq[1] . "\t" . $periodfreq[0];

}

else {

$thisdataline = $periodfreq[0] . "\t" . $periodfreq[1];

}

$histogramdata{$periodfreq[2]} = $thisdataline;

}

close (MVOGENERATED);

print "<table align=center>\n";

print "<tr><td>Period</td>\n<td>Rate</td><td>Count</td>";

print "<td> </td>";

print "<td>Histogram (Rate * 10)</td></tr>";

foreach $periodkey (sort keys(%histogramdata)) {

$thisline = $histogramdata{$periodkey};

@thisdataline = split("\t", $thisline);

print "<tr>";

print "<td>" . $periodkey . "</td>\n";

print "<td align=right>" . $thisdataline[0] . "</td>\n";

print "<td align=right>" . $thisdataline[1] . "</td>\n";

print "<td> </td>";

$numofxxx = sprintf("%0.f", $thisdataline[0] * 10);

if ($numofxxx > 1) {

$x = 0;

print "<td>";

while ($numofxxx > $x) {

print "*";

$x++;

}

else {

print "<td> ";

}

print "</td></tr>\n";

}

print "</table><hr>\n";

Hall of Shame Hack to handle more than 32K documents

We have only encountered this problem here once. PhiloLogic uses a 16 bit integer to access documents. We thought that we had this set to an unsigned integer, which would give 64K documents. But, as recent report from another team using PhiloLogic suggests that this is not the case. The problem is noticed when you have more than 32,768 documents. If the document id (philodocid) is greater than this number, it returns a negative integer. Clearly, this is not possible.

The only known problem with this is in getting text contexts from searches. In PhiloLogic, this is in search3t, artfl_conc.pl, artfl_kwic.pl, artfl_pole.pl, artfl_sortedkwic.pl, theme_rheme.pl.

While reading the index of each hit, we set the documentid. This varies, but looks like in search3t:

while ($hit = &GetHit){

@index = unpack ( "s" . 6 . "i" . $nw, $hit );

Add the following directly below (you can leave the comment out):

# Hall of Shame HACK. MVO May 20, 2004

# Int problem.... there are more than 32768 docs so.....

if ($index[0] < 0 ) {

$xxvv = (32768 + $index[0]) + 32768;

$index[0] = $xxvv;

}

In other functions, find code that looks like:

while (($hit = &GetHit) && ($counter < $finish + 1)) {

@o = unpack ($unpack, $hit);

undef (@index);

for ( $i = 0; $i < $CONTEXT; $i++ ) {

push (@index, shift (@o));

}

$doc = shift @index;

And again add the hack immediately following:

if ($doc < 0 ) {

$xxvv = (32768 + $doc) + 32768;

$doc = $xxvv;

}

This is known to work without problem. The problem arises (I think) in the unpack template, which specifies the first 6 ints as "s", 16 bit signed shorts. Oddly enuff, we need signed ints for all BUT documents, since we start counting from 0, and use -1 to indicate not an object. Remember this is a fixed width indexing scheme. I suppose one could try something like

unpack ( "S" . 1 . "s" . 5 . "i" . $nw, $hit );

Next time I have a 32K+ document set, I'll try it. MVO

UPDATE -- for dbs with more than 32767 docs -- the philodocid must be set as MEDIUMINT in the load.database.sql script. CMC

Les fréquences en ordre chronologique

La construction d'un histogramme (cf ci-dessus) est une excellente idée, mais assez complexe. PhiloLogic donne les résultats numériques par ordre numérique décroissant ; on peut préférer l'ordre chronologique. Ce que l'on peut obtenir par une modification minime !

Dans le fichier /usr/lib/cgi-bin/newphilo/search3t, tout se passe lignes 2118 et suivantes. La fonction construit un fichier provisoire mvogenerated.$$ dans lequel les deux premiers champs sont les fréquences et le troisième les dates de chaque période. Il suffit donc de modifier les paramètres de la fonction sort à la ligne 2119 :

open(MVOSORTED, "| sort -nr +0 -1 > $PHILOTMP/mvogenerated.$$");

à remplacer par :

open(MVOSORTED, "| sort -n +2 -3 > $PHILOTMP/mvogenerated.$$");

Pour être complet, on peut aussi modifier les messages 197 et 198 dans les fichiers /var/lib/philologic/databases/xxx/lib/english.messages.pl et /var/lib/philologic/databases/xxx/lib/french.messages.pl.

Page updated

Google Sites

Report abuse