In the stock version of philo-db.cfg, we have:
This should actually be, I believe:
Paul Schaffner reports that trying to perform a proximity search for the same word fails, returning all of the occurences of the word, with hit highlighting broken. The phrase search "day day" performs properly, as does the phrase "day to day". A proximity search for "day day" fails. My initial assessment is that this is a search3 evaluation error. Judging by the duplicated "hit" highlighting, it anchors the proximity on one occurrence and then when doing an index search for the next item, it gets the same one, decides it is done. This appears to be a long standing bug, as the same behavior is notes in search2. I believe this is the first report of this behavior in PhiloLogic 2 or 3 (or about 8 years).
This is a fairly important bug fix since searching for repeated words in close proximity is of interest to literary scholars.
I was wondering if we should not try to arrange the frequency result query and response differently. Currently, you get back, say, decades with lists of works. There is a turn-off title button. I'lm wondering if we could not have that be the default mode... but then when you get the list of decades with there frequences... you could click on the decade and the titles for that decade would open up below (not in a java box) for that decade...
Robert asks if we can support phrase searching for collocation table reports. Currently, I am limiting the permitted searches when getting a collocation report to single vectors of words, such as grand*. I did this because it would be difficult to decide what would be the "pole" in a multiword search. Robert suggests permitting searches for "grand* homme*". This will need modification to search3t and artfl_pole.pl. The logic of artfl_pole.pl is pretty hardwired to the single word per hit model, so it may need some significant hacking.
Robert suggests that we allow users to click on a word in the collocation report and run a search, presumably in the context of the pole word and bibliographic specification. This too will require modification to artfl_pole.pl. Looking at it, I think I erred in making artfl_pole.pl a standalone function. Next revision, I'll make it a search3t subroutine, which will make this and the obvoe modification more coherent.
Martine Groulx reports that the Frequency by DivHead report sorts only on the frequencies of words appearing in each Div (Chambers Cyclopedia). These are typically dictionary or reference work entries. Charles also had to modify this report for certain function in Proestant database for ASP.
Solution is to leave hooks in either search3t: &dofreqbydiv or (more reasonably) philosubs.pl:&DivHeadFreqLinks (since we already have a subroutine), to generate a new sort key and resort on frequency and divhead. MVO might try to hack that and add it to Patches.
If you have loaded a database with hyphens as a non-word-breaking character, and your search comes up with a result that has a hyphen in it, it won't highlight properly. The word will be highlighted until the hyphen, then not.
Perhaps the word pattern that is used to load the database could be saved in philosubs and then used to do the highlighting too?
You can solve this by editing ConcSpan in philosubs.pl and removing - or any other character that should be considered as part of a word from the pattern.
Vincenzo Lomiento lomiento_AT_alice.it reports that PhiloLogic produces each hit 3 times in the case of having a one document database and specifying some metadata in addition to the word to be searched. It works fine without specifying metadata, which is probably why we have not seen this. Vincenzo also reports that this occurs in cases where the search produces more than 9 hits.
Some tests to run: will this behavior be replicated in cases of large databases where the first loaded document -- document 0 -- is specified by metadata?
My guess is search3 is failing to initialize properly on the "0", probably finding it as a failed logic test, whereas it is a valid document id.
Another solution may be simply to uniq the hitlist -- recall that it produces EXACTLY the same hits, including byte offsets, which is in normal behaviors impossible.
Title says it all. Will be fixed in the next release.
Robert suggests that we add a little navigation function to allow users to click on a letter [a,b,c] for word count files. This needs to be hooked into getwordcount.pl. It will only work for alphabetized lists.
Allow for fuzzy matching with regexp:
TRE agrep can handle full regular expressions and approximate matching via edit distance. This would allow us to add similarity matching to regexp searches.
If you click on the page number link from a results page, contextualize.pl will bring you the page content, but it doesn't know about where the document ends, so it will put up "Next" or "Previous" page arrows even if you are at the last or first page of a document.
Robert Scholes (Brown) suggests that a PDF text recognizer (loader) would be a useful addition. This would probably work like a plaintext loader without document structure, but you could get most of the reporting functions. I'll have to think about it a bit more. Might need to create a "dummy" document in the background. MVO.
As noted in the Optional Code section, we still have a 32K document limit, which can easily be bumped to 64K documents. Probably just need to redo the unpack function when reading search hits.
We currently have word frequencies for individual documents and various word counting functions from searches. Jacques Guilhaumou suggests that we implement a word count feature for groups of documents, such as global word count for all documents by a particular author. This would not be difficult to do. A couple of ways to do this would simply to have a link from a bibliography search with total word count. This could push to a variant to the current getwordcount.pl, which would simply sum the counts from a list of philodocids: getwordcountpl?DBNAME.OBJ:OBJ:OBJ:OBJn.sortorder We could simply put a switch in to see if we have one OBJ or many, and then use the rest of the same code. We would need some thinking about this. 01/08
Using a user-selected bibliography, searches that return no results display erroneous bibliographies
If you run a null search on a database (by simply pressing "enter" in the search box or hitting "Submit" without filling in any search criteria), then select a bibliography using the checkboxes, then run a search that returns no results, bibliography display contains duplicates and omits some entries. For example:
The solution is to sort the bibliographic result array before printing the bibliography.
It also seems that you still can't select the document with philodocid of 0. Try running a search here:
Select all documents... 0 doesn't appear.
Add a function to allow modification of result display for frequency by Div object, like chapters. Specific request to add author to Encyclopedie frequency by article report.
Add a function, like we have for bibliographic metadata, to allow the user to get counts by selected metadata. Again, this is specific to the Encyclopedie. Frequency by author, class of knowledge, etc.
We ought to have textload.cfg be database-specific in some manner, or at least put a copy of it into the database directory after running a load, so you can go back and see what parameters you loaded it under.
If you have words that are entirely made up of bytes that match the pattern in your CHARSNOTTOINDEX, they will be reduced to nothing and philoload will fail with an error about counts differing. You can solve this with a hack like this:
$oldword = $theword;
For example, indicate that Mac OS X is the main operating system we're supporting. Currently reads:
...if you were on Mac OS X (which is close to being supported but
Middle of Clause: out of 0
This should have a number and total count.
By default kwicresort.pl points the description of the page to the wrong message. It was 240. This should be 190. But we have to modify the display a bit. Might require a new message.
For high frequency words when metadata returns low byte ids, results from docid 256
may be appended.
This is a constantly growing list of things that we intend to fix for future revisions of PhiloLogic.
Add a routine to newextract (the bibliography generator) check to see if each file has the basic elements required for loading as a TEI/ATE or other file. This would check for a DIV level object, a P level object, and possibly for some CDATA contents. More later... This is to resolve the fact that one can have directories of XML files that will contain a few headers, including, and other stuff....
Martin Mueller (Oct 10, 2005) suggests a random hit function:
I think that adding a random sample feature would be very useful.
Not a TODO, but a kewl idea from Orion that I don't want to loose track of....:
Somebody from the New York Times is asking people to submit addresses
IWW style requeries, to re-present results in different ways, giving the user a "filter" (Julia's expression) approach to result sets. Russ and I are thinking of a dynamic results header as a drop-in block of code, which would keep LATENTQUERYSTRING on the server and parse it for different result sets.
Carole Mah sez: Put something about the sort order of basic text search results. These are in LOAD order. The general loader tries to sort out the load in chronological order (year only). This could simply be put in philosubs..... Oddly enuff, we won't always know about that. Geez, you wudda thunk we would have something like that, eh?
Create the philohistory directory either on install or when the Philo history function is run and does not find one. Check to see if it reads the PHILOTMP directive.
Add sort by frequency to Terms button? There may be speed problems with this. And I don't have a good idea about how to put the switch in the interface (a global selection)?
02-22-05: Allow specification for the sort -T location in general philo configuration, which will probably avoid the next. Note that it can be set in loader.xmake changing
SORTFLAGS= -T . -y +0 -1 +1 -2n +2 -3n +3 -4n +4 -5n +5 -6n +6 -7n +7 -8nto some other location with lots of space.
SORTFLAGS= -T /export/home/thymephilo/mark/temp/philosort/ -y +0 -1 +1 -2n +2 -3n +3 -4n +4 -5n +5 -6n +6 -7n +7 -8n
02-22-05: Trap for no space on device error on load. If we get this as we are reading texts in, it simply stops loading the offending batch and in certain circumstances will load the database without noticing it is missing a batch.
Loading 999 ===> TEXTS/pharisjn.xml...
02-22-05: and while we're at it, let's encourage a default database directory that is NOT in the standard install location (/var/lib/philologic/databases/).