Wiki‎ > ‎

To do

  • 1 philo-db.cfg $BIBOPS{"dgphilodiv"} -> $BIBOPS{"dgphilodivid"}
  • 2 Proximity Searching for same word fails
  • 3 Toogle Display of Titles in Frequency by Period/Author Reports
  • 4 Permit phrase search on collocation report
  • 5 Search link back from collocation report
  • 6 Sort on DivHead words in Frequency by DivHead Report
  • 7 Hyphens break word highlighting
  • 8 Triple Search Results for document 0
  • 9 Missing 'f' in sprintf on line 960 in search3t
  • 10 Add sorted word count navigation
  • 11 Investigate TRE agrep support for approximate matches
  • 12 contextualize.pl doesn't know when we are at the end of the document, throws up bogus next arrows
  • 13 PDF text recognizer
  • 14 Fix 32K document limit
  • 15 Total word count for user selected groups of documents
  • 16 Using a user-selected bibliography, searches that return no results display erroneous bibliographies
  • 17 Add Frequency by Div object formatting to philosubs.pl
  • 18 Add field sorting/counting for Frequency by Div Objects
  • 19 textload.cfg should be database specific
  • 20 Textload can generate errors if CHARSNOTTOINDEX reduces work to null string
  • 21 Edit Install page to reflect *reality*
  • 22 Theme-Rheme fails to sum Middle
  • 23 Sorted KWIC description message is incorrect
  • 24 Compilation Failure of Index verification on 64 bit Ubuntu
  • 25 Docid 256 search results appended to low byte orders
  • 26 Older stuff

  • philo-db.cfg $BIBOPS{"dgphilodiv"} -> $BIBOPS{"dgphilodivid"}

    In the stock version of philo-db.cfg, we have:

    1. Define DIV LEVEL for searching

    $BIBOPS{"dgphilodiv"}="exact";

    This should actually be, I believe:

    1. Define DIV LEVEL for searching

    $BIBOPS{"dgphilodivid"}="exact";


    Proximity Searching for same word fails

    Paul Schaffner reports that trying to perform a proximity search for the same word fails, returning all of the occurences of the word, with hit highlighting broken. The phrase search "day day" performs properly, as does the phrase "day to day". A proximity search for "day day" fails. My initial assessment is that this is a search3 evaluation error. Judging by the duplicated "hit" highlighting, it anchors the proximity on one occurrence and then when doing an index search for the next item, it gets the same one, decides it is done. This appears to be a long standing bug, as the same behavior is notes in search2. I believe this is the first report of this behavior in PhiloLogic 2 or 3 (or about 8 years).

    This is a fairly important bug fix since searching for repeated words in close proximity is of interest to literary scholars.


    Toogle Display of Titles in Frequency by Period/Author Reports

    Robert suggests:

    I was wondering if we should not try to arrange the frequency result query and response differently. Currently, you get back, say, decades with lists of works. There is a turn-off title button. I'lm wondering if we could not have that be the default mode... but then when you get the list of decades with there frequences... you could click on the decade and the titles for that decade would open up below (not in a java box) for that decade...



    Permit phrase search on collocation report

    Robert asks if we can support phrase searching for collocation table reports. Currently, I am limiting the permitted searches when getting a collocation report to single vectors of words, such as grand*. I did this because it would be difficult to decide what would be the "pole" in a multiword search. Robert suggests permitting searches for "grand* homme*". This will need modification to search3t and artfl_pole.pl. The logic of artfl_pole.pl is pretty hardwired to the single word per hit model, so it may need some significant hacking.


    Search link back from collocation report

    Robert suggests that we allow users to click on a word in the collocation report and run a search, presumably in the context of the pole word and bibliographic specification. This too will require modification to artfl_pole.pl. Looking at it, I think I erred in making artfl_pole.pl a standalone function. Next revision, I'll make it a search3t subroutine, which will make this and the obvoe modification more coherent.



    Sort on DivHead words in Frequency by DivHead Report

    Martine Groulx reports that the Frequency by DivHead report sorts only on the frequencies of words appearing in each Div (Chambers Cyclopedia). These are typically dictionary or reference work entries. Charles also had to modify this report for certain function in Proestant database for ASP.

    Solution is to leave hooks in either search3t: &dofreqbydiv or (more reasonably) philosubs.pl:&DivHeadFreqLinks (since we already have a subroutine), to generate a new sort key and resort on frequency and divhead. MVO might try to hack that and add it to Patches.



    Hyphens break word highlighting

    If you have loaded a database with hyphens as a non-word-breaking character, and your search comes up with a result that has a hyphen in it, it won't highlight properly. The word will be highlighted until the hyphen, then not.

    Perhaps the word pattern that is used to load the database could be saved in philosubs and then used to do the highlighting too?

    You can solve this by editing ConcSpan in philosubs.pl and removing - or any other character that should be considered as part of a word from the pattern.


    Triple Search Results for document 0

    Vincenzo Lomiento lomiento_AT_alice.it reports that PhiloLogic produces each hit 3 times in the case of having a one document database and specifying some metadata in addition to the word to be searched. It works fine without specifying metadata, which is probably why we have not seen this. Vincenzo also reports that this occurs in cases where the search produces more than 9 hits.

    Some tests to run: will this behavior be replicated in cases of large databases where the first loaded document -- document 0 -- is specified by metadata?

    My guess is search3 is failing to initialize properly on the "0", probably finding it as a failed logic test, whereas it is a valid document id.

    Another solution may be simply to uniq the hitlist -- recall that it produces EXACTLY the same hits, including byte offsets, which is in normal behaviors impossible.


    Missing 'f' in sprintf on line 960 in search3t

    Title says it all. Will be fixed in the next release.


    Add sorted word count navigation

    Robert suggests that we add a little navigation function to allow users to click on a letter [a,b,c] for word count files. This needs to be hooked into getwordcount.pl. It will only work for alphabetized lists.


    Investigate TRE agrep support for approximate matches

    Allow for fuzzy matching with regexp:

    http://laurikari.net/tre/download.html

    TRE agrep can handle full regular expressions and approximate matching via edit distance. This would allow us to add similarity matching to regexp searches.


    contextualize.pl doesn't know when we are at the end of the document, throws up bogus next arrows

    If you click on the page number link from a results page, contextualize.pl will bring you the page content, but it doesn't know about where the document ends, so it will put up "Next" or "Previous" page arrows even if you are at the last or first page of a document.


    PDF text recognizer

    Robert Scholes (Brown) suggests that a PDF text recognizer (loader) would be a useful addition. This would probably work like a plaintext loader without document structure, but you could get most of the reporting functions. I'll have to think about it a bit more. Might need to create a "dummy" document in the background. MVO.


    Fix 32K document limit

    As noted in the Optional Code section, we still have a 32K document limit, which can easily be bumped to 64K documents. Probably just need to redo the unpack function when reading search hits.


    Total word count for user selected groups of documents

    We currently have word frequencies for individual documents and various word counting functions from searches. Jacques Guilhaumou suggests that we implement a word count feature for groups of documents, such as global word count for all documents by a particular author. This would not be difficult to do. A couple of ways to do this would simply to have a link from a bibliography search with total word count. This could push to a variant to the current getwordcount.pl, which would simply sum the counts from a list of philodocids: getwordcountpl?DBNAME.OBJ:OBJ:OBJ:OBJn.sortorder We could simply put a switch in to see if we have one OBJ or many, and then use the rest of the same code. We would need some thinking about this. 01/08


    Using a user-selected bibliography, searches that return no results display erroneous bibliographies

    If you run a null search on a database (by simply pressing "enter" in the search box or hitting "Submit" without filling in any search criteria), then select a bibliography using the checkboxes, then run a search that returns no results, bibliography display contains duplicates and omits some entries. For example:

    http://philologic.uchicago.edu/cgi-bin/philologic3/search3t?dbname=docsouth&word=foober&CONJUNCT=PHRASE&DISTANCE=2&PROXY=or+fewer&multidocid=72&multidocid=738&multidocid=1032&multidocid=418&multidocid=53&OUTPUT=conc&DFPERIOD=1&POLESPAN=5&THMPRTLIMIT=1

    The solution is to sort the bibliographic result array before printing the bibliography.

    It also seems that you still can't select the document with philodocid of 0. Try running a search here:

    http://robespierre.uchicago.edu/philologic/pmt3.whizbang.form.html

    Select all documents... 0 doesn't appear.


    Add Frequency by Div object formatting to philosubs.pl

    Add a function to allow modification of result display for frequency by Div object, like chapters. Specific request to add author to Encyclopedie frequency by article report.


    Add field sorting/counting for Frequency by Div Objects

    Add a function, like we have for bibliographic metadata, to allow the user to get counts by selected metadata. Again, this is specific to the Encyclopedie. Frequency by author, class of knowledge, etc.


    textload.cfg should be database specific

    We ought to have textload.cfg be database-specific in some manner, or at least put a copy of it into the database directory after running a load, so you can go back and see what parameters you loaded it under.


    Textload can generate errors if CHARSNOTTOINDEX reduces work to null string

    If you have words that are entirely made up of bytes that match the pattern in your CHARSNOTTOINDEX, they will be reduced to nothing and philoload will fail with an error about counts differing. You can solve this with a hack like this:

     $oldword = $theword;
    if ($CHARSNOTTOINDEX) {
    $theword =~ s/($CHARSNOTTOINDEX)//g;
    }
    if ($theword eq '') {
    $theword = $oldword;
    }

    Edit Install page to reflect *reality*

    For example, indicate that Mac OS X is the main operating system we're supporting. Currently reads:

    ...if you were on Mac OS X (which is close to being supported but
    is really cranky right now and I don't recommend trying it unless you
    want to bang your head against it to make it work...

    Theme-Rheme fails to sum Middle

    Middle of Clause: out of 0

    This should have a number and total count.


    Sorted KWIC description message is incorrect

    By default kwicresort.pl points the description of the page to the wrong message. It was 240. This should be 190. But we have to modify the display a bit. Might require a new message.


    Compilation Failure of Index verification on 64 bit Ubuntu

    [Blog Entry]


    Docid 256 search results appended to low byte orders

    For high frequency words when metadata returns low byte ids, results from docid 256 may be appended.



    Older stuff

    This is a constantly growing list of things that we intend to fix for future revisions of PhiloLogic.


    Add a routine to newextract (the bibliography generator) check to see if each file has the basic elements required for loading as a TEI/ATE or other file. This would check for a DIV level object, a P level object, and possibly for some CDATA contents. More later... This is to resolve the fact that one can have directories of XML files that will contain a few headers, including, and other stuff....
    Martin Mueller (Oct 10, 2005) suggests a random hit function:
    I think that adding a random sample  feature would be very useful.  
    For any set of returns that runs in the hundreds, not to speak
    thousands, it would be a terrific first orientation to have a random
    sample. It might have a minimum size--say 50, and then increase as a
    fraction of the total size until the sample size is such that
    increasing the sample size won't add much.

    As an example, I have a student who is interested in figuring out the
    relationship of cognitive and ethical meanings in 'true' from Chaucer
    to Shakespeare. There are close to 20,000 occurrences of tr[vu]e in
    that period. For her, a random sample of 1,500 would let her figure
    out in a day where the action is.


    Not a TODO, but a kewl idea from Orion that I don't want to loose track of....:
    Somebody from the New York Times is asking people to submit addresses
    of things from books, for them to add to a map of Places Mentioned In
    Books, a "literary map of manhattan".

    http://www.nytimes.com/2005/05/01/books/review/01COHENHO.html?ex=1272600000&en=9
    093cefdfcdb6409&ei=5090&partner=rssuserland&emc=rss
    (tinyurl: http://tinyurl.com/9ew8h)

    Some of these have addresses ("The Talented Mr. Ripley") but most of
    them don't. A fun project -- for someone with lots of text and a fast
    search engine and Google Maps -- would be to map all of this
    automatically, parsing out addresses or intersections or what have
    you. Though of course it would be impossible to get everything that a
    human could.

    New results format: map geographically.

    IWW style requeries, to re-present results in different ways, giving the user a "filter" (Julia's expression) approach to result sets. Russ and I are thinking of a dynamic results header as a drop-in block of code, which would keep LATENTQUERYSTRING on the server and parse it for different result sets.
    Carole Mah sez: Put something about the sort order of basic text search results. These are in LOAD order. The general loader tries to sort out the load in chronological order (year only). This could simply be put in philosubs..... Oddly enuff, we won't always know about that. Geez, you wudda thunk we would have something like that, eh?
    Create the philohistory directory either on install or when the Philo history function is run and does not find one. Check to see if it reads the PHILOTMP directive.
    Add sort by frequency to Terms button? There may be speed problems with this. And I don't have a good idea about how to put the switch in the interface (a global selection)?
    02-22-05: Allow specification for the sort -T location in general philo configuration, which will probably avoid the next. Note that it can be set in loader.xmake changing
    SORTFLAGS= -T . -y +0 -1 +1 -2n +2 -3n +3 -4n +4 -5n +5 -6n +6 -7n +7 -8n
    to some other location with lots of space.
    SORTFLAGS= -T /export/home/thymephilo/mark/temp/philosort/ -y +0 -1 +1 -2n +2 -3n +3 -4n +4 -5n +5 -6n +6 -7n +7 -8n

    02-22-05: Trap for no space on device error on load. If we get this as we are reading texts in, it simply stops loading the offending batch and in certain circumstances will load the database without noticing it is missing a batch.
    Loading 999 ===> TEXTS/pharisjn.xml... 
    /usr/local/bin/sort: write failed: ./sortU6aO_m: No space left on device

    02-22-05: and while we're at it, let's encourage a default database directory that is NOT in the standard install location (/var/lib/philologic/databases/).
Comments