Finding Co-occurences of Words within Texts

By Sandra Schloen, November 2016

One of the most rewarding aspects of OCHRE's item-based data model is the ability to itemize Textual content down to a very granular level of detail -- down to signs and words if necessary -- and from there to build a corpus-based Dictionary/Glossary. Assorted features and tools are provided to help with the import of Texts and their subsequent processing.

Searching for a Single Word, for starters

A corpus-based Dictionary allows the scholar to collect variations of a "word" within a common lemma, articulating the grammatical forms and assorted spellings of various instances found in the Texts. Due to variations in how the word was written, and factoring in issues like noting brokenness on clay tablets (e.g. "e-ma-[ru-um]"), a string-matching search for the (in this case Akkadian) word for "donkey," for example, would have very few hits.

OCHRE's intelligent query capabilities, along with the relationships captured in the dictionary entry, enable smart searching for words in the Text corpus. To learn more about searching within Dictionaries (e.g., to find out how to determine the Akkadian word for donkey!) see the wiki article Search Dictionaries.

Searching for Co-occurring Words

To enable searching for co-occurring words we use the Concepts category to help. Here we have added a new Concept -- one representing the co-occurrence of the words "donkey", "textile" (suggesting trade?), and "die." For lack of imagination we've called this concept "Donkey - Textiles - Die." Using the Linked Items pane in any of the usual ways, we've linked in to the Components of the concept the words of interest as represented by their lemma form in the relevant Dictionary.

From a Query we can now reference this concept. Create a query that is Scoped to the Texts category. The COMPONENTS option will be one of the available meta-Variables and the Operator "that co-occur" will be available for this meta-Variable.

When you Perform this Query, OCHRE will find all texts that contain ANY form of the word "donkey" and ANY form of the word for "textile" and ANY form of the word for "die". When one of the resulting Texts is viewed from the Query Results list, each of the instances of all of the qualifying words will be highlighted.

Note that you can also select any grammatical form, or any attested form, of any lemma entry, simply by navigating to the appropriate level of the entry in the Linked Items pane and linking in the desired form.

Other Text Components, on Concepts -- Property Values

Along with Dictionary entries there are other types of items that can be included as Text Components on a Concept and which will be used appropriately when searching for co-occurring words in Texts. Here we are specifying the Value of a Property "Cereal or grain." This will allow ANY dictionary entry that has been tagged as a "Cereal or grain" to qualify as a matching word in the co-occurrence query. In this example from the Persepolis Fortification Archive (PFA) project we are looking for co-occurring instances of any Elamite word tagged as a Cereal/grain, plus ANY instance of the word for storehouse, plus ANY instance of the given place-name.

Note too that other kinds of Properties can be specified on the query Criteria. Here we use the COMPONENTS-that-co-occur operator along with regular Properties restricting the matching Texts to those tagged as Category T ("Letters"). Scoping to a specific set of Text hierarchies on the query's Scope tab, or to those assigned to a selected Period, is also permitted.

Here is the View of one of the resulting Texts showing a word for grain ("ŠE.BAR"), the word for "storehouse" ("araš"), and the place name (from the lemma entry "Ziššawi-š") all highlighted.

Other Text Components, on Concepts -- Persons and Locations

If Prosopography or Gazetteer work has been done on the Texts using the Wizards provided, OCHRE can recognize those words tagged as Persons and Locations. The Person and Location items can thus be used as Text Components to be considered on co-occurrence queries. In this example from the Ras Shamra Tablet Inventory (RSTI) project we are looking for co-occurring instances of words representing the Person of the Queen of Ugarit, along with the Ugaritic word for "mother", along with the Location of "The kingdom of Ugarit."

The Query itself is straight-forward, as before ...

... and the View of one of the resulting texts shows all of the matching words highlighted.