OCHRE Wiki - Text Importation

OCHRE Wiki

Text Importation

By Miller Prosser October, 2013

Updated January 2017

OCHRE can import texts copied in from other sources, like text editors and word processors.

There are many options to help import the text correctly.

The Pending Content Tab

Type or paste your text here.

Text entered in square and half brackets will be marked with the appropriate level of metadata upon import.

Entering Special Codes

Broken Sections

Use the following codes to indicate broken sections.

NOTE: the resulting display depends on the configuration of the epigraphic sigla at the project level.

A section of approximately 3 broken signs: [x$3$] will result in something like [x x x]

A section of an indeterminate number of broken signs: [x$~$] will result in something like [ ]

A section of an indeterminate number of broken signs equivalent to about 6 signs: [x$%6$] will result in a blank space six signs long.

A section of 3 broken lines: [X$3$]

NOTE: If you are providing line number in your transliteration, do not indicate a line number when using the missing lines code X$.

A section of an indeterminate number of broken lines: [X$~$]

A section represented as both blank and missing: [x$3$]

NOTE: You may use this last code to indicate blank space as well. Simply omit the brackets and use the code x$9$, replacing the number 9 with the approximate amount of space needed. However, keep in mind that the blank space will be assigned the "missing signs" metadata property even if the view of the text looks like simple blank space.

NOTES:

A space after the final $ will separate the broken section from what follows.
- - [x$~$] GÍN will result in [...] GÍN, creating two words
  - [x$~$]-pi-iš will result in [...]-pi-iš, creating only one word
You cannot enter any other text on a line that begins with [X$ because this code indicates that the entire line is missing. To put it another way, if a line contains the broken line indicator [X$ it cannot contain anything else.
If you want OCHRE to display every broken line as a separate line, define the epigraphic unit as Type: Line. Type: Region will summarize the broken lines.

Examples of broken sections:

ml[k x$~$] ytn will be imported as ml[k ...] ytn with ytn being a separate word.

ml[k x$~$]ytn will interpret -ytn as the latter half of a word that began in the break.

mlk [x$9$] ytn will be imported as mlk [ . . . . . . . . .] ytn.

NOTE: the blank and missing space is usually represented simply by spaces in the text view; however, users can override this display and choose another character in their project-level preferences.

NOTE: If you import blank space in the middle of a word, this blank space will be added to the discourse unit. In most cases, this will cause the text analysis wizard to fail when looking up this attested form. You will need to remove the blank epigraphic unit from the discourse unit before it will match against an attested form in the glossary.

Here is a sample Pending Content and resulting import:

Recto

[X$5$]

6) k [x$14$]

7) w [x$14$]

8) a͗⸢ḥ⸣[x$11$]

9) k . [x$13]

10) w . ⸢ṯ⸣[x$12]

11) w . y[x$13]

Recto

[5 lines missing]

(6) k [--------------]

(7) w [--------------]

(8) a͗⸢ḥ⸣[-----------]

(9) k . [-------------]

(10) w . ⸢ṯ⸣[------------]

(11) w . y[-------------]

Ruled Lines

A similar strategy is used for noting the presence of ruled lines on the epigraphic representation of a Text. Use "s" to represent a single line; "S" to represent a double line. Include the length of the line as shown in the following examples:

s$36$ : a single line 36 characters long
[s$18$] : a damaged single line 18 characters long
S$18$ : a double line 18 characters long

The s- or S- notation marks the resulting epigraphic unit's Type as a "separator" and, in effect, assigns the following settings which are made explicit on the epigraphic unit's pane. If auto-line-numbering is on, the ruled line is not assigned a number.

NOTE: after import, all of these options can be changed in edit mode. See the wiki article on using metadata to indicate damage and missingness.

Other Codes

OCHRE provides two codes for creating compound discourse units during the import process. These compound units could be words with clitic elements (as is common in certain alphabetic languages), compound proper names, or entire phrases. Each project may decide for itself which character codes to use when marking up import text. The codes should be entered at the project level on the epigraphic sigla tab. We suggest the following two codes because they are relatively safe in that they do not communicate any inherent processing meaning for the database.

For compounds combined without intervening spaces:

Use @ to enter a compound word that results in a discourse unit with the elements nested beneath a parent which displays the constituent elements with no intervening spaces. For example, b@šnt will produce a single discourse unit for bšnt which is the parent of two discourse units b and šnt.

For compounds separated by spaces:

Use # to create a compound discourse unit, like a phrase, that results in a discourse unit with the elements nested beneath a parent which displays the constituent elements separated by spaces. This would be the approach to use if you needed to treat an entire phrase as a discourse unit, with its constituent elements as individual words.

NOTE: In the case where you want the compound items to be separated by a specific character, like a dash, configure the epigraphic sigla tab with your chosen character in the compound field. Be careful not to use a character that has other meaning on import.

The Import Tab

Link in the appropriate dictionaries and writing systems. The import will try to match against these, if the user so chooses.

On the Information Tab of the Text, it is helpful (and sometimes necessary) to select the type of writing and the language of the text.

Enable epigraphic-unit processing: this will atomize the text down to individual signs or letters.

Lookup script-units/readings in available writing systems: links each sign to a sign in a writing system.

Automatically number each line of text: select this option if you do not intend on providing line numbers in the Pending Content pane.

Expect section headers in the text: check this box if you intend on providing headers like Recto, Section, etc. In the Pending Content tab, each headers must follow an blank line.

Leave 'x' and 'X' unformatted: check this box if you include 'x' or 'X' as indications of broken signs. This well tell OCHRE not to treat these as real letters.

Parse words into individual alphabetic characters: the is only available for texts of the Type: alphabetic. If unchecked, OCHRE will not fully atomize words into letters.

Apply language tags to numerals: if checked OCHRE will define logographic numerals (e.g. 5, 1/2, etc.) as being in the selected language.

Remove format: these check boxes allow the user to provide italicized text on import to be used for language recognition and to remove this formatting upon formatting. These are useful if you are importing a bilingual text and you want the import to recognize Latin apart from Punic, for example.

There are some helpful notes at the bottom of the page explaining some of the assumptions OCHRE makes based on the input it receives in the Pending Content pane.

Project Specifications

Each project must specify their transcription system in the project specifications. OCHRE will use these specifications to recognize languages and writing systems. A project can choose which cuneiform transcription tradition it wishes to follow, whether signs are uppercase, lowercase, italicized, or superscript.

Additional Assumptions

OCHRE will ignore the '.' (period) and the "|" (vertical bar) when auto-generating discourse units, as these are typically used as word dividers (represented by the epigraphic structure only).

Workflow

Assign the Type and Language of the text on the main Information tab of the text.
Add text to the Pending transliteration pane on the Pending Content tab. NOTE: if the transliteration is highly formatted, it may be best to load the text from a docx file using the Word button on the pending content pane.
From the Import tab, click Copy pending transliteration.
Select the desired epigraphic and discourse unit processing options.
Click Process textual content to produce a preview.
Expand the Preview results to inspect the proposed Epigraphic and Discourse hierarchies.
Click the View Results button to see a composed view of the text as it would appear once imported.
Check the Not Found tab to verify that all signs were found in the appropriate writing systems. Please report any missing signs to ODS staff.
Make any necessary changes to the source document and repeat the Process step.
Click Accept result to post the text to the database.

Other Notes

On rare occasions, you may need to classify a sign as a specific type even though it is not a typical sign. Perhaps it does not match against a sign in a writing system. Perhaps you can tell the sign in question is a number, but you cannot determine the value. In these cases, you can use the sign type pick list to specify manually the type of sign.

This will allow OCHRE to handle and view the sign as needed. This manual choice should not be used in most cases.

Google Sites

Report abuse