זיהוי נושאים של טקסטים עתיקים

Hackathon challenge

Tiresias is a database of texts on religion from the ancient Mediterranean, with subject tags. The subject tags were derived by juxtaposing subject indices with text indices in secondary literature, as described here. The derived database consists of references to ancient texts (consisting of an ancient work number and an internal reference, e.g., 1.200 for book 1 line 200) and subject tags (e.g., “monotheism”) relating to these references.

The challenge is to build on these subject tag – reference pairs to tag additional ancient texts. This can be done by creating a model of the relationship between existing tags and full texts, and using it to tag additional texts.

Data is available in csv files:

1. subjects_with_refs.csv: work numbers and two types of subject tags: full tags (i.e., several words, not stubbed) and a shortened stubbed form.

2. titles_clean.csv: All work numbers and corresponding authors and titles.

3. texts_hackathon.csv: Work numbers, internal references, and corresponding full text in original languages and english translations (in many cases, only translation or original is available).

4. titles_list.csv: work numbers, author and titles for which full text or translation is available in texts_hackathon.csv.