Faculty Collaborator: Christelle Alvarez
About:
Seong-Heon has been collaborating with Professor Christelle Alvarez on a dual-focused project. The first part involved conducting Data Science and AI workshops for her graduate seminar, EGYT 2200 Monumentality and Texts in Ancient Egypt. In these workshops, students gained hands-on experience in building machine learning models and engaged in discussions about the capabilities and responsibilities associated with AI.
The second, ongoing part of the project aims to develop an OCR (Optical Character Recognition) model for hieroglyphs. Currently, identifying inscriptions on surfaces covered with hieroglyphs requires human transcription. The goal of this project is to create a tool that can transform three-dimensional, photogrammetrically captured images of inscriptions into accessible, searchable, and analyzable content. This tool has the potential to significantly accelerate the pace at which researchers and learners can process large amounts of hieroglyphic data, thereby opening new opportunities in the study of ancient texts.
Project Goals
Develop a workshop introducing Data Science and Machine Learning to students in Egyptology.
Build a hieroglyphs OCR program for learners and researchers.
DS + ML Workshop: Goals
Students Can:
Productively discuss Egyptology literature containing significant Deep Learning/Machine Learning components.
Utilize Python in Colab environments and make small modifications.
Sketch ideas for applying Data Science or Machine Learning in their own research projects.
DS + ML Workshop: Structure
Readings:
Curated 2 + 1 readings on recent applications of Data Science and Machine Learning in Hieroglyphs.
Hands-On Workshop:
Offered Colab workshop for building a clusterer for hieroglyphs.
Discussion Workshop:
Hosted workshop on ML capabilities, limitations, and ethics.
Hieroglyph OCR: Motivation
Background:
Few Egyptology resources are easily available digitally; some are not digitized at all.
Problem:
Many hieroglyph publications are only available as physical copies with varying transcription formats.
Consequences:
Students have few resources for learning hieroglyphs.
Researchers waste time manually searching for resources and flipping pages.
Hieroglyph OCR: Strategy
Retrain existing OCR model on hieroglyphs.
Locate various fonts supporting Egyptian Hieroglyphs.
Generate training data from fonts.
Train a Tesseract model from scratch with training data.
Evaluate on real facsimiles of hieroglyphs.
Hieroglyph OCR: Next Steps
Current Status:
Successfully generated and trained on a very small generated dataset.
Scale Up:
Increase the number of datapoints to 10k, 40k, and eventually 400k.
Action Items:
Secure computing resources.
Prepare validation dataset.
Challenges
Building the OCR Engine:
Turned out to be much more challenging than expected.
Solution: Prioritize workshops and build the OCR in the background, reach out to CCV (Center for Computation and Visualization).
Introducing DS and ML on a Tight Schedule:
Solution: Utilize low-code/no-code solutions and various methods to become familiar with it besides hands-on programming.