Seong-Heon Jung

Seeing DATA1150 Final Presentation

Seong-Heon Jung

Faculty Collaborator: Christelle Alvarez

About:

Seong-Heon has been collaborating with Professor Christelle Alvarez on a dual-focused project. The first part involved conducting Data Science and AI workshops for her graduate seminar, EGYT 2200 Monumentality and Texts in Ancient Egypt. In these workshops, students gained hands-on experience in building machine learning models and engaged in discussions about the capabilities and responsibilities associated with AI.

The second, ongoing part of the project aims to develop an OCR (Optical Character Recognition) model for hieroglyphs. Currently, identifying inscriptions on surfaces covered with hieroglyphs requires human transcription. The goal of this project is to create a tool that can transform three-dimensional, photogrammetrically captured images of inscriptions into accessible, searchable, and analyzable content. This tool has the potential to significantly accelerate the pace at which researchers and learners can process large amounts of hieroglyphic data, thereby opening new opportunities in the study of ancient texts.

transcript

Project Goals

Develop a workshop introducing Data Science and Machine Learning to students in Egyptology.
Build a hieroglyphs OCR program for learners and researchers.

DS + ML Workshop: Goals

Students Can:
- Productively discuss Egyptology literature containing significant Deep Learning/Machine Learning components.
- Utilize Python in Colab environments and make small modifications.
- Sketch ideas for applying Data Science or Machine Learning in their own research projects.

DS + ML Workshop: Structure

Readings:
- Curated 2 + 1 readings on recent applications of Data Science and Machine Learning in Hieroglyphs.
Hands-On Workshop:
- Offered Colab workshop for building a clusterer for hieroglyphs.
Discussion Workshop:
- Hosted workshop on ML capabilities, limitations, and ethics.

Hieroglyph OCR: Motivation

Background:
- Few Egyptology resources are easily available digitally; some are not digitized at all.
Problem:
- Many hieroglyph publications are only available as physical copies with varying transcription formats.
- Consequences:
  - Students have few resources for learning hieroglyphs.
  - Researchers waste time manually searching for resources and flipping pages.

Hieroglyph OCR: Strategy

Retrain existing OCR model on hieroglyphs.
Locate various fonts supporting Egyptian Hieroglyphs.
Generate training data from fonts.
Train a Tesseract model from scratch with training data.
Evaluate on real facsimiles of hieroglyphs.

Hieroglyph OCR: Next Steps

Current Status:
- Successfully generated and trained on a very small generated dataset.
Scale Up:
- Increase the number of datapoints to 10k, 40k, and eventually 400k.
Action Items:
- Secure computing resources.
- Prepare validation dataset.

Challenges

Building the OCR Engine:
- Turned out to be much more challenging than expected.
- Solution: Prioritize workshops and build the OCR in the background, reach out to CCV (Center for Computation and Visualization).
Introducing DS and ML on a Tight Schedule:
- Solution: Utilize low-code/no-code solutions and various methods to become familiar with it besides hands-on programming.

Page updated

Report abuse