Professor Christoph Anderl (Ghent University)
The “Database of Medieval Chinese Texts” (DMCT) has been a project ongoing for ca. 10 years at the Department of Languages and Cultures, Ghent University, and its main partner, DILA at Dharma Drum University, Taiwan. Its main aim has been the creation of high-quality digital editions of non-canonical Buddhist texts extant in Dunhuang manuscripts. Originally a by-product of working with the manuscript material, the extraction of variant characters (yitizi 異體字) has developed into a major part of the database project, and the Variants Database module currently contains close to 100.000 character forms. DMCT also contains parts which are not accessible to the public yet, such as the Syntax and Sentence Analysis modules.
In the lecture, I will provide an overview of the structure and functions of the database, and also focus on the way DMCT has “organically” grown in response to specific needs related to research, international collaboration (concretely, importing information from and exporting data to other Database projects), and educational purposes, having developed into an important tool in the infrastructure of our research group, as well as in the training of MA and PhD students at our department.
The last part of the presentation will focus on future perspectives in the development of DMCT.
Christoph Anderl is Professor of Chinese Language and Culture at the Department of Languages and Cultures, Ghent University, where he has been working since 2015. Before that, he held positions as Senior Research Fellow at KHK / CERES (Ruhr Univ. Bochum, 2010-2014) and Oslo University (2005-2010). He completed his PhD (Studies in the Language of Zǔtáng jí 祖堂集, Oslo University) in 2005, and his MA in Chinese and Japanese Studies in 1995 (Vienna University). Anderl’s research interests include Chinese Historical Linguistics with an emphasis on Medieval Chinese, non-canonical narratives in Dunhuang manuscripts, Chinese Chan Buddhism, the interrelation of text and image in Dunhuang art, and Digital Humanities. Anderl is currently the editor-in-chief of the international project “Database of Medieval Chinese Texts” (https://www.database-of-medieval-chinese-texts.be/), initiated in 2015 and housing one of the largest repositories of Chinese medieval character forms. Until recently, he has been Co-investigator and Research Cluster Leader (“Typology of Text and Image Relations, Cliffs and Caves”) of the project “From the Ground Up: Buddhism and East Asia Religions” at UBC, Canada (2016-2024).
For an overview of ongoing projects, please consult: http://research.flw.ugent.be/en/christoph.anderl. For an overview of his publications, please see https://ugent.academia.edu/ChristophAnderl.
Dr. Yu-Chun Wang (Dharma Drum Institute of Liberal Arts)
This study presents an AI-assisted workflow for constructing a large-scale Buddhist Knowledge Graph (KG) by transforming dictionary-based conceptual definitions into machine-interpretable relational structures. Buddhist terminology is vast and specialized, making manual entity consolidation and relation extraction extremely labor-intensive. Leveraging recent advances in large language models (LLMs), we develop a multi-stage pipeline that merges synonymous senses, summarizes definitions, classifies doctrinal categories, and extracts subject–predicate–object triples from dictionary entries. Dictionaries provide curated conceptual foundations, allowing LLMs to generate consistent semantic representations suitable for KG construction. This work demonstrates that large-scale KG construction is now feasible through AI, and we are expanding the KG for future public release to support research in Buddhist studies and digital humanities.
Yu-Chun Wang holds a Ph.D. in Computer Science and Information Engineering from National Taiwan University. He is currently an Assistant Professor in the Department of Buddhist Studies at Dharma Drum Institute of Liberal Arts, where he also serves as Director of the Information and Communication Section at the university library.
His research interests span natural language processing (NLP), information retrieval, computational linguistics, and digital humanities. His work focuses on integrating text processing technologies with humanities research, with the goal of enhancing scholarly methodologies through computational tools.
Dr. Wang leads and collaborates on multiple research projects related to NLP and digital humanities. He has developed tools for Chinese Buddhist text segmentation and named entity recognition in historical Buddhist biographies, addressing key challenges in processing classical Chinese texts. His work also extends to digital humanities projects involving event extraction from premodern Chinese historical documents and the diachronic phonological evolution of the Southern Min dialect. Through these projects, he actively explores the application of computational methods in textual analysis within the humanities.
Jen-Jou HUNG 洪振洲 (Dharma Drum Institute of Liberal Arts)
Full-text search is indispensable for researching the complex Buddhist corpus. The Chinese Buddhist Electronic Text Association (CBETA) currently serves over 12,000 queries daily. However, traditional keyword search faces significant challenges, including rigid query formulation, phrasing variations, heterographic variants, and semantic ambiguity. This presentation outlines CBETA’s technical evolution in addressing these challenges. We first review foundational solutions such as variant normalization and the adaptation of the Smith-Waterman algorithm—traditionally used in genomics—for "fuzzy" matching. This enables scholars to locate passages despite scribal errors or version differences. The core focus is the recent integration of Semantic Search using Retrieval-Augmented Generation (RAG). By utilizing vector embeddings, this system moves beyond exact character matching to answer conceptual queries. We present a comparative analysis of our RAG implementations (Beta 1 vs. Beta 2). The optimized architecture, which retrieves top-ranked passages via vector similarity before summarization, successfully reduced response times from 120 to 12 seconds and cut AI token costs by 66%. Finally, we address practical limitations such as hardware costs and hallucinations. We conclude with a roadmap for future developments, specifically the transition towards local LLMs to balance performance with sustainability. These insights aim to provide a valuable reference for the development of next-generation digital archives.
Jen-Jou (aka “Joey”) HUNG 洪振洲 was born in Taiwan in 1976. He earned his Ph.D. in Information Management from National Taiwan University of Science and Technology in 2006. He is currently a full-time professor in the Department of Buddhist Studies at Dharma Drum Institute of Liberal Arts, where he also serves as the Director of the Library and Information Center. Additionally, he holds the position of Executive Director of the CBETA Foundation. Dr. Hung actively participates in various digital archiving projects led by Dharma Drum Institute of Liberal Arts. His research interests include analysis of translators of Chinese Buddhist texts, construction of digital archiving projects, development of digital humanities research resources, and the application of AI in Buddhist studies.
Dr. Patrick McAllister (Austrian Academy of Sciences)
Quite a few projects that produce editions of texts have a neat, linear sequence: following a phase of data collection and curation, a standardised edition is produced. While it is being revised, a workflow to generate outputs — typically an online edition and a printed version — is implemented. Finally, the raw edition with its metadata is put in a long-term archive backed by an institution, and the online edition is kept running for a few years, often at personal cost to researchers.
In this talk, I present some strategies for dealing with projects that do not follow this linear model. These projects typically lack a definitive completion point. My experience draws on projects that aim to produce unbounded text collections. The benefit of such efforts versus the projects focused on a single text is obvious: a uniformly encoded group of texts will be more useful than an isolated edition. But it is equally obvious that this sort of open-ended editorial project poses specific problems. I will argue that three criteria, all organised around the central notion of “utility”, must be in place to provide such projects a good chance of survival: there needs to be an editorial organisation that can survive for long enough to support the project; FAIR principles need to be upheld during and throughout the project, not added at the end only; and the technology underlying the project must remain both functional and open for experimentation. I propose that modern systems of reproducible software stacks can extend the FAIR principles to the software that makes the editions “useful” (or interoperable and reusable) and thereby satisfy the last criterion.
I will discuss two specific cases that illustrate how a project can go through lulls of inactivity and be revived if the organisational, editorial, and technological criteria are met in such a way that they jointly contribute to the project’s utility. One case I will discuss is SARIT, a long-running, loosely organised editorial project that is now waking up from hibernation; the other is VEGEST, a framework for running projects such as SARIT.
• SARIT: https://github.com/sarit/sarit-corpus
• VEGEST: https://gitlab.oeaw.ac.at/vegest/vegest-schema
Patrick McAllister received an MA in Philosophy from the University of Vienna in 2005, and a PhD from the Institute for South Asian, Tibetan and Buddhist Studies, University of Vienna, in 2011 (supervised by Helmut Krasser). He has been working at the Austrian Academy of Sciences’ Institute for the Cultural and Intellectual History of Asia since July 2016. While his primary research interest is the development of Buddhist epistemological theories during the 9th to 11th centuries (primarily in the works of Prajñākaragupta, Jñānaśrīmitra, and Ratnakīrti), he is also engaged in Digital Humanities. He significantly contributes to the conceptual, methodological and technical development of two resources: EAST, a tool to collect bibliographical and prosopographical information on the South Asian and Tibetan philosophical literature dealing with logic and argumentation. SARIT, a growing and dynamically developing library of Indic texts (mainly Sanskrit) which are encoded according to the TEI Guidelines. More recently, he has been working on “VEGEST—Vienna Encoding Guidelines for Editing Sanskrit Texts.”
ORCID: https://orcid.org/0000-0001-8043-7453
Bunchird CHAOWARITHREONGLITH, PhD. (Dhammachai Tipiṭaka Project, DCI Center for Buddhist Studies, Thailand)
This talk explores the integration of advanced digital tools into the field of Pāli textual criticism, specifically in the preparation of a critical edition of the Pāli canon. In Thailand, the Dhammachai Tipitaka Project (DTP) utilizes around twenty witnesses—if not more—for each text. These sources include four traditions of palm-leaf manuscripts alongside four printed editions of the Pāli canon: Burmese, Sinhalese, Thai, and European (Be, Ce, Se, and Ee). This is a painstaking task, particularly when incorporating such a high volume of palm-leaf manuscripts. Consequently, advanced digital tools are implemented throughout the entire workflow: manuscript surveying, image segmentation, transcription, text alignment, editing, and apparatus creation. I will demonstrate how we utilize these tools in our editorial work, while also discussing future developments aimed at reducing workload, increasing accuracy, and improving flexibility.
While the DTP has utilized digital tools for over a decade, the limitations of this legacy system—combined with recent advancements in digital and AI technology—have necessitated the creation of a completely new system. Manuscript transcription requires greater accuracy, and text alignment must be more effective when encountering complex passages. Furthermore, the tasks of editing text and preparing the critical apparatus need to be more intuitive. We are integrating these capabilities into our new system and are exploring all possibilities, including AI and OCR. This presentation outlines how we are addressing these challenges.
I am a researcher with a background in technology and Buddhist studies from Thailand, Japan, and the US. For nearly 15 years, I have been working in the Dhammachai Tipiṭaka project, overseeing every process from manuscript digitization to digital platform development and critical edition preparation. Currently, we are finding a way to leverage the power of AI and digital humanities to efficiently analyze vast amounts of Pali variant readings from four major traditions of palm-leaf manuscripts. This approach will significantly streamline the process of transcribing manuscripts, collating variants, editing texts, and creating critical apparatus, ultimately leading to a more accurate and comprehensive critical edition of the Pali Canon.
Sebastian Nehrdich (Tohoku University)
This presentation gives an overview of the current status and future outlook of Dharmamitra, a collaborative research ecosystem between Tohoku University, the Tsadra Foundation, and the Berkeley AI Research Lab. I will examine how recent advances in Large Language Models facilitate a new paradigm for Buddhist philology and how Dharmamitra explores these through three main technological vectors: First, optical character recognition to digitize classical texts with high accuracy. Second, machine translation systems based on LLM and semantic search technology with high precision for Sanskrit, Pāli, Tibetan, and Chinese. Third, vector-based semantic retrieval to allow seamless search and retrieval across multilingual corpora. I will demonstrate how Dharmamitra can be used to dramatically shorten the time needed for finding relevant passages in large, multilingual corpora, and in what way this might affect textual studies of Buddhist material in the future.
Sebastian Nehrdich is a tenure-track Assistant Professor at Tohoku University. He completed his PhD in Computational Linguistics at the University of Düsseldorf, co-supervised by Oliver Hellwig and Kurt Keutzer. He holds an MA in Buddhist Studies from the University of Hamburg. His work integrates digital philology, Buddhist textual analysis, and machine learning. He serves as Director of the Dharmamitra project that was founded at the Berkeley AI Research Lab (BAIR), has managed the ML infrastructure of the ChronBMM project, and has led the development of the BuddhaNexus platform 2018-2023.