My research interests lie in addressing fundamental theoretical problems in data communication and storage, as well as developing practical techniques for large-scale systems such as distributed machine learning (ML). I focus on designing codes and algorithms, and on deriving theoretical guarantees on their performance, to improve system efficiency with respect to data reliability, storage overhead, latency, communication cost, and ML model accuracy. My research builds on mathematical foundations from coding theory, information theory, applied probability, and optimization, and is motivated by applications to modern storage and communication systems, including DNA-based data storage, file synchronization, and federated learning.
The exponential growth of digital information has made conventional storage technologies increasingly unsustainable, as most data today resides in large, energy-hungry data centers. DNA has emerged as a promising alternative medium for future data storage thanks to its exceptional density, durability, and sustainability. Despite several successful proof-of-concept demonstrations, scalability remains a major challenge. As illustrated in the figure to the right, a typical DNA storage pipeline involves encoding binary data into quaternary sequences over the alphabet {A, C, G, T}, writing them through DNA synthesis, storing them in molecular form, and reading them through DNA sequencing and digital decoding. A key challenge lies in the fact that DNA synthesis and sequencing are error-prone biochemical processes. Current systems rely on slow and costly high-fidelity biochemical technologies to mitigate these imperfections. Our research aims to address these limitations algorithmically rather than biochemically, by developing encoding and decoding techniques that enhance reliability and enable faster, cheaper, but inherently more error-prone synthesis & sequencing technologies.
Writing and reading data in synthetic DNA
At the core of this work is the study of error-correcting codes that efficiently handle random deletions, insertions, and substitutions, which are the dominant types of errors in DNA storage. These codes must operate at short blocklengths, as current synthesis technologies limit the length of synthetic DNA strands (oligos) to only a few hundred nucleotides, making classical constructions unsuitable. Furthermore, DNA sequencing produces multiple noisy reads of each oligo, creating a multi-read setting in which algorithms can exploit redundancy across reads to enhance decoding accuracy. By combining insights from from information theory, coding theory, and algorithm design, our objective is to develop coding techniques that enable scalable and resource-efficient DNA storage systems, thereby helping bridge the gap between laboratory demonstrations and the eventual commercialization of this technology. At a theoretical level, we also seek to characterize the fundamental limits of reliable information storage and retrieval under the distinctive noise characteristics of the DNA channel.
R. Khabbaz, M. Antonini, S. Kas Hanna, Marker Guess & Check Plus (MGC+): An Efficient Short Blocklength Code for Random Edit Errors, 13th International Symposium on Topics in Coding (ISTC), Los Angeles, 2025.
S. Kas Hanna, On the Reliability of Information Retrieval From MDS Coded Data in DNA Storage, under review in IEEE Transactions on Molecular, Biological, and Multi-Scale Communications, arXiv:2502.06618 [cs.IT], 2025.
S. Kas Hanna, GC+ Code: A Systematic Short Blocklength Code for Correcting Random Edit Errors in DNA Storage, under review in IEEE Transactions on Molecular, Biological, and Multi-Scale Communications, arXiv:2402.01244 [cs.IT], 2025.
D. Lazzarotto, M. Testolina, S. Kas Hanna, O. Watanabe, E. San Antonio, X. Pic, M. Dimopoulou, M. Antonini, T. Ebrahimi, Overview of JPEG DNA Coding System for Image Storage on Synthetic DNA, SPIE Applications of Digital Image Processing XLVIII, San Diego, 2025.
Deletion and insertion errors are directly associated with loss of synchronization between transmitted and received sequences. Channel coding for such errors is a fundamental problem in information theory, with applications extending beyond DNA-based data storage to areas such as file synchronization and communication networks where synchronization errors naturally arise. Our research investigates this problem from a fundamental perspective, focusing on the design of binary codes for correcting random deletions and insertions, detecting synchronization errors, and handling bursty or localized error patterns. These problems remain among the most challenging in the field, as deletions and insertions alter the alignment between transmitted and received sequences, creating channels with memory that are notoriously difficult to analyze in information theory. From a coding-theoretic standpoint, these errors make code design considerably more complex than for substitution or erasure errors, as deletions and insertions represent the most general form of errors. For example, a substitution can be viewed as a special case where a deletion is immediately followed by an insertion at the same position. In our research, we aim to develop explicit and efficient code constructions for synchronization errors, together with theoretical guarantees on their performance and optimality.
S. Kas Hanna, GC+ Code: A Systematic Short Blocklength Code for Correcting Random Edit Errors in DNA Storage, under review in IEEE Transactions on Molecular, Biological, and Multi-Scale Communications, arXiv:2402.01244 [cs.IT], 2025.
S. Kas Hanna, Optimal Codes Detecting Deletions in Binary Strings Applied to Coded Trace Reconstruction, IEEE Transactions on Information Theory, Vol. 69, No. 9, 2023.
S. Kas Hanna and S. El Rouayheb, Codes for Correcting Localized Deletions, IEEE Transactions on Information Theory, Vol. 67, No. 4, 2021.
R. Bitar, S. Kas Hanna, N. Polianskii, and I. Vorobyev, Optimal Codes Correcting Localized Deletions, IEEE International Symposium on Information Theory (ISIT), Melbourne, 2021.
S. Kas Hanna and S. El Rouayheb, Guess & Check Codes for Deletions, Insertions, and Synchronization, IEEE Transactions on Information Theory, Vol. 65, No. 1, 2019.
Modern intelligent systems rely on large-scale iterative machine learning (ML) algorithms trained on vast and often decentralized datasets. Running these algorithms efficiently requires distributing computations across multiple servers or devices, giving rise to distributed and federated ML frameworks. Our research focuses on developing algorithmic strategies that optimize the performance of such systems under practical constraints, including computation delays, limited communication bandwidth, and data heterogeneity. We are particularly interested in characterizing the trade-offs between training error and runtime, and between communication cost and model accuracy, as well as designing data-driven strategies to mitigate the effect of stragglers, that is, slow or unresponsive workers that can bottleneck training. We study adaptive and coded computation techniques that accelerate convergence while preserving model accuracy, and we aim to provide theoretical guarantees on their performance using tools from optimization and applied probability. These ideas extend naturally to federated learning, where privacy, reliability, and efficiency must coexist across a network of decentralized clients.
O. Makkonen, S. Niemelä, C. Hollanti, and S. Kas Hanna, Approximate Gradient Coding for Privacy-Flexible Federated Learning with Non-IID Data, 32nd European Signal Processing Conference (EUSIPCO), Lyon, 2024.
M. Egger, S. Kas Hanna, and R. Bitar, Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load, IEEE International Symposium on Information Theory (ISIT), Taipei, 2023.
S. Kas Hanna, R. Bitar, P. Parag, V. Dasari, and S. El Rouayheb, Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers, 45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 2020.
Ensuring reliable communication with minimal delay is a central challenge in next-generation communication systems. We are interested in developing coding strategies that provide high reliability under strict latency constraints, motivated by applications such as XR, VR, and cloud gaming. We study forward error correction (FEC) techniques that balance delay, reliability, and complexity, with a particular focus on explicit low-delay codes capable of correcting burst and random erasures. We also seek theoretical guarantees on their performance and methods for optimizing code parameters to meet target reliability levels under varying network conditions.
S. Kas Hanna, Z. Tan, W. Xu, and A. Wachter-Zeh, Codes Correcting Burst and Arbitrary Erasures for Reliable Low-Latency Communication, 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, 2023.
S. Kas Hanna, A. Wachter-Zeh, Z. Tan, W. Xu, and P. Dong, “A Design Method of Network Coding with Maximum Delay Constraints,” China National Intellectual Property Administration (CNIPA), CN116961824A, 2023.
Z. Tan, S. Kas Hanna, A. Wachter-Zeh, W. Xu, P. Dong, H. Zhu, and F. Wu, “Method and Device to Determine and Control Feedback Information and Parameters for Network Coding,” China National Intellectual Property Administration (CNIPA), CN116980072A, 2023.