Posters

Poster Session 1

Tuesday 11 November, 15:45 – 17:00

DNA as a Rewritable Storage Medium

Anagha Rajesh - BioCompute Inc

As the field of DNA data storage advances toward commercialization, the high cost and slow turnaround time of DNA synthesis remain critical bottlenecks. Moreover, for DNA storage systems to become comparable to silicon-based memory, they must enable multiple read–write cycles rather than operate as read-once archives. This poster focuses on a new frontier: rewritable DNA data storage systems integrated with end-to-end hardware solutions. It explores how specially engineered enzymes, whose activity can be modulated by physical stimuli such as light, temperature, or electric fields, enable controlled, reversible modifications to DNA strands, in order to encode data. It also discusses the design of microfluidic and nanofluidic systems to automate encoding, editing, and retrieval workflows, thus reducing human intervention and operational cost. The poster aims to present a technical and strategic overview of how rewritable DNA systems, coupled with integrated hardware, could lower the barriers to widespread adoption of DNA-based storage technologies.

DNAmaker: Automated Construction of Long DNA Molecules for Digital Data Storage

Julien Leblanc - IRISA, CNRS

Today, the consensus in the scientific community for storing digital information in DNA is to use short single-stranded DNA molecules ranging from 100 to 350 nucleotides. However, this approach presents several limitations, including encoding constraints, DNA stability issues, difficulties in recovery, and limitations in sequencing technologies. To address these challenges, we propose storing information in long double-stranded DNA molecules ranging from 5,000 to 25,000 base pairs.

To this end, we have developed a fully in vitro method to construct such molecules in the simplest possible way.

But is it scalable?

Currently, long double-stranded DNA molecules are assembled manually. This process is time-consuming, not easily parallelizable, and tedious — taking up to four days to build a 24 kbp molecule. To make our DNA-based information storage solution convincing, it must offer a scalable experimental platform. We are therefore developing an automated system capable of constructing multiple long DNA molecules. As a proof of concept, we store the Universal Declaration of Human Rights in 12 DNA molecules of 3.5 kbp each.

Sequencing and Characterization of Polymers using Nanopores

Jette Bleicken - ENS de Lyon, CNRS

The ability to resolve the structure and composition of individual polymer molecules with high precision is a long-standing goal in the biological and materials sciences. Traditional bulk methods often mask molecular heterogeneity, which limiting our understanding of the functional and structural complexities of polymers. Nanopore-based techniques have emerged as a transformative technology for single-molecule analysis at the base pair level for biomolecules like DNA and proteins.

Current commercial nanopore technologies rely on ionic current measurements which is dependent on the pore size and polymer type.

For versatile characterization, our group has developed an optical method for real-time detection of single molecules translocating through nanopores. Based on this method, we can analyze the optical signals of several polymers.

During my PhD project I focus on storing and reading information in double-stranded DNA (dsDNA), DNA origami and synthetic polymers, which differ in size, mass and structure. In this case information is encoded as a sequence of a fluorescent blocks that are read optically during the translocation through a nanopore membrane via the nearfield effect (zero-mode waveguide). During my first year I investigated the translocation of six-helix bundles and DNA plates.

My work will be part of a project funded by PEPR MoleculArXiv (OPTOPOLYSEQ). The goal of this new project is to increase the reading throughput of the methodology using FRET and multicolor.

Cost-Effective and Scalable Short-Read Nanopore Sequencing for DNA Data Storage

Kornelija Kaminskaitė, Duygucan Gül, Ignas Galminas and Simonas Juzėnas - Genomika

DNA is emerging as a promising medium for long-term digital data storage due to its ultra-high density, durability, and longevity. However, practical implementation remains limited by the cost and efficiency of key processes - particularly sequencing of short synthetic oligonucleotides typically used to encode information. Oxford Nanopore Technologies (ONT) platforms offer portability and affordability, but their performance with short DNA fragments (<300 bp) is constrained by rapid molecule translocation and pore reloading inefficiencies, leading to low throughput and high sequencing costs.

As part of the EIC Pathfinder project DNAMIC (www.dnamic.org), we developed a High-Performance Protocol optimized for nanopore sequencing of ultra-short DNA (40–200 bp), based on ONT’s Ligation Sequencing Kit V14. This protocol achieves >10X higher sequencing output, 2–3X more purified library recovery, and 50% less reagent use, all without compromising data quality. Library preparation time is reduced by over half, supporting automation and large-scale applications.

Benchmarking showed similar average read quality (Q ≈ 11) compared to standard protocols, with fewer low-quality reads (Q ≤ 5) and slightly lower positional error rates. These advances enable a significantly lower cost-per-byte and more robust performance for short-read sequencing workflows.

The protocol is now integrated into the µFactory DNA data storage pipeline, a modular, end-to-end system developed under the DNAMIC project to prototype and scale DNA-based storage workflows. This integration enables faster iterations, reduced operating costs, and broader accessibility for real-world use cases.

By addressing one of the core limitations in nanopore sequencing for synthetic DNA, this work contributes to making DNA data storage more scalable, cost-effective, and ready for broader technological adoption.

A directed evolution plateau for plate experiment to help the scientific community

Kévin Ricard - ESPCI, CNRS

Part of the PEPR MoleculArXiv project involves exploring DNA data storage. This requires conducting directed evolution of polymerases to identify an enzyme feating the selection criteria. To this end, we have built a technical platform at the Gulliver laboratory (with Yannick Rondelez and Vasily Shenshin) to conduct these experiments. This technical platform is also intended to offer services to the scientific community, to benefit from our expertise and equipment in directed evolution.

We offer several services:

• Construction of enzyme and protein libraries, as well as individual mutants.

• Isolation and identification of the generated variants.

• High-throughput DNA Nanopore sequencing service (Flongle to Promethion).

• Enzyme activity measurements (depending on the enzyme type).

Using a plate-based experiment system, we can easily construct multiple enzyme libraries simultaneously and identify the generated variants. Our current limit is between 2,000 and 10,000 variants for identification, and around 10^10 variants. Our platform range goes from directed evolution reactions in tubes to high-throughput sequencing to identify mutants created by transforming our DNA into bacteria and amplification of DNA by PCR from them. Our activity measurements are performed in vitro, by producing enzymes in our plates, using the PURE system.

Since 2024, we have completed several projects to construct enzyme and protein libraries, and we have provided several high-throughput sequencing services for various partners near Paris. In 2026 we want to develop more the plateau around the world and be involved in different directed evolution projects.

PCR-free DNA data random access using tunable nanovesicles in reusable fiber matrices

Jokin Yeregui Elosua - CIC nanoGUNE BRTA , EHU/UPV

DNA has become a promising medium for storing digital data. To guarantee the longevity and preservation of the stored data, various techniques have been investigated to protect DNA from oxidative harm. Moreover, the on-demand retrieval of certain DNA sequences that are placed in the same area is a desired random-access feature of memory systems. In this context, state-of-the-art random access primarily depends on PCR selection. Even if this selection is feasible and successful, the process could be more prone to the introduction of errors and data corruption. It also renders the entire process more costly, energy-intensive, and time-consuming. Additionally, this method may result in the loss of non-selected information during the procedure.

In this work, we integrate electrospun fiber technology with a new approach for PCR-independent random access that relies on the selective release of target DNA from the matrix. For this purpose, multiple messages encoded in distinct DNA sequence sets are enclosed within various stimuli-responsive small unilamellar vesicles (SUVs) prior to incorporating them into the same electrospun water-resistant nanofiber mesh. We deliberately activate the release of targeted sequences by employing distinct physical stimuli (e.g. temperature, pH). The orthogonality of the employed triggers allows for the retention of SUVs that include the non-selected DNA data, kept within the fiber mesh for potential future retrieval purposes.

We demonstrate preliminary advancements in a molecular type of random access within water-resistant fibers suitable for DNA-based memory applications. Our system would provide a user-friendly, affordable, and eco-friendly option that avoids the need for enzymatic amplification for random access and minimizes dependence on sophisticated sorting technologies. By combining ideas of synthetic biology, materials science and nanotechnology, this work takes a further step towards the realization of a sustainable, feasible DNA data storage system with randomized, PCR-free data access.

In Vitro DNA Cryptography

Victoria Bloquert, Sandra Jaudou - ESPCI, CNRS

With digital data increasing exponentially, available space and longevity of energy demanding data centres are reaching their limit. With regards to storage density and durability, archiving data in DNA provides a sustainable solution. Rapidly evolving DNA synthesis and sequencing technologies have permitted such concepts to be achieved. Accordingly, raising questions around security and confidentiality is crucial to store information securely, particularly by regularly re-encrypting this massive amount of molecular data. An efficient strategy would be to perform these operations at the molecular level, without the use of a digital intermediary. In this work, we investigate an experimental in vitro DNA cryptography operation. Combining concepts of DNA origami and cryptography, a design involving a message containing DNA strand, written using a designed character to codon basis, can be encrypted by the formation of a ciphered DNA sequence.

A Nanopore-Based Assay for High-Throughput Evaluation of Transcriptional Orthogonality in DNA Storage Systems

Kanji Tomohara - ESPCI, CNRS

In living systems, DNA serves as a stable genomic repository, while RNA provides dynamic, environment-responsive outputs generated by transcription. A similar principle can be applied to DNA-based data storage: DNA acts as the immutable master archive, and RNA molecules serve as user-defined readouts triggered by specific queries. In this framework, transcription becomes the interface that selectively converts stored information into accessible formats.

Selective access critically depends on the orthogonality of RNA polymerase (RNAP)–promoter pairs. When each RNAP exclusively recognizes its cognate promoter, multiple independent information layers can coexist on the same DNA substrate. Distinct RNAP inputs thus operate as molecular queries, retrieving or activating different subsets of data without crosstalk. This enables parallel operations, hierarchical organization, and selective addressing within a single molecular storage pool.

Quantitative evaluation of RNAP–promoter orthogonality is therefore essential: it determines how reliably one channel can be accessed without interference and how many independent channels can be multiplexed. Beyond characterization, rational design or directed evolution of novel orthogonal RNAP–promoter pairs would expand the molecular “instruction set” available for DNA storage. These advances could transform DNA from a passive archival medium into an active, multi-layered storage and computation platform, where programmable transcriptional queries enable precise information retrieval.

Here, we present a nanopore-based assay to quantify RNAP–promoter orthogonality. Transcribed RNAs carry unique molecular identifiers (UMIs) that watermark their promoter of origin, allowing direct, scalable readout by nanopore sequencing. Unlike fluorescence-based assays constrained by spectral overlap, barcode-driven multiplexing supports simultaneous evaluation of tens to hundreds—or even thousands—of RNAP–promoter combinations. We will discuss ongoing results from this high-throughput framework and its potential to accelerate the development of orthogonal transcriptional channels for DNA-based data storage.

A multiplate experiment system for targeted evolution plateau services

Igor Koop - ESPCI, CNRS

We offer several services:

• Construction of enzyme and protein libraries, as well as individual mutants.

• Isolation and identification of the generated variants.

• High-throughput DNA Nanopore sequencing service (Flongle to Promethion).

• Enzyme activity measurements (depending on the enzyme type).

Programmable Nanopore-Based Biological DNA synthesis for a New Era of Information Encoding

Nicole Alizade - Sorbonne University

DNA synthesis is a cornerstone of biotechnological advancement, but it still remains heavily reliant on phosphoramidite chemistry, a decades-old method with considerable limitations in scalability, precision, and organic compatibility. The emergence of enzymatic synthesis has become a promising alternative; however, its widespread adoption is hindered by a lack of specificity and the absence of parallelized technologies. Thus, the development of programmable biological synthesis strategies could overcome these barriers, advancing molecular biology and expanding DNA’s role as a substrate for information encoding and storage.

For this goal, we are developing a platform for a polymerase-nanopore fusion system designed to achieve de novo DNA synthesis at a single-base resolution. High-fidelity sequence construction is made possible by synchronizing polymerase extension one base at a time, restricting nucleotide flow from an initial pool through external stimuli.

To accomplish this, gating mechanisms are positioned within nanopore constructs to modify base availability depending on the introduced signal. The feasibility of this selective alignment and enzyme-catalyzed synthesis is supported by computational modeling of protein coupling and dynamics of nucleotide trajectories. As this platform allows real-time control of nucleotide addition, we are also building a digital simulator allowing direct input of the desired sequence, followed by an automated synthesis in a miniaturized machine accommodating arrays of these nanopore constructs, eliminating the need for user interference throughout the process.

The platform offers advancements for current in vitro enzyme-driven synthesis and data storage pipelines, providing a sustainable, efficient and precise alternative to chemical methods. Because of the biological nature of nanopores, their implementation in DNA synthesis opens avenue for the first in vivo synthesis platform, potentially capturing cellular events as retrievable data.

Single-polymerase coupling to a nanopore, in combination with programmable stimulated control, can provide the foundation for novel high-fidelity, single-molecule, electronic DNA synthesis technologies.

Random walk on a DNA origami to solve mazes

Yarong Shi - ENS de Lyon

DNA programming is a field that uses biochemistry to design artificial systems made of DNA/RNA that can fold in 2D or 3D structures, embedding computation abilities. Those systems are based on the combination of two key techniques. DNA origami which enables the reliable and high-yield assembly of the initial support for the computation. And algorithmic self-assembly that involves the design of short DNA strands which will collectively assemble into a larger shape, while conducting computations.

Here, we develop a biophysical approach to address the problem of solving graph algorithms. Our goal is to design artificial DNA strands that attach to each other in such a way that they solve a maze designed on an origami platform. Such origami mazes have already been solved using DNA navigators that bind to the origami to draw irreversibly some random path on the maze (hairpins are consumed) and paths that do not reach the exit must be filtered out. In our project we aim at solving the maze using a random walk that will stop walking if and only if it reaches the maze exit.

To demonstrate feasibility, we simplified the maze into a linear rectangular path comprising two alternating strands (odd/even steps) and optimized the assembly using a kinetic model. We then designed a reversible random walk mechanism based on toehold-mediated strand displacement, where a single-stranded toehold domain initiates strand invasion and displacement. In our experiments, resulting assembly process is tuned (strands concentration, domain binding energy) and evaluated in through AFM measurements. Our preliminary results show a proof-of-principle of this successful approach, allowing us to assemble only correct path solutions. The next step is to extend this approach to more realistic 2D maze, thanks to a DNA-strand-displacement-powered random walk.

DNA Origami-Based Nanostructures for Rewritable Molecular Data Storage

Ilaria Sandei - Imperial College London

The exponential growth of digital data, projected to exceed 1,000 zettabytes by 2030, poses a significant challenge to conventional storage technologies, which suffer from limited capacity, high energy consumption, and considerable environmental impact. DNA-based storage has emerged as a promising alternative, offering theoretical storage densities up to 215 petabytes per gram and exceptional stability for long-term preservation. However, widespread adoption remains hindered by the high cost of DNA synthesis and the time- and resource-intensive nature of data encoding and retrieval.

Here, we present a novel approach leveraging DNA nanostructures as physical scaffolds for high-density, reusable data storage. Unlike conventional DNA storage systems that rely solely on single-stranded DNA sequences, DNA origami’s 3D architecture enables encoding in both nucleotide sequences and the physical configuration of the nanostructure. Its rigid, stable framework enhances resistance to environmental damage, improving durability. Binary information (0s and 1s) was encoded via single-stranded DNA oligonucleotides hybridized to predefined sites on a square DNA origami, enabling parallel and spatially organized data representation. Ten binary configurations were encoded and retrieved through selective dehybridization of ssDNA encoding sequences at 37°C and 50°C, followed by PCR amplification and next-generation sequencing reading. Results show accurate data recovery while preserving the origami’s structural integrity.

Our work highlights DNA origami’s potential to enhance information density, improve readout efficiency, and enable integration with microfluidic platforms for automated write/read cycles. Ongoing work aims to validate multiple rounds of data rewriting, positioning DNA nanostructures as a viable pathway toward scalable, sustainable, and reusable molecular data storage systems.

Repurposing Illumina chip to print data on DNA with patterned light

Adrien Rey - ENS-PSL, ESPCI

With the emergence of the second and now third generation of DNA sequencing, DNA data storage is more and more considered a good alternative to traditional data centre for the storage of the exponentially increasing amount of cold/archive data. The bottleneck to make it commercially viable now lies in the development of a fast, low-cost and sustainable DNA data writing tools. Here we present a method that uses Illumina chips as solid support for the highly parallel specific biosynthesis of DNA strands. By using light sensitive caged primers, we can extend DNA strands only in specific spots, allowing us to write spatial data.

Poster Session 2

Thursday 13 November, 15:45 – 17:00

Benchmarking Conventional and Learning-based Codecs for DNA Image Storage

Claire Couvreur - I3S, CNRS, UNICA

In response to the ever-growing data storage demands, research has been increasingly focusing on more sustainable solutions as alternative to traditional storage methods (e.g. HDD). One promising approach is storing cold data —that is, data that is rarely accessed— on synthetic DNA, which offers exceptional density, low energy consumption, and long-term durability. This paper explores and compares various image compression techniques tailored for DNA-based storage, evaluating both conventional transform-based methods and state-of-the-art learning-based approaches.

De-Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing

Olivier Boullé - SPINTEC, CNRS

DNA-based data storage offers a compelling solution for long-term, high-density archiving. In this framework, accurately reconstructing high-quality encoded sequences after sequencing is critical, as it directly impacts the design of error-correcting codes optimized for DNA storage. Furthermore, efficient and scalable processing is essential to manage the large volumes of data expected in such applications.

This poster presents a novel method based on de-Bruijn graph partitioning, enabling fast and accurate processing of sequencing data regardless of the underlying sequencing technology and without requiring prior knowledge of the information encoded in the oligonucleotides.

The first processing step involves clustering the reads into multiple subsets of approximately equal size and containing all instances of a given original sequence. In the second step, de-Bruijn graphs are built based on k-mer counting for each subset. Multiple consensus sequences are then established by traversing the paths of each de-Bruijn graphs, which allows for the correction of potential errors. Since this exploration has exponential complexity, several techniques are employed to reduce processing time. First, the graph is progressively pruned by removing nodes with no outgoing edges. Second, when a path is terminated, the current sequence is analyzed for repeated k-mers. If repetitions are detected, one edge is removed from the graph to break the loop. This optimization is lossy, meaning that it may prevent some valid consensus sequences from being found. However, eliminating loops significantly accelerates the search process.

This process ensures the production of high-quality sequences, which can then be transmitted to the decoder.

Evaluated on both synthetic and real datasets, the method achieves excellent precision and recall. Our experiments show that a dataset of 89 million reads, corresponding to a 10 GB fasta file, can be fully processed in less than a minute on a standard 32-cores server.

DNAMIC Project: DNA Microfactory for Autonomous Archiving

Michaël El Kharroubi - University of Geneva

The DNAMIC project, part of the European Path Finder challenge for DNA-based digital data storage, aims to develop an autonomous, end-to-end DNA data storage solution based on a microfactory. Our main application is long-term data archiving, using the OAIS-compliant OLOS system (olos.swiss). This initiative brings together experts from academia and the private sector across Europe. At the University of Geneva, our team is focusing on the design of a CODEC that encodes binary data in DNA and decodes it back into binary while integrating it into the OLOS framework.

Our CODEC is based on a codon wheel to convert bits into nucleotides and avoid homopolymers greater than three while maintaining a balanced GC ratio. Each strand is 243 nucleotides long, including metadata, payload, and primers. To improve data integrity, we implemented the Reed-Solomon ECC scheme both within and between multiple strands. Additionally, we use a classical clustering, alignment, and consensus mechanism to identify and correct synchronization errors. We have successfully validated a preliminary version of our CODEC using six files (for a total of about 1 MB) synthesized by Twist Bioscience and sequenced by our partners at Genomika.

Currently, our efforts are focused on optimizing the decoding part by:

using a locality-sensitive hash clustering algorithm, implementing a sorting system to reduce cluster size, and assessing the use of 2D Reed-Solomon ECC or tag-based approach to handle InDel errors;
porting the algorithms to parallel architectures such as GPUs and FPGAs. The aim is to extend file storage to the terabyte scale, i.e., to decode 1 TB in less than 12 hours.

Enhanced Clustering Methods for Decayed Synthetic DNA in Data Storage Applications

Guy Assa - Technion (Israel Institute of Technology)

Clustering synthetic DNA presents significant challenges for data storage applications. Synthetic DNA breaks occur spontaneously and lack inherent structure, making reconstruction difficult. Additionally, storage conditions accelerate DNA decay, introducing unexpected breaks that further complicate clustering [1]. Existing methods address these challenges with varying trade-offs. Methods like k-median and k-means clustering is computationally greedy, with runtime scaling with input size [2], while other approaches face similar efficiency constraints [3]. Deep learning approaches [4] sacrifice accuracy for improved runtime, potentially compromising data integrity—the primary concern in storage applications. Our method, which leverages user-defined primers and inherent DNA features, demonstrates the potential for accurate recovery of designed sequences from decayed DNA. Our approach uses primer-base reconstruction, adding optimization to the approach described in [5]. Our preliminary analysis of data from Meiser et al. [1] reveals that the first 9 bases following the forward primer potentially allow detection of all 7373 original designed sequences. Combining the reverse primer information and using optimization and advanced statistical techniques suggests a promising enhanced data retrieval from degraded DNA.

References

[1] Meiser, L.C., Gimpel, A.L., Deshpande, T. et al. Information decay and enzymatic information recovery for DNA data storage. Commun Biol 5, 1117 (2022).

https://doi.org/10.1038/s42003-022-04062-9

[2] S. Guha, Y. Li, and Q. Zhang. Distributed Partial Clustering. arXiv preprint arXiv:1703.01539, 2017.

[3] M.-F. Balcan, Y. Liang, and P. Gupta. Robust Hierarchical Clustering. Journal of Machine Learning Research, 15(1):3831–

3871, 2014.

[4] B. Betancourt, G. Zanella, J. W. Miller, H. Wallach, A. Zaidi, and B. Steorts. Flexible Models for Microclustering with Application to Entity Resolution. In NIPS, 2016.

[5] Antkowiak, P.L., Lietard, J., Darestani, M.Z. et al. Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction.

Marker Guess & Check Plus (MGC+): An Efficient Short Blocklength Edit-Correcting Code for DNA Data Storage

Ramy Khabbaz - I3S, CNRS, UNICA

DNA has emerged as a promising data storage medium due to its exceptional density and durability. However, ensuring data reliability remains challenging because of edit errors—including deletions, insertions (indels), and substitutions—arising during synthesis, sequencing, and storage. Current synthesis constraints require storing information in short DNA strands (oligos), making short blocklength edit-correcting codes essential. While most prior approaches either address a single error type or rely on asymptotic results for impractically long codes, practical systems often depend on sequencing redundancy (coverage) to enhance reliability.

Among practical solutions, GC+ codes correct indels by mapping them from the quaternary DNA alphabet to erasures and substitutions over a larger field, which are then corrected using Reed-Solomon (RS) codes via a guess-and-check decoding mechanism.

We introduce Marker-GC+ (MGC+), an enhanced variant of GC+ that integrates periodic markers and a novel Maximum A Posteriori (MAP)-based decoding algorithm. The periodic markers help synchronize decoding, while the MAP-based approach estimates the most likely offset patterns, significantly reducing the number of decoding hypotheses compared to exhaustive guessing. Although markers add redundancy, the MAP-driven decoding lowers the redundancy needed by the RS code, yielding an overall improvement in code rate.

Extensive simulations under realistic error models demonstrate that MGC+ achieves higher information density, reduced decoding latency, and improved reliability compared to both HEDGES, a state machine–based code for correcting indels, and GC+. Even when deletions, insertions, and substitutions occur in unequal proportions, MGC+ consistently delivers lower error rates and faster decoding while maintaining high information density. By directly correcting all three edit error types in short DNA sequences, MGC+ advances the practicality of DNA-based storage systems, reducing reliance on excessive sequencing coverage and offering improved efficiency for real-world deployments.

Data Storage in Random DNA

Omer Ella Caplan - Reichman University

The exponential growth of digital information is pushing conventional storage technologies to their physical and economic limits. DNA-based data storage offers an attractive alternative, combining ultra-high density, durability, and energy efficiency. In a typical workflow, binary data is encoded into sequences of the four DNA bases (A, C, G, T), synthesized into physical DNA molecules, stored, and later retrieved via sequencing and decoding.

A standard DNA data storage pipeline includes DNA synthesis, storage, and sequencing for data retrieval. Among these steps, the synthesis stage remains the major cost driver, presenting a significant barrier to large-scale adoption. This has motivated research into alternative synthesis strategies, improved encoding methods, and more efficient error correction.

This work proposes an approach to storing data in DNA synthesized with mechanisms involving randomness, that integrates digital and DNA-based storage, aiming to reduce overall costs compared to classical DNA data storage methods. Randomness in synthesis, explored in several recent studies, such as Lüschner et al.’s chemical stochastic synthesis techniques, Anavy et al.’s DNA-based data storage with fewer synthesis cycles, and Preuss et al.’s probabilistic sequence generation.

We perform a simulation of image storage using DNA sequences. Given a fixed number of bases per sequence and a predefined set of all possible sequences of that length, we demonstrate that storing only the corresponding primers digitally allows reconstruction of the original image associated with those primers. For our experiments, we use images from the MNIST dataset, applying various resolution-reduction steps to make each image more suitable for DNA-based representation. The simulation illustrates that, under random synthesis, a substantial portion of the generated data is irrelevant. however, with minimal digital storage, the desired data can be accurately and efficiently reconstructed.

Comparison of trace reconstruction algorithms

Sinan Yercan - I3S, CNRS, UNICA

Trace reconstruction is the process of recovering the original sequence from its many noisy copies. For DNA storage, where the sequencing output is composed of multiple noisy reads of the stored oligos, a trace reconstruction algorithm is integral to successful decoding. As such, there exist numerous approaches where the reconstruction process relies on different methods, such as properly aligning the traces, encoding the input data stream, etc.

We model the DNA storage channel as an insertion-deletion-substitution (IDS) channel and propose a marker-based trace reconstruction algorithm based on BCJR decoding. As it is well-known that joint decoding using the BCJR algorithm comes with high complexity, we provide a separate decoding strategy to reduce the decoding complexity. This separation comes at a cost of decoding performance, which we show can be largely mitigated by the addition of iterative decoding with only a slight increase in complexity. To illustrate the performance of the algorithm, we present our results in comparison to previously established literature.

Compact image representation for content-based image retrieval in DNA data storage

Sara Al Sayyed - INRIA Center at Rennes University

In this work, we propose a novel image compression method for content-based image retrieval in the context of DNA data storage. Storing data on DNA is an extremely promising

solution due to its compactness, long-term durability, and energy efficiency. However, its compactness introduces two challenges: the need for efficient data access and the ability to flexibly

handle new (a d not predefined) types of queries. To address the efficiency challenge, our approach enables direct image retrieval within the DNA domain. To ensure flexibility, we design a compact data identifier that is a semantic representation of the image and serves as a header at the beginning of the DNA strand. Our approach shows high visual and quantitative performance,

outperforming state-of-the-art method for various types of query. This highlights that hybridization can be effectively modeled using cosine similarity, without the need for training.

Enabling automated high density storage via assembly of single strand oligonucleotides from a small set

Kaya Selina Wernhart - Ekorefugium

Creation of DNA containing data traditionally relies on de novo synthesis of DNA strands containing pre-encoded information with error correction. However, this approach is costly and time consuming for large data. This motivates the investigation of an alternative strategy, in which a predefined, small set of single stranded oligonucleotides of equal length are assembled by predictable hybridization to encode arbitrary data. Such a set of oligos is called a library.

Any data can be represented by selecting and ordering these presynthesized oligos, potentially enabling low-cost encoding. This is due to the possibility that production of a library can be done by oligo production factories instead of chemical de novo synthesis, as the sequences are predefined. This approach supports multiple ways to represent the same data giving rise to different encoding schemes inherent to this approach. Depending on the encoding chosen the density and error correction capability of the encoded data can be varied, leading to comparably high data density in spite of the fact that only a small fraction of possible nucleotide combinations is used.

The design of the library must be constrained such that oligos hybridize predictably and build a gapless strand of double stranded DNA. Additionally, any combination of oligos of the library in any order must be able to be built. These constraints give rise to a specific oligo structure. The library, further, should be designed such that the edit distance between every oligo is large. This will add an additional layer of error correction that is independent of the encoding method used.

The feasibility of this approach is demonstrated by encoding images into DNA using several encoding methods compatible with the approach. This modular assembly based DNA storage could form the basis for automated cost-efficient molecular data storage systems.

Anchor-Based Architecture for Robust DNA Data Storage and Retrieval

Puru Sharma - National University of Singapore

DNA-based data storage promises exceptional density and durability, yet data recovery remains a major bottleneck due to sequencing noise and structural errors accumulated across the pipeline. We employ an anchor-based storage architecture that embeds predefined mid-strand anchors during encoding. These fixed substrings, synthesized as part of the strand, act as reliable reference points during readout to guide alignment and trace reconstruction. By partitioning reads around anchors, the decoder reduces search space, improves alignment accuracy in the presence of insertions and deletions, and decreases reliance on heavy error-correction redundancy.

We implement this design and evaluate it with wet-lab experiments, observing a 37.5% improvement in both read cost and end-to-end latency relative to popular baseline, with no increase in write cost and no loss in storage density. Beyond efficiency, anchors enhance robustness by stabilizing local alignments, which in turn lowers reconstruction errors under realistic sequencing noise. Because anchors target the synchronization challenge directly, they complement traditional error-correcting codes and enable reducing ECC overhead while maintaining or improving retrieval reliability.

Our results show that mid-strand anchors constitute a simple, practical architectural change that materially accelerates and stabilizes DNA data recovery. This approach advances the viability of DNA as a general-purpose archival medium by improving throughput and lowering operational cost without compromising capacity.

Primer design for DNA storage random access

Jérémy Mateos - PEARCODE, I3S, CNRS

DNA is a promising alternative to traditional storage media due to the molecule's high density and long term stability. However, this novel medium presents challenges, particularly regarding the addressing to allow retrieving specific data from pooled DNA sequences, a process known as random access. This is achieved by designating an addressing zone on each DNA sequence known as a primer, a short DNA segment that acts as a file identifier for the stored information. Establishing a random access is crucial to optimize the efficiency and flexibility of data retrieval.

The efficiency of the random access highly depends on the quality of the primer sequences which is strongly conditioned by structural constraints of DNA sequences.

We propose a methodology for generating high-stringency primers that meet specific biochemical constraints, avoiding sequences that can form undesired shapes or loops that hinder DNA amplification and data retrieval. The tool uses a computational approach to predict the binding affinity and specificity of primers based on thermodynamic calculations. Users can adjust parameters to align with their specific wet lab protocols, providing accurate simulations that enhance comprehension, optimization, and efficiency in data retrieval during biochemical processes.

Gilbert–Varshamov Bound for Codes in L1 Metric using Multivariate Analytic Combinatorics

Keshav Goyal - I3S, CNRS

Analytic combinatorics in several variables refers to a suite of tools that provide sharp asymptotic estimates for certain combinatorial quantities. In this paper, we apply these tools to determine the Gilbert–Varshamov lower bound on the rate of optimal codes in L1 metric. Several different code spaces are analyzed, including the simplex and the hypercube in Zn, all of which are inspired by concrete data storage and transmission models such as the permutation channel, the repetition channel, the adjacent transposition (bit-shift) channel, the multilevel flash memory channel, etc.

Page updated

Google Sites

Report abuse