ما هو الحمض النووي وأسباب استخدامه في التخزين الحمض النووي DNA:
النووي (DNA) هو جزيء بيولوجي يتكوّن من سلسلتين تلتفّان حول بعضهما البعض لتشكيل ما يُعرف بالحلزون المزدوج، ويعمل كمخزن للمعلومات الوراثية في جميع الكائنات الحية تقريبًا. تكمن أسباب استخدامه في تخزين المعلومات التقنية في قدرته الهائلة على استيعاب كم هائل من البيانات في حجم صغير، وثباته على مدى آلاف السنين مقارنة بوسائط التخزين التقليدية التي تتقادم بسرعة وتستهلك طاقة كبيرة.
تطور تخزين المعلومات حتى وصوله للحمض النووي، DNAتخزين المعلومات مرّ بتطورات عديدة بدءاً من النقوش الحجرية، الورق، ثم وسائط إلكترونية كالشرائط الممغنطة، الأقراص الصلبة، والفلاش، ومع تزايد البيانات البشرية تم البحث عن بدائل أكثر كفاءة ومتانة. في التسعينيات ظهرت فكرة استغلال الحمض النووي كوسيط لتخزين البيانات الرقمية، لأن نظامه الوراثي مبني على تشفير رباعي يمكن تمثيله رقمياً.
آلية تخزين المعلومات داخل DNAتخزين المعلومات داخل الحمض النووي يعتمد على تحويل البيانات الرقمية (النظام الثنائي: 0 و1) إلى تسلسل من القواعد الأربعة الأساسية للحمض النووي (A, T, C, G). يتم تصنيع تسلسل جيني اصطناعي يمثل الرسالة أو الملف المطلوب حفظه، ويمكن لاحقًا استرجاع هذه البيانات عبر تقنيات تسلسل الحمض النووي وقراءتها وتحويلها مجددًا إلى بيانات رقمية.
مستقبل تخزين المعرفة في الأرشفة الجينية تخزين المعلومات في الحمض النووي يُعد من أكثر الوسائط واعدة لحفظ المعرفة البشرية لفترات طويلة للغاية، حيث يمكن تخزين مكتبات وصور ووثائق كاملة في شرائط مجهرية لا تحتاج طاقة أثناء التخزين ويمكن أن تبقى لآلاف السنين؛ فضلاً عن أن التكلفة وسرعة الاستخدام تتطور باستمرار ومن المتوقع أن تصبح شائعة للاستخدامات الأرشيفية طويلة المدى خلال العقد المقبل.
كيفية الاستفادة منه في علم المكتبات والمعلوماتفي المستقبل القريب، قد تستخدم المكتبات والمؤسسات المعلوماتية الحمض النووي كوسيط أرشيفي لتخزين الوثائق والسجلات الهامة لتفادي تلفها، حيث يسمح بتخزين كمية كبيرة من البيانات في مساحة صغيرة جدا وبأمان عالي. بذلك يمكن تحويل أرشيفات ضخمة إلى كبسولات بيولوجية دقيقة تدوم قروناً
التحديات والمخاطر في تخزين المعلومات داخل الجينتتمثل أبرز التحديات في ارتفاع تكلفة التشفير والقراءة، وسرعة التعامل مع البيانات (كتابة وقراءة)، بالإضافة إلى وجود مخاطر تتعلق بالأخطاء في الترميز واسترجاع البيانات. ومع تطور التقنية تنخفض التكاليف تدريجياً وتتحسن دقة وسرعة العمليات، إلا أن الاعتماد الكامل على الحمض النووي بحاجة إلى مزيد من الابتكار لضمان الموثوقية والأمان والخصوص
هي جزيئات بيولوجية طويلة السلسلة مكونة من وحدات وصفات نيوكليوتيدات، وهي تسعى إلى تخزين المعلومات المتقدمة في جميع الكائنات الحية
أجهزة معلومات الريبوزي منقوص الأكسجين
: (DNA - Deoxyribonucleic Acid)
التركيب الموجود : يتكون من سلسلتين ملتفتين حول الأعضاء على شكل حلزون مزدوج
السكر الخماسي : ديوكسي ريبوز (ديوكسي ريبوز)
تعليمات إلكترونية : أربع متطلبات هي الأدينين (A)، الثايمين (T)، الجوانين (G)، والسيتوزين (C)
الوظيفة الرئيسية : تخزين المعلومات المتطورة لفترة طويلة ومن جيل إلى آخر
أجهزة معرفة الريبوزي (RNA - Ribonucleic Acid)
التركيب الظاهري : عادة ما يكون ذو سلسلة واحدة
السكر الخماسي : الريبوز (الريبوز)
تعليمات النيكل : الأدينين (A)، اليوراسيل (U)، الجوانين (G)، واليتوزين (C)
الوظيفة الرئيسية : نقل المعلومات والحركات والتعبيرات والتشكيلات
سبب استخدام الDNA في التخزين:
:الطبيعي
نموذج الفضائيات
رقم الحلزونة يتطلب حماية للكمبيوترات المخزنة
كاملتان المتكاملتان تعملان كنسخة احتياطية لبعضهما البعض
غياب مجموعة الهيدروكسيل (OH) في الموضع 2' من السكر يجعل DNA أكثر استقراراً كيميائياً من RNA
الهدف المعلوماتي
يمكن لجرام واحد من الحمض النووي تخزين حوالي 215 بيتابايت (215 مليون جيجا بايت) من البيانات
هذه الميزة تفوق أي وسيلة إلكترونية لتخزين الحالات بآلاف المرات
الديمومة والمتانة
في ظروف مناسبة، يمكن للحمض النووي أن يبقى مستقراً لآلاف السنين
تم الحصول على الحمض النووي القابل للقراءة من العينات المجرية التي يزيد عمرها عن 700.000 سنة
لا يحتاج إلى طاقة كهربائية على البيانات (على عكس الأقراص الصلبة)
نظام التصحيح الذاتي
الخلايا الممتلكة للقتلى الإرهابيين في DNA
تكامل بين القواعد النيكلاندية (AT و GC) البسيطة مع الاكتشاف البسيط
القدرة على النسخ المختلف
يمكن نسخة DNA بدقة عالية جداً ( خطأ واحد فقط في كل نسخة مليارنيت روجينية)
السبب وراء فشل المعلومات عبر الاجيال
:الاستخدامات التقنية للتخزين الصناعي
ضرورة توفير المساحة
مركز بيانات كامل يمكن تخزينه في مساحة بحجم مكعب سكر
حل مثالي لأزمة التمويل الناشئة
الأمان والخصوصية
قوة القراءة غير المتخصصة معدات متوفرة دقة عالية
ويمكن ترجمة البيانات على المستوى التسلسلي لنفسه
الاستدامة البلاستيكية
لا تنتج درجة حرارة أو تستهلك الطاقة أثناء التخزين
قابلة للتحلل كيميائيا وصديقة للبيئة
المقاومة لظروف التصوير
المقاومة للإشعاع الكهرومغناطيسي والنبضات الكهرومغناطيسية (EMP)
يمكن حمايته من القراءات والطرق البسيطة
كيفية الحصول على البيانات منه في معلومات الشبكة الخاصة
تطبيقات DNA في علم المكتبات الجديدة
المجال الأول: الأرشفة والحفظ ثمرة طويلة
1- أرشفة المجموعات الخاصة والنادرة
المعاملات العملية :
أ. الخطوط والوثائق التاريخية :
الحرارة العالمية : تحويل المخطوطات الدموية إلى تسلسلات الحمض النووي
النسخ الاحتياطية : حماية من الفيضانات والفيضانات
الحفظ الدائم : ضمان وصول الأجيال القادمة للوثائق التاريخية
مثال تطبيقي : مكتبة الفاتيكان يمكنها تخزين جميع مخطوطاتها الحرارية (ملايين صفحات) في حبة DNA بحجم حبة السكر، مع ضمانها عبر آلاف السنين [ 10 ]
ب. الصور والوسائط المتعددة :
تخزين الصور التاريخية التاريخية
حفظ التسجيلات الصوتية
أرشفة الأفلام الوثائقية القديمة
2. إدارة الأرشيفات الحكومية والمؤسسية
الاستخدامات :
المواصفات الحكومية : الوثائق الرسمية، المعاهدات، القوانين
أرشيفات الشركات : العقود، براءات الاصدار، فقط
سجلات الأكاديمية : الأطروحات، الأبحاث، البيانات التجريبية
تستحق على الأرشفة التقليدية :
توفير المساحة : أرشيف كامل في صندوق متنوع من مستودعات ضخمة
عدد التكاليف : لا حاجة لمباني ضخمة ومكيفة
الأمان : مقاوم للتلف، الحريق، الماء، والنبضات الكهرومغناطيسية
المجال الثاني: تنظيم وفهرسة المعلومات
1. نظام الفهرسه
الجديد :
استخدام سلاسل DNA قصير كـ "علامات" (Tags) للمواد المكتبية
كل كتاب أو اتفاقية عصر "باركود DNA" فريد
إمكانية تضمين معلومات ببليوغرافية كاملة في العلامة التجارية
التطبيقات :
أ. الفهرسه المتقدمة :
الفهسة متعددة الأبعاد : كل علامة DNA تحمل:
كتاب بيانات (المؤلف، العنوان، سنة النشر)
التصنيف الموضوعي
التاريخ الاستعارة والمستخدمين
حالة كتاب خريطة المحافظة
ب. الاسترجاع الذكي :
قراءة آلاف العناوين في انتظار
تحديد موقع أي كتاب بدقة في المكتبة
جرد تلقائي للمجموعات
2. التصنيف والتكشيف
الفئة النباتية :
التصنيف الهرمي : استخدام قاعدة شجرية مشفرة في الحمض النووي
القصبة الهوائية الجنوبية : تسلسلات الحمض النووي تشير إلى المواد ذات الصلة
التطور التكنولوجي : إمكانية إضافة معلومات مصنفة جديدة
المجال الثالث: إدارة المجموعات المكتبية
1. تتبع المواد
تفاصيل جديدة :
أ. علامات DNA أفضل من RFID :
العطلة :
أصغر حجماً (نانومترية)
أكثر أماناً (يستحيل تزويرها)
صحة معلومات أكبر بكثير
لا تحتاج إلى الطاقة
ب. منع الحرث والتزوير :
كل كتاب نادر عصر "بصمة DNA" فريدة
يمكن التحقق من الأصالة بسهولة
يستحيل النسخة أو تزوير العلامة التجارية
2. الصيانة والحفظ
التطبيقات :
الحالة : علامات DNA بتأثير الرطوبة أو درجة الحرارة
تاريخ الافتتاح : نهاية كل سبتمبر في العلامة التجارية التالية
التنبيه المبكر : التلف الذي حدث قبل أن يصبح خطيرا
المجال الرابع: خدمات اخرى
1. التخصيص والتوصيات
الأنظمة الذكية :
ملف المستخدم جيد : تفضيلات مخزنة بشكل آمن
صغيرة الحجم : خوارزميات تحليل الرغبة في القراءة
الخصوصية : بيانات مشفرة ومحمية
2. الوصول عن بُعد
الخدمات الرقمية الجينية :
المكتبة الإلكترونية : نسخة رقمية مخزنة في DNA
الاستعارة الإلكترونية : إرسال نسخة رقمية من الكتب الفيزيائية
الوصول الجديد : ضمان توفر المحتوى للأبد
الخامس: البحث والاسترجاع
1. محركات البحث الهندية
المستقبل :
أ. البحث الموازي :
استخدام تفاعلات DNA للبحث في ملايين سجلات في وقت واحد
سرعة تفوق الحواسيب المحمولة بمراتب
ب. البحث الدلالي العميق :
لقد فهمت بالضبط أكبر
المفاهيم ذات الصلة ذات الصلة
نتائج البحث أكثر تنوعا وشمولا
2. التنقيب في البيانات
التطبيقات :
اكتشاف جديد : تحليل واستخدام المعرفة
التحليلات التنبؤية : المتوقعة للمستخدمين المستقبليين
اتجاهات البحثية : أبعاد التوجهات العلمية
المجال السادس: التعاون والشبكات المكتبية
1. الفهارس الموحدة
الشبكات الخاصة :
الفهرس العالمي : قاعدة بيانات DNA موحدة لجميع المكتبات
التبادل السريع : مشاركة المعلومات الببليوغرافية فوراً
التوحيد القياسي : معايير DNA للفهرسة المكتبية
2. الإعارة المتبادلة
الصناعات المتقدمة :
تتبع المواد المعارة بين المكتبات
ضمان توريد المواد
تقليل الفقد والضياع
سابعاً: التعليم الموصى به
1. برامج علم المعلومات الجديدة
المناهج المستقبلية :
علم المعلومات الهنغارية : تخصص جديد يدمج علم المكتبات والبيولوجية المقدسة
إدارة المواقع الأرشيفية : مهارات جديدة لأمناء المواقع
التكنولوجيا الحيوية البيولوجية : استخدام الأدوات الجزيئية الجزيئية في المواقع الجغرافية
2. الوعي المعلوماتي
البرامج التعليمية :
تثقيف المستخدمين حول تقنيات DNA
ورش عمل عن الأرشفة
ندوات حول معلومات المستقبل
المجال الثامن: الأخلاقيات والسياسات
1. سياسة الخصوصية
الأسئلة :
حماية بيانات المستخدمين المخزنة في DNA
سياسات الوصول والاستخدام
الموافقة المستنيرة على المحفظة
2. حقوق الملكية الفكرية
التحديات الجديدة :
حقوق النشر للمحتوى المخزن في DNA
قضايا مفتوحة للوصول
الراديو والاستخدام العادل
عملية التطبيق
النموذج 1: مكتبة الكونجرس الأمريكية
المشروع المقترح :
تخزين 17 مليون كتاب في كبسولة DNA واحدة
توفير المال من تكاليف التخزين
ضمان الحفظ للأجيال القادمة
الفوائد :
تقليل المساحة الأساسية من أميال من الرفوف إلى صندوق صغير
حماية من الكوارث الطبيعية
إمكانية إنشاء نسخة متعددة بسهولة
النموذج 2: المكتبة الوطنية (أي دولة)
التطبيقات :
الأرشيف الوطني : جميع الوثائق الحكومية والتاريخية
التراث الثقافي : المخطوطات والكتب الدموية
الإنتاج الفكري : جميع الاطروحات والأبحاث المحلية
الاستراتيجية :
المرحلة الأولى (2025-2028) : رقمنة المجموعات الدموية
المرحلة الثانية (2028-2032) : تخزين النسخ الرقمي في الحمض النووي
المرحلة الثالثة (2032-2035) : نظام استرجاع كامل
المرحلة الرابعة (2035+) : جميع الاختلافات الجديدة في الحمض النووي
النموذج 3: المكتبات الأكاديمية
التطبيقات : الخاصة
الاطروحات والرسائل : أرشيف دائم لجميع الأبحاث
البحثية : مجموعات تخزين البيانات الضخمة
المجلات العلمية : أرشيف كامل الأوروبية
الفوائد للباحثين :
ضمان عدم سرقة البيانات البحثية
إمكانية الوصول الجديد للأبحاث العلمية
دعم إنتاج العلوم الجديدة
الإصابة بالعدوى
المرحلة التحضيرية (2025-2027)
1. التقييم والتخطيط :
تقييم المجموعات المكتبية الحالية
تحديد الأولويات للرقمنة والتخزين
وضع استراتيجية استراتيجية للتحول
2. البيانات المجهولة :
إنشاء مختبرات DNA داخل المكتبات أو أكاديمية مع مختبرات
تدريب الموظفين على التفاصيل الجديدة
تطوير نظام الفهرسة والاسترجاع
3. اختار والبروتوكولات :
وضع القواعد الأساسية للتخزين
المشاركة في الأنشطة التجارية الدولية للتوحيد القياسي
تطوير لاستخدامه
مرحلة التطبيق التجريبي (2027-2030)
1. المشاريع الجديدة :
اختيار مجموعة محدودة للتجربة (مثل 100 كتاب نادر)
تخزينها في DNA واختبار الاسترجاع
تقييم نتائج العمليات والعمليات
2. التوسع :
زيادة عدد المواد المخزنة
تساعد على خفض التكاليف
مشاركات مستفادة
مرحلة التطبيق الكامل (2030+)
1. تكامل الشامل :
دمج تخزين الحمض النووي في جميع عمليات المكتبة
تخزين جميع الاختلافات الجديدة في DNA
إنشاء نسخة جينية لجميع المجموعات
2. التصميم :
مواكبة التقنية
تحسين الخدمات للمستفيدين
مجلة البحث في البحث
الخلاصة
تخزين الحمض النووي يشارك في ثورة علم المنظمات الأعضاء، حيث يوفر:
حفظا باستمرار للتراث الجديد
متكافئة في مساحة لا مثيل لها
أماناً موثوقة
استدامة ممتازة
للحصول على المعلومات المكتبية والمتخصصة لهذه التدفئة من خلال:
اكتسب المعرفة بالتقنيات الجديدة
المشاركة في تطوير مثل
التخطيط للتحول
التعاون مع الجزيئية الجزيئية
Overview of DNA data storage
The idea of using molecules to store digital information was first proposed by American scientist Richard Feynman in a public lecture in 1959.36 In the mid-1960s, Mikhail Samiolvich Neiman and Norbert Wiener first proposed theoretically a miniature device for storing data with DNA molecules.37,38 However, the first validated work on DNA data storage was done by Davis in 1988, who encoded the “Microvenus” icon into 28-base-pair long double-stranded DNA (dsDNA) inserted into Escherichia coli (E. coli).39 The maximum storage capacity of the subsequent period was only a few tens of bytes.40–42 The major breakthroughs and real demonstrations occurred in the 2010s when Church et al. and Goldman et al. successfully stored several hundred kilobytes of data in DNA, taking advantage of array-based DNA synthesis technology.13,14 Recently, with the development of high-throughput DNA synthesis and sequencing technologies, the capacity of digital data (pictures, books, films, etc.) to be stored in DNA is breaking through to the MB level,23,43,44 and even large-scale storage of 200 MB has been achieved.45
As mentioned above, data storage with DNA involves several steps (Fig. 1): encoding, writing, preservation, reading, and decoding.8,12,18,28–35
Fig. 1 Overview of the main steps of data storage using DNA. Data input: the 0/1 string of digital data is encoded into A/G/C/T base sequences according to an algorithm (encoding). In addition to the data coding area, each strand includes primer sequences at both ends for PCR, addressing sequences to mark positions, and error-correcting codes (ECC). The encoded DNA strands are synthesized by phosphoramidite chemistry method or enzymatic synthesis (writing). The DNA strands are stored in vivo or in vitro (preservation). Data output: the base sequences data of the DNA strands are obtained by DNA sequencing technology (reading); finally, the A/G/C/T sequences are retrieved into digital data according to the initial encoding algorithm (decoding).
needed for better understanding and optimizing the compatibility, stability, and functionality of the input DNA.70–72 In addition, specific laboratory environments are often necessary for ensuring the genetic stability as well as the viability of the living organisms, unless considering using the tenacious candidates.73 For DNA data storage applications, the most favorable method should be selected according to the frequency of access to data required in different scenarios. Generally, for DNA data storage, the amount of each sequence of the synthesized DNA is at a trace scale, because large-scale parallelization is needed to increase the speed and data density of the “writing step”. In order to ensure the effectiveness and reliability of the data, PCR amplification technology is necessary to increase the concentration of the synthesized products and backup the data as well.74–76
“Reading” means that the stored DNA molecules are extracted by biochemical methods, and the target base sequences are identified one by one to obtain the written coding data. DNA sequencing technology is used to read and splice the base sequences carried by DNA fragments in the oligonucleotide pool. In early studies, data stored in DNA required sequencing all of the molecules. Later on, PCR-based random-access techniques were developed,17,45 allowing random access to a portion of the data without sequencing all of the DNA in the oligonucleotide pool. As a new trend, array-based technologies for DNA data storage may ease the workload for the PCR because the synthesized DNA is confined or immobilized to the designed location on the array and can be addressed directly via the chip. Approaches such as spot-specified digital microfluidics,77 sequencing-by-synthesis,78 DNA microdisks,79 and SlipChip80 have contributed to a further step towards high manipulability and rapid access. The DNA is conjugated to the surface of the chip and is not damaged or lost during replication, which also allows for easy access and handling, reducing the need for PCR primer selection and large-scale PCR amplification.
Finally, “decoding” is the reconversion of base sequences into digital data and further restoration to the original format of the data. In the whole workflow of a storage cycle, biological and chemical reactions take on the function of writing/reading data.
Using DNA for data storage has several attractive advantages: (1) high storage density. Considering a coding density of 2 bits per base,13 DNA would have a theoretical data density of 6 bits per nm,33 given that a nucleotide is ∼0.3 nm long. If we only consider the nature of the DNA molecule and put aside the complexity of the practical aspects including data retrieval, 1 gram of DNA can store about 4.5 × 107 GB of data given that only a single copy of each unique DNA sequence presents the mixture, while the current technology only stores 10 terabyte (TB) on a 600 g HDD, which is 6 orders of magnitude difference.17,33,81 For the possibility of fully retrieving the data, to use as few as 10 copies of per sequence in the mixture would result in a storage density of 17 EB g−1,17 which is still a significant improvement compared with the current HDD. (2) Long preservation time and durability. Under suitable conditions (e.g., at room temperature in a dry atmosphere, or lyophilized powder), DNA can remain stable for thousands of years and withstand temperatures as low as −196 °C (liquid nitrogen) and as high as 250 °C (silica).82–85 As for magnetic, silicon-based storage devices, the requirements for humidity, temperature and magnetic fields in the environment are stringent and the lifetime usually does not exceed 50 years.15 However, long-term storage of DNA molecules also does face some risks.68 For example, the stored data may be contaminated by bacteria or human DNA.67 In addition, natural DNA is highly susceptible to degradation by microorganisms and nuclease enzymes in the natural environment, while environmental factors can cause strand breaks, hydrolytic damage and UV-induced cross-linking, all of which can lead to partial data loss. Mirror-image DNA has the same storage density as natural DNA, but also has a unique bio-orthogonality, which prevents it from being easily degraded by microorganisms and nucleases, and it is successfully utilized in orthogonal information storage.86–88 (3) Low maintenance cost and environmental friendliness. The inherent durability of DNA renders it highly amenable to preservation. Compared with the regular maintenance of conventional long-term storage equipment which consumes a lot of electricity, energy, and land resources, the energy to store the DNA is almost negligible.15,28,89 In addition, the data stored in DNA can be easily backed up by PCR technology. A helpful comparison of the main performance indicators for various storage media was given by Linda C. Meiser et al. in 2022 (Fig. 2).31 Although current DNA synthesis methods cannot completely avoid using toxic chemicals, even in the case of enzymatic synthesis, DNA is still a more friendly option for data storage media compared with its opponents, as DNA is biodegradable90 and requires less heavy metals and rare elements for synthesis.31
Fig. 2 A comparison of the various storage methods in terms of lifetime, capacity and cost. The cost of mainstream media is derived from the average consumer market price. The data survey was carried out during the writing period of ref. 31. Reproduced from ref. 31 with permission from Copyright 2022 Springer Nature.
In 2020, the world's leading enterprises including Microsoft Research, Illumina, Western Digital, Twist Bioscience, etc., founded the international organization “DNA Data Storage Alliance”. As the association is growing, the total number of members has now exceeded 40. It brings together the world's state-of-the-art information technology, DNA artificial synthesis, DNA sequencing, and integrated circuit manufacturing industries.
Their mission is to create and promote an interoperable storage ecosystem based on manufactured DNA as a data storage medium. The alliance launched its first version of white paper in 2021,3 outlining the background, strategy, and technical development of DNA data storage. It seems promising that the establishment of the consortium will accelerate the cross-fertilization and breakthrough progress of data encoding technology, high-throughput DNA synthesis technology, and sequencing technology, and will vigorously promote the process of DNA data storage technology.
However, current DNA data storage technology still faces several challenges: (1) low throughput and speed. At present, the throughput of synthesis technology and sequencing technology is far not high enough for data storage, particularly the synthesis technology. Enzymatic synthesis offers a higher speed for “data writing” compared with the chemical approach. It has been demonstrated that the coupling time of enzymatic synthesis can be minimized to 10–20 s, while that of the chemical phosphoramidite synthesis is usually in the range of 4–10 min.91,92 Lee et al. gave an estimate of 40 s per cycle for enzymatic synthesis, which is six times faster than phosphoramidite synthesis. However, this rate is still much slower than that of the state-of-the-art electronic devices.93,94 (2) Difficult data access. Unlike conventional storage devices, it is not yet feasible to access random parts of the data or modify them in DNA molecules on a single device. (3) Workload in large-scale data reproduction. Although the PCR is no doubt a powerful tool for nucleic acid amplification and is generally acknowledged to be a high-fidelity process, it introduces bias by e.g., the GC content in the strand, which may cause loss of the data strands containing a high GC content during amplification. This would lead to a significantly different proportion of the sequences when the PCR cycles are large,95,96 and affect both data storage capacity and retrieval efficiency. Also, it is still difficult to amplify highly repetitive sequences by the PCR.97 (4) High complexity and costs of integration. Most of today's DNA data storage strategies are realized on separate devices and locations for synthesis, preservation, replication, and sequencing sessions, making the process complex and time-consuming. Besides, although the average cost of sequencing genes per TB of data in 2021 was only $0.006 (calculated based on the production cost of sequencing one million bases, including equipment, reagents, administration, and overhead costs), significantly decreased from $5292.39 in 2001, according to the National Human Genome Research Institute (NHGRI),98 the cost of DNA synthesis is still orders of magnitude higher compared with the cost of sequencing. According to the estimation by Meiser et al., storing 1 MB encoded data into DNA would cost around $800 to $500023,28,31 in which the cost of DNA synthesis makes up the major proportion.12,99 Yet, tape storage costs just $16 per TB.33 Antkowiak et al. had given a detailed estimation on each step of the DNA data storage workflow in 2020.100 The high cost greatly prevents DNA storage from becoming a commercial product.13,28,100 Nevertheless, DNA data storage is still considered to be one of the most promising long-term storage solutions for the future, as the cost of synthesis and sequencing keeps falling dramatically and consistently over the years.
3. DNA synthesis
Fundamentally, two methods have been developed for artificial DNA synthesis: chemical and enzymatic. Based on these two methods, the technological route of artificial DNA synthesis can be divided into three generations (Fig. 3).91,94,101–103 The first generation is the traditional column-based solid-phase phosphoramidite chemistry synthesis. Owing to the development of microchip technology, high-throughput array-based synthesis technology based on phosphoramidite chemistry synthesis has blossomed in recent years, which is considered to be the second generation. Enzymatic synthesis, subsequently, could bring the synthetic biology industry into an exciting next stage.
Fig. 3 Overview of DNA synthesis techniques and their classification.
3.1. Chemical synthesis
The history of DNA synthesis began in the 1950s when Michelson and Todd published the first chemical synthesis of dinucleotides.104 Subsequently, phosphodiester105 and phosphotriester106 methods of oligo synthesis were developed. In 1981, Caruthers first described the solid-phase phosphoramidite method of oligo synthesis.107 In this method, nucleotides were covalently immobilized on a solid-phase carrier. Phosphoramidite monomers, each carrying a base group, are used as the synthesis unit. The monomers underwent a series of chemical reactions to extend the nucleotide strand in a controlled manner. So far, this is still the standard protocol for chemical synthesis of DNA. The conventional solid-phase phosphoramidite chemistry method consists of four cyclic steps, which are displayed in Fig. 4.94,101,108
Fig. 4 A four-step cycle for the synthesis of oligonucleotides by solid-phase phosphoramidite chemistry method. ① Deprotection. The DMT group at the 5′ end of an oligonucleotide monomer is removed, and the hydroxyl group is exposed to start the reaction. ② Coupling. The desired free nucleotide monomer is attached to the 5′ end hydroxyl group of the previous monomer. ③ Capping. The unreacted 5′ end hydroxyl groups of the oligonucleotide are sealed to prevent unwanted strand extention. ④ Oxidation. Oxidation reagent oxidizes the linkage bonds between the coupled monomers to a more stable state. The cycle is repeated until the target sequences are achieved. Reproduced from ref. 94 with permission from Copyright 2014 Springer Nature.
investigated and the mechanism of the enzyme is well studied. However, there are several issues that are necessary to be considered carefully: how to add nucleotides in a controlled and precise manner? What moieties are used to modify the monomers? How much do the unreacted initiators contribute to the deletion error rate? What is the probability of side reactions occurring in the synthesis process? What level of scale can be achieved for target products?122 How to improve the enzyme activity of the native TdT on 3′-end blocked dNTPs? In addition, further exploratory improvements in enzyme engineering and optimization of enzyme cycle reactions, etc., are still needed for large-scale industrialization. For example, Lu et al. demonstrated a two-step cyclic synthetic route using an engineered Zonotrichia albicollis (ZaTdT) enzyme with an average stepwise coupling efficiency of 98.7% for extending single nucleotides, which has some potential applications. The catalytic activity of this engineered enzyme was 3-fold higher than that of the normal TdT enzyme.128 Verardo et al.129 from DNA Script recently reported their approach to large-scale industrialization of TdT-based enzymatic DNA synthesis, which will be discussed in detail later in this review. This is a significant step towards the industrialization and parallelization of enzyme synthesis.
3.3. Technological development for DNA data storage
DNA synthesis technology is a key step in the process of DNA data storage. The speed, throughput, accuracy, and cost of synthesis all contribute to determining the availability of DNA data storage.
There are two strategies for improving the throughput of DNA synthesis: one is to simply increase the number of channels for the above column-based synthesis and expand the scale of parallel synthesis; the other is to increase the synthesis density and miniaturize the system as a whole. Miniaturized array-based synthesis allows more sequences to be synthesized in parallel in a limited space while reducing the amount of consumed liquids. The scale of the products at a single site in the array is much lower than that in the column. Furthermore, array-based synthesis costs only $0.00001–$0.0001 per base, while column-based synthesis costs $0.05–$0.10 per base which is 2–4 orders of magnitude higher.94,101 Array-based DNA synthesis is oriented to the synthetic biology field of gene splicing, library building, and other applications that require trace level (e.g., fmol) as well as multiple sequences. The automation and continuous miniaturization of the instrument further enhance the throughput of array-based synthesis, which precisely provides a more suitable platform for DNA artificial synthesis whose application is data storage.
Fig. 5 shows the density of synthetic arrays required to achieve high-speed writing of large amounts of data. The total amount of data written per unit area can be calculated using the following equation:
C = Eυριt
where C is the total writing capacities of the data, E is the coding density of each base, υ is the synthesis time per base, ρ is the number of synthesis sites per cm2, ι is the effective nucleotide strand length and t is the total synthesis time.
Fig. 5 The relationship between the amount of data written, the speed and site density of array-based DNA synthesis. The three curves represent the required synthesis site density and individual base synthesis time for writing speeds up to KB, MB, and GB levels respectively. Fixed parameters: synthesis length of 100 nt, single base coding density of 2 bits. To achieve high-throughput synthesis, high-speed data writing requires faster synthesis speeds and higher site densities.
Assuming that the amount of data shown in the figure needs to be achieved over an area of 1 square centimeter (cm2), and if a base can be encoded as 2 bits, a single synthesis site effectively encodes a nucleotide length of 100 nt, and 1 base could be synthesized at a rate of 1 base per second, then, to achieve TB (1 TB = 240 B) level data writing in one day, the scale of the array sites needs to be below the submicron level. However, current coding density is only able to reach 2 bits per base pair.11–15 What's more, the “encoding” and “decoding” steps also lead to errors. To restore the original data, in addition to the information-containing fragments, a certain length of data redundancy sequence needs to be added to the synthesized DNA strand. This requires the length of the synthetic sequence to be longer than the effective coding sequence.35 In sum, the above description implies that a much higher array density is required to achieve the TB level of data per day. To achieve such high-density arrays, micro and nanochips based on integrated circuit fabrication are the most optimal strategy.
Here, we aim to list and evaluate the diverse technological routes of utilizing integrated micro and nanoscale chips for DNA artificial synthesis. By weighing the pros and cons of each unique route, we hope that this review could provide a basic perspective on the trends in high-throughput DNA synthesis.
4. Array-based DNA synthesis chip
In the early 1990s, Affymetrix (acquired by Thermo Fisher) utilized integrated circuit technology to achieve high-density, 25 nt oligonucleotide synthesis on a single chip, opening up the route to DNA array-based synthesis technology.130,131 After decades of technological innovation, a variety of DNA array-based synthesis technologies have been developed and commercialized. These technologies are based on the four-step phosphoramidite chemistry but with different deprotection mechanisms. These are realized by involving terminal protecting groups of different natures in the synthesis process, such as pH-sensitive protecting groups,80,116,132–135 temperature-sensitive protecting groups32,136,137 or photolabile groups.138–140 Several array-based DNA synthesis methods are briefly illustrated in Fig. 6. Each of these has its advantages and disadvantages in terms of synthesis density, coupling efficiency, length, fidelity, time, and cost. These parameters are also evaluation indicators of DNA synthesis technologies. The following sections will further describe several mainstream technologies and the state of technological development of representative companies employing these technologies.
Fig. 6 Schematic of array-based DNA synthesis. (a) Inkjet printing synthesis. Each nozzle is equipped with a different nucleotide monomer reagent, moving over the chip surface to deliver reagents to a designated site. (b) Thermal synthesis. The heating source makes the reaction site active for the bases to attach to it as the reagent flows across the entire chip. The cycles of heating and extension are repeated until thousands of different nucleotide sequences are synthesized in parallel on the chip. (c) Mask-based lithography synthesis. Different colored rectangles represent masks with different patterns. In each exposure, light is only allowed to pass through specific areas (bright color). The black round shape represents the protecting group, and red indicates that it is undergoing deprotection. Letters A/G/C/T represent four different nucleotide monomers. (d) Maskless digital micromirror lithography synthesis. The bright-colored square within the dashed line area indicates digital micromirror devices (DMD) are “on”, while gray indicates “off”. When the DMD are at “on” state, light is reflected onto the substrate for deprotection. (e) Electrochemical synthesis. The bright-colored electrodes indicate that an electric potential is applied to the appointed active spots to deblock the protecting group for further DNA synthesis process.
4.1. Inkjet printing synthesis
Piezoelectric inkjet DNA synthesis was first proposed by Blanchard and Hood.141 This approach loads the monomer reagents into tiny nozzles as the “ink”, and uses an inkjet printer, namely a microdroplet generator, to precisely deposit reagents to the surface of a functionalized substrate to achieve large-scale parallel synthesis of oligonucleotide sequences.108,142 In brief, program-controlled nozzles move rapidly above the chip and spray chemical reagents to specified synthesis sites one by one according to a designed sequence as displayed in Fig. 7.143 The inkjet system is able to deposit the required monomer type to each site all over the entire chip rapidly in one round of injection in the coupling step, while the steps of deprotection, capping, oxidation and cleaning are carried out in the tiling mode through multiple channels. Generally, a piezoelectric printhead can control the liquid volume to the picoliter (pl) level. The droplets are spread on the array substrate with a diameter of tens to hundreds of microns in tens of microseconds. Owing to the small volume of the printed liquids, the reagent addition time is only hundreds of milliseconds. High-speed motors control the rapid movement of the microarray in the front and rear directions, and tens of thousands of dots can be printed in minutes. Additionally, 1,4-dicyanobutane (a more viscous, non-volatile solvent than acetonitrile) is used to dissolve the phosphoramidite monomers and the catalyst. This slows down the solvent evaporation, thus prolonging the reaction time between the reagent and substrate, ensuring coupling efficiency. However, as the nucleotide strand length extends, the surface properties of the array substrate may change, altering the size and location of the fallen droplets which might result in cross-contamination between adjacent sites. Therefore, a flat chip structure can't realize extremely high-density load. Moreover, a large number of deletions begin to accumulate once the strand length exceeds 50 nt.108
Fig. 7 Schematic diagram of inkjet printing synthesis platform. (a) A program controls the motion of the inkjet print heads and prints trace amounts of phosphoramidite reagents on the slide surface.143 The slides are packed with tens of thousands of reaction chambers. Each of them can carry out a conventional four-step synthesis of phosphoramidite chemistry. Reproduced from ref. 143 with permission from Copyright 2013 Elsevier. (b) Twist's silicon-based DNA Synthesis platform. There are thousands of clusters on the chip, each consisting of 121 surface sites, performing different sequence synthesis.146
error checking to improve yields and the eventual assembly of the as-obtained dsDNA.136,148 The entire reaction processes are carried out at the reaction sites (called “virtual wells”) in a continuously flowing liquid system with thermosensitive reagents. Each heating site on the chip has a diameter of 100 μm and a space of 300 μm resulting in approximately 10 heaters per square millimeter.
Under the control of computer programs, thousands of sites can be independently activated and warmed to start the independent DNA synthesis cycles, respectively. The closed-loop thermal control system allows liquid in each virtual well to reach different temperatures within the same circulation system and avoids the thermal diffusion on each site that happens with an array of conventional heaters. Temperature sensors at the sites feed the actual temperature back to the computer system, and, then, an algorithm compares it with the target temperature to determine whether it needs to be warmed up or cooled down. This requires very precise scaling circuitry and algorithmic programming. To achieve both “warming & cooling” functions, the material with controlled thermal resistance is installed underneath the site, which draws heat from the site to achieve a cooling effect.149 As shown in Fig. 8a, firstly, the circuitry controls the generation of heat at the activated sites. The heat transfers to the liquid above and, as a result, the temperature-sensitive protecting groups are removed.137 Subsequently, a new monomer can be added to each oligonucleotide strand at the activated sites. The cycle of heating and extension is repeated until the target oligonucleotide strands are synthesized. After that, with the help of precise flow pumps and electromagnetic fields, the short ssDNA fragments are selectively released by heating and are transferred to the partner strands with complementary base sequences immobilized on the substrate. In this way, long dsDNA can be automatically assembled on the chip. In addition, mis-matched double strands are identified, once the oligos are annealed because they have a lower denaturation temperature than the desired DNA. Subsequently, unwanted DNA strands are removed by applying precise, sequence-dependent temperature followed by flushing liquid (Fig. 8b). The error correction and purification processes can minimize polluted fragments in the product and help to provide a higher yield. Finally, the successfully matched oligos continue to assemble into longer dsDNA by complementary pairing at the terminal (Fig. 8c).
Fig. 8 Schematic of thermally controlled oligonucleotide synthesis.103 (a) Thermally controlled strand extension process. The temperature-sensitive protecting group is removed by heating the selected site (site 1). The protecting group may alternatively be Boc, Fmoc, Bsmoc, and more examples could be found in ref. 137. Then, free oligonucleotide monomers are added onto the strand terminal. The cycles of heating and extension are repeated until the desired ssDNA fragments are achieved. (b) Thermally controlled cleavage and error-correction process. Deprotection and cleaving occur at different temperatures. The ssDNAs are released from site 3 by heating and then migrate toward partner strands with complementary base sequences which are immobilized on site 1; the mis-matches can be cleaved by applying a precise temperature during annealing and eventually washed away with the flowing liquid. (c) Thermally controlled assembly process. By heating site 5, the short dsDNAs are released and combined with another dsDNA (site 6) by the principle of complementary base pairing to assemble a longer strands; Heating site 4, short-stranded DNA continues to assemble at site 6. Those processes continue to produce desired long dsDNAs with high yield. Reproduced from ref. 103 with permission from Copyright 2023 Springer Nature
It is claimed that this technology platform is compatible with chemical and enzymatic DNA synthesis methods. However, it also faces some challenges. For example, appropriate protecting groups are selected according to the type of activating agent used in the heating step. When the activator is acidic (e.g., trifluoroacetic acid), tert-butyloxy carbonyl (Boc) or trityl (Trt) is mostly used. When the activator is basic (e.g., morpholine or piperidine), (1,1-dioxobenzo[b]thiophene-2-ylmethyloxycarbonyl (Bsmoc)) is preferable.137 It is challenging to develop highly temperature-sensitive protecting functional groups. Another serious difficulty is how to independently and precisely control the thermal behavior of micron reaction sites on the chip. To ensure the efficiency of synthesis, it is important to consider the approach that can help the generated heat dispersed evenly around the reaction site without conducting to the gap region or the adjacent sites. To efficiently conduct heat, Evonetix has developed a cooling system that consists of fluid flowing coolant, a thermoelectric cooler, and a copper substrate glued to the back side of the chip. Besides, there are other technical difficulties, such as: which microfluidic system to choose and what is the optimal flow rate? How to control the behavior of DNA under different thermal conditions? How to manufacture precisely assembled silicon wafer modules and avoid the risk of wafer explosion at the weakly bonded area during heating? How to prevent the chip from corroding when it is immersed in the strong acid/alkali reagent at a high temperature?
4.3. Photochemical synthesis
Photochemical synthesis is realized as follows: firstly, a laser with a specific wavelength is precisely projected onto the selected sites of the array substrate. On the irradiated sites, the protecting groups at the 5′ end of the nucleotides are removed. Subsequently, a series of chemical reactions including coupling and capping are performed in a tiling manner, while the nucleotide strands extension only occurs on the irradiated sites. According to the mechanism of deprotection, there are two types of synthesis methods: photo-acids and photo-degration.130,138,151,152 The principle of photo-acids is to decompose the photocatalyst by light exposure, generating acid to remove the protecting group (e.g., DMT).134,135 The photo-degradation approach, on the other hand, is based on direct decomposition of the photolabile protecting groups (e.g., 2-(2-nitrophenyl)-propoxycarbonyl (NPPOC) or benzoyl-2-(2-nitrophenyl) propoxycarbonyl (Bz-NPPOC)) caused by the projected light.138,140,152,153 According to the optical control system, photochemical synthesis is divided into two types, which are mask-based photolithography (used by Affymetrix) and maskless photolithography (used by Roche, LC Sciences), respectively.
Mask-based photolithography synthesis refers to the transmission of light through specifically designed physical masks placed over the synthesis surface. Light is only allowed to pass through the transparent area of the mask, and be projected onto the substrate at certain locations.154 Affymetrix's commercial product GeneChipTM represents a mask-based photolithographic in situ synthesis (Fig. 9a).155 This technique typically produces 20–25 nt oligonucleotide strands and more than 106 feature sites per chip. With the development of photolithography process, the feature size of each chip has evolved from 50 μm to 20 μm, 18 μm, 11 μm and eventually down to 5 μm on a 1.28 cm × 1.28 cm chip in 2005.139,156 Subsequently, it was found that a further reduction of feature size of the chip to 1 μm with densities up to 1 × 108 cm−2 was proven to be promising by simulation with a reasonable control of the diffraction.139 However, a unique mask is needed for almost each cycle of nucleotide strands extension. For long sequence synthesis, the mask photolithography method requires a large number of custom-made mask plates, which dramatically increases the cost of synthesis.
Fig. 9 Schematic diagram of two photochemical methods of DNA synthesis. (a) Mask-based photolithography synthesis.155 Top: Mask-based photolithography. UV light passes through a lithographic mask that acts as a filter to either transmit or block the light from the chemically protected microarray surface (wafer). The sequential application of specific lithographic masks determines the order of sequence synthesis on the surface. Bottom: Chemical synthesis cycle. UV light removes the protecting groups (squares) from the array surface, allowing the addition of one nucleotide. The sequential synthesis cycles result in multiple 25-mer probes on the array surface. Reproduced from ref. 155 with permission from Copyright 2015 Elsevier. (b) Maskless photolithography synthesis.152 Top: DMD. The 365 nm UV light from an LED is uniformly projected onto the DMD. Digital micromirrors in the “ON” state reflect the light onto the surface of selected synthesis sites. Bottom: The cycles of phosphoramidite synthesis with a Bz-NPPOC protecting group at the 5′ end that is used in this method. Reproduced from ref. 152 with permission from Copyright 2021 Oxford University Press.
semiconductor (CMOS) integrated circuit chips. These microelectrodes are treated with a porous reaction layer (sucrose) to improve the quality of nucleotides synthesis.164 To confine the diffusion of proton acid from the activated electrode sites to the neighboring ones, an opposite potential is applied to the electrodes around the synthesis sites to trigger a reduction electrochemical reaction that produces bases to neutralize the excessive acid.165 Their 12 K microarray chip product has a circular electrode diameter of 44 μm and can synthesize 12 472 oligos. The 90 K chip offers synthesis throughput of 92 918 and oligonucleotide libraries up to 170 nt in length with an error rate of less than 0.5%, and the electrode size is further reduced to 22 μm. On the company's website, it is announced that this is the highest density commercial oligo-synthetic chip at present, with a throughput of 8 million oligos per chip, and the number may reach 200 billion, potentially.166 In addition, the cost is affordable at less than $0.2 per base and the yield of each oligo is up to 1 fmol.166,167 Their chip products are starting to be used in DNA data storage research, which may bring the cost of data storage down to $50 per TB.24,168
Similar approaches were recently studied by Microsoft Research and University of Washington. They have achieved a parallel synthesis of arbitrary sequences of DNA at submicron scale, increasing the synthesis density by three orders of magnitude compared with existing products. The electrodes are 650 nm in diameter and the corresponding pitch length is 2 μm. According to the density of electrodes, 2.5 × 107 oligonucleotide strands are theoretically synthesized in parallel on a 1 cm2 area, which meets the electrode density required for data storage speeds of megabytes per second that we estimated in Fig. 5. In addition, the synthesis length is up to 180 nt, tripled than previous electrochemical microarray-based DNA synthesis methods.116,132 Furthermore, the total cumulative error rate including deletions, insertions, substitutions ranges from 4% to 8%, which is still within the 15% tolerance of DNA data storage technology employing an error-correcting system.14,100,133,169 They designed a special electrode array (Fig. 10c) to resolve the crosstalk problem among adjacent reactors. The synthesis sites (circular-shaped anodes) are at the bottom of a nanowell structure, where deprotection and coupling steps occur. One anode electrode is surrounded by four cathode electrodes (diamond-shaped) applying an opposite potential. As reported, the deprotection step involved the addition of methanol to acetonitrile in a ratio of 1 : 9, resulting in the generation of alkaline species that consumed the protonic acid at the cathodes and completed the electrochemical half-reaction. The alkaline methoxide anion chemically confines the acid within the synthesis sites region effectively, preventing unwanted deprotection at the sites which are supposed to be “off” during a synthesis cycle. Additionally, the deep nanowell also provides a physical barrier to limit the acid cross-contamination.
Fig. 10 An overview of the electrochemical DNA synthesis. (a) Schematic diagram of the electrochemical synthesis of nucleotide strand on an electrode. ① A positive potential is applied to the electrode, producing a protonic acid to remove the DMT protecting group and exposing the “–OH” to start the next cycle. ② A free phosphoramidite monomer with a protecting group (DMT) at its 5′ end is coupled to the “–OH” on the electrode/previous nucleotide. ③ The newly formed phosphite backbone linkage is oxidized to the more stable phosphate by an oxidizing agent. ④ The capping reagents seal off “–OH” groups that are not coupled to the monomer, making them unavailable for subsequent reactions. (b) An example of reaction of redox pairs at electrodes during the electrochemical deblock step. The anode undergoes an oxidation reaction to generate protons; the cathode undergoes a reduction reaction that consumes protons. Reproduced from ref. 133 with permission from Copyright 2021 AAAS. (c) (I) Cathodes (diamond-shapes) are connected together (dashed line) while four anodes (circle-shapes) of the same color connected together (solid line). (II) SEM image of a nanoscale electrode array. The 650-nm anodes with the pitch length of 2 μm are sunk in a 200-nm deep well and surrounded by four counter electrodes. (III) A fluorescent image of the array in (II) after parallel synthesis of two different sequences with different fluorophores. The clear demarcation of the different fluorescence proves that the acid generated by the electrodes is strictly confined and demonstrates independently controlled parallel synthesis. Reproduced from ref. 133 with permission from Copyright 2021 AAAS.
Further shrinking electrode feature size and shortening electrode pitch are effective solutions to greatly increase synthesis density and throughput. Typically, electrode sizes based on advanced semiconductor manufacturing technologies can reach submicron or even nanometer scale. It is relatively feasible to prepare ultra-dense micro/nanoelectrode arrays. In other applications, researchers have succeeded in narrowing down the diameter of micro-electrodes to 100–200 nm or even 10 nm, and the pitch of electrodes to 750 nm.170–172
However, the risk that the acid diffuses to neighboring electrodes raises at a higher density of electrodes.165,173 This results in unwanted deprotection on the surface of adjacent electrodes, which increases error rates and reduces synthesis yields. Currently, the biggest technical challenge in electrochemical DNA synthesis is to strictly confine the acid produced around the activated microelectrodes and prevent it from diffusing to the adjacent electrodes. A compromise between the synthesis density and the ion diffusion must be studied before a breakthrough technology that can solve the conflict appears.116,132,163,174
Although the aforementioned array-based DNA synthesis technologies have improved the throughput by several orders of magnitude over the traditional column-based synthesis, their capabilities are still not yet ready for applications such as DNA data storage. Each of these technologies faces its own challenges to substantially increase throughput, reduce costs and speed up synthesis while ensuring appropriate coupling efficiency: (1) for inkjet printing, the size, complexity and cost of piezo printheads are considered as the crucial limit. (2) The size and spacing of heater units and precise control of heat become an obstacle for thermal synthesis techniques. (3) Photochemical methods struggle with the inherent diffraction and refraction of light. Novel developments in applied physics, such as the plasmonics, may bring disruptive technological innovation to overcome the physical constrain of the chip size. It has been demonstrated that metallic nanostructures can generate localized surface plasmons when they couple to electromagnetic waves, resulting in thermoplasmonics effect like localized heating,175 or subdiffraction-limited spatial resolution in optics,176 which might shine a light in further reducing the working units of chips for thermal or photochemical DNA synthesis. (4) Electrochemical synthesis has to overcome the proton acid diffusion and crosstalk effects between electrodes (Table 1). For example, technologists have proposed new ways in generating ions or protons, e.g., by using ion-releasing materials as a working electrode that releases protons directly instead of via the redox reaction in the solution, which may hold potential to further localize the protons and enable even higher synthesis density.177
DNA Sequencing in Space Timelin
Bacteria can be identified by their unique biological blueprint, contained within molecules of deoxyribonucleic acid (DNA). DNA is made up of four base molecules that link together to encode instructions for cell growth and behavior. Identifying the order of the bases using the process of DNA sequencing clues researchers in to the identity of the organisms and how they might behave.
The equipment required for DNA sequencing has historically been expensive and time intensive and has required specialized expertise to operate, limiting its use in space.
Explore how this technology has evolved to where researchers can now sequence DNA aboard the International Space Station:
February 1953 – Francis Crick, James Watson, and Rosalind Franklin discover the double helix structure that makes up DNA.
NASA
December 1977 – Frederick Sanger develops the first DNA sequencing method to read the genome of a virus.
July 1995 – First full bacterial genome sequenced (H. influenzae) with shotgun sequencing, which breaks the genome into small fragments that are sequenced individually using the chain termination method, then reassembled.
February 2012 – Oxford Nanopore Technologies debuts the first nanopore sequencer that uses next-generation sequencing (NGS) with the MinION.
April 2016 – As a part of the NASA WetLab-2 study, NASA astronaut Jeff Williams performs the first RNA isolation in space from E.coli and collects data on the RNA expression levels in the microbe.
April 2016 – DNA is amplified for the first time aboard station by ESA (European Space Agency) astronaut Tim Peake using the first PCR machine sent to station by company miniPCR.
NASA
August 2016 – NASA astronaut Kate Rubin sequences DNA in space for the first time
August 2017 – NASA astronaut Peggy Whitson combines the miniPCR and MinION, sequencing and identifying the first unknown microbe from the station.
August 2018 – NASA astronaut Ricky Arnold demonstrates Biomolecule Extraction and Sequencing Technology (BEST) by using culture-independent methods to sequence DNA on station for the first time with a “swab to sequencer” method. This process speeds up the rate of sequencing, no longer requiring the time and resources needed to grow the bacteria prior to analysis.
May 2019 – NASA astronaut Christina Koch performs the first CRISPR-Cas9 gene editing on station, using yeast to mimic the effects of space radiation on human DNA.
February 2021 – The crew performed more than 800 microbial sample collections throughout station for the 3DMM experiment. Scientists used DNA sequencing and other analyses to construct the first comprehensive 3D map of bacteria and bacterial products throughout the station.
May 2024 – Genes in
Space Molecular Operations and Sequencing (GiSMOS) marks the first true on-site microbial profiling investigation of the space station water system. This is done with targeted gene sequencing, which allows rapid and accurate identification of bacteria and fungi species in a water sample.
DNA storage: The future direction for medical cold data storage
In the current era, we are witnessing an unprecedented surge in information. Between 2010 and 2018, the number of global data center compute instances increased by 550 % [1]. The corresponding energy consumption of these data centers, reaching 200 TW-hours (TWh) in 2018, rivals that of some countries at the time [2]. The International Data Corporation (IDC) predicted a staggering growth in global data volume from 33 Zettabytes (ZB) in 2018 to 175 ZB by 2025 [3]. In 2021, IDC projected a compound annual growth rate of over 20 % for the global data volume [4], estimating a global data burden of over 435 ZB by 2030.
This formidable data challenge to the existing storage capacities is primarily attributed to the gap between rapid data expansion and sluggish advancements in storage media. On one hand, the acceleration of fields such as medical health, precision medicine, bioengineering, artificial intelligence (AI), the Internet of Everything, and 5G communication substantially contribute to large-scale data production. For example, with the rapid development of the Internet of Things (IoT), IDC estimated that about 75 % of the world's population will interact with an IoT device every 18 s, generating a cumulative data volume surpassing 90 ZB by 2025 [3]. On the other hand, current storage media are approaching their limits and struggling to meet the needs of massive data storage. Traditional storage media like hard disk drives (HDD) and solid-state drives (SSD) currently dominate the field of data storage. Despite reported progress in storage density (from 380 Gigabyte per square inch (GB/inch2) to 1100 GB/inch2 for HDD, from 200 GB/inch2 to 2000 GB/inch2 for NAND) and cost reduction (from 0.272 United States Dollars per Gigabyte (USD/GB) to 0.039 USD/GB for HDD, from 3.33 USD/GB to 0.320 USD/GB for NAND) from 2008 to 2016, they still fall short of Moore's Law predictions [5].
The healthcare industry, a paramount aspect of human well-being, is at the forefront of grappling with the limitations of current data storage. The diversity and complexity of medical results and records, along with the rapid advancements in genomics, have led to an unprecedented increase in healthcare data. In 2018, IDC estimated that healthcare data constituted 30 % of the global data volume and would reach 36 % by 2025 [3]. This growth rate surpasses that of manufacturing by 6 %, financial services by 10 %, and media and entertainment by 11 % [3]. Due to medical and legal requirements, healthcare data often necessitates long-term storage, consuming substantial resources, including physical storage space, manpower, materials, and daily maintenance. Therefore, there is an urgent need for an innovative, high-density storage medium to address the burgeoning data crisis in the field of healthcare.
Currently, the exploration of new storage media primarily revolves around two key tracks. One is silicon-based storage media, such as SSD and optical tapes [6]. The other is carbon-based storage media, encompassing specific small organic molecules [7], peptide sequences [8], synthetic metabolomes [9], and DNA. Among these options, we perceive DNA as the most promising candidate for medical data storage. DNA carries the genetic information essential to understanding the physiological and pathological conditions of every organism. Therefore, researchers have never ceased to unlock DNA's mysteries, discovering its capacity to store data using the permutation and combination of the four bases: Adenine (A), Guanine (G), Cytosine (C), and Thymine (T). This finding positions DNA not only as a substantial data source in healthcare but also as a potential solution for medical data storage.
Given the considerable prospects of DNA storage for medical data storage, we aim to review the state-of-the-art technologies related to the DNA storage workflow, explore their applications in medical data storage, and endeavor to design a DNA storage system based on existing technologies in the near future.
improving its resistance to UV radiation and heat in 2013 [19]. In 2018, Organick et al. further advanced in vitro DNA storage capacity, successfully storing and retrieving over 200 Megabytes (MB) of data using 13 million DNA oligonucleotides [20]. In 2023, we witnessed the launch of the first commercial in vitro DNA storage device—a 1000 USD DNA storage card capable of storing 1 KB of data [21]. It is designed to preserve precious text with unyielding reliability, maximum security, eco-friendliness, and timeless compatibility. Last year, Hou et al. further optimized the CRISPR/Cas system and developed the "Cell Disk", the first in vivo DNA storage system capable of random reading and rewriting as flexibly as modern hard drives [22]. Most excitingly, on March 12th, 2024, the DNA Data Storage Alliance, which includes major industry players like Illumina, Microsoft, and Twist Bioscience, introduced the first DNA data storage specifications [23]. The specifications are structured around two main components: Sector Zero and Sector One. Sector Zero contains essential information to identify the vendor responsible for synthesizing the DNA and the Coder-Decoder (CODEC) used for coding the data. This sector ensures that any DNA storage system can recognize the manufacturing company and the coding method, facilitating compatibility across different systems. Sector One includes metadata about the contents, a file table, and parameters required for data transfer to a sequencer. This design ensures that DNA-stored data can be effectively accessed and read, standardizing the retrieval process. The specifications aim to foster an interoperable DNA data storage ecosystem, facilitating cooperation among various systems and paving the way for broader commercialization of DNA-based data storage solutions. These historical milestones in the development of DNA storage were shown in
Fig. 1. Historical milestones in the development of DNA storage.
The fundamental principle of DNA storage involves converting digital data into nucleotide sequences by coding the binary digits "0" and "1" from computers into DNA bases (A, T, C, G) during synthesis. The synthesized DNA is then stored under specific conditions for long-term preservation. Subsequently, DNA sequencing technology is used to read the stored sequences, allowing the data to be decoded back into its original digital format for computer recognition. At present, the basic workflow of DNA storage has been established, including six processes: data coding, DNA synthesis, DNA preservation, DNA acquisition, DNA sequencing, and data decoding (Fig. 2). Although several research teams have successfully designed DNA storage systems capable of executing the entire workflow, these systems remain far from practical application [24,25].
Fig. 2. Illustration of the DNA storage workflow. Digital data, such as images, videos, texts, and audio, is encoded into binary code, represented by "0" and "1"; and then converted into nucleotide sequences, including Adenine (A), Guanine (G), Cytosine (C), and Thymine (T). Subsequently, these nucleotide sequences are synthesized into DNA molecules for storage. To retrieve the original digital data, the data-embedded DNA is extracted, sequenced, and converted back into binary code for computer recognition.
algorithm for DNA storage by simply mapping "0" to A/C and "1" to G/T, allowing for random substitution between A and C, as well as between G and T [17]. The design is based on restricted basic mapping relationships but does not adhere to biological restrictions. Other examples include the Goldman coding algorithm [26], Grass coding algorithm [27], and Blawat coding algorithm [63]. To address biological restrictions, coding algorithms with biological filtering have been developed, such as the Base64 coding algorithm [24], Fountain code [64], Yin-Yang code [65] and DNA-AEon [66]. These algorithms perform sequence filtering under biochemical constraints after basic mapping, ensuring the DNA sequence satisfies preset biochemical constraints perfectly. Additionally, to prevent data loss during DNA synthesis, preservation, and sequencing, ECCs such as Reed-Solomon Code [67], Hamming Code [68], and Raptor Code [69] are applied. Although there are many types of ECCs, most do not fully address the unique biological constraints of DNA storage, such as substitutions, deletions, and insertions of bases. Fortunately, researchers have developed various ECCs specifically tailored for DNA storage in recent years. Press et al. designed an ECC for DNA storage called Hash Encoded, Decoded by Greedy Exhaustive Search (HEDGES) [70]. HEDGES corrects DNA errors by encoding data with a hash-generated pseudo-random sequence, allowing a algorithm to detect and fix insertions, deletions, or substitutions during decoding. It converts stubborn indels into substitutions, which a backup Reed-Solomon code then repairs across strands. Tested on 3.5 % error-prone DNA, it recovers >97 % data and scales to exabyte storage with <10 % errors. Song et al. developed a de Bruijn Graph-based DNA sequence reconstruction algorithm (DBGPS) capable of recovering the original DNA sequence without error from fragments with substitutions, insertions, and deletions [71]. The error-correction mechanism breaks DNA strands into small overlapping fragments (k-mers), filters out rare/noisy fragments (likely errors), then reconstructs the original sequence by finding the most probable path in a graph. This method reliably recovers data even from heavily damaged DNA (e.g., 96.3 % success after 70 days at 70 °C) by focusing on high-confidence fragments. Besides error correction algorithms, the decoding methods for ECCs have also been improved. Ding et al. developed a soft-decision decoding software called Derrick to improve error-correcting capability in DNA digital storage [72]. Compared to traditional hard-decision strategies, Derrick doubles the error correction capability of Reed-Solomon codes and reduces the probability of uncorrectable errors by several orders of magnitude. These advanced ECCs have significantly enhanced the error correction performance of DNA storage systems, providing robust support for medical data recovery.
The above methods outline the development process of coding techniques for DNA storage (Table 1). From simple mapping to biologically constrained mapping, and from basic data error correction to specialized biological error correction, DNA storage coding process is becoming increasingly sophisticated and efficient. However, with the above methods, the coding density for DNA storage remains around 2 bits/nt. Therefore, specialized coding algorithms have been devised. Wu et al. developed an end-to-end DNA coding method for text and images, achieving logical storage densities exceeding 2 bits/nt for images and 3 bits/nt for text [73]. Notably, the example of the highest density reached 3.83 bits/nt for a 250 MB image. Furthermore, specialized DNA coding algorithms for video [74], images [[75], [76], [77]], English text [78], and Chinese text [79] have been developed. Recently, with the help of deep learning, Sun et al. proposed a more efficient paradigm for compressing digital data to DNA while excluding arbitrary sequence constraints [80]. Both standalone recurrent neural networks (RNNs) and pre-trained language models were used to extract intrinsic patterns in the data and generate probabilistic portrayals. These were then transformed into constraint-free nucleotide sequences using a hierarchical finite state machine. Utilizing these methods, a 12 %–26 % improvement in compression ratio was achieved for various data types, directly translating to up to a 26 % reduction in DNA synthesis costs. Zhang et al. utilized another neural network, convolutional neural network (CNN), to train on 6507 randomly selected images from the ImageNet database and developed a coding algorithm for those images with a density of 23.72 bits/nt—approximately 10 times higher than the theoretical density [81].