Data volumes increase exponentially in the era of information technology,which makes traditional storage devices impossible to meet the needs of big data storage and long-term preservation of archived data.The DNA digital storage(DDS)promises to address these issues.As a new storage technology,DDS uses deoxyribonucleic acid(DNA)molecules to store any form of digital information,such as files,pictures,and videos.DDS has significant advantages,including high storage density,long storage life,and parallel access.DDS involves two processes: writing and reading.When writing,the DDS system encodes information into nucleotide sequences,and then synthesizes,replicates and stores DNA molecules correspondingly.To read out the information,DNA molecules are sequenced,sequencing reads are assembled into consensus,and decoding is performed on consensus sequences.Since synthesis,polymerase chain reaction(PCR)amplification,storage and sequencing steps may induce errors,including random and systematic errors,error correction is vital for the DDS accuracy.In order to ensure that the information restored is accurate,error correcting strategies must be developed.In the field of DNA digital storage,nearly all error-correction methods rely on the information redundancy to ensure the correctness of the restored information,including physical redundancy and logical redundancy.The traditional physical redundancy-based methods explicitly copy the DNA molecules for one or more times and expect every piece of information correctly appears in the majority of copies while decoding.Although the physical redundancy is able to solve random errors,the systematic errors such as sequence missing caused by PCR stochastic bias or synthesis bias,as well as strand breaks,rearrangements and indels from PCR amplification and long-term storage are usually beyond its capability.Therefore,the state-of-the-art DDS systems mostly use logical redundancy in the form of error-correcting code(ECC)instead.Logical redundancy provides robust error correction capabilities,effectively handling both random and systematic errors.Moreover,it offers the significant advantage of substantially reducing sequencing costs.However,the current error correcting methods which apply ECC to DDS suffer from a trade-off between errorcorrecting capability and redundancy proportion.Our research introduces soft-decision decoding approach into DDS by proposing a DNA-specific error prediction model and a series of novel strategies,for generally improving the error-correcting capability of ECC without increasing the proportion of redundancy.We demonstrate the effectiveness of our approach through a proof-ofconcept DDS system based on Reed-Solomon(RS)code,named as Derrick.Derrick shows significant improvement of error-correcting capability without involving additional redundancy in both in vitro and in silico experiments.Our main innovations and results are as follows:1.A soft-decision decoding strategy applied to DDS was proposed and improved.We exploited the uneven distribution of errors in DNA sequences and leveraged the error-related key information such as error positions and true values that can be predicted based on the detected error-enriched patterns in consensus sequence,and this type of information provided opportunities for ECC to address blocks with error counts that exceed the original ECC capability.2.As a proof of concept,we have developed a novel ECC system called Derrick,which utilized RS codes.Derrick offers an efficient encoding and decoding process,requiring only a single line of code for implementation.During the encoding phase,we incorporated RS error correction codes and CRC64.In the decoding process,we employed the RS soft-decision decoding algorithm to effectively correct errors.Additionally,we utilized the shift algorithm to accurately determine the error type post-decoding,thereby effectively addressing any position offset issues resulting from insertion or deletion errors.Moreover,to ensure the reliability of the decoding results,we deployed a backtracking verification algorithm that allowed us to perform secondary validation.This enabled us to backtrack and reapply error correction in cases where RS code collisions may occur.3.Derrick’s performance for in silico tests.We prepared a larger file library of 11.7 MB in total for in silico tests,which contained 6 files of different types such as videos,photos and executable files.The files were merged,encoded with varied code rates,and then subjected to sequencing simulation using currently popular sequencing techniques,including Pac Bio CLR,ONT and Illumina.More detailed,Illumina and ONT datasets were built with RS(255,211),RS(255,235)and RS(255,241),and Pac Bio CLR datasets were encompassed a wider range of RS codes,ranging from RS(255,201)to RS(255,241)with an interval of 4.The results from the simulated datasets demonstrated that,when compared to hard-decision decoding,Derrick reduced the numbers of failed matrices from hundreds to units,in most cases to 0.Furthermore,statistical predictions revealed that Derrick could potentially increase the storage volume by 2-8 orders of magnitude compared to traditional hard-decision decoding.4、Derrick’s performance for in vitro tests.We generated a Megabyte-scale real dataset by combining DNA sequences from an E.coli genome and 18 Covid-19 genomes.It underwent Derrick’s pre-processing steps,including compression,randomization,redundancy addition,index and primer addition,synthesis,sequencing,and subsampling.Sequencing was performed not only on the commonly used Illumina platform but also on the Nanopore platform to consider the portability of DDS.The decoding results demonstrated that,by comparing the number of solvable errors,the in vitro experiments with ONT sequencing showed that the soft-decision strategy was able to correct the number of errors doubled the capability of hard-decision strategy.The performance evaluation of Derrick was also conducted on a per-matrix basis,where the proportion of failed matrices after each decoding experiment was recorded and analyzed.For the datasets that cannot be decoded by hard-decision decoding,Derrick either successfully corrected all matrices or reduced the proportions of failed ones by 7 to 229 folds.Furthermore,a statistical model was constructed to evaluate Derrick’s performance based on the probability of an uncorrectable error.Derrick reduced the probability of an uncorrectable error in RS codes by approximately 10 to 500 k folds on Illumina and ONT datasets.The best improvement was achieved on an Illumina dataset with a sequencing depth of 10× and a code rate of 0.83,increasing the maximum storage volume from 2.77 E+22 bytes to 1.39 E+28 bytes,achieving Brontobyte-scale.In addition,we evaluated the accuracy and sensitivity of the error prediction model,achieving an average accuracy rate of up to76.7%.We also found that the size of the prediction set was a critical factor influencing the error correction accuracy and sensitivity.Through our testing,we discovered that the optimal performance was achieved when the prediction set size was equal to or larger than the size of RS redundancy.In conclusion,this study has developed a highly accurate error prediction model for DNA digital storage channels which was integrated with ECC decoding,and introduced soft-decision decoding in DDS.The Derrick algorithm was devised for encoding and soft-decision decoding based on RS code.Through simulations and real-world experiments,Derrick has effectively enhanced error correction capabilities without compromising information density.This research addresses the problems of existing approaches in the field,which either rely on hard-decision decoding with limited error correction capabilities or employ soft-decision decoding with lower accuracy and increased complexity.Derrick surpasses the limitations imposed by added redundancy on error correction capabilities,resulting in a substantial improvement in error correction through softdecision decoding.This breakthrough enables DDS to be applied at larger data storage scales while maintaining high fidelity,thereby paving the way for the industrialization and widespread adoption of DNA digital storage,offering promising prospects for the future of data storage technologies. |