With the exponential growth in the capacity of information generated and the emerging need for data to be stored for prolonged period of time, there emerges a need for a storage medium with high capacity, high storage density, and possibility to withstand extreme environmental conditions. DNA emerges as the prospective medium for data storage with its striking features. Diverse encoding models for reading and writing data onto DNA, codes for encrypting data which addresses issues of error generation, and approaches for developing codons and storage styles have been developed over the recent past. DNA has been identified as a potential medium for secret writing, which achieves the way towards DNA cryptography and stenography. DNA utilized as an organic memory device along with big data storage and analytics in DNA has paved the way towards DNA computing for solving computational problems. This paper critically analyzes the various methods used for encoding and encrypting data onto DNA while identifying the advantages and capability of every scheme to overcome the drawbacks identified priorly. Cryptography and stenography techniques have been analyzed in a critical approach while identifying the limitations of each method. This paper also identifies the advantages and limitations of DNA as a memory device and memory applications.
The excursion of data storage initiated from bones, rocks, and paper. Then this journey deviated to punched cards, magnetic tapes, gramophone records, floppies, and so forth. Afterwards with the development of the technology optical discs including CDs, DVDs, Blu-ray discs, and flash drives came into operation. All of these are subjected to decay. Being nonbiodegradable materials these pollute the environment and also release high amounts of heat energy while using energy for operation [
With the employment of digital systems for the purpose of generation, transmission, and storage of information, there rises a need for active and ongoing maintenance of digital media. With the massive amounts of digital data that has to be stored for future use, a problem arises in the storage of irresistible amounts of data. The demand for data storage is rapidly increasing day by day. The total information storage of the entire world was around 2.7 ZB in 2012. Every year the storage necessity is increasing by 50% [
Taking into account the manner fossil bones preserve genetic material for ages, researchers paid their attention towards using deoxyribonucleic acid (DNA) as a storage medium.
DNA has an unbelievable storage capacity. Castillo states that all the information in the entire Internet could be located in a device which is lesser than unit cubic inch [
DNA can withstand a broader range of temperatures (−800°C–800°C). It utilizes power usage million times more effectively than a modern personal computer. Additionally it privileges more storage options as it stores data in a nonlinear structure unlike most of the media storing data in a linear structure. DNA promises more options to improve latency and extraction of data, as it allows reading data in bidirections. The important fact that DNA is invisible to human eye ensures that DNA is secure and is impossible to be harmed by living organisms [
This research paper critically analyzes the techniques used in writing and reading data into DNA. Many models of encoding were used to encode data into DNA. First section of this review paper provides a background for digital data storage in DNA, analyzing the perfection of DNA to be used as a storage medium. The second section of this review paper analyzes the results and methods of the models used for encoding data into DNA while critically analyzing the custom in which each model overcomes the drawbacks identified in the prior models. In view of the organization of the first subsection of the second section, function of DNA as a storage device is highlighted in this. Evolution and development of encoding models in the recent past have been summarized firstly in this subsection. Secondly encryption schemes used for encryption of data in DNA have been discussed in detail, diagnosing their significant features, advantages, and drawbacks associated. Thirdly, several approaches used for designing of codons have been discussed. Basic data storage styles have been analyzed in a critical manner by identifying the advantages over the other finally in the first subsection. The second subsection of Section
Two lengthy strands of nucleotides together compose a DNA molecule. Each nucleotide contains one of four bases (A: adenine, G: guanine, C: cytosine, and T: thymine), along with a deoxyribose sugar and a phosphate group. DNA molecule comprises a double stranded structure, comprising two single stranded sets of nucleotides bonded by A=T double hydrogen bond or C≡G triple hydrogen bond. These two single strands which are bonded together by hydrogen bonds are referred to as complimentary strands. A single stranded DNA is positioned in between two ends 5′ (5 prime) and 3′ (3 prime) [
Information is usually read in base pairs as DNA is double stranded. Currently with the aid of new DNA synthesizing techniques the way in which base nucleotide pairs are generated is being reformed because it is a challenge to read G-C pair consecutively. Traditional base pairs were A-T and G-C. Innovative base pairs, A-C and G-T, are being used by newer technologies. A-C is employed to code 0 while G-T is used for 1 [
Schematic representation of storing data on DNA.
Encoding was achieved by placing a nucleotide at each repeated position of 1 and 0 bits, for instance, 100101=CTCCT and 10101=CCCCC. In the process of decoding, “C” could be decoded as “1” or a “0,” because only the number of repeated bits was taken into consideration at the time of encoding. For instance, CTCCT could be decoded as 011010 or 100101. Therefore this scheme of encoding was inaccurate as it was not distinctively decodable [
These mutations have in turn caused changes in the sentence. When the genes were decoded back into the Morse code and then back into English, the original sentence has been changed [
These 2 methods laid the foundation for encoding in DNA but were inaccurate as it was not uniquely decodable and the original content was altered due to mutations in the Microvenus and Genesis projects, respectively.
Limbachiya and Gupta used microdots to store data. This approach was secure because of its size and even if an adversary identifies the microdot, it would be extremely difficult to read the data without the knowledge of the primer sequence. But the limitation was scalability of data encoded in the limited size of microdots, as only 136 bits of data could be encoded [
To overcome this limitation Kac proposed information DNA (iDNA) [
Main advantage of PCR based encoding models is high security as the recipient should be aware of the encryption key and primer. Hitches associated with PCR methods are the need of PCR, need of the knowledge of primers, widespread experimental obstacles, and practical problems. Insertion of errors in template region makes recovering the encoded data unmanageable. Moreover data breakage could occur in encoding and decoding procedures due to errors of humans [
As a result researchers paved their attention towards developing PCR independent data encoding models.
Yachie et al. proposed a method to copy and paste data within a sequence of an organism to achieve flexibility of storage and vigor of data inheritance, mostly appropriate in using DNA as trademarks/signatures of living modified organisms (LMOs) and as valuable transmissible media [
Through this approach it is possible to retrieve data from a living DNA without additional material such as template DNA, and it needs only the sequencing of the complete genome.
Data retrieval by PCR based amplification is prone to breakage of either side of DNA annealing sites, which are crucial for reading even parts of encoded data [
Advantages of this approach are greater speed and lower cost of reading DNA data and lower cost of synthetic DNA. This approach ensures higher durability and data inheritance as multiple copies of data are available and each copy is capable of detecting and correcting errors of the other copy [
Data could be lost during evolution. In order to prevent this Yachie et al. introduced different nucleotide sequences encoding the same data by multiple data compression paths [
Disadvantage of this approach is that multiplication of cassettes leads to redundant volumes. Parity effects cost a certain volume of data sequence; at the same time data recovery rate is fragile and is proportional to data breakage which occurs through DNA deletion of long ranges. Positions of the data breakages could be identified easily by the alignment results although they were not recoverable [
Main downside of this approach was the size limit of the cassette oligonucleotides being used to encode the message. If it increases a certain limit there is a possibility of it to appear by chance in host genome. And also sequencing of the entire genome is required to retrieve data. Thereafter Alienberg proposed improved Huffman coding method in which nucleotides were used efficiently and used specific primers for different types of files. Improved Huffman coding defines DNA codes for the entire keyboard, for clear-cut information coding. This is based on a construction of a plasmid library with specially designed primers embedded along with the message for fast retrieval. A good encoding scheme should have economical use of nucleotide per character which is about 3.5 here [
Tabatabaei Yazdi et al. encoded Wikipedia pages of six universities, carefully chose parts of the stored data, and edited text written into DNA related to three universities. Shifting from current read-only methods to rewritable methods requires to address the below mentioned drawbacks [ There is need to rewrite the entire content in order to edit in a compressive domain. Fourfold coverage is used to ensure reliability of information which makes the rewriting process much complicated because in order to rewrite one base modification of four locations is needed. Addressing method is utilized only to read the position of a read but does not perform selective reads.
This method can be effectively utilized for accessing random data sections and also for storing frequently updated data which needs to memorize the editing history [
Blaum et al. used DNA sequences consisting of special strings of addresses to access random information. These DNA sequences are also encoded with error correction mechanisms. Mutually noninterrelated addresses are designed while they satisfy error-control running digital sum constraint [
gBlock [
High cost is one of the drawbacks of this approach. Instead of using Sanger sequencing employing next generation sequencing methods [
Primary advantages of this approach over the prior approaches used for encoding include the employment of one-bit representation per base (G or T for 1 and A or C for zero). This brings in the capability of encoding sequences which are challenging to be read or written due to containing repeats, secondary structures, and extreme GC content. As we are splitting the stream of bits into data blocks thereby Church et al. are avoiding constructing of long DNA which is challenging to assemble in reading the information. Synthesizing, storing, and sequencing multiple copies of olingo are done with the intention of evading sequence verifying constructs and cloning. Each copy has the capability to correct the errors in other copy as the errors are almost never coextensive. In this approach the cost incurred is ~100,000 less compared with the first generation technologies in encoding and decoding information [
Challenges associated with this scheme include the cost which is unfeasible and the time for reading and writing onto DNA. However the cost associated with synthesizing and sequencing of DNA has been dropping at 5–12 exponential rates per year which is relatively much speedy than electronic media [
Church et al. have not paid their attention towards compression, parity checks, redundant encodings, and error rate. Therefore attention has to be focused towards error correction for the purpose of improving density and safety [
Hence to overcome this issue Goldman included improved base 3 Huffman coding instead of the one-bit-per-base representation [
At present many loss less compression algorithms are in use but still they require ample context information for encoding purposes.
Burrow-Wheeler transformation (BWT) [
The following three steps are performed to generate context information through this encoding scheme. Text file compression: Huffman coding method was employed for this. Output is a binary sequence set. Mapping function: Two nucleotide base pairs were effectively employed to represent four binary bits. Foundation for choosing 4 bits for 3 nucleotides is output of Huffman coding being a hexadecimal value. Encryption: For the purpose of maintaining security, encoded message ought to be encrypted. One Time Pad (OTP) which require a random key of same length as message encoded is employed for this.
This approach reviewed that maximum efficiency of compression is possible to be achieved through performing transformation prior to compression thereby reducing number of nucleotides. This directly affects reducing cost factor.
This scheme is more significant in military applications and signatures of living modified organism as they are small messages which required to be deposited for extensive period of time [
This research has not implemented the biological protocols to insert the sequence in genome of bacteria.
Future work to be addressed includes modification of transformation algorithm and designing other mapping function for encoding nucleotide sequence.
Basically 3 codes have been used over the past to store information on DNA. All these codes generally considered that an alphabetic language is being encoded in DNA. Although most of the researches considered English as the alphabetic language, it could have been used for even shorthand, which is the writing scheme for phonetics.
For a code to be optimum it should satisfy the dual criteria as follows: It should use DNA (nucleotides) economically, mainly because synthesizing of extended oligonucleotides is an expensive process though replicating appears to be comparatively economical. It should be able to reconstruct the message after encoding of data.
Although it is not considered to be essential, if the coding scheme offers some error detection and protection mechanism it would be of tremendous advantage. But this feature is not considered vitally important, because there are other mechanisms for addressing this issue such as using multiple copies of DNA. As the written language inherently consists of self-correcting mechanisms it makes this feature of error detection and correction not essentially important [
Unambiguity of the code is achieved through comprising of only one way in which the encrypted message can be read once the starting point is mentioned [
Drawbacks associated with Huffman coding include not outfitting for numbers and symbols. This is mainly because the frequency of showing these symbols is highly dependent on the text which reviews the fact that they are unable to be included in formulating the Huffman code. Secondly it is not suitable for long term storage due to the fact that when different length codons assembled together it might not reveal a pattern. Therefore the future generations might not be able to detect the significance of the pattern [
Consisting of isothermal melting temperature is the advantage of the composition of the message DNA utilizing this scheme. Dominant feature of comma code is the reading frame of six codons including G, the comma, which is not achieved by other codes. This helps to identify a clear reading frame without the necessity to mention a starting point. Protection mechanism from insertion and deletion mutation is also guaranteed by this approach which makes the other codes much more complex [
Drawback of this code is that it is not economical as it repeats the comma-base G to create an automatic reading frame [
Alternating code also comprises repetitive features which makes it noneconomical [
Although comma-free code is robust and the error correction works to correct against small-scale loss such as DNA point mutations, it does not have the ability to recover broken data when a large DNA segment is deleted from the data encoded DNA region [
Other coding schemes have low base-to-character ration but is limited to lower number of characters such as the English alphabet. DNA was inserted into living organisms and they are subject to losing information due to breakage by mutation, insertion, and deletion. Hence, this approach is a solution to this problem as it is able to recover data of damaged DNA. Therefore this method overcomes the drawback of the comma-free code. This is also able to identify any frame shift due to mutation or errors in sequencing. This method uses unique primer design using plasmid DNA libraries [
According to Doig the necessity for a fixed codon length comprising 3 bases requires 42% more DNA than the minimum requirement [
As there is very little redundancy, any mutation would cause a change in amino acids. Additionally, if the mutation is chief to fluctuation of codon length many of the mutations will be extensive frameshift mutations. According to Doig, it is impossible for the shift towards use of perfect genetic code, considering a simple example that Val has been coded for four codons comprising 3 bases. As the third does not convey any information it would be effective to code Val employing two bases only. The difficulty of machinery to shift from fixed codon length to variable codon length plus the priorly mentioned drawbacks leads to not using this effective code though it maximizes the efficiency through using variable length codons [
There is no standard structure in which code words are generated due to the fact that importance of constraints to be addressed differs according to the encoding models [
Data storage style is the manner in which DNA words are stored in a medium. In these two approaches which are discussed in detail, DNA words are stored in solid and liquid media. Data storage style depends on the word design. DNA chips which are an immobilization technique had been popular among the public as the weaker constraint on words limits the design problems. A systematic word design which avoids mishybridization serves both surface based approach and soluble approach [
Secret writing is used to prevent illegal access of information by unauthorized parties. Cryptography and steganography are two methods used for secret writing. Cryptography manipulates information for misunderstanding while steganography hides the existence of information [ DNA methods: writing into DNA using insertion and creation; DNA sequencing: reading DNA.
Objectives of cryptography are mainly as follows. Authentication: Authentication is confirmation of the details about entity from which we are receiving information. Digital signatures, passwords, and trademarks are used to ensure authentication. Data confidentiality: This is the process of securing confidential data from unauthorized personnel. In cryptography this goal is achieved through encryption. Data integrity: This is to guarantee that information is received in the exact format in which it has been sent by the official party. That includes the fact that no modification or alteration is done during cryptography process [
Practical methods for DNA data embedding are twofold: DNA based stenographic methods are proposed by Bancroft and Clelland by physically secreting information in a living organism so that PCR and secret key are essential for retrieving information [ Cox proposed embedding information in living beings so that information will be carried by organism along with cell replication without affecting biological properties of living organism. This can be done in two ways [
Replacement of DNA in noncoding segments never being transferred to proteins: drawback is that extreme care is needed to ensure that this insertion would not affect the biological functions. Modifying coding DNA (cDNA) partitions which get transferred to proteins: this approach is more systematic and safer [
Nonavailability of a hypothetical basis and lack of knowledge associated with DNA cryptographic methods are major problems. Similarly high cost and difficulty in understanding also have effect, in addition to inappropriateness to be used by general public due to the biological tests and trials which have to be performed in highly technology equipped laboratories.
One Time Pad (OTP) generated keys are used as encryption key. This key is used just only once for exactly one message. The used pad is destroyed by the user after encryption. Figure
Block diagram of encrypting a message using DNA hybridization technique.
After decrypting the message receiver destroys the identical pad which is owned by him. Because of this reason this approach is extremely secure. In this algorithm single stranded DNA is used as the OTP. Length of the OTP should be 10 times larger than the binary message [
Block diagram of decrypting a message using DNA hybridization.
DNA hybridization is a slower process at the beginning because it is difficult for two complementary strands to combine together. But later this is a rapid process. This can be effectively utilized in searching and parallel computation. Restrictions at present for this process are time consumption and expansiveness [
So, it already provides an honest security and takes solely less time for the message to be communicated [
Diagrammatic representation of the encryption process by chromosome DNA indexing is represented in Figure
Block diagram of encryption process using DNA Chromosome Indexing.
Block diagram of decryption process using DNA Chromosome Indexing.
This algorithm uses the vast randomness of the DNA medium. This is the downsize of this approach. This cannot be termed as a proper cryptographic algorithm due to this fact. Hence, OTP key can be used only once [
In decryption process OTP is used to extract the encryption key by the user. Self-assembling of the tiles used for encryption in reverse order results in giving the plain text [
Incorporating DNA into a living host, who has the ability to withstand risky ecological conditions, has the ability to grow rapidly, and is able to tolerate addition of artificial gene sequences was, the solution proposed for generating a reliable storage medium. Inoculating DNA sequences into an organism is a challenge because it is difficult to retrieve a message from a whole organism composed of many genomes. Another obstacle is the unpredictable nature of genomic mutation [
DNA memory prototype consists of 4 main steps: Encoding information as artificial DNA sequences. Injecting the sequences to living organisms. Permitting the organisms to be nurtured. Extracting information back from organisms.
DNA memory can be effectively utilized in commercial applications and in national security for information hiding purposes and for data stenography.
There exists a competition among seed companies to protect their investments. Therefore, incorporating a DNA watermark in the seeds could be a better approach to track their sales and preserve their copyrighted products against illegitimate planting [
In order to capture pollutants that will contaminate with the ecological resources and will pollute the resources, researchers drill wells to gather samples of soil. In cooperating sufficient information in bacteria which could update the current status of soil with time uninterruptedly by tracking bacteria’s distribution spatially and temporally, using developed technologies would be an effective approach for this purpose [
Endangered species could be identified by injecting a DNA watermark into the genome of the subject, replacing the other synthetic identification. This could also be used to preserve safely the personal information of a person such as medical information and family history in their own bodies [
Era of DNA computing begins with the identification of limitations in electronic computers. The volume of data that can be stored in an electronic computer and the speed thresholds that can be reached which is governed by the physical characteristics of computers are the main limitations identified in big data storage [
Advantages offered by DNA computing include consuming significantly less energy than the electronic computers. Energy consumed by DNA computers is billion times comparatively less than other electronic computers. The storage space needed to store information is less than trillion times over electronic computers [
Adleman addresses the Directed Hamilton Path problem through exploring possibilities of information encoding in DNA sequences and thereafter performing simple operations for strand manipulations. The simplified version of this problem was the salesman problem, which needs to find out the optimal path out of a pool of cities through which salesman has to travel. Adleman complicated this problem through restricting the connection routes between cities and specified the start and end of the journey.
This approach solves the above-mentioned problem with the intention of generating random paths through the graph where Adleman encoded each node in the graph into a random strand comprising 20 bases. Each and every edge of the graph was represented by another different oligonucleotide consisting of 20 bases complementary to the source node’s second half and target node’s first half. This results in self-assembling and ligation of compatible edges by the function of T4 DNA ligase enzyme. In order to filter the paths the product of the self-assembling process is subjected to amplification by PCR. To filter the paths with the exact length, separation and recovery of DNA strands with the exact length are done by Agarose gel electrophoresis. Agarose gel electrophoresis [
The time duration required for the practical approach was around 7 lab days. Adleman’s algorithm was more labour extensive. Process automation could address the problem of high labour intensity. The algorithm used here was inefficient as the number of oligonucleotides needed increased linearly with the increase of edges and exponentially with the number of vertices. This is energy efficient. It is not clear whether the large number of inexpensive operations could be used for resolution of real computational problems [ The number of operations it can perform parallel. How many steps each process can perform per unit time.
With regard to first measure it is in favour of DNA computers because of the vast parallelism it offers. Second measure is in favour of electronic computers because considering a personal computer, it can perform 100 million instructions per second [
With the enormous flexibility offered by electronic computers through the numerous operations it is pretty efficient to multiply two 100 digit numbers whereas it is overwhelming to perform this using a DNA computer with the protocols and enzymes presently available [
This is the largest problem solved yet with a DNA computer which is 20 variables for three-satisfiability problem. This problem is an NP (Nondeterministic Polynomial) time-complete computational problem. As the problem complexity is very high, even with fastest sequential algorithms, exponential time to solve this problem is required [
DNA has been identified as the potential medium for data storage due to its vast storage capacity, high data density, sustaining to extreme environmental conditions, and so forth. Evolution of data storage in DNA is described in detail in this review article. Idea of data storage in DNA emerges with Microvenus project of Davis [
Comparison of encoding models.
Encoding model | Advantages | Disadvantages |
---|---|---|
Microvenus project | Laid the foundation for storing abiotic information in DNA | Being inaccurate and not distinctively decodable |
|
||
Genesis project | Laid the research work to explore the intricate relationship between biology, belief systems, information technology, dialogical interaction, ethics, and the Internet | Inaccurate as the original sentence was altered during mutation at the presence of ultraviolet light |
|
||
PCR based encoding models | High security because of the size of the microdots and even if an adversary identifies the microdot it would be extremely difficult without the knowledge of the primer sequence | Insertion of errors in template region making it unmanageable to recover the encoded data |
|
||
Alignment based encoding models | Independent of Polymerase Chain Reaction |
Multiplication of cassettes leads to redundant volumes |
|
||
Rewritable and random access based DNA storage system | Random access to data blocks of DNA which promotes nonlinear access |
High cost |
|
||
Next generation digital information storage | Employment of one-bit representation per base |
Cost is unfeasible |
|
||
Encoding scheme for small text files | High volume data storage density |
Have not proceeded in implementing the biological protocols to insert the sequence in genome of bacteria |
PCR based encoding models [
Alignment based encoding model [
Church and Goldman model [
Numerous encryption schemes have been used to encrypt data in DNA. Huffman code [
Comparison of encryption codes.
Huffman code | Comma code | Alternate code | Comma-free code | Improved Huffman code | Perfect genetic code | |
---|---|---|---|---|---|---|
Base-to-character ratio | ~2.2 | ~6 | ~6 | Variable | ~3.5 | Variable |
|
||||||
Economical | Very economical | No | No | Yes | ||
|
||||||
Long term storage | No | Yes | Yes | |||
|
||||||
Error correcting | Yes | Yes | Yes | |||
|
||||||
Protection from mutation | Yes | Yes | No | |||
|
||||||
Isothermal melting temperature | Yes | Yes | ||||
|
||||||
Synthetic DNA | Yes | Yes | ||||
|
||||||
Special features | Uses the principle of varying the length of symbols used for representation based on the recurrence of a character |
Consists of fixed length reading frames of 6 bases including the comma, G | Fixed length base frames without commas to separate the frames |
Stores text, images, and music in DNA |
70% more efficient than the other codes due to the use of a variable code length |
DNA is effectively used for secret writing due to the security mechanisms offered by DNA. DNA cryptosystems are more secure due to the huge size of the OTP key used for encryption. It will be extremely hard to break the algorithm without knowing the primer sequences and scientific specifications of the organism. Running time of cryptographic systems is less. Larger key and the indexes of arrays used in DNA systems need high memory space. Therefore in practical situations it might require a separate storage device. In the hybridization method, an OTP key 10 times bigger is generated for each binary bit of plain text. In DNA indexing the key obtained from the public database is extremely more huge than the hybridization method. Encrypted data is obtained by scanning a randomly picked number from the key and the plain text of DNA form. Random pick of numbers along with the lengthy OTP key improves the confidentiality. Implementation of cryptographic systems is highly costly.
In order to provide more security, OTP key generated in the hybridization method can be increased in length further. It is possible by generating an OTP key of more than 10 bases in length (say 12 or more). It can be concluded that “the higher the length of the key data, the higher the security.”
In order to enhance the security of cryptographic systems further step of hiding the data after encryption could be practiced. It can be achieved through performing the biological process of hiding the encrypted data between the primers in the DNA sequence. Comparison of the performance of the basic cryptography algorithms is summarized in Table
Comparison of the performance of cDNA secret writing techniques.
Features | DNA hybridization technique | Chromosome DNA indexing |
---|---|---|
Running time | Less | More |
|
||
Size of the key | Large depending on the input | Large independent of the input |
|
||
Strength of the algorithm | High based on the type, size, and the randomness of the key | High based on the key type and key size and the randomly produced index |
|
||
Memory space | Needs more memory space for storing the lengthy key and performing the operations involving it | More than the hybridization type because of the huge key length and the index array involved |
|
||
Cost | High | High |
|
||
Longevity | Believed to withstand any duration | Believed to withstand any duration |
Incorporating messages into human, mouse, or bacterium was popular but the challenges imposed were extracting the information from the whole genome without knowing any tracks about the embedded message and the unpredictable nature of genomic mutations. Wong et al. [
Cost and the data retrieval rate are major problems in DNA based storage systems. At present data can be read at a rate about 100 MB per second by storage devices which is much higher rate than natural storage. Synthesis and sequencing processes are time consuming and require expertise knowledge which makes this method inaccessible to general public. Even though DNA is scalable, robust, and stable the above-mentioned drawbacks have a high concern. Making a custom DNA molecule is expensive and this has been identified as the major obstacle for DNA based storage systems [
This review article critically analyzes the existing methods of storing data onto DNA. Data is encrypted into DNA using diverse codes and this article analyzes and discusses the codes used for encrypting data. Multiple approaches for designing DNA codons and diverse data storage styles have been analyzed in detail identifying the pros and cons of each approach. Secret writing techniques using DNA molecules for secure data storage are also discussed through this article. DNA can be used as an organic memory to store massive amounts of data. This paper also analyzes the mechanism where living organism could be used as storage devices while identifying limitations and appropriate applicability. Challenges faced through trying to apply organic memory concepts are also discussed through this paper. Big data storage and analytics and the way it has led to DNA computing to solve hard computational problems are also discussed here. The outcome of this study is a review article which identifies the limitations of existing encoding algorithms and proposes methods to overcome the identified limitations.
The authors declare that there are no competing interests regarding the publishing of this paper.
Pavani Yashodha De Silva would like to express her sincere gratitude to her supervisor, Dr. Gamage Upeksha Ganegoda, for the guidance and support extended to drive her on the correct path to carry out her independent study.