In a world flooded with data, figuring out where and how to store it efficiently and cheaply becomes a bigger problem. One of the most exotic solutions could turn out to be one of the best: archiving information in DNA molecules.
The predominant long-term refrigeration method from the 1950s writes data on pizza-sized magnetic tape reels. In comparison, DNA storage may be cheaper, more energy efficient, and more durable. Studies show that DNA properly encapsulated with a salt will remain stable for decades at room temperature and should last much longer in the controlled environment of a data center. DNA does not require maintenance, and files stored in DNA can easily be copied at negligible cost.
Even better, DNA can archive a staggering amount of information in an almost inconceivably small volume. Consider this: humanity will generate an estimated 33 zettabytes of data by 2025 – that’s 3.3, followed by 22 zeros. DNA storage allows all of this information to be combined into a ping pong ball. The 74 million million bytes of information in the Library of Congress could be packed into a DNA archive the size of a poppy – 6,000 times the size. Split the seed in half and you can save all the data from Facebook.
Science fiction? Barely. DNA storage technology exists today, but to make it work, researchers must overcome some formidable technological hurdles in integrating different technologies. As part of an extensive collaboration for this work, our team at Los Alamos National Laboratory developed a key technology for molecular storage. Our software, the Adaptive DNA Storage Codex (ADS Codex), translates data files from the binary language of zeros and ones that computers understand into the four-letter biology of code.
ADS Codex is an integral part of the MIST (Molecular Information Storage) program of Intelligence Advanced Research Projects Activity (IARPA). MIST strives to make government and private sector big data operations cheaper, larger, and more durable with the short-term goal of writing one terabyte – one trillion bytes – and 10 terabytes in 24 hours at a cost of too reads $ 1,000.
FROM COMPUTER CODE TO GENETIC CODE
When most people think of DNA, they think of life, not computers. But DNA is itself a four-letter code used to convey information about an organism. DNA molecules are made up of four types of bases or nucleotides, each identified by a letter: adenine (A), thymine (T), guanine (G), and cytosine (C). They are the basis of all DNA codes and provide the instruction manual for the construction of all living things on earth.
DNA synthesis is a fairly well understood technology that is widely used in medical, pharmaceutical, and biofuel development, to name a few applications. The technique organizes the bases in various arrangements indicated by specific sequences of A, C, G, and T. These bases wrap around each other in a twisted chain – the well-known double helix – to form the molecule. The arrangement of these letters in sequences creates a code that tells an organism how to form.
The entire set of DNA molecules makes up the genome – your body’s blueprint. By synthesizing DNA molecules – from scratch – researchers have found that they can specify or write long strings of the letters A, C, G, and T and then read those sequences back. The process is analogous to how a computer stores binary information. From there it was a quick conceptual step to encode a binary computer file into a molecule
The method has been shown to work, but the DNA encoded files are currently taking a long time to read and write. Attaching a single base to DNA takes about a second. Writing an archive file at this rate could take decades, but research is developing faster methods, including massively parallel operations that write on many molecules at the same time.
NOTHING LOST IN TRANSLATION
The ADS Codex explains exactly how the zeros and ones are translated into sequences of four letter combinations of A, C, G and T. The Codex also does the back-coding in binary form. DNA can be synthesized by a variety of methods, and ADS Codex can accommodate them all.
Unfortunately, the error rates when writing to the molecular memory with DNA synthesis are very high compared to conventional digital systems. These errors come from a different source than in the digital world and are therefore more difficult to correct. Binary errors occur on a digital hard drive when a zero changes to a one or vice versa. With DNA, the problems arise from insertion and deletion errors. For example, you might write ACGT, but sometimes you try to write A and nothing comes up, so the letter sequence shifts left or types AAA.
Regular error correction codes don’t work well with such problems, so ADS Codex adds error detection codes that validate the data. When the software converts the data back to binary, it checks to see if the codes match. If not, bases – letters – are removed or added until verification is successful.
We have completed version 1.0 of ADS Codex and plan to use it later this year to evaluate the storage and retrieval systems developed by the other MIST teams. The work fits in well with Los Alamos’ history of driving new developments in computing as part of our national security mission. Since the 1940s, as a result of these advances in computing, we have amassed some of the oldest and largest stores for digital data only. It still has tremendous value. Because we keep data forever, we’ve long been at the tip of the spear when it comes to finding a cold storage solution, but we’re not alone.
All of the world’s data – all of your digital photos and tweets; all records of the global financial sector; all these satellite imagery of farmland, troop movements, and glacial melting; all the simulations that underlie so much of modern science; and so much more – have to go somewhere. The “cloud” is not a cloud at all. They are digital data centers in huge warehouses that use huge amounts of electricity to store trillions of millions of bytes (and keep them cool). These data centers cost billions of dollars to build, power, and operate and can struggle to remain profitable as the need for data storage continues to grow exponentially.
DNA holds great promise in satisfying the world’s voracious appetite for data storage. Technology requires new tools and new ways to apply familiar ones. But don’t be surprised if one day the world’s most valuable archives find a new home in a collection of poppy-sized molecules.
Funding for ADS Codex came from the Intelligence Advanced Research Projects Activity (IARPA), a research agency in the Office of the Director for National Intelligence Services.
This is an opinion and analysis article.