As we head further into this book and talk about DNA in more detail, we’ll be able to better appreciate the wonder of genes. But even at this early stage there is one way in which we can ponder one important aspect in a relatable way: Size.
As said before, we’re going to be measuring the size of DNA information in bytes (like the radical heretics that we are). We’ll get to how this is computed in a moment, but first let us consider just how much DNA is there? Or rather, how much information do you need to create an independently living organism? Today we can answer that question pretty well: around 320 kilobytes. And if we stretch the definition somewhat, there is a parasitic bacterium weighing in at 150 kilobytes.
The weighing by the way is quite literal, 250 megabytes of DNA turn out to weigh almost exactly one picogram (one thousandth of a nanogram). Before the advent of DNA reading (“sequencing”) technology, this was the only way to know the size of an organism’s genome! There are still some jumbo-sized genomes that we don’t have the technology to sequence, but we can already tell that Paris japonica, a modest plant, has 50 times more DNA than we do.
Viruses meanwhile are much craftier, with one successful virus (Porcine circovirus type 2 strain MLP-22) containing only 450 bytes of information. Viruses get away with this by not technically being alive themselves, and relying on the host organism for most things.
For plants, human beings and other animals, things are a bit larger. If you store the human genome on disk, it takes up 750 megabytes1.
To put that into context, 750 megabytes corresponds to one hour of uncompressed CD quality music. 450 bytes meanwhile corresponds to the first two paragraphs of this chapter.
Or comparing apples to apples, if we took our raw DNA data and if we could play it back as uncompressed CD quality audio, our genome would be roughly one hour of sound (the average run time of a full-length music album), the DNA of the smallest bacterium would be a 1.5 second sound clip, and the smallest virus would just emit a tiny blip two milliseconds long.
To get a feel for storage needed to store the genome for a simple bacterium, this electron microscope picture of a colony of E. coli bacteria takes up the same amount of space on disk (as an uncompressed BMP file) as its DNA (320 kilobytes).
DNA, bits & bytes
So first, how do we get to these kilobytes and megabyte numbers? There are four DNA letters, A, C, G and T. If we string four of them together, there are 4*4*4*4 = 256 possible combinations. A computer byte of information can also store 256 values, so it is fair to say 4 DNA nucleotides2 are the equivalent of one byte. This means we can divide the number of DNA letters by 4, and get the corresponding number of bytes.
This may still feel artificial, “you could call anything a byte that way!”. But we have the ultimate proof that we can capture a living organism on disk. Already in 2006, researchers succeeded in recreating the poliovirus from scratch.
They did this by ordering up the synthesis of snippets of viral genetic material and joining these together to a full 7500 nucleotide long chromosome. This was then mixed with what I can best describe as human cell juice. This was made from purified insides of human cells, so no DNA was left anymore.
When mixed, the cell juice & synthetic chromosome then proceeded to generate functioning poliovirus particles. Well done, I guess?
Now, this was just a virus, but in 2010, the J Craig Venter institute, famous for the human genome sequencing project, succeeded in synthesizing all the DNA of a bacterium called Mycoplasma mycoides. And not only that, they added some fun signatures to prove it was their creation. They then emptied out another bacterium, and inserted their own synthetic DNA. The resulting organism was viable.
Now, these are impressive stunts, and they definitely prove that we can sequence DNA to a file on disk, and then recreate that DNA, and that it will power life. But if you look carefully, both of these attempts needed help to be “bootstrapped” - the poliovirus took cell juice, the bacterium required an existing (empty) cell. We’ll revisit this subject later - because we have not yet settled if we can fully recreate life from scratch.
The fly in the ointment here is one of the central dogmas of life, “Omnis cellula e cellula”, or “All cells arise only from pre-existing cells”. This “Cell Theory” may oddly enough mean that a cell does not know itself how to make a cell - only how to divide one into two.
Information or data?
In information theory, there is a difference between information and data. If we for example measure a repeating signal, the information is that there is a 50Hz sine wave of a certain amplitude and phase. The data however consists of the values of that signal as measured at a certain interval. We can keep collecting that data, but as long as the signal remains the same, we are not gathering any additional information.
In my examples of genome sizes, I used some careful words. 750 megabytes indeed corresponds to one hour of CD quality audio - uncompressed. The 320 kilobyte image of the bacteria was also carefully uncompressed.
It turns out that while almost all viral or bacterial DNA is clearly
necessary, the same does not hold for plants and animals, including us.
Up to 97% of our DNA has no immediately apparent function. A lot of it
may be noise. It may however have a supporting (indirect) role.
If we purely look at the parts that are directly used, the core of the human genome may be as small as 25 megabytes, a number that is as depressing as it is awe inspiring.
To put that into context, 25 megabytes is smaller than the source code of most computer programs you use.
Perhaps you are thinking, “Is the essence of my being really this trivial?”
So how does our intuition fare here? A bacterium is a self-replicating nanoscale robot, able to source components from a vast array of materials. It observes its environment, it can seek out food, it protects itself against enemies. Bacteria can also survive an astounding range of temperatures and chemical conditions. Within every bacterium hides a 3D printer that is able to arrange for the creation of giant molecules with atomic level precision3.
Can we imagine a computer programmer writing the firmware for such a nano-robot, and ending up with only 350 kilobytes? As a programmer, this is hard to imagine, but it is at least within the range of our imagination. The “demoscene”, a (mainly European) technology and art subculture of programmers, routinely create astounding programs within 64 or even 4 kilobytes.
The 25 versus 750 MB situation
So what is it? Do we have a 750 MB genome? Or a 25 MB one, accompanied by 725MB of “junk”? As we progress in this book we’ll be able to get a better idea of illuminating this, but here we can already draw some analogies.
Us eukaryotes (“things with a nucleus”) tend to have genomes around 1000 times larger than bacteria (“prokaryotes”). But, if we analyze a bacterial genome, we find that almost all of it ends up being used. For bacteria this makes sense - DNA takes time to replicate, and for bacteria replication speed is the name of the game. Any bits of superfluous DNA are soon gone.
For a variety of reasons, our complex genomes are generally not under this kind of pressure. And in the vast majority of eukaryotic genomes, we observe that >90% of genetic data is not immediately used.
Yet, there are some notable counter-examples.
This lovely small plant has a 25 MB genome, almost all of which is functional. In other words, it is living proof that you can build complex eukaryotic life without all of the DNA that appears to be “non-coding”.
Does this mean you could also build a whole human being out of 25 MB? Or might the other 725 MB also be required?
When we look into the matter we find a lot of repetitive stuff in the 725MB that harks back to periods when our DNA was plagued by selfishly copying elements, like a viral graffiti.
One concrete question to ask is, does that 725 MB carry a lot of information? Or is it filler, stuff that has to be there, but could also have been entirely different DNA, without affecting the outcome a lot?
Within human genomics this remains a very hotly debated thing. For now I’d like to leave it at this: if you take any kind of picture at high resolution and you zoom in you’ll note there is a lot of arbitrary material in there. Speckles, flecks of dust, random patterns that together are an important part of the photo. Yet if you’d go round rotating or changing these zoomed in bits, no one would be able to tell. The noisier elements of the photo could actually have been different noisy elements, and nothing would have changed.
Yet take away all that detail and the photo no longer looks the same.
The best way, at least in this chapter, to look at “junk DNA” may be like that: it has to be there for now to complete the picture, but it may not very specifically contribute to the outcome. Different non-coding DNA might have led to the same results.
In subsequent chapters we’ll delve into the cellar of non-coding DNA and investigate what fascinating things we find there. We’ll also cover how non-coding DNA still highly influences DNA folding and gene activation.
But for now, we may have to come to terms with the very real possibility that the core of our DNA may be smaller than the source code of even rather trivial computer programs.
Further reading
- A Mathematical Theory of Communication, Claude E. Shannon
- Mycoplasma laboratorium on Wikipedia
-
Actually, because the state of bioinformatics is dire, it typically takes up 3.5 gigabytes. We’ll get to this in a later chapter. ↩︎
-
Sadly, the same thing in DNA goes by many names. And sometimes for a good reason. The chemical people got there first and named DNA letters ‘bases’. A base is the opposite of an acid. In this not very specific terminology, we speak of ‘kilobases’ or ‘megabases’. If a DNA letter finds itself in DNA, we can also call it a nucleotide, even if it is not in the nucleus, so this name too is not that satisfying. Sometimes the chosen name has a specific and relevant meaning, and in later chapters we’ll add some better resolution. ↩︎
-
One of my favorites is vitamin B12, which in all its complexity, hides a single cobalt atom at its core. Bacteria are able to assemble this molecule in such a way that it magically attracts such an atom. No technology that we have ever invented comes close to doing anything like this with any kind of efficiency – if at all. ↩︎