How COVID-19 is diagnosed: bacterium assisted DNA searching

This post is dedicated to lab technicians everywhere doing the difficult work institutes and hospitals rely on to investigate disease and keep us healthy. Lab work requires high precision, deep understanding, is physically demanding, and can even be dangerous. Although our healthcare systems & universities would come to a grinding halt without lab technicians, they are often almost literally invisible somewhere far away. Thank you all for your hard work!

When COVID-19 (caused by “SARS-CoV-2”, and originally known as 2019-nCoV) was discovered, the world quickly determined the DNA (or actually RNA) of this virus. Within a few days reliable laboratory tests became available that were able to detect an infection in only a few hours. In this post I attempt to explain the magnificent technology used, which is called Reverse Transcription Real-Time Quantitative Polymerase Chain Reaction or qRT-PCR.

qRT-PCR is used to detect the presence of a specific bit of COVID-19 genetic material, and if we see enough of it, we then determine that the patient is infected. In essence this virus test is actually a DNA test.

NOTE: If you are a professional, you’ll note I take some shortcuts in the story. Please read on to the end, where some of these shortcuts are patched up. Feedback if I got it wrong is VERY welcome on bert@hubertnet.nl or @PowerDNS_Bert.

We’ve known about genes and DNA for longer than you might think. The first DNA sequences were painstakingly determined in the early 1970s, one DNA letter at a time. In 1978, the (tiny) genome of bacteriophage φX174 was determined and published. Only around the year 2000 did it become possible to read “whole genomes”, and then only with international efforts costing billions.

Because DNA sequencing technology was so limited, in the 1980s a lot of thought was put to the question how to detect specific bits of DNA without performing the (then) unimaginable act of reading gigabytes of DNA.

Note: Thanks are due to Dr Mamnun Khan and Erwin van Rijn for suggestions & feedback. Please note that all mistakes remain mine!

Enter the Polymerase Chain Reaction (PCR).


By Maurits Hubert

DNA

DNA is fully digital, consisting of strings of nucleotides, which are small molecules. We call these small molecules A, C, G and T. Human DNA consists of around 3 billion nucleotides, which comprise around 750MB of data.

Because our DNA is, so to say, important, nature stores it redundantly. For example, the sequence ACGTTCA is actually stored like this:

<-------
 ACGTTCA
 |||||||
 TGCAAGT
 ------->

This shows the two strands of DNA, where each A finds itself opposite a T and every C is attached to a G. If damage occurs, it can easily be repaired, because the opposite side forms a template to attract replacement nucleotides. A/T and C/G pairs can each be compared to north/south pole magnets, they attract each other strongly.

As an example, here a missing T and a G are repaired via their opposite sides:

<-------               <-------
 ACG TCA                ACGTTCA
 |||||||      -->       |||||||
 T CAAGT                TGCAAGT
 ------->               ------->

         repair process

This redundancy also enables copying. First the two DNA strands are separated, leaving all the nucleotides “waiting to be repaired”. Repairs are then initiated, and the DNA molecule is “zipped up again”, leading to two functional copies of the DNA fragment.

This ‘zipping up’ process is called the polymerase reaction, and it will turn out to be vital for this story. All of life utilizes the polymerase reaction to duplicate DNA. DNA is fully compatible down from the lowliest virus to the mightiest tree.

Of specific note are the little arrows drawn in the DNA diagrams in this post: our genetic material has a preferred direction, and it can only be processed in that direction. This direction is reversed on opposite sides of the DNA.

           1                              2      &      3         
 
                 <-------          <-------             <-------
                  ACGTTCA           ACGTTCA              ACGTTCA
              ^   |||||||    -->    |||||||     -->      |||||||
             /                      TGCA                 TGCAGGT
<-------    /                       --->                 ------->
 ACGTTCA   /                                              Done!
 |||||||                                                  
 TGCAAGT                                              two copies now
 ------->  \                           <---             <-------
            \                          TTCA              ACGTTCA
             \    |||||||    -->    |||||||     -->      |||||||
              v   TGCAAGT           TGCAAGT              TGCAAGT
                  ------->          ------->             ------->

Summarising, to copy DNA:

  1. the strands are separated (‘denatured’)
  2. new nucleotides attach themselves
  3. the two new DNA molecules get zipped up (‘polymerised’)

The astute reader will have noted that this ‘doubling’ of DNA can be used to cause a chain reaction, where we first get one copy, then 2, then 4, then 8, 16, 32 etc.

If we could do, say, 40 rounds of duplication we could turn a single stretch of DNA into a trillion copies. This would then generate enough DNA to detect it “by eye” if necessary!

Note: to learn more about DNA, RNA, proteins & life, the briefest of summaries can be found in my post DNA: The Code of Life, which also includes links to >2 hours of video with q&a.

Adding some complication

Nature has, as far as we know, had over 4 billion years to work on “the architecture of life”. So it turns out that nothing is really simple.

In reality, life does not randomly go about copying bits of DNA. The ‘polymerase’ reaction as described above does not operate on fully denatured (separated) strands. Polymerase is instead used to complete a DNA copy, starting from a bit of existing dual stranded DNA. So polymerases can do this very well:

<-------           <-------
 ACGTTCA            ACGTTCA
 |||||||    ->      |||||||
 TGCA               TGCAGGT  
 --->               ------->

But they can’t do this:

<-------           <-------
 ACGTTCA            ACGTTCA
 |||||||    ->      |||||||
                    TGCAGGT
                    ------->

In other words, polymerase can’t operate on a single strand alone - it needs to start with a short double stranded bit, and then polymerase can continue the work. This “starter bit” of dual stranded DNA can be created with a primer.

In the diagram above, we can make polymerase do its work by adding some “TGCA” single stranded DNA as a primer. It will bind solidly to the ‘ACGT’ part because it is the exact ‘complement’ (remember the A/T, C/G “magnetism”):

<-------                      <-------            <-------
 ACGTTCA                       ACGTTCA             ACGTTCA
 |||||||    + TGCA     -->     |||||||    -->      |||||||
              ---->            TGCA                TGCAGGT
                               ---->               ------->
             primer                     polymerase 
                                        reaction

Herein lies the key insight - by adding a primer, we can selectively make denatured DNA suitable for the polymerase reaction. Because fully single stranded DNA can’t be copied, only copies will be made where a primer has attached, and it will only attach to the DNA we care about.

Because no copying (‘polymerasing’) happens without a matching primer, we can use the primer as our “search term” in DNA. Ordered online & delivered as a bit of fluid, the primer is the selector of the DNA we are interested in.

Primers

To detect a specific virus, primers are ordered that match up to bits of the viral genome. We also need to make sure we pick DNA that is not also present in other organisms though.

A typical primer is 20 nucleotides long, and it can be extremely specific. Primers can be designed that don’t just match specific viruses, but also specific strains.

It may be somewhat surprising that “only” 20 letters suffice for such a strong match but this is due to statistics. 20 nucleotides represent 4 to the power of 20 possibilities, which is around 1 trillion. Most detections rely on two primers (see below), which multiplies the specificity by another factor of 1 trillion.

For robustness, multiple primer (pairs) can be used so that a virus (or gene) can be detected even if one part of it has mutated.

Copying selected DNA in the lab

Nature surely is clever about this copying, but can we replicate it in the lab? It turns out that by borrowing a bit from the bacterial kingdom, we can.

The ingredients required:

  1. The DNA we are interested in (from a patient perhaps)
  2. Primer material (order online)
  3. New nucleotides (off the shelf)
  4. Polymerase from a high-temperature bacterium (off the shelf)
  5. Food & nutrients for the polymerase (off the shelf)

Conveniently, we can mix all these together in a single vial. In this way, PCR is like a single pan recipe. If you are in a hurry, premixed vials containing 2, 3, 4 and 5 are available.

First we need to separate the strands. DNA does this automatically at around 95°C. So we heat up a test-tube (with all the ingredients) to this temperature, and wait two minutes.

Now our tube is full of single stranded DNA (and all the other ingredients). We then cool things down again, typically to 50-60°C. This makes the primer DNA bind to the single stranded DNA we are interested in. Because remember, the primer is the ‘search term’.

COVID-19 labels. Photo: Sam Nicholls / @samstudio8

After only 15 seconds, the primers will have bound to the right pieces of single stranded DNA. These bits of DNA are now ready to be copied.

We then raise the temperature to 68°C. Why 68°C? It turns out that the high-temperature bacterium (Thermus aquaticus aka Taq) we got the polymerase from does its best work at that temperature. After only 15 seconds, ‘Taq polymerase’ will have done its copying work on the single-stranded DNA bits that have bound to a primer (if they are short bits - longer stretches require more copying time).

With some luck many of the interesting single stranded bits of DNA have now been duplicated. In reality, we don’t exactly get two copies of every strand, but as long as we got more than one copy it is good.

We then heat up the tube to 95°C again and restart the process. This cycle is repeated dozens of times. Even if we only gained 30% material per copy, we will now have ‘amplified’ the relevant bit of DNA by three orders of magnitude.

Ok, then what?

If we have done our work right, the test tube is now teeming with copies of the bit of DNA we are interested in. But how could we tell? Various mechanisms are used. Let’s say the (viral) DNA we care about was not present, in that case all these temperature cycles have achieved almost nothing - the primer material didn’t latch on to anything, the polymerase had nothing to do.

This means that we could simply measure how much DNA there now is in the tube, and if this is significantly more than there was before the copying cycles, apparently we scored a hit.

Such detection is possible with “DNA staining“ which uses molecules that become fluorescent once they are attached to (any) double stranded DNA. By observing (with a suitable camera) if the amount of light emitted increases over the cycles, this can tell us if the PCR is actually multiplying DNA.

More precision

If we look at the US CDC PCR information for COVID-19, we find that it lists primer information, but also a “probe”:

We’ll get to why there are two primers later, but the third line is the interesting one. A probe is again a bit of single-stranded DNA, like the primers, but it comes with a light generating molecule attached. If it manages to bind to a piece of (complementary) DNA, it becomes fluorescent.

If we again, like with the DNA staining, measure how the amount of light increases during PCR, we get a very precise confirmation that the PCR process is 1) amplifying something and 2) it is actually the DNA we were expecting.

(Note that probes are actually slightly more complicated than this - they bind to DNA, but during polymerasing get dislodged & then split up. It is this splitting that causes the fluorescence).

But why are there two primers?

As noted, DNA has two strands, and it can only be copied in one direction. Here is some actual COVID-19 DNA:

    ------------------------------------------------------------------------>
A:  GACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGA
    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
B:  CTGGGGTTTTAGTCGCTTTACGTGGGGCGTAATGCAAACCACCTGGGAGTCTAAGTTGACCGTCATTGGTCT
   <------------------------------------------------------------------------

If we heat this bit of DNA up so it denatures, we end up with two single strands:

     ------------------------------------------------------------------------>
A:   GACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGA
     ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


     ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
B:   CTGGGGTTTTAGTCGCTTTACGTGGGGCGTAATGCAAACCACCTGGGAGTCTAAGTTGACCGTCATTGGTCT
    <------------------------------------------------------------------------

The first primer for COVID-19 is: GACCCCAAAATCAGCGAAAT, and we can indeed see that this stretch of the COVID-19 genome starts with that string. This means it would bind here, and initiate the polymerase reaction.

     ------------------------------------------------------------------------>
     GACCCCAAAATCAGCGAAAT
     ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
B:   CTGGGGTTTTAGTCGCTTTACGTGGGGCGTAATGCAAACCACCTGGGAGTCTAAGTTGACCGTCATTGGTCT

We would then have taken our original bit of DNA, denatured it into two single stranded stretches, and copied one of these into a double-stranded whole again. This is not amplification, we started with one double-stranded bit of DNA, and we also ended up with one!

So to actually make things work, we need a second primer for the other single-stranded stretch we produced.

The second COVID-19 primer is: TCTGGTTACTGCCAGTTGAATCTG, and lo, it matches the other single strand (once we “reverse complement” it):

A:   GACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGA
     ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                                                     GTCTAAGTTGACCGTCATTGGTCT
    <------------------------------------------------------------------------

With this primer, polymerase can create the second double-stranded copy of DNA.

As a bonus, in the CDC page we found the DNA for the ‘probe’, it is ACCCCGCATTACGTTTGGTGGACC, which we indeed find in the middle:

                           ACCCCGCATTACGTTTGGTGGACC
     ------------------------------------------------------------------------>
A:   GACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGA
     ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
B:   TCTGGTTACTGCCAGTTGAATCTGAGGGTCCACCAAACGTAATGCGGGGTGCATTTCGCTGATTTTGGGGTC
    <------------------------------------------------------------------------

One final twist

At the very beginning of this article we noted the impressive name of this technique: Reverse Transcription Real-Time Quantitative Polymerase Chain Reaction or qRT-PCR

It turns out that COVID-19 is actually an RNA virus. DNA and RNA both carry genetic material. The techniques described above only work on DNA and not on RNA. Luckily nature has supplied us with an enzyme called ‘reverse transcriptase’. Once added to RNA, this produces the DNA variant of the same genetic material. This is the ‘Reverse transcription’ part of the name.

That doesn’t sound so hard!

Well.. yes and no. We glossed over many important details. For example, how do you actually gain access to the DNA? This requires mechanical preprocessing. In addition, once the DNA has been released, we need to make sure that we extract it, and only it, and add it to the PCR vial. You can’t just put some snot in there and expect it to work (although it might, which might not be what you want).

In addition, for this to be useful and reliable, great care must be taken not to (cross-)contaminate samples. DNA is everywhere and can easily end up in places where it should not be. From my own DNA research, I fondly recall my system detecting ‘human or monkey DNA’ in every sample we tried, even though these were supposed to be bacterial samples.

We also sort of glossed over how we pick primers to detect specific disease or organisms. It turns out that primer design is also somewhat of an art, and just picking some DNA will not end well.

So in short, although in this page I may have explained the basics of qRT-PCR, realize that people take multi-year courses to learn how to do this well.

I do hope that you found this entertaining, and as ever, feedback is very welcome on bert@hubertnet.nl or @PowerDNS_Bert.