Economist submission: Towards a multi-stranded genome

Note: For context, please see this article on the Economist Job. This is a condensed and updated version of my earlier post On the pan-genome.

Towards a multi-stranded genome

Given that the completion of the human genome project was announced in 2003, one could be forgiven for thinking the kinks would have been worked out by now. It turns out however that as published today, the human DNA reference sequence is neither complete nor a good description of mankind.

For example, recent sequencing of DNA of people of African descent uncovered that 10% of their genome was not present at all in the human reference sequence. DNA missing from the reference sequence is to a large extent invisible to scientists, leading to significant blind spots in research, for example on the effectiveness of medicines in the presence or absence of certain genes.

A key problem is that as published, even though it was generated from the DNA of 20 individuals, the reference genome lacks the ability to encode variations. In other words the reference represents only a singular genetic snapshot. Recent studies have uncovered that large swathes of the reference do not even describe the most common gene variants but to a large extent consist of DNA from a single subject (‘RPC-11’) who is now known to be at high risk for diabetes. Such uneven DNA coverage biases research.

Ideally, the reference genome would fully capture the variation occurring in our DNA, showing for each part which variants exist and how common they are. For some bacteria this goal has already been achieved with the publication of ‘pan-genomes’.

Human DNA however is vastly more complex, with the presence of structural variations implying stretches of the genome can be longer, shorter or completely absent in some people. As yet no technology has been developed to capture all such variations in a way that is suitable for research as currently practiced.

As it stands, each gene or mutation described by a simple street-like address, which for example locates the Insulin gene on position 2159779 of chromosome 11. A reference that ‘forks’ to capture multiple variants no longer allows for such simple linear addressing. Finding a solution that is both simple to use yet generic enough to cover structural variation is challenging.

Recently, after consulting with over 65 basic research, clinical, and bioinformatic scientists, the US National Institutes of Health have put out funding to create such technology, but given the magnitude of the problem the advent of a solution may take a while.

Meanwhile Heng Li, a noted DNA bioinformatics pioneer now at Harvard medical school, has released software that can process and address most but (crucially) not all structural variations. A risk is that by standardising on an early solution that is not quite universal may create future blind spots that could be even harder to root out.

To tide us over, there are calls to rejig the existing DNA reference to a consensus sequence so that each gene stored in there is at least the most common one, allowing for the continued use of existing software (with its linear addressing) but with reduced genetic bias.

Whatever the eventual outcome, the publication of a true human pan-genome that is fully inclusive would be a great boon to DNA research and would certainly lead to better and more universal understanding of life and health.