Pangenomes are here to bring equity to genomics
Reference genomes: what they are and why the next big thing in genomics is the human pangenome.
The human genome project started in 1990 and was declared complete in April of 2003.
In actuality, only about 90% of the genome was finished then, but the pesky telomeres (ends of chromosomes), centromeres (centers of chromosomes) and a bunch of lengthy repetitive regions remained elusive.
The human genome wasn't actually completed telomere-to-telomere (T2T) until January of 2022 and that's thanks to the development of long-read sequencing and mapping technologies that didn't exist when the human genome was originally sequenced.
But the 'human genome' here is a bit of a misnomer because the samples used to sequence the 'human genome' were pooled at random from 20 donors, except that pool turned out to mostly be low quality, so 66% of the original human genome sequence was from one male donor, RPCI-11, who was later determined to be 50% White European and 50% African.
While the completion of the 'human genome' in 2003 was historic, and we learned a lot from it, we also learned that a single reference genome that's supposed to represent the 'average' human actually misses a ton of important variation.
Studies since the completion of that genome have shown that it is missing a lot of the variation observed across ancestral populations!
What this means in practice is that something that is identified as a 'variant' using our current reference genome might not actually be a variant depending on that individual's genetic ancestry.
This has HUGE implications for genomics, tracking disease, and identifying causal variants of those diseases in diverse populations!
So, to recap, we sequenced a couple people, created a linear consensus that serves as our reference, and called the genome done.
And we've continued to use that genome, despite knowing that, at best, it's a huge compromise and probably doesn't do a super good job of helping us identify all of the relevant variation in diverse populations.
Fortunately, humans aren't the only species that have this genetic diversity problem and microbes have served as a reasonable proxy for the development of a better solution.
Early work with mixed bacterial populations showed us that 'graph' genomes are much more powerful than linear ones.
Graph genomes are special because instead of settling on a single consensus, they annotate all of the variation observed at each position in the genome, meaning, you actually retain all of the important ancestral variation.
Fortunately, we can do the same exact thing in humans, and capture our species wide diversity in a pangenome!
This work has already begun and teams are contributing datasets to the Human Pangenome Reference Consortium, a group dedicated to bringing equity to the reference genome.