Why short-read sequencing can't ever get us a 'whole genome'

Short-reads miss things, or: How I learned to stop worrying and love the long-read.

Why short-read sequencing can't ever get us a 'whole genome'
🗞️
This post originally appeared in the Omic.ly Premium 36 newsletter. To get Omic.ly Premium in your inbox every Sunday, subscribe to the Premium tier or higher.

Chromosome 21 is 47 million contiguous bases and Chromosome 1 is 247 million contiguous bases - the other 22 fall somewhere in between.

When we sequence a genome using short-reads, the first step in the process after DNA isolation is Fragmentation.

We take those big, beautiful, contiguous bases and chop them up into 200-300 base pair fragments.

I wonder if we lose important information when we do that?

We do!

And this is because when we 'align' those hundreds of millions of 300 base pair fragments, we’re using the unique overlapping sequence within those fragments to figure out where they belong in the genome.

It’s like putting together a huge puzzle!

But what happens when those fragments aren't unique?

20% of the genome has repetitive sequence longer than 200bp, so we’re left with millions of fragments that can’t be aligned to the right places.

This becomes a big problem in disease diagnosis because if we don’t know the orientation or location of these fragments, or worse, they’re mapped to the wrong places, we can totally miss important indicators of genetic disease.

Short-reads are especially problematic for:

Pseudogenes - Viruses liked to infect our ancestors. Sometimes they copied and repeated adjacent sequences as they entered and exited our genome. Except these copies now exist in a genetic no-man’s land where they aren’t expressed, and they serve only to confuse and deceive.

Translocations and Inversions - These happen when genetic content is rearranged and put back in the wrong place (translocation) or flipped (inversion). These errors can have a huge impact on gene regulation, especially if unbalanced or fused, and we mostly miss them with short reads.

Repeat Expansions - Trinucleotide repeat expansions are causal of 50 genetic diseases including Huntington’s, Fragile X and spinocerebellar ataxia. ALS can also be caused by a repeat expansion. Many of these are too long to be resolved with short-reads and so archaic techniques like Southern blot are used to measure them.

Variant Haplotyping and Phasing - You have two of each chromosome, the variants between those copies are called alleles. Knowing which alleles the different variants are on is important for understanding a number of diseases, especially those caused by compound heterozygotes (two different mutations in the same gene spaced very far apart).

Mitochondrial Genome - Did you know your cells actually have two genomes? A lot of the content of your mitochondrial genome can be repeated in your nuclear genome. But the copies in your nuclei don’t matter.

How can we capture all of this information to make sure we’re getting an accurate representation of the whole genome without missing anything important?

We can use high quality long-reads, and luckily for us, we have a few good options to fill in these short-read gaps!


Omic.ly Premium 36
HOT TAKE: After an IPO valuing them at $3.5B 3 years ago, 23andMe wants to go private for $200m, signaling another tremendous SPAC failure