Being accurate about sequencing accuracy

High throughput sequencing metrics: Let’s be accurate about accuracy.

Being accurate about sequencing accuracy
🗞️
This post originally appeared in the Omic.ly Premium 38 newsletter. To get Omic.ly Premium in your inbox every Sunday, subscribe to the Premium tier or higher.

If “What is a genome?” is the most loaded question in the world of short-reads, the most loaded question for long-reads is “What’s the accuracy?”

A lot of this comes down to the semantics that PacBio and Oxford Nanopore (ONT) use when talking about accuracy.

Because in conversations they tend to use the different types of accuracy essentially interchangeably.

There are 3 main types of accuracy in sequencing:

Base Accuracy - The accuracy of each individual base call. This is the traditional Phred Q-score and a general measure of quality used for assessing most short-read sequencing outputs. Element is Q40+ (99.99%), Illumina/Complete is Q30+ (99.9%), and Thermo is Q20+ (99%).

Read Accuracy - Up one level from base accuracy, what percentage of the bases in a read are accurate. PacBio HiFi reads are 99.9-99.999% accurate, ONT is 70-99.9% accurate.

Consensus Accuracy - The accuracy of the consensus sequence, or, the accuracy after assembling all of the data using multiple reads to correct all the errors.

Before PacBio released HiFi reads, both PacBio and ONT required 'polishing' to produce useful datasets. Reads where ~25% of the bases are wrong are challenging to use for most sequencing applications.

Polishing requires the use of short reads to correct all of the errors because of their very high accuracy, and so long-reads struggled to gain traction beyond niche genome scaffolding applications since these hybrid sequencing approaches are very expensive.

Another expensive method for correcting these sorts of errors is to just do more sequencing.

This works for PacBio because the errors in their data are random. This is why HiFi circular consensus sequencing (CCS) has such high read accuracy, it’s a self-polishing method!

It also works for some ONT data errors, but their method is prone to systematic errors (always happen in the same sequences), so sequencing more will not remove those and requires short-read polishing methods to be fully resolved.

More recently, ONT have touted read accuracies at the same level as PacBio, with sequences for HG002 coming off their dual reader R10 pores with a 'duplex' read accuracy of 99.9%+.

But, at the end of the day, the key thing to know about accuracy is that traditional base accuracy is more relevant for short-reads when addressing quality and read accuracy and consensus accuracy are more important for long-reads.

This will become more apparent as each of the two technologies settle on the niches where they excel - long-reads for high quality genomes, and short-reads for just about everything else!


Omic.ly Premium 38
A weekly email newsletter on omics and clinical laboratory diagnostics.