7. Quality Control (QC)¶
Whole-genome sequencing of viruses is a complex biotechnological process. Results can vary significantly in their quality, in particular, from scarce or degraded input material. Some parts of the sequence might be missing and the bioinformatic analysis pipelines that turn raw data into a consensus genome sometimes produce artefacts. Such artefacts typically manifest in spurious differences of the sequence from the reference.
If such problematic sequences are included in a phylogenetic analysis, they can distort the resulting tree. For example, the Nextstrain analysis pipeline therefore excludes sequences deemed problematic. Many such problems can be fixed by tweaking the pipeline or removing contaminants. It is therefore useful to spot these problems as early as possible.
Nextclade scans each query sequence for issues which may indicate problems occurring during sequencing or assembly. It implements several Quality Control (QC) to flag sequences as potentially problematic. Individual rules produce various metrics, specific to each rule as well as numeric quality scores.
For each query sequence each individual QC rule produces a quality score. These individual QC scores are empirically calibrated to fit the following thresholds:
|0||the best quality||bright green|
|0 to 29||"good" quality||green to yellow|
|30 to 99||"mediocre" quality||yellow to orange|
|100 and above||"bad" quality||red to bright red|
After all scores are calculated, the final QC score \( S \) is calculated as follows:
where \( S_i \) is the score for an individual QC rule \( i \).
With this quadratic aggregation, multiple mildly concerning scores don’t result in a bad overall score, but a single bad score guarantees a bad overall score.
The final score has the same thresholds as the the individual scores.
Individual QC Rules¶
For SARS-CoV-2, we currently implement the following QC rules (in parentheses are the one-letter designations used in Nextclade Web)
Missing data (N)¶
If your sequence misses more than 3000 sites (
N characters), it will be flagged as
Mixed sites (M)¶
Ambiguous nucleotides (such as
Y, etc) are often indicative of contamination (or superinfection) and more than 10 such non-ACGTN characters will result in a QC flag
Private mutations (P)¶
Sequences with more than 24 mutations relative to the closest sequence in the reference tree are flagged as
bad. We will revise this threshold as diversity of the SARS-CoV-2 population increases.
Mutation clusters (C)¶
If your sequence has clusters with 6 or more private differences within a 100-nucleotide window, it will be flagged as
Stop codons (S)¶
Replicating viruses can not have premature stop codons in essential genes and such premature stops are hence an indicator of problematic sequences.
However, some stop codons are known to be common even in functional viruses. Our stop codon rule excludes such known stop codons and assigns a QC score of 75 to each additional premature stop.
Frame shifts (F)¶
Frame shifting insertions or deletions typically result in a garbled translation or a premature stop. Nextalign currently doesn’t translate frame shifted coding sequences and each frame shift is assigned a QC score 75. Note, however, that clade 21H has a frame shift towards the end of ORF3a that results in a premature stop.
Nextclade’s QC warnings don’t necessarily mean your sequences are problematic, but these issues warrant closer examination. You may explore the rest of the analysis results for the flagged sequences to make the decision.
The numeric QC scores are useful for rough estimation of the quality of sequences. However, these values are empirical. They only hint on possible issues, the possible scale of these issues and call for further investigation. They do not have any other meaning or application.
The Nextstrain SARS-CoV-2 pipeline uses similar (more lenient) QC criteria. For example, Nextstrain will exclude your sequence if it has fewer than 27000 valid bases (corresponding to roughly 3000 Ns) and doesn’t check for ambiguous characters. Sequences flagged for excess divergence and SNP clusters by Nextclade are likely excluded by Nextstrain.
Note that there are many additional potential problems Nextclade does not check for. These include for example: primer sequences, adaptaters , or chimeras between divergent SARS-CoV-2 strains.