6. Clade assignment
To simplify discussion of co-circulating virus variants, Nextstrain groups them into Clades, which are defined by specific combination of signature mutations. Clades are groups of related sequences that share a common ancestor. For SARS-CoV-2, we try to align these clades as much as possible with WHO variant designations.
In contrast to the analysis pipeline of Nextstrain.org, which requires setting up and running a heavy computational job to assign clades, Nextclade takes a lightweight approach, and assigns your sequences to clades by placing sequences on a phylogenetic tree annotated with clade definitions. More specifically, Nextclade assigns the clade of the nearest reference node found during the Phylogenetic placement step. This is an accuracy-to-runtime-performance trade-off - Nextclade provides almost instantaneous result, but is expected to be slightly less accurate than the full pipeline. For more details see Phylogenetic placement: Known limitations section.
⚠️ Nextclade only considers those clades which are present in the input reference tree. Only one of these clades, and no others, can be assigned to the analysed sequences. It is important to make sure that every clade that you expect to find in the results is well represented in the tree.
If unsure, use one of the trees from the default Nextclade datasets or any other well-known, up-to-date, sufficiently large and diverse tree.
💡 For regional, focused studies, it is recommended to use a tree which includes clades that are specific to your region.
For SARS-CoV-2, Nextstrain maintains one of the 3 major clade systems:
Nextstrain clades, Nextclade also assigns each sequence to a
Pango lineage, another widely used clade system.
The Nextstrain clade system is outlined in this blog post.
The clades are hierarchically structured as follows:
You can find the exact, up-to-date clade definitions in github.com/nextstrain/ncov.
Nextclade also assigns each sequence to a
Pango lineage in the same way clades are assigned, reading off the lineage of the nearest neighbour in the reference tree.
You can read more about the method and validation results in this report.
In short, for recent sequences (within last 12 months) Nextclade’s
Pango lineage assignments are about as accurate as pangoLEARN’s.
To keep the reference tree small, Nextclade does not include all early
Pango lineages and Nextclade Pango lineages should thus be treated with caution for samples older than 12 months.
Clades are reported in the “Clade” column in the results table of Nextclade Web as well as in the analysis results JSON, CSV and TSV files generated by Nextclade CLI and in the “Download” dialog of Nextclade Web.
For SARS-CoV-2, Pango lineages are also displayed in the results. In
csv files, the column is named