Labeling clades

Clades in phylogenetic trees are often named to facilitate discussion of genetic diversity, see for example seasonal influenza on nextstrain. Augur has a command to determine the position of such clade labels and assign sequences to clades. The definition of these clades are provided in a tab-delimited file (tsv) using the following format:

clade	gene	site	alt
3b	HA1	145	S
3b	HA1	312	S
3b	nuc	1671	G
3c	HA1	48	I
3c	HA1	45	N
3c	nuc	456	T
3c2	HA2	160	N
3c2	nuc	693	A

Each line specifies a sequence feature of a clade. The first column specified the name of the clade to which the feature belongs, the column gene can be any annotated gene or the underlying nucleotide sequence (nuc), the column site specifies the position (numbering starting at 1), while the column alt specified the state. A clade if often defined by multiple such sequence features.

The augur command clades can be used to annotate such clades in your tree and a rule in a Snakefile would look this

rule clades:
    input:
        tree = rules.refine.output.tree,
        aa_muts = rules.translate.output.aa_data,
        nuc_muts = rules.ancestral.output.nt_data,
        clades = "config/clades.tsv"
    output:
        clade_data = "results/clades.json"
    shell:
        """
        augur clades --tree {input.tree} \
            --mutations {input.nuc_muts} {input.aa_muts} \
            --clades {input.clades} \
            --output-node-data {output.clade_data}
        """

As input, this command requires the tree, the output of the ancestral reconstruction steps and the translation step (assuming your clades are defined using translations), as well as the file with clade definitions.

The output of this command is a json file in the common augur format that specifies clade_membership for each node in the tree. Nodes that didn’t match any clade definition will be left unassigned. Internal nodes that form the root of each clade will also have a branch label assigned (you can prevent this by adding --label-name none).