Manually Labeling Clades on a Nextstrain Treeď
If you look at the Nextstrain Seasonal Influenza tree, youâll see that the clades are labelled (for example, 3c2
, 3c2.A
, 3c3
etc.). However labelled clades are not a default in Nextstrain builds. You may want to add clade labelling to your Nextstrain trees to more easily describe certain aspects of the tree, point out agreed-upon groups that are accepted by the field, or to facilitate zoomed views on the tree when using Nextstrain narratives. This bit of documentation describes how to add clade labeling to your Nextstrain build.
Finding Clade Defining Nucleotide Mutationsď
You can find clade-defining mutations by exploring a Nextstrain tree that you have built previously. By default, Nextstrain trees will show amino acid mutations. While you can certainly define clades by amino acid changes, there may be many clades that you would like to label for which no amino acid mutations have occurred. Thus, it may work better to define clades using nucleotide mutations, which occur more frequently.
Then, head back to the tree. When you hover over a branch in the tree, a dark grey window pops up.
Inside this window, youâll see that thereâs a header that says Nucleotide Mutations:
.
In the image below, the clade descended from this branch is defined by a mutation from G to A at site 9417 in the genome.
This is the information that youâll need for defining your clades.
Make a TSV File Containing your Clade Mutationsď
The header of this TSV file should have the following entries: clade\t
gene\t
site\t
alt
clade
is the name of your clade, whatever you want to refer to it as.gene
should be entered asnuc
if you are using nucleotide mutations. If you are using amino acid mutations, it would be the gene the amino acid mutation is in, as that gene is defined within the build.site
is the position in the genome where the mutation occurred.alt
is the nucleotide that the site mutated to. For example if the nucleotide mutation isG9417A
, thenalt
would beA
.
Note that some clades will be defined by more than one mutation. To avoid any conflicts with other clades, you should list all the mutations in the TSV file (using the same clade name, but different site information). A picture of what this tsv file should look like, showing invisible characters, is given below. Note that clades c1
, c2
, c3
, c4
, and c5
are defined by only one mutation, but that c6
is defined by two mutations, and thus has two entries, one for each mutation. Along the same lines, there are six mutations defining clade c7
, and thus there are six entries in the tsv
file.
The
â
andÂŹ
characters are displayed by the Atom text editor to represent âtabâ and â(unix) end of lineâ characters, respectively.
It doesnât really matter where you save your clades.tsv
file (as long as you know the path to the file), we usually save this file within the config
directory of our workflows, as this is where other files such as colors.tsv
are usually kept within Nextstrain analysis directories.
Side note: Please avoid numerical labelling for the clades, which currently cause issues. I.e. instead of using
1
,2
,3
etc, add a letter ahead of the number - for instancec1
,c2
,c3
.
Changing your Workflow to Add Clade Labeling by Editing your Snakefileď
Clade labelling is not a default task in the standard Nextstrain workflow. However, you can certainly add it to your workflow! To do so, youâll need to edit your Snakefile.
The first thing youâll need to edit is rule files:
. Here, youâll need to add an entry to the params
section that references your file. For example, you could add clades = config/clades.tsv
to the params
section. The important thing here is that the path to your TSV file is correct. This is what this looks like for us in one of our Snakefiles:
rule files:
params:
input_fasta = "data/sequences.fasta",
metadata = "data/metadata.tsv",
dropped_strains = "config/dropped_strains.txt",
reference = "config/reference.gb",
colors = "config/colors.tsv",
lat_longs = "config/lat_longs.tsv",
auspice_config = "config/auspice_config.json",
root_name = "lab-outgroup",
clades = "config/clades.tsv"
Next, youâll need to add a rule to the Snakefile that tells the workflow to perform the clade labelling. We usually place this rule after the translate
rule.
Hereâs an example rule that we wrote for clade labeling.
rule clades:
message: " Labeling clades as specified in config/clades.tsv"
input:
tree = rules.prune_outgroup.output.tree,
aa_muts = rules.translate.output.node_data,
nuc_muts = rules.ancestral.output.node_data,
clades = files.clades
output:
clade_data = "results/clades.json"
shell:
"""
augur clades --tree {input.tree} \
--mutations {input.nuc_muts} {input.aa_muts} \
--clades {input.clades} \
--output {output.clade_data}
"""
Finally, youâll need to edit the export
rule to make sure that the clade labelling information gets added to the Auspice JSONs. Youâll need to add the clade data as an input to export
, and youâll also need to include it as an argument in the --node-data
flag of the auspice export command
.
For instance, this is what our export
command looks like:
rule export:
message: "Exporting data files for auspice"
input:
tree = rules.prune_outgroup.output.tree,
metadata = rules.combine_strain_epi_names.output.metadata,
branch_lengths = rules.refine.output.node_data,
traits = rules.traits.output.node_data,
nt_muts = rules.ancestral.output.node_data,
aa_muts = rules.translate.output.node_data,
colors = files.colors,
lat_longs = files.lat_longs,
clades = rules.clades.output.clade_data,
auspice_config = files.auspice_config
output:
auspice_tree = rules.all.input.auspice_tree,
auspice_meta = rules.all.input.auspice_meta
shell:
"""
augur export \
--tree {input.tree} \
--metadata {input.metadata} \
--node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.clades} {input.aa_muts} \
--colors {input.colors} \
--lat-longs {input.lat_longs} \
--auspice-config {input.auspice_config} \
--output-tree {output.auspice_tree} \
--output-meta {output.auspice_meta}
"""
Note that in the input we have clades = rules.clades.output.clade_data
and in the --node-data
flag, we have an argument {input.clades}
.
Important: You may have to edit the names depending on how you named your rules, your inputs, and your outputs. For example, in the export
rule above, we have a line that says clades = rules.clades.output.clade_data
. This is formatted this way because we have a rule called clades
and the output from that rule is called clade_data
. However if our rule had been called something different, say label_clades_on_my_tree
, then the referenced file would have to be labelled rules.label_clades_on_my_tree.output.clade_data
. This is also true for however you name the output files.