augur.ancestral

Infer ancestral sequences based on a tree.

The ancestral sequences are inferred using TreeTime. Each internal node gets assigned a nucleotide sequence that maximizes a likelihood on the tree given its descendants and its parent node. Each node then gets assigned a list of nucleotide mutations for any position that has a mismatch between its own sequence and its parent’s sequence. The node sequences and mutations are output to a node-data JSON file.

If amino acid options are provided, the ancestral amino acid sequences for each requested gene are inferred with the same method as the nucleotide sequences described above. The inferred amino acid mutations will be included in the output node-data JSON file, with the format equivalent to the output of augur translate.

The nucleotide and amino acid sequences are inferred separately in this command, which can potentially result in mismatches between the nucleotide and amino acid mutations. If you want amino acid mutations based on the inferred nucleotide sequences, please use augur translate.

Note

The mutation positions in the node-data JSON are one-based.

augur.ancestral.ancestral_sequence_inference(tree=None, aln=None, ref=None, infer_gtr=True, marginal=False, fill_overhangs=True, infer_tips=False, alphabet='nuc')

infer ancestral sequences using TreeTime

Parameters:
  • tree (Bio.Phylo.BaseTree.Tree or str) – tree or filename of tree

  • aln (Bio.Align.MultipleSeqAlignment or str) – alignment or filename of alignment

  • ref (str, optional) – reference sequence to pass to TreeTime’s TreeAnc class

  • infer_gtr (bool, optional) – Description

  • marginal (bool, optional) – Description

  • fill_overhangs (bool) – In some cases, the missing data on both ends of the alignment is filled with the gap character (‘-‘). If set to True, these end-gaps are converted to “ambiguous” characters (‘N’ for nucleotides, ‘X’ for aminoacids). Otherwise, the alignment is treated as-is

  • infer_tips (bool) – Since v0.7, TreeTime does not reconstruct tip states by default. This is only relevant when tip-state are not exactly specified, e.g. via characters that signify ambiguous states. To replace those with the most-likely state, set infer_tips=True

  • alphabet (str) – alphabet to use for ancestral sequence inference. Default is the nucleotide alphabet that included a gap character ‘nuc’. Alternative is aa for amino acids.

Returns:

treetime.TreeAnc instance

Return type:

treetime.TreeAnc

augur.ancestral.collect_mutations(tt, mask, character_map=None, reference_sequence=None, infer_ambiguous=False)

iterates of the tree and produces dictionaries with mutations and sequences for each node.

If a reference sequence is provided then mutations can be collected for the root node. Masked positions at the root-node will be treated specially: if we infer ambiguity, then we report no mutations (i.e. we assume the reference base holds), otherwise we’ll report a mutation from the <ref> to “N”.

Parameters:
  • tt (treetime.TreeTime) – instance of treetime with valid ancestral reconstruction

  • mask (numpy.ndarray(bool)) –

  • character_map (None, optional) – optional dictionary to map characters to a custom set.

  • reference_sequence (str, optional) –

Returns:

dict -> <node_name> -> [mut, mut, …] where mut is a string in the form <from><1-based-pos><to>

Return type:

dict

augur.ancestral.collect_sequences(tt, mask, reference_sequence=None, infer_ambiguous=False)

Create a full sequence for every node on the tree. Masked positions will have the reference base if we are inferring ambiguity, or the ambiguous character ‘N’.

Parameters:
  • tt (treetime.TreeTime) – instance of treetime with valid ancestral reconstruction

  • mask (numpy.ndarray(bool)) – Mask these positions by changing them to the ambiguous nucleotide

  • reference_sequence (str or None) –

  • infer_ambiguous (bool, optional) – if true, request the reconstructed sequences from treetime, otherwise retain input ambiguities

Returns:

dict -> <node_name> -> sequence_string

Return type:

dict

augur.ancestral.create_mask(is_vcf, tt, reference_sequence, aln)

Identify sites for which every terminal sequence is ambiguous. These sites will be masked to prevent rounding errors in the maximum likelihood inference from assigning an arbitrary nucleotide to sites at internal nodes.

Parameters:
  • is_vcf (bool) –

  • tt (treetime.TreeTime) – instance of treetime with valid ancestral reconstruction. Unused if is_vcf.

  • reference_sequence (str) – only used if is_vcf

  • aln (dict) – describes variation (relative to reference) per sample. Only used if is_vcf.

Return type:

numpy.ndarray(bool)

augur.ancestral.register_parser(parent_subparsers)
augur.ancestral.run(args)
augur.ancestral.run_ancestral(T, aln, reference_sequence=None, is_vcf=False, full_sequences=False, fill_overhangs=False, infer_ambiguous=False, marginal=False, alphabet='nuc')

ancestral nucleotide reconstruction using TreeTime

augur.ancestral.validate_arguments(args, is_vcf)

Check that provided arguments are compatible. Where possible we use argparse built-ins, but they don’t cover everything we want to check. This checking shouldn’t be used by downstream code to assume arguments exist, however by checking for invalid combinations up-front we can exit quickly.