augur.ancestral module

Infer ancestral sequences based on a tree.

The ancestral sequences are inferred using TreeTime. Each internal node gets assigned a nucleotide sequence that maximizes a likelihood on the tree given its descendants and its parent node. Each node then gets assigned a list of nucleotide mutations for any position that has a mismatch between its own sequence and its parent’s sequence. The node sequences and mutations are output to a node-data JSON file.

If amino acid options are provided, the ancestral amino acid sequences for each requested gene are inferred with the same method as the nucleotide sequences described above. The inferred amino acid mutations will be included in the output node-data JSON file, with the format equivalent to the output of augur translate.

The nucleotide and amino acid sequences are inferred separately in this command, which can potentially result in mismatches between the nucleotide and amino acid mutations. If you want amino acid mutations based on the inferred nucleotide sequences, please use augur translate.

Note

The mutation positions in the node-data JSON are one-based.

class augur.ancestral.Ancestral_JSON

Bases: TypedDict

annotations: Annotations_JSON

mask: NotRequired[str]

nodes: Any

reference: dict[str, str]

class augur.ancestral.Ancestral_Reconstruction

Bases: TypedDict

mutations: Mutations

root_seq: str

tt: Any

class augur.ancestral.Annotations_JSON

Bases: TypedDict

nuc: NotRequired[Nuc_Annotation]

augur.ancestral.GENE_PATTERN = '%GENE': String pattern used for gene replacement in filenames etc

class augur.ancestral.Mode(is_vcf, nuc_reconstruction, aa_reconstruction=False)

Bases: object

aa_reconstruction: bool = False

is_vcf: bool

nuc_reconstruction: bool

class augur.ancestral.Mutations

Bases: TypedDict

mask: Any

nodes: Any

class augur.ancestral.Nuc_Annotation

Bases: TypedDict

end: int

start: int

strand: str

type: str

augur.ancestral.collect_mutations(tt, mask, reference_sequence=None, infer_ambiguous=False)

iterates of the tree and produces dictionaries with mutations and sequences for each node.

If a reference sequence is provided then mutations can be collected for the root node. Masked positions at the root-node will be treated specially: if we infer ambiguity, then we report no mutations (i.e. we assume the reference base holds), otherwise we’ll report a mutation from the <ref> to “N”.

Parameters:

tt (treetime.TreeTime) -- instance of treetime with valid ancestral reconstruction
mask (numpy.ndarray(bool))
reference_sequence (str, optional)

Returns:

dict -> <node_name> -> [mut, mut, …] where mut is a string in the form <from><1-based-pos><to>

Return type:

dict

augur.ancestral.collect_sequences(tt, mask, reference_sequence=None, infer_ambiguous=False)

Create a full sequence for every node on the tree. Masked positions will have the reference base if we are inferring ambiguity, or the ambiguous character ‘N’.

Parameters:

tt (treetime.TreeTime) -- instance of treetime with valid ancestral reconstruction
mask (numpy.ndarray(bool)) -- Mask these positions by changing them to the ambiguous nucleotide
reference_sequence (str or None)
infer_ambiguous (bool, optional) -- if true, request the reconstructed sequences from treetime, otherwise retain input ambiguities

Returns:

dict -> <node_name> -> sequence_string

Return type:

dict

augur.ancestral.construct_cds_feature(name, aa_len)

augur.ancestral.correct_alignment(aln_fname, correct_seq)

Read an alignment from a FASTA file and correct sequences using the provided correction function (from _make_seq_corrector).

Returns a MultipleSeqAlignment suitable for passing directly to TreeAnc.

Return type:: MultipleSeqAlignment

augur.ancestral.create_mask(is_vcf, tt, reference_sequence, aln)

Identify sites for which every terminal sequence is ambiguous. These sites will be masked to prevent rounding errors in the maximum likelihood inference from assigning an arbitrary nucleotide to sites at internal nodes.

Parameters:

is_vcf (bool)
tt (treetime.TreeTime) -- instance of treetime with valid ancestral reconstruction. Unused if is_vcf.
reference_sequence (str) -- only used if is_vcf
aln (dict) -- describes variation (relative to reference) per sample. Only used if is_vcf.

Return type:

numpy.ndarray(bool)

augur.ancestral.reconstruct_translations(anc_seqs, nuc_ref, aa_ref_fname, T, genes, annotation_fname, translations_fname_pattern, infer_ambiguous, fill_overhangs, marginal, rng_seed, output_fname_pattern, report_inconsistent_translation)

Return type:: Ancestral_JSON

augur.ancestral.register_parser(parent_subparsers)

augur.ancestral.run(args)

augur.ancestral.run_ancestral(T, aln, reference_sequence=None, is_vcf=False, full_sequences=False, fill_overhangs=False, infer_ambiguous=False, marginal=False, alphabet='nuc', rng_seed=None)

ancestral nucleotide reconstruction using TreeTime

Return type:: Ancestral_Reconstruction

augur.ancestral.validate_arguments(args, genes)

Check that provided arguments are compatible. Where possible we use argparse built-ins, but they don’t cover everything we want to check. This checking shouldn’t be used by downstream code to assume arguments exist, however by checking for invalid combinations up-front we can exit quickly.

Return type:: Mode