augur translate

Translate gene regions from nucleotides to amino acids.

Translates nucleotide sequences of nodes in a tree to amino acids for gene regions of the annotated features of the provided reference sequence. Each node then gets assigned a list of amino acid mutations for any position that has a mismatch between its own amino acid sequence and its parent’s sequence. The reference amino acid sequences, genome annotations, and node amino acid mutations are output to a node-data JSON file.

Note

The mutation positions in the node-data JSON are one-based.

usage: augur translate [-h] --tree TREE --ancestral-sequences
                       ANCESTRAL_SEQUENCES --reference-sequence
                       REFERENCE_SEQUENCE [--genes GENES [GENES ...]]
                       [--output-node-data OUTPUT_NODE_DATA]
                       [--alignment-output ALIGNMENT_OUTPUT]
                       [--vcf-reference VCF_REFERENCE]
                       [--vcf-reference-output VCF_REFERENCE_OUTPUT]

Named Arguments

--tree

prebuilt Newick – no tree will be built if provided

--ancestral-sequences

JSON (fasta input) or VCF (VCF input) containing ancestral and tip sequences

--reference-sequence

GenBank or GFF file containing the annotation

--genes

genes to translate (list or file containing list)

--output-node-data

name of JSON file to save aa-mutations to

--alignment-output

write out translated gene alignments. If a VCF-input, a .vcf or .vcf.gz will be output here (depending on file ending). If fasta-input, specify the file name like so: ‘my_alignment_%GENE.fasta’, where ‘%GENE’ will be replaced by the name of the gene

VCF specific

These arguments are only applicable if the input (–ancestral-sequences) is in VCF format.

--vcf-reference

fasta file of the sequence the VCF was mapped to

--vcf-reference-output

fasta file where reference sequence translations for VCF input will be written

Example Node Data JSON

Here’s an example of the output node-data JSON where NODE_1 has no mutations compared to it’s parent and NODE_2 has multiple mutations in multiple genes.

{
    "annotations": {
        "GENE_1": {
            "end": 1685,
            "seqid": "reference.gb",
            "start": 108,
            "strand": "+",
            "type": "CDS"
        },
        "GENE_2": {
            "end": 2705,
            "seqid": "reference.gb",
            "start": 1807,
            "strand": "+",
            "type": "CDS"
        },
    },
    "nodes": {
        "NODE_1": {
            "aa_muts": []
        },
        "NODE_2": {
            "aa_muts": [
                "GENE_1": [
                    "S139N",
                    "R213K",
                    "R439G",
                    "V440A",
                    "D474N",
                    "S479W",
                    "S481T",
                    "P485L",
                    "R521K"
                ],
                "GENE_2": [
                    "P43S",
                    "D46N",
                    "C64R",
                    "R98K",
                    "D136G",
                    "M175V"
                ]
            ]
        }
    },
    "reference": {
        "GENE_1": "MATLLRSLAL...",
        "GENE_2": "MAEEQARHVK..."
    }
}