Inferring Sequence Traits (like Drug Resistance)ļƒ

Unfortunately this method currently only works with VCF-input files. It will be updated to work with Fasta-input soon!

The augur function sequence-traits can identify any trait associated with particular nucleotide or amino-acid mutations, but itā€™s often used to identify drug resistance mutations (DRMs).

To tell augur which sites confer what trait (or drug resistance), youā€™ll need to pass a file detailing these sites.

The file should usually contain five columns: GENE, SITE, ALT, DISPLAY_NAME, and FEATURE. DISPLAY_NAME can be blank, and the GENE column can be omitted if only nucleotide locations are used.

Amino Acid Sitesļƒ

For example, for drug resistance in TB, we list the gene, the AA position in the gene, the AA mutation that confers resistance (you can list a site multiple times if multiple bases give resistance), and the name of the drug this mutation gives resistance to:

GENE    SITE    ALT DISPLAY_NAME    FEATURE
gyrB    461     N                   Fluoroquinolones
gyrB    499     D                   Fluoroquinolones
rpoB    432     E                   Rifampicin
rpoB    432     K                   Rifampicin

We can leave DISPLAY_NAME blank, as auspice will by default display the gene, site, and original and alternative base.

Nucleotide Sitesļƒ

For mutations outside of protein-coding genes, we can specify their position using nucleotides, and specify how weā€™d like them to be named when displayed:

GENE    SITE    ALT DISPLAY_NAME    FEATURE
nuc     1472749 A   rrs: C904A      Streptomycin
nuc     1473246 G   rrs: A1401G     Amikacin Capreomycin Kanamycin
nuc     1673423 T   fabG1: G-17T    Isoniazid Ethionamide
nuc     1673425 T   fabG1: C-15T    Isoniazid Ethionamide

In the TB literature, these mutations are still referred to by their position within non-protein-coding genes (rrs) or location near genes (-17 fabG1), not their nucleotide location. We can ensure auspice displays the more useful common nomenclature by giving entries for the DISPLAY_NAME column.

If you are only using nucleotide sites, you can also omit the GENE column:

SITE    ALT DISPLAY_NAME    FEATURE
1472749 A   rrs: C904A      Streptomycin
1473246 G   rrs: A1401G     Amikacin Capreomycin Kanamycin

Both Nucleotide Sites and AA Sitesļƒ

You can also mix sites identified by nucleotide position and those identified by AA position:

GENE    SITE    ALT DISPLAY_NAME    FEATURE
gyrB    461     N                   Fluoroquinolones
gyrB    499     D                   Fluoroquinolones
rpoB    432     E                   Rifampicin
rpoB    432     K                   Rifampicin
nuc     1472749 A   rrs: C904A      Streptomycin
nuc     1473246 G   rrs: A1401G     Amikacin Capreomycin Kanamycin
nuc     1673423 T   fabG1: G-17T    Isoniazid Ethionamide
nuc     1673425 T   fabG1: C-15T    Isoniazid Ethionamide

Optionsļƒ

sequence-traits will return a value for each ā€œfeatureā€ - for example, all the mutations on the tree that lead to resistance to Streptomycin. It will also generate a count either of the total number of ā€œfeaturesā€ each node has (ex: the total number of drugs a sequence is resistant to), or the total number or mutations specified in the file each node has (ex: the total number of DRMs a sequence has, even if some are for the same drug).

You can specify a name for this count using the --label argument (ex: ā€œDrug_Resistanceā€). The --count argument value specifies whether to count the number of traits (ex: drugs resistant to) (use traits) or number of overall mutations (use mutations).