augur index

Count occurrence of bases in a set of sequences.

usage: augur index [-h] --sequences SEQUENCES --output OUTPUT [--verbose]

Named Arguments

--sequences, -s

sequences in FASTA or VCF formats. Augur will summarize the content of FASTA sequences and only report the names of strains found in a given VCF.

--output, -o

tab-delimited file containing the number of bases per sequence in the given file. Output columns include strain, length, and counts for A, C, G, T, N, other valid IUPAC characters, ambiguous characters (β€˜?’ and β€˜-β€˜), and other invalid characters.

--verbose, -v

print index statistics to stdout

Default: False

Speed up filtering with a sequence index

As we describe in the phylogenetic workflow tutorial, augur index precalculates the composition of the sequences (e.g., numbers of nucleotides, gaps, invalid characters, and total sequence length) prior to filtering. The resulting sequence index speeds up subsequent filter steps especially in more complex workflows.

mkdir -p results/
augur index \
    --sequences data/sequences.fasta \
    --output results/sequence_index.tsv

The first lines in the sequence index look like this.

strain      length  A       C       G       T       N       other_IUPAC     -       ?       invalid_nucleotides
PAN/CDC_259359_V1_V3/2015   10771   2952    2379    3142    2298    0       0       0       0       0
COL/FLR_00024/2015  10659   2921    2344    3113    2281    0       0       0       0       0
PRVABC59    10675   2923    2351    3115    2286    0       0       0       0       0
COL/FLR_00008/2015  10659   2924    2344    3110    2281    0       0       0       0       0

We then provide the sequence index as an input to augur filter commands to speed up filtering on sequence-specific attributes.

augur filter \
    --sequences data/sequences.fasta \
    --sequence-index results/sequence_index.tsv \
    --metadata data/metadata.tsv \
    --exclude config/dropped_strains.txt \
    --output results/filtered.fasta \
    --group-by country year month \
    --sequences-per-group 20 \
    --min-date 2012