augur indexο
Count occurrence of bases in a set of sequences.
usage: augur index [-h] --sequences SEQUENCES --output OUTPUT [--verbose]
Named Argumentsο
- --sequences, -s
sequences in FASTA or VCF formats. Augur will summarize the content of FASTA sequences and only report the names of strains found in a given VCF.
- --output, -o
tab-delimited file containing the number of bases per sequence in the given file. Output columns include strain, length, and counts for A, C, G, T, N, other valid IUPAC characters, ambiguous characters (β?β and β-β), and other invalid characters.
- --verbose, -v
print index statistics to stdout
Default:
False
Speed up filtering with a sequence indexο
As we describe in the phylogenetic workflow tutorial, augur index precalculates the composition of the sequences (e.g., numbers of nucleotides, gaps, invalid characters, and total sequence length) prior to filtering. The resulting sequence index speeds up subsequent filter steps especially in more complex workflows.
mkdir -p results/
augur index \
--sequences data/sequences.fasta \
--output results/sequence_index.tsv
The first lines in the sequence index look like this.
strain length A C G T N other_IUPAC - ? invalid_nucleotides
PAN/CDC_259359_V1_V3/2015 10771 2952 2379 3142 2298 0 0 0 0 0
COL/FLR_00024/2015 10659 2921 2344 3113 2281 0 0 0 0 0
PRVABC59 10675 2923 2351 3115 2286 0 0 0 0 0
COL/FLR_00008/2015 10659 2924 2344 3110 2281 0 0 0 0 0
We then provide the sequence index as an input to augur filter commands to speed up filtering on sequence-specific attributes.
augur filter \
--sequences data/sequences.fasta \
--sequence-index results/sequence_index.tsv \
--metadata data/metadata.tsv \
--exclude config/dropped_strains.txt \
--output results/filtered.fasta \
--group-by country year month \
--sequences-per-group 20 \
--min-date 2012