augur filter


Filter and subsample a sequence set.

usage: augur filter [-h] --metadata FILE [--sequences SEQUENCES]
                    [--sequence-index SEQUENCE_INDEX]
                    [--metadata-chunk-size METADATA_CHUNK_SIZE]
                    [--metadata-id-columns METADATA_ID_COLUMNS [METADATA_ID_COLUMNS ...]]
                    [--query QUERY] [--min-date MIN_DATE]
                    [--max-date MAX_DATE]
                    [--exclude-ambiguous-dates-by {any,day,month,year}]
                    [--exclude EXCLUDE [EXCLUDE ...]]
                    [--exclude-where EXCLUDE_WHERE [EXCLUDE_WHERE ...]]
                    [--exclude-all] [--include INCLUDE [INCLUDE ...]]
                    [--include-where INCLUDE_WHERE [INCLUDE_WHERE ...]]
                    [--min-length MIN_LENGTH] [--non-nucleotide]
                    [--group-by GROUP_BY [GROUP_BY ...]]
                    [--sequences-per-group SEQUENCES_PER_GROUP | --subsample-max-sequences SUBSAMPLE_MAX_SEQUENCES]
                    [--probabilistic-sampling | --no-probabilistic-sampling]
                    [--priority PRIORITY] [--subsample-seed SUBSAMPLE_SEED]
                    [--output OUTPUT] [--output-metadata OUTPUT_METADATA]
                    [--output-strains OUTPUT_STRAINS]
                    [--output-log OUTPUT_LOG]

inputs

metadata and sequences to be filtered

--metadata

sequence metadata, as CSV or TSV

--sequences, -s

sequences in FASTA or VCF format

--sequence-index

sequence composition report generated by augur index. If not provided, an index will be created on the fly.

--metadata-chunk-size

maximum number of metadata records to read into memory at a time. Increasing this number can speed up filtering at the cost of more memory used.

Default: 100000

--metadata-id-columns

names of valid metadata columns containing identifier information like ‘strain’ or ‘name’

Default: [‘strain’, ‘name’]

metadata filters

filters to apply to metadata

--query
Filter samples by attribute.

Uses Pandas Dataframe querying, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query for syntax. (e.g., –query “country == ‘Colombia’” or –query “(country == ‘USA’ & (division == ‘Washington’))”)

--min-date

minimal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD

--max-date

maximal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD

--exclude-ambiguous-dates-by

Possible choices: any, day, month, year

Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g., 2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”).

--exclude

file(s) with list of strains to exclude

--exclude-where

Exclude samples matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND

--exclude-all

exclude all strains by default. Use this with the include arguments to select a specific subset of strains.

Default: False

--include

file(s) with list of strains to include regardless of priorities or subsampling

--include-where

Include samples with these values. ex: host=rat. Multiple values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any sequences matching these rules will be included.

sequence filters

filters to apply to sequence data

--min-length

minimal length of the sequences

--non-nucleotide

exclude sequences that contain illegal characters

Default: False

subsampling

options to subsample filtered data

--group-by

categories with respect to subsample; two virtual fields, “month” and “year”, are supported if they don’t already exist as real fields but a “date” field does exist

--sequences-per-group

subsample to no more than this number of sequences per category

--subsample-max-sequences

subsample to no more than this number of sequences; can be used without the group_by argument

--probabilistic-sampling

Allow probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when –subsample-max-sequences is provided.

Default: True

--no-probabilistic-sampling

Default: True

--priority
tab-delimited file with list of priority scores for strains (e.g., “<strain>t<priority>”) and no header.

When scores are provided, Augur converts scores to floating point values, sorts strains within each subsampling group from highest to lowest priority, and selects the top N strains per group where N is the calculated or requested number of strains per group. Higher numbers indicate higher priority. Since priorities represent relative values between strains, these values can be arbitrary.

--subsample-seed

random number generator seed to allow reproducible subsampling (with same input data).

outputs

possible representations of filtered data (at least one required)

--output, --output-sequences, -o

filtered sequences in FASTA format

--output-metadata

metadata for strains that passed filters

--output-strains

list of strains that passed filters (no header)

--output-log

tab-delimited file with one row for each filtered strain and the reason it was filtered. Keyword arguments used for a given filter are reported in JSON format in a kwargs column.

How we subsample sequences in the zika-tutoral

As an example, we’ll look that the filter command in greater detail using material from the Zika tutorial. The filter command allows you to selected various subsets of your input data for different types of analysis. A simple example use of this command would be

augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --min-date 2012 --output filtered.fasta

This command will select all sequences with collection date in 2012 or later. The filter command has a large number of options that allow flexible filtering for many common situations. One such use-case is the exclusion of sequences that are known to be outliers (e.g. because of sequencing errors, cell-culture adaptation, …). These can be specified in a separate file:

BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...

To drop such strains, you can pass the name of this file to the augur filter command:

augur filter --sequences data/sequences.fasta \
           --metadata data/metadata.tsv \
           --min-date 2012 \
           --exclude config/dropped_strains.txt \
           --output filtered.fasta

(To improve legibility, we have wrapped the command across multiple lines.) If you run this command (you should be able to copy-paste this into your terminal) on the data provided in the Zika tutorial, you should see that one of the sequences in the data set was dropped since its name was in the dropped_strains.txt file.

Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --min-date 2012 \
  --exclude config/dropped_strains.txt \
  --group-by country year month \
  --sequences-per-group 1 \
  --output filtered.fasta

This subsampling and filtering will reduce the number of sequences in the tutorial data set from 34 to 24.