augur subsample
Guides
Command line reference
Subsample sequences from an input dataset.
The input dataset can consist of a metadata file, a sequences file, or both.
See documentation page for details on configuration.
usage: augur subsample [-h] [--metadata FILE] [--sequences FILE]
[--sequence-index FILE] [--metadata-chunk-size N]
[--metadata-id-columns COLUMN [COLUMN ...]]
[--metadata-delimiters CHARACTER [CHARACTER ...]]
[--skip-checks] [--seq-type {nuc,aa}] --config FILE
[--config-section KEY [KEY ...]]
[--search-paths DIR [DIR ...]] [--nthreads N]
[--seed N] [--output-metadata FILE]
[--output-sequences FILE] [--output-log OUTPUT_LOG]
Input options
options related to input files
- --metadata
sequence metadata
- --sequences
sequences in FASTA or VCF format. For large inputs, consider using --sequence-index in addition to this option.
- --sequence-index
sequence composition report generated by augur index. If not provided, an index will be created on the fly. This should be generated from the same file as --sequences.
- --metadata-chunk-size
maximum number of metadata records to read into memory at a time. Increasing this number can reduce run times at the cost of more memory used.
Default:
100000- --metadata-id-columns
names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.
Default:
('strain', 'name')- --metadata-delimiters
delimiters to accept when reading a metadata file. Only one delimiter will be inferred.
Default:
(',', '\t')- --skip-checks
use this option to skip checking for duplicates in sequences and whether ids in metadata have a sequence entry. Can improve performance on large files. Note that this should only be used if you are sure there are no duplicate sequences or mismatched ids since they can lead to errors in downstream Augur commands.
Default:
False- --seq-type
Possible choices: nuc, aa
Sequence type: ‘nuc’ or ‘aa’
Default:
'nuc'
Configuration options
options related to configuration
- --config
augur subsample config file. The expected config options must be defined at the top level, or within a specific section using --config-section.
- --config-section
Use a section of the file given to --config by listing the keys leading to the section. Provide one or more keys. (default: use the entire file)
- --search-paths, --search-path
One or more directories to search for relative filepaths specified in the config file. If a file exists in multiple directories, only the file from the first directory will be used. This can also be set via the environment variable ‘AUGUR_SEARCH_PATHS’. Specified directories will be considered before the defaults, which are: (1) directory containing the config file (2) current working directory
- --nthreads
Number of CPUs/cores/threads/jobs to utilize at once. This controls both parallelism across samples and threads within proximal samples. Individual filter samples are limited to a single thread, while proximal samples use all available threads. The final augur filter call can take advantage of multiple threads.
Default:
1- --seed
random number generator seed for reproducible outputs (with same input data).
Output options
options related to output files
- --output-metadata
output metadata file
- --output-sequences
output sequences file
- --output-log
Tab-delimited file to debug sequence inclusion in samples. All sequences have a row with filter=filter_by_exclude_all. The sequences included in the output each have an additional row per sample that included it (there may be multiple). These rows have filter=force_include_strains with kwargs pointing to a temporary file that hints at the intermediate sample it came from.
Terminology
- sample
This term can refer to either the process of creating a subset or the subset itself:
Process: Selecting a subset of sequences from a dataset according to specific parameters for filtering and subsampling (e.g. minimum/maximum date, minimum/maximum sequence length, sample size).
Example: Run the focal sample …
Resulting subset: The set of sequences obtained from the process described in (1).
Example: The contextual sample consisted of …
- filter sample
The most common type of sample, one which is generated internally via
augur filter. Configured by specifying filtering parameters (date ranges, queries etc) and (optionally) a context sample to use as the input.- proximal sample
A specific type of sample where we compare a (small) focal sample against a (large) context sample and find the closest genetic matches. Uses
augur proximityunder the hood.
Configuration
The --config option expects a YAML-formatted configuration file. This
section describes how the file should be structured.
defaults:
# default sample options
samples:
<sample 1>:
# sample options
<sample 2>:
# sample options
…
Tip
Use --config-section to read from a configuration file that puts these
options under a specific section.
defaults
The defaults section is optional and allows you to specify common options
that apply to all filter samples. This reduces repetition when multiple filter
samples share the same criteria.
Options specified in the defaults section can be overridden by individual
samples. If both defaults and a specific sample define the same option, the
sample-specific value takes precedence.
Note that some options are only available at the sample level and cannot be specified in defaults.
Option |
Type |
Description |
|---|---|---|
exclude |
string(s) |
File(s) with list of strains to exclude. |
exclude_all |
boolean |
Exclude all strains by default. Use this with the include arguments to select a specific subset of strains. |
exclude_ambiguous_dates_by |
one of:
|
Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g., 2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”). |
exclude_where |
string(s) |
Exclude sequences matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND. |
include |
string(s) |
File(s) with list of strains to include regardless of priorities, subsampling, or absence of an entry in sequences. |
include_where |
string(s) |
Include sequences with these values. ex: host=rat. Multiple values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any strains matching these rules will be included regardless of priorities, subsampling, or absence of an entry in sequences. |
min_date |
string or integer |
Minimal cutoff for date (inclusive). Supported formats:
|
max_date |
string or integer |
Maximal cutoff for date (inclusive). Supported formats:
|
min_length |
integer |
Minimal length of the sequences, only counting valid characters (excluding gaps, ambiguous, and invalid characters). |
max_length |
integer |
Maximum length of the sequences, only counting valid characters (excluding gaps, ambiguous, and invalid characters). |
exclude_invalid |
boolean |
Exclude sequences that contain invalid characters. |
non_nucleotide |
boolean |
Deprecated, please use ‘exclude_invalid’ instead. Exclude sequences that contain invalid characters. |
query |
string |
Filter sequences by attribute. Uses Pandas DataFrame query syntax. (e.g., “country == ‘Colombia’” or “(country == ‘USA’ & (division == ‘Washington’))”) |
query_columns |
string(s) |
Use alongside query to specify columns and data types in the format ‘column:type’, where type is one of (bool,float,int,str). Automatic type inference will be attempted on all unspecified columns used in the query. Example: region:str coverage:float. |
samples
samples must contain at least one sample.
filter sample options
These options override any values set in the defaults section.
Option |
Type |
Description |
|---|---|---|
exclude |
string(s) |
File(s) with list of strains to exclude. |
exclude_all |
boolean |
Exclude all strains by default. Use this with the include arguments to select a specific subset of strains. |
exclude_ambiguous_dates_by |
one of:
|
Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g., 2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”). |
exclude_where |
string(s) |
Exclude sequences matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND. |
include |
string(s) |
File(s) with list of strains to include regardless of priorities, subsampling, or absence of an entry in sequences. |
include_where |
string(s) |
Include sequences with these values. ex: host=rat. Multiple values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any strains matching these rules will be included regardless of priorities, subsampling, or absence of an entry in sequences. |
min_date |
string or integer |
Minimal cutoff for date (inclusive). Supported formats:
|
max_date |
string or integer |
Maximal cutoff for date (inclusive). Supported formats:
|
min_length |
integer |
Minimal length of the sequences, only counting valid characters (excluding gaps, ambiguous, and invalid characters). |
max_length |
integer |
Maximum length of the sequences, only counting valid characters (excluding gaps, ambiguous, and invalid characters). |
exclude_invalid |
boolean |
Exclude sequences that contain invalid characters. |
non_nucleotide |
boolean |
Deprecated, please use ‘exclude_invalid’ instead. Exclude sequences that contain invalid characters. |
query |
string |
Filter sequences by attribute. Uses Pandas DataFrame query syntax. (e.g., “country == ‘Colombia’” or “(country == ‘USA’ & (division == ‘Washington’))”) |
query_columns |
string(s) |
Use alongside query to specify columns and data types in the format ‘column:type’, where type is one of (bool,float,int,str). Automatic type inference will be attempted on all unspecified columns used in the query. Example: region:str coverage:float. |
context_sample |
string |
Use the outputs from another sample as the inputs for this sample. Value must be a sample name. |
drop_sample |
boolean |
Drop this sample from the final output |
group_by |
string(s) |
Grouping columns for subsampling. Notes:
|
group_by_weights |
string |
TSV file defining weights for grouping. Requirements:
Notes:
|
probabilistic_sampling |
boolean |
Allow probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when a total sample size is provided. |
sequences_per_group |
integer |
Select no more than this number of sequences per category. |
max_sequences |
integer |
Select no more than this number of sequences (i.e. total sample size). Can be used without grouping columns. |
proximal sample options
Option |
Type |
Description |
|---|---|---|
method |
one of:
|
Proximity approach used |
focal_sample |
string |
FASTA file with aligned focal sequences to find neighbors for |
context_sample |
string |
FASTA file with aligned contextual sequences |
drop_sample |
boolean |
Drop this sample from final outputs |
k |
integer |
number of nearest neighbors to find per focal strain |
max_distance |
integer |
maximum distance threshold for considering a sequence to match |
ignore_missing_data |
string |
All non-ATGC bases are converted to ‘N’, and then: - ‘none’ treats ‘N’ as a normal base for comparison purposes; - ‘all’ ignores positions where either sequence is N; - ‘flanking’ ignores runs of Ns at the start/end of each sequence. |
Implementation details
Configurations containing a single filter sample are run using a single call to augur filter.
Configurations containing multiple samples are run using multiple “intermediate” calls to augur commands. Each filter sample has its own call to
augur filterand each proximal sample has its own call toaugur proximity. Each intermediate call will write temporary files which may include a strains list, sequences FASTA and/or metadata TSV. The eventual output dataset is produced by a finalaugur filtercall that uses the union of requested samples.Samples may be dropped from the final output (via the
drop_sampleconfig option. This is useful when we wish to use samples only for as an input for another sample.As samples may depend on other samples (via the
focal_sampleandcontext_sampleconfig values), internally we create a graph of samples which controls the order in which samples are evaluated.Multithreading (via
--nthreads) will allow samples to run as efficiently as possible, according to the sample dependency graph. Proximal samples always use all available threads which both allows them to run as efficiently as possible (they’re often computationally expensive) as well as minimising the memory consumption ofaugur subsample.CLI and YAML config options map closely to augur filter options.
The following table shows the mapping between
augur subsampleandaugur filterCLI options.augur subsample CLI option
augur filter CLI option
--metadata--metadata--metadata-chunk-size--metadata-chunk-size--metadata-delimiters--metadata-delimiters--metadata-id-columns--metadata-id-columns--sequences--sequences--sequence-index--sequence-index--seed--subsample-seed--seq-type--seq-type--output-metadata--output-metadata--output-sequences--output-sequences--output-log--output-log--skip-checks--skip-checksThe following table shows the mapping between
augur subsamplefilter sample configuration options andaugur filterCLI options.YAML config option
augur filter CLI option
context_sampledrop_sampleexclude--excludeexclude_all--exclude-allexclude_ambiguous_dates_by--exclude-ambiguous-dates-byexclude_where--exclude-whereinclude--includeinclude_where--include-wheremin_date--min-datemax_date--max-datemin_length--min-lengthmax_length--max-lengthexclude_invalid--exclude-invalidnon_nucleotide--exclude-invalidquery--queryquery_columns--query-columnsgroup_by--group-bygroup_by_weights--group-by-weightsprobabilistic_sampling--probabilistic-sampling/--no-probabilistic-samplingsequences_per_group--sequences-per-groupmax_sequences--subsample-max-sequencesNote that the following
augur filteroptions are not supported:--priority--output-group-by-sizes--output-strains--empty-output-reporting
The following table shows the mapping between
augur subsampleproximal sample configuration options andaugur proximityCLI options.YAML config option
augur filter CLI option
focal_samplecontext_sampledrop_samplemethod--methodk--kmax_distance--max-distanceignore_missing_data--ignore-missing-data