augur.subsample module

Subsample sequences from an input dataset.

The input dataset can consist of a metadata file, a sequences file, or both.

See documentation page for details on configuration.

augur.subsample.AugurArgs

Augur command arguments stored as a mapping from option (YAML key name) to value (command line arg name)

alias of Dict[str, Any]

augur.subsample.AugurOption

Type for an augur command line option. Either a single option or boolean pair of flags.

alias of str | Tuple[str, str | None]

augur.subsample.BooleanFlags

Type for a boolean pair of augur filter command line flags that configure the same option. When there is no second flag, absence of the first flag indicates the default behavior.

alias of Tuple[str, str | None]

augur.subsample.FILTER_FINAL_CLI_OPTIONS: Dict[str, str | Tuple[str, str | None]] = {'output_log': '--output-log', 'output_metadata': '--output-metadata', 'output_sequences': '--output-sequences', 'skip_checks': ('--skip-checks', None)}: Mapping of argparse namespace variable name to augur filter option. These are sent to only the final augur filter call.

augur.subsample.FILTER_GLOBAL_CLI_OPTIONS: Dict[str, str | Tuple[str, str | None]] = {'metadata': '--metadata', 'metadata_chunk_size': '--metadata-chunk-size', 'metadata_delimiters': '--metadata-delimiters', 'metadata_id_columns': '--metadata-id-columns', 'seed': '--subsample-seed', 'seq_type': '--seq-type', 'sequence_index': '--sequence-index', 'sequences': '--sequences'}: Mapping of argparse namespace variable name to augur filter option. These are sent to both intermediate and final augur filter calls.

augur.subsample.FILTER_SAMPLE_CONFIG: Dict[str, str | Tuple[str, str | None] | None] = {'context_sample': None, 'drop_sample': None, 'exclude': '--exclude', 'exclude_all': ('--exclude-all', None), 'exclude_ambiguous_dates_by': '--exclude-ambiguous-dates-by', 'exclude_invalid': ('--exclude-invalid', None), 'exclude_where': '--exclude-where', 'group_by': '--group-by', 'group_by_weights': '--group-by-weights', 'include': '--include', 'include_where': '--include-where', 'max_date': '--max-date', 'max_length': '--max-length', 'max_sequences': '--subsample-max-sequences', 'min_date': '--min-date', 'min_length': '--min-length', 'non_nucleotide': ('--exclude-invalid', None), 'probabilistic_sampling': ('--probabilistic-sampling', '--no-probabilistic-sampling'), 'query': '--query', 'query_columns': '--query-columns', 'sequences_per_group': '--sequences-per-group'}: Mapping of YAML configuration key name to augur filter option. These are sent to only the intermediate augur filter calls. A value of None is a sample config value which does not directly map to an augur filter argument and must be handled separately.

class augur.subsample.FilterSample(name, config, global_filter_args, /, drop=False, context_sample=None)

Bases: Sample

connect_dependencies(samples_by_name)

Given all known samples_by_name (i.e. ones which this sample depends on), connect this sample to its dependencies. Essentially use their _outputs_ as _inputs_ for our sample (i.e. self). This may requires modifying upstream sample(s) to output sequences & metadata which we can consume!

When all upstream dependencies have been wired up, self.incomplete is set to False.

Return type:: None

augur.subsample.PROXIMAL_SAMPLE_CONFIG: Dict[str, str | Tuple[str, str | None] | None] = {'context_sample': None, 'drop_sample': None, 'focal_sample': None, 'ignore_missing_data': '--ignore-missing-data', 'k': '--k', 'max_distance': '--max-distance', 'method': '--method'}: Mapping of YAML configuration key name to augur proximity argument A value of None is a config value which does not directly map to an augur proximity argument and must be handled separately.

class augur.subsample.ProximalSample(name, config, global_args, /, drop=False, context_sample=None, nthreads=1)

Bases: Sample

connect_dependencies(samples_by_name)

Connect this sample to its dependencies. Subclasses override as needed.

Return type:: None

class augur.subsample.Sample(name, augur_cmd, drop, nthreads)

Bases: object

Base class containing information about a particular sample (represented by a ‘sample’ block in the YAML config). Information includes arguments & parameters for the underlying augur command, other samples which we depend on, details about (temporary) output files etc.

args: Dict[str, Any]

connect_dependencies(samples_by_name)

Connect this sample to its dependencies. Subclasses override as needed.

Return type:: None

depends_on: dict[str, str]

incomplete: bool

remove_temporary_files(): Remove any temporary outputs which may have been created

run()

Run an augur command as a subprocess.

Notes:

A direct import of the command in Python is not used because all samples would share the same sys.stderr, which causes interleaved messages when processes are run in parallel.
shell=True is not used because it requires additional logic to carefully escape values such as “--metadata-delimiters , “. This is also why run_shell_command() isn’t used here.

Return type:: None

augur.subsample.get_referenced_files(config_file, config_section=None, search_paths=None)

Get the files referenced in a subsample config file.

Extracts and resolves all filepath values referenced in the config, including defaults and individual sample options.

Parameters:

config_file (str) -- Path to the subsample config file.
config_section (Optional[List[str]]) -- Optional list of keys to navigate to a specific section of the config file.
search_paths (Optional[List[str]]) -- Optional list of directories to search for relative filepaths specified in the config file. If a file exists in multiple directories, only the file from the first directory will be used. This can also be set via the environment variable ‘AUGUR_SEARCH_PATHS’. Specified directories will be considered before the defaults, which are: (1) directory containing the config file (2) current working directory

Returns:

Resolved filepaths

Return type:

set

augur.subsample.merge_defaults(config)

Returns a config object without a defaults section.

Defaults are applied to every sample which is a “filter sample” (i.e. not a proximal sample), with the sample options taking precedence over the defaults.

Return type:: Dict[str, Any]

augur.subsample.register_parser(parent_subparsers)

Return type:: ArgumentParser

augur.subsample.requires_aligned_sequences(config_file, config_section=None)

NOTE: This function may change without warning & is for internal Snakemake development / testing purposes.

Does the config specify one or more proximal samples. If it does, then --sequences must be supplied to augur subsample and they must be aligned.

Parameters:

config_file (str) -- Path to the subsample config file.
config_section (Optional[List[str]]) -- Optional list of keys to navigate to a specific section of the config file.

Returns:

Does augur subsample require aligned sequences?

Return type:

bool

augur.subsample.run(args)

Run augur subsample.

This is implemented by calling augur filter once for each sample in the config (i.e. the intermediate calls), then one more time to combine the samples (i.e. the final call). It was inspired by several pathogen repos adopting a similar approach using Snakemake rules.

Notes on performance:

If multiple intermediate calls use sequence-based filters and --sequence-index is not set, each call will build its own sequence index, meaning the same work is done at least twice. A more optimal approach would be to add a preliminary step to build the sequence index then pass it down to the intermediate calls. However, this complicates things and may not be worth it if sequence indexing is rewritten: <https://github.com/nextstrain/augur/issues/1846>
If multiple intermediate calls use the same default filters that significantly reduce the size of the initial input dataset, each call will go through the large input dataset and filter it with the same filters, meaning the same work is done at least twice. A more optimal approach would be to run the default options through an initial augur filter call. This would output a much smaller intermediate dataset that can be used by the intermediate calls. However, this complicates things and may not be worth it if a proper input reuse approach such as database/parquet file support is adopted: <https://github.com/nextstrain/augur/issues/1574>

Return type:: None