augur.subsample moduleο
Subsample sequences from an input dataset.
The input dataset can consist of a metadata file, a sequences file, or both.
See documentation page for details on configuration.
- augur.subsample.AugurArgsο
Augur command arguments stored as a mapping from option (YAML key name) to value (command line arg name)
- augur.subsample.AugurOptionο
Type for an augur command line option. Either a single option or boolean pair of flags.
- augur.subsample.BooleanFlagsο
Type for a boolean pair of augur filter command line flags that configure the same option. When there is no second flag, absence of the first flag indicates the default behavior.
- augur.subsample.FILTER_FINAL_CLI_OPTIONS: Dict[str, str | Tuple[str, str | None]] = {'output_log': '--output-log', 'output_metadata': '--output-metadata', 'output_sequences': '--output-sequences', 'skip_checks': ('--skip-checks', None)}ο
Mapping of argparse namespace variable name to augur filter option. These are sent to only the final augur filter call.
- augur.subsample.FILTER_GLOBAL_CLI_OPTIONS: Dict[str, str | Tuple[str, str | None]] = {'metadata': '--metadata', 'metadata_chunk_size': '--metadata-chunk-size', 'metadata_delimiters': '--metadata-delimiters', 'metadata_id_columns': '--metadata-id-columns', 'seed': '--subsample-seed', 'seq_type': '--seq-type', 'sequence_index': '--sequence-index', 'sequences': '--sequences'}ο
Mapping of argparse namespace variable name to augur filter option. These are sent to both intermediate and final augur filter calls.
- augur.subsample.FILTER_SAMPLE_CONFIG: Dict[str, str | Tuple[str, str | None] | None] = {'context_sample': None, 'drop_sample': None, 'exclude': '--exclude', 'exclude_all': ('--exclude-all', None), 'exclude_ambiguous_dates_by': '--exclude-ambiguous-dates-by', 'exclude_invalid': ('--exclude-invalid', None), 'exclude_where': '--exclude-where', 'group_by': '--group-by', 'group_by_weights': '--group-by-weights', 'include': '--include', 'include_where': '--include-where', 'max_date': '--max-date', 'max_length': '--max-length', 'max_sequences': '--subsample-max-sequences', 'min_date': '--min-date', 'min_length': '--min-length', 'non_nucleotide': ('--exclude-invalid', None), 'probabilistic_sampling': ('--probabilistic-sampling', '--no-probabilistic-sampling'), 'query': '--query', 'query_columns': '--query-columns', 'sequences_per_group': '--sequences-per-group'}ο
Mapping of YAML configuration key name to augur filter option. These are sent to only the intermediate augur filter calls. A value of None is a sample config value which does not directly map to an augur filter argument and must be handled separately.
- class augur.subsample.FilterSample(name, config, global_filter_args, /, drop=False, context_sample=None)ο
Bases:
Sample- connect_dependencies(samples_by_name)ο
Given all known samples_by_name (i.e. ones which this sample depends on), connect this sample to its dependencies. Essentially use their _outputs_ as _inputs_ for our sample (i.e. self). This may requires modifying upstream sample(s) to output sequences & metadata which we can consume!
When all upstream dependencies have been wired up,
self.incompleteis set to False.- Return type:
- augur.subsample.PROXIMAL_SAMPLE_CONFIG: Dict[str, str | Tuple[str, str | None] | None] = {'context_sample': None, 'drop_sample': None, 'focal_sample': None, 'ignore_missing_data': '--ignore-missing-data', 'k': '--k', 'max_distance': '--max-distance', 'method': '--method'}ο
Mapping of YAML configuration key name to augur proximity argument A value of None is a config value which does not directly map to an augur proximity argument and must be handled separately.
- class augur.subsample.ProximalSample(name, config, global_args, /, drop=False, context_sample=None, nthreads=1)ο
Bases:
Sample
- class augur.subsample.Sample(name, augur_cmd, drop, nthreads)ο
Bases:
objectBase class containing information about a particular sample (represented by a βsampleβ block in the YAML config). Information includes arguments & parameters for the underlying augur command, other samples which we depend on, details about (temporary) output files etc.
- connect_dependencies(samples_by_name)ο
Connect this sample to its dependencies. Subclasses override as needed.
- Return type:
- remove_temporary_files()ο
Remove any temporary outputs which may have been created
- run()ο
Run an augur command as a subprocess.
Notes:
A direct import of the command in Python is not used because all samples would share the same sys.stderr, which causes interleaved messages when processes are run in parallel.
shell=True is not used because it requires additional logic to carefully escape values such as β--metadata-delimiters , β. This is also why run_shell_command() isnβt used here.
- Return type:
- augur.subsample.get_referenced_files(config_file, config_section=None, search_paths=None)ο
Get the files referenced in a subsample config file.
Extracts and resolves all filepath values referenced in the config, including defaults and individual sample options.
- Parameters:
config_file (
str) -- Path to the subsample config file.config_section (
Optional[List[str]]) -- Optional list of keys to navigate to a specific section of the config file.search_paths (
Optional[List[str]]) -- Optional list of directories to search for relative filepaths specified in the config file. If a file exists in multiple directories, only the file from the first directory will be used. This can also be set via the environment variable βAUGUR_SEARCH_PATHSβ. Specified directories will be considered before the defaults, which are: (1) directory containing the config file (2) current working directory
- Returns:
Resolved filepaths
- Return type:
- augur.subsample.register_parser(parent_subparsers)ο
- Return type:
- augur.subsample.requires_aligned_sequences(config_file, config_section=None)ο
NOTE: This function may change without warning & is for internal Snakemake development / testing purposes.
Does the config specify one or more proximal samples. If it does, then --sequences must be supplied to
augur subsampleand they must be aligned.
- augur.subsample.run(args)ο
Run augur subsample.
This is implemented by calling augur filter once for each sample in the config (i.e. the intermediate calls), then one more time to combine the samples (i.e. the final call). It was inspired by several pathogen repos adopting a similar approach using Snakemake rules.
Notes on performance:
If multiple intermediate calls use sequence-based filters and --sequence-index is not set, each call will build its own sequence index, meaning the same work is done at least twice. A more optimal approach would be to add a preliminary step to build the sequence index then pass it down to the intermediate calls. However, this complicates things and may not be worth it if sequence indexing is rewritten: <https://github.com/nextstrain/augur/issues/1846>
If multiple intermediate calls use the same default filters that significantly reduce the size of the initial input dataset, each call will go through the large input dataset and filter it with the same filters, meaning the same work is done at least twice. A more optimal approach would be to run the default options through an initial augur filter call. This would output a much smaller intermediate dataset that can be used by the intermediate calls. However, this complicates things and may not be worth it if a proper input reuse approach such as database/parquet file support is adopted: <https://github.com/nextstrain/augur/issues/1574>
- Return type: