augur.filter module
Filter and subsample a sequence set.
- exception augur.filter.FilterException
Bases:
Exception
Representation of an error that occurred during filtering.
- class augur.filter.PriorityQueue(max_size)
Bases:
object
A priority queue implementation that automatically replaces lower priority items in the heap with incoming higher priority items.
Add a single record to a heap with a maximum of 2 records.
>>> queue = PriorityQueue(max_size=2) >>> queue.add({"strain": "strain1"}, 0.5) 1
Add another record with a higher priority. The queue should be at its maximum size.
>>> queue.add({"strain": "strain2"}, 1.0) 2 >>> queue.heap [(0.5, 0, {'strain': 'strain1'}), (1.0, 1, {'strain': 'strain2'})] >>> list(queue.get_items()) [{'strain': 'strain1'}, {'strain': 'strain2'}]
Add a higher priority record that causes the queue to exceed its maximum size. The resulting queue should contain the two highest priority records after the lowest priority record is removed.
>>> queue.add({"strain": "strain3"}, 2.0) 2 >>> list(queue.get_items()) [{'strain': 'strain2'}, {'strain': 'strain3'}]
Add a record with the same priority as another record, forcing the duplicate to be resolved by removing the oldest entry.
>>> queue.add({"strain": "strain4"}, 1.0) 2 >>> list(queue.get_items()) [{'strain': 'strain4'}, {'strain': 'strain3'}]
- add(item, priority)
Add an item to the queue with a given priority.
If adding the item causes the queue to exceed its maximum size, replace the lowest priority item with the given item. The queue stores items with an additional heap id value (a count) to resolve ties between items with equal priority (favoring the most recently added item).
- get_items()
Return each item in the queue in order.
- Yields
Any – Item stored in the queue.
- exception augur.filter.TooManyGroupsError(msg)
Bases:
ValueError
- augur.filter.apply_filters(metadata, exclude_by, include_by)
Apply a list of filters to exclude or force-include records from the given metadata and return the strains to keep, to exclude, and to force include.
- Parameters
metadata (pandas.DataFrame) – Metadata to filter
exclude_by (list[tuple]) – A list of 2-element tuples with a callable to filter by in the first index and a dictionary of kwargs to pass to the function in the second index.
include_by (list[tuple]) – A list of 2-element tuples in the same format as the
exclude_by
argument.
- Returns
set – Strains to keep (those that passed all filters)
list[dict] – Strains to exclude along with the function that filtered them and the arguments used to run the function.
list[dict] – Strains to force-include along with the function that filtered them and the arguments used to run the function.
For example, filter data by minimum date, but force the include of strains
from Africa.
>>> metadata = pd.DataFrame([{“region” (“Africa”, “date”: “2020-01-01”}, {“region”: “Europe”, “date”: “2020-10-02”}, {“region”: “North America”, “date”: “2020-01-01”}], index=[“strain1”, “strain2”, “strain3”]))
>>> exclude_by = [(filter_by_date, {“min_date” (numeric_date(“2020-04-01”)})])
>>> include_by = [(include_by_include_where, {“include_where” (“region=Africa”})])
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{‘strain2’}
>>> sorted(strains_to_exclude, key=lambda record (record[“strain”]))
[{‘strain’ (‘strain1’, ‘filter’: ‘filter_by_date’, ‘kwargs’: ‘[[“min_date”, 2020.25]]’}, {‘strain’: ‘strain3’, ‘filter’: ‘filter_by_date’, ‘kwargs’: ‘[[“min_date”, 2020.25]]’}])
>>> strains_to_include
[{‘strain’ (‘strain1’, ‘filter’: ‘include_by_include_where’, ‘kwargs’: ‘[[“include_where”, “region=Africa”]]’}])
We also want to filter by characteristics of the sequence data that we’ve
annotated in a sequence index.
>>> sequence_index = pd.DataFrame([{“strain” (“strain1”, “A”: 7000, “C”: 7000, “G”: 7000, “T”: 7000}, {“strain”: “strain2”, “A”: 6500, “C”: 6500, “G”: 6500, “T”: 6500}, {“strain”: “strain3”, “A”: 1250, “C”: 1250, “G”: 1250, “T”: 1250}]).set_index(“strain”))
>>> exclude_by = [(filter_by_sequence_length, {“sequence_index” (sequence_index, “min_length”: 27000})])
>>> include_by = [(include_by_include_where, {“include_where” (“region=Europe”})])
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{‘strain1’}
>>> sorted(strains_to_exclude, key=lambda record (record[“strain”]))
[{‘strain’ (‘strain2’, ‘filter’: ‘filter_by_sequence_length’, ‘kwargs’: ‘[[“min_length”, 27000]]’}, {‘strain’: ‘strain3’, ‘filter’: ‘filter_by_sequence_length’, ‘kwargs’: ‘[[“min_length”, 27000]]’}])
>>> strains_to_include
[{‘strain’ (‘strain2’, ‘filter’: ‘include_by_include_where’, ‘kwargs’: ‘[[“include_where”, “region=Europe”]]’}])
- augur.filter.calculate_sequences_per_group(target_max_value, counts_per_group, allow_probabilistic=True)
Calculate the number of sequences per group for a given maximum number of sequences to be returned and the number of sequences in each requested group. Optionally, allow the result to be probabilistic such that the mean result of a Poisson process achieves the calculated sequences per group for the given maximum.
- Parameters
target_max_value (int) – Maximum number of sequences to return by subsampling at some calculated number of sequences per group for the given counts per group.
counts_per_group (list[int]) – A list with the number of sequences in each requested group.
allow_probabilistic (bool) – Whether to allow probabilistic subsampling when the number of groups exceeds the requested maximum.
- Raises
TooManyGroupsError : – When there are more groups than sequences per group and probabilistic subsampling is not allowed.
- Returns
int or float – Number of sequences per group.
bool – Whether probabilistic subsampling was used.
- augur.filter.construct_filters(args, sequence_index)
Construct lists of filters and inclusion criteria based on user-provided arguments.
- Parameters
args (argparse.Namespace) – Command line arguments provided by the user.
sequence_index (pandas.DataFrame) – Sequence index for the provided arguments.
- Returns
list – A list of 2-element tuples with a callable to use as a filter and a dictionary of kwargs to pass to the callable.
list – A list of 2-element tuples with a callable and dictionary of kwargs that determines whether to force include strains in the final output.
- augur.filter.create_queues_by_group(groups, max_size, max_attempts=100, random_seed=None)
Create a dictionary of priority queues per group for the given maximum size.
When the maximum size is fractional, probabilistically sample the maximum size from a Poisson distribution. Make at least the given number of maximum attempts to create queues for which the sum of their maximum sizes is greater than zero.
Create queues for two groups with a fixed maximum size.
>>> groups = ("2015", "2016") >>> queues = create_queues_by_group(groups, 2) >>> sum(queue.max_size for queue in queues.values()) 4
Create queues for two groups with a fractional maximum size. Their total max size should still be an integer value greater than zero.
>>> seed = 314159 >>> queues = create_queues_by_group(groups, 0.1, random_seed=seed) >>> int(sum(queue.max_size for queue in queues.values())) > 0 True
A subsequent run of this function with the same groups and random seed should produce the same queues and queue sizes.
>>> more_queues = create_queues_by_group(groups, 0.1, random_seed=seed) >>> [queue.max_size for queue in queues.values()] == [queue.max_size for queue in more_queues.values()] True
- augur.filter.filter_by_ambiguous_date(metadata, date_column='date', ambiguity='any')
Filter metadata in the given pandas DataFrame where values in the given date column have a given level of ambiguity.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
date_column (str) – Column in the dataframe with dates.
ambiguity (str) – Level of date ambiguity to filter metadata by
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”, “date”: “2020-01-XX”}, {“region”: “Europe”, “date”: “2020-01-02”}], index=[“strain1”, “strain2”]))
>>> filter_by_ambiguous_date(metadata)
{‘strain2’}
>>> sorted(filter_by_ambiguous_date(metadata, ambiguity=”month”))
[‘strain1’, ‘strain2’]
If the requested date column does not exist, we quietly skip this filter.
>>> sorted(filter_by_ambiguous_date(metadata, date_column=”missing_column”))
[‘strain1’, ‘strain2’]
- augur.filter.filter_by_date(metadata, date_column='date', min_date=None, max_date=None)
Filter metadata by minimum or maximum date.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
date_column (str) – Column in the dataframe with dates.
min_date (float) – Minimum date
max_date (float) – Maximum date
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”, “date”: “2020-01-01”}, {“region”: “Europe”, “date”: “2020-01-02”}], index=[“strain1”, “strain2”]))
>>> filter_by_date(metadata, min_date=numeric_date(“2020-01-02”))
{‘strain2’}
>>> filter_by_date(metadata, max_date=numeric_date(“2020-01-01”))
{‘strain1’}
>>> filter_by_date(metadata, min_date=numeric_date(“2020-01-03”), max_date=numeric_date(“2020-01-10”))
set()
>>> sorted(filter_by_date(metadata, min_date=numeric_date(“2019-12-30”), max_date=numeric_date(“2020-01-10”)))
[‘strain1’, ‘strain2’]
>>> sorted(filter_by_date(metadata))
[‘strain1’, ‘strain2’]
If the requested date column does not exist, we quietly skip this filter.
>>> sorted(filter_by_date(metadata, date_column=”missing_column”, min_date=numeric_date(“2020-01-02”)))
[‘strain1’, ‘strain2’]
- augur.filter.filter_by_exclude(metadata, exclude_file)
Exclude the given set of strains from the given metadata.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
exclude_file (str) – Filename with strain names to exclude from the given metadata
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”}, {“region”: “Europe”}], index=[“strain1”, “strain2”]))
>>> with NamedTemporaryFile(delete=False) as exclude_file
… characters_written = exclude_file.write(b’strain1’)
>>> filter_by_exclude(metadata, exclude_file.name)
{‘strain2’}
>>> os.unlink(exclude_file.name)
- augur.filter.filter_by_exclude_all(metadata)
Exclude all strains regardless of the given metadata content.
This is a placeholder function that can be called as part of a generalized loop through all possible functions.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
- Returns
set[str] – Empty set of strains
>>> metadata = pd.DataFrame([{“region” (“Africa”}, {“region”: “Europe”}], index=[“strain1”, “strain2”]))
>>> filter_by_exclude_all(metadata)
set()
- augur.filter.filter_by_exclude_where(metadata, exclude_where)
Exclude all strains from the given metadata that match the given exclusion query.
Unlike pandas query syntax, exclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
exclude_where (str) – Filter query used to exclude strains
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”}, {“region”: “Europe”}], index=[“strain1”, “strain2”]))
>>> filter_by_exclude_where(metadata, “region!=Europe”)
{‘strain2’}
>>> filter_by_exclude_where(metadata, “region=Europe”)
{‘strain1’}
>>> filter_by_exclude_where(metadata, “region=europe”)
{‘strain1’}
If the column referenced in the given query does not exist, skip the filter.
>>> sorted(filter_by_exclude_where(metadata, “missing_column=value”))
[‘strain1’, ‘strain2’]
- augur.filter.filter_by_non_nucleotide(metadata, sequence_index)
Filter metadata for strains with invalid nucleotide content.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”, “date”: “2020-01-01”}, {“region”: “Europe”, “date”: “2020-01-02”}], index=[“strain1”, “strain2”]))
>>> sequence_index = pd.DataFrame([{“strain” (“strain1”, “invalid_nucleotides”: 0}, {“strain”: “strain2”, “invalid_nucleotides”: 1}]).set_index(“strain”))
>>> filter_by_non_nucleotide(metadata, sequence_index)
{‘strain1’}
- augur.filter.filter_by_query(metadata, query)
Filter metadata in the given pandas DataFrame with a query string and return the strain names that pass the filter.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
query (str) – Query string for the dataframe.
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”}, {“region”: “Europe”}], index=[“strain1”, “strain2”]))
>>> filter_by_query(metadata, “region == ‘Africa’”)
{‘strain1’}
>>> filter_by_query(metadata, “region == ‘North America’”)
set()
- augur.filter.filter_by_sequence_index(metadata, sequence_index)
Filter metadata by presence of corresponding entries in a given sequence index. This filter effectively intersects the strain ids in the metadata and sequence index.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”, “date”: “2020-01-01”}, {“region”: “Europe”, “date”: “2020-01-02”}], index=[“strain1”, “strain2”]))
>>> sequence_index = pd.DataFrame([{“strain” (“strain1”, “ACGT”: 28000}]).set_index(“strain”))
>>> filter_by_sequence_index(metadata, sequence_index)
{‘strain1’}
- augur.filter.filter_by_sequence_length(metadata, sequence_index, min_length=0)
Filter metadata by sequence length from a given sequence index.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
min_length (int) – Minimum number of standard nucleotide characters (A, C, G, or T) in each sequence
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”, “date”: “2020-01-01”}, {“region”: “Europe”, “date”: “2020-01-02”}], index=[“strain1”, “strain2”]))
>>> sequence_index = pd.DataFrame([{“strain” (“strain1”, “A”: 7000, “C”: 7000, “G”: 7000, “T”: 7000}, {“strain”: “strain2”, “A”: 6500, “C”: 6500, “G”: 6500, “T”: 6500}]).set_index(“strain”))
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
{‘strain1’}
It is possible for the sequence index to be missing strains present in the metadata.
>>> sequence_index = pd.DataFrame([{“strain” (“strain3”, “A”: 7000, “C”: 7000, “G”: 7000, “T”: 7000}, {“strain”: “strain2”, “A”: 6500, “C”: 6500, “G”: 6500, “T”: 6500}]).set_index(“strain”))
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
set()
- augur.filter.filter_kwargs_to_str(kwargs)
Convert a dictionary of kwargs to a JSON string for downstream reporting.
This structured string can be converted back into a Python data structure later for more sophisticated reporting by specific kwargs.
This function excludes data types from arguments like pandas DataFrames and also converts floating point numbers to a fixed precision for better readability and reproducibility.
- Parameters
kwargs (dict) – Dictionary of kwargs passed to a given filter function.
- Returns
str – String representation of the kwargs for reporting.
>>> sequence_index = pd.DataFrame([{“strain” (“strain1”, “ACGT”: 28000}, {“strain”: “strain2”, “ACGT”: 26000}, {“strain”: “strain3”, “ACGT”: 5000}]).set_index(“strain”))
>>> exclude_by = [(filter_by_sequence_length, {“sequence_index” (sequence_index, “min_length”: 27000})])
>>> filter_kwargs_to_str(exclude_by[0][1])
’[[“min_length”, 27000]]’
>>> exclude_by = [(filter_by_date, {“max_date” (numeric_date(“2020-04-01”), “min_date”: numeric_date(“2020-03-01”)})])
>>> filter_kwargs_to_str(exclude_by[0][1])
’[[“max_date”, 2020.25], [“min_date”, 2020.17]]’
- augur.filter.get_groups_for_subsampling(strains, metadata, group_by=None)
Return a list of groups for each given strain based on the corresponding metadata and group by column.
- Parameters
strains (list) – A list of strains to get groups for.
metadata (pandas.DataFrame) – Metadata to inspect for the given strains.
group_by (list) – A list of metadata (or calculated) columns to group records by.
- Returns
dict – A mapping of strain names to tuples corresponding to the values of the strain’s group.
list – A list of dictionaries with strains that were skipped from grouping and the reason why (see also: apply_filters output).
>>> strains = [“strain1”, “strain2”]
>>> metadata = pd.DataFrame([{“strain” (“strain1”, “date”: “2020-01-01”, “region”: “Africa”}, {“strain”: “strain2”, “date”: “2020-02-01”, “region”: “Europe”}]).set_index(“strain”))
>>> group_by = [“region”]
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{‘strain1’ ((‘Africa’,), ‘strain2’: (‘Europe’,)})
>>> skipped_strains
[]
If we group by year or month, these groups are calculated from the date
string.
>>> group_by = [“year”, “month”]
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{‘strain1’ ((2020, (2020, 1)), ‘strain2’: (2020, (2020, 2))})
If we omit the grouping columns, the result will group by a dummy column.
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata)
>>> group_by_strain
{‘strain1’ ((‘_dummy’,), ‘strain2’: (‘_dummy’,)})
If we try to group by columns that don’t exist, we get an error.
>>> group_by = [“missing_column”]
>>> get_groups_for_subsampling(strains, metadata, group_by)
Traceback (most recent call last) – …
augur.filter.FilterException (The specified group-by categories ([‘missing_column’]) were not found. No sequences-per-group sampling will be done.)
If we try to group by some columns that exist and some that don’t, we allow
grouping to continue and print a warning message to stderr.
>>> group_by = [“year”, “month”, “missing_column”]
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{‘strain1’ ((2020, (2020, 1), ‘unknown’), ‘strain2’: (2020, (2020, 2), ‘unknown’)})
If we group by year month and some records don’t have that information in
their date fields, we should skip those records from the group output and
track which records were skipped for which reasons.
>>> metadata = pd.DataFrame([{“strain” (“strain1”, “date”: “”, “region”: “Africa”}, {“strain”: “strain2”, “date”: “2020-02-01”, “region”: “Europe”}]).set_index(“strain”))
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, [“year”])
>>> group_by_strain
{‘strain2’ ((2020,)})
>>> skipped_strains
[{‘strain’ (‘strain1’, ‘filter’: ‘skip_group_by_with_ambiguous_year’, ‘kwargs’: ‘’}])
Similarly, if we group by month, we should skip records that don’t have
month information in their date fields.
>>> metadata = pd.DataFrame([{“strain” (“strain1”, “date”: “2020”, “region”: “Africa”}, {“strain”: “strain2”, “date”: “2020-02-01”, “region”: “Europe”}]).set_index(“strain”))
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, [“month”])
>>> group_by_strain
{‘strain2’ (((2020, 2),)})
>>> skipped_strains
[{‘strain’ (‘strain1’, ‘filter’: ‘skip_group_by_with_ambiguous_month’, ‘kwargs’: ‘’}])
- augur.filter.include(metadata, include_file)
Include strains in the given text file from the given metadata.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
include_file (str) – Filename with strain names to include from the given metadata
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”}, {“region”: “Europe”}], index=[“strain1”, “strain2”]))
>>> with NamedTemporaryFile(delete=False) as include_file
… characters_written = include_file.write(b’strain1’)
>>> include(metadata, include_file.name)
{‘strain1’}
>>> os.unlink(include_file.name)
- augur.filter.include_by_include_where(metadata, include_where)
Include all strains from the given metadata that match the given query.
Unlike pandas query syntax, inclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.
- Parameters
metadata (pandas.DataFrame) – Metadata indexed by strain name
include_where (str) – Filter query used to include strains
- Returns
set[str] – Strains that pass the filter
>>> metadata = pd.DataFrame([{“region” (“Africa”}, {“region”: “Europe”}], index=[“strain1”, “strain2”]))
>>> include_by_include_where(metadata, “region!=Europe”)
{‘strain1’}
>>> include_by_include_where(metadata, “region=Europe”)
{‘strain2’}
>>> include_by_include_where(metadata, “region=europe”)
{‘strain2’}
If the column referenced in the given query does not exist, skip the filter.
>>> include_by_include_where(metadata, “missing_column=value”)
set()
- augur.filter.parse_filter_query(query)
Parse an augur filter-style query and return the corresponding column, operator, and value for the query.
- Parameters
query (str) – augur filter-style query following the pattern of “property=value” or “property!=value”
- Returns
str – Name of column to query
callable – Operator function to test equality or non-equality of values
str – Value of column to query
>>> parse_filter_query(“property=value”)
(‘property’, <built-in function eq>, ‘value’)
>>> parse_filter_query(“property!=value”)
(‘property’, <built-in function ne>, ‘value’)
- augur.filter.read_priority_scores(fname)
- augur.filter.register_arguments(parser)
- augur.filter.run(args)
filter and subsample a set of sequences into an analysis set
- augur.filter.validate_arguments(args)
Validate arguments and return a boolean representing whether all validation rules succeeded.
- Parameters
args (argparse.Namespace) – Parsed arguments from argparse
- Returns
Validation succeeded.
- Return type
bool
- augur.filter.write_vcf(input_filename, output_filename, dropped_samps)