augur.filter module

Filter and subsample a sequence set.

exception augur.filter.FilterException

Bases: augur.utils.AugurError

Representation of an error that occurred during filtering.

class augur.filter.PriorityQueue(max_size)

Bases: object

A priority queue implementation that automatically replaces lower priority items in the heap with incoming higher priority items.

Add a single record to a heap with a maximum of 2 records.

>>> queue = PriorityQueue(max_size=2)
>>> queue.add({"strain": "strain1"}, 0.5)
1

Add another record with a higher priority. The queue should be at its maximum size.

>>> queue.add({"strain": "strain2"}, 1.0)
2
>>> queue.heap
[(0.5, 0, {'strain': 'strain1'}), (1.0, 1, {'strain': 'strain2'})]
>>> list(queue.get_items())
[{'strain': 'strain1'}, {'strain': 'strain2'}]

Add a higher priority record that causes the queue to exceed its maximum size. The resulting queue should contain the two highest priority records after the lowest priority record is removed.

>>> queue.add({"strain": "strain3"}, 2.0)
2
>>> list(queue.get_items())
[{'strain': 'strain2'}, {'strain': 'strain3'}]

Add a record with the same priority as another record, forcing the duplicate to be resolved by removing the oldest entry.

>>> queue.add({"strain": "strain4"}, 1.0)
2
>>> list(queue.get_items())
[{'strain': 'strain4'}, {'strain': 'strain3'}]
add(item, priority)

Add an item to the queue with a given priority.

If adding the item causes the queue to exceed its maximum size, replace the lowest priority item with the given item. The queue stores items with an additional heap id value (a count) to resolve ties between items with equal priority (favoring the most recently added item).

get_items()

Return each item in the queue in order.

Yields

Any – Item stored in the queue.

exception augur.filter.TooManyGroupsError(msg)

Bases: ValueError

augur.filter.apply_filters(metadata, exclude_by, include_by)

Apply a list of filters to exclude or force-include records from the given metadata and return the strains to keep, to exclude, and to force include.

Parameters
  • metadata (pandas.DataFrame) – Metadata to filter

  • exclude_by (list[tuple]) – A list of 2-element tuples with a callable to filter by in the first index and a dictionary of kwargs to pass to the function in the second index.

  • include_by (list[tuple]) – A list of 2-element tuples in the same format as the exclude_by argument.

Returns

  • set – Strains to keep (those that passed all filters)

  • list[dict] – Strains to exclude along with the function that filtered them and the arguments used to run the function.

  • list[dict] – Strains to force-include along with the function that filtered them and the arguments used to run the function.

For example, filter data by minimum date, but force the include of strains from Africa.

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-10-02"}, {"region": "North America", "date": "2020-01-01"}], index=["strain1", "strain2", "strain3"])
>>> exclude_by = [(filter_by_date, {"min_date": numeric_date("2020-04-01")})]
>>> include_by = [(include_by_include_where, {"include_where": "region=Africa"})]
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{'strain2'}
>>> sorted(strains_to_exclude, key=lambda record: record["strain"])
[{'strain': 'strain1', 'filter': 'filter_by_date', 'kwargs': '[["min_date", 2020.25]]'}, {'strain': 'strain3', 'filter': 'filter_by_date', 'kwargs': '[["min_date", 2020.25]]'}]
>>> strains_to_include
[{'strain': 'strain1', 'filter': 'include_by_include_where', 'kwargs': '[["include_where", "region=Africa"]]'}]

We also want to filter by characteristics of the sequence data that we’ve annotated in a sequence index.

>>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}, {"strain": "strain3", "A": 1250, "C": 1250, "G": 1250, "T": 1250}]).set_index("strain")
>>> exclude_by = [(filter_by_sequence_length, {"sequence_index": sequence_index, "min_length": 27000})]
>>> include_by = [(include_by_include_where, {"include_where": "region=Europe"})]
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{'strain1'}
>>> sorted(strains_to_exclude, key=lambda record: record["strain"])
[{'strain': 'strain2', 'filter': 'filter_by_sequence_length', 'kwargs': '[["min_length", 27000]]'}, {'strain': 'strain3', 'filter': 'filter_by_sequence_length', 'kwargs': '[["min_length", 27000]]'}]
>>> strains_to_include
[{'strain': 'strain2', 'filter': 'include_by_include_where', 'kwargs': '[["include_where", "region=Europe"]]'}]
augur.filter.calculate_sequences_per_group(target_max_value, counts_per_group, allow_probabilistic=True)

Calculate the number of sequences per group for a given maximum number of sequences to be returned and the number of sequences in each requested group. Optionally, allow the result to be probabilistic such that the mean result of a Poisson process achieves the calculated sequences per group for the given maximum.

Parameters
  • target_max_value (int) – Maximum number of sequences to return by subsampling at some calculated number of sequences per group for the given counts per group.

  • counts_per_group (list[int]) – A list with the number of sequences in each requested group.

  • allow_probabilistic (bool) – Whether to allow probabilistic subsampling when the number of groups exceeds the requested maximum.

Raises

TooManyGroupsError : – When there are more groups than sequences per group and probabilistic subsampling is not allowed.

Returns

  • int or float – Number of sequences per group.

  • bool – Whether probabilistic subsampling was used.

augur.filter.construct_filters(args, sequence_index)

Construct lists of filters and inclusion criteria based on user-provided arguments.

Parameters
  • args (argparse.Namespace) – Command line arguments provided by the user.

  • sequence_index (pandas.DataFrame) – Sequence index for the provided arguments.

Returns

  • list – A list of 2-element tuples with a callable to use as a filter and a dictionary of kwargs to pass to the callable.

  • list – A list of 2-element tuples with a callable and dictionary of kwargs that determines whether to force include strains in the final output.

augur.filter.create_queues_by_group(groups, max_size, max_attempts=100, random_seed=None)

Create a dictionary of priority queues per group for the given maximum size.

When the maximum size is fractional, probabilistically sample the maximum size from a Poisson distribution. Make at least the given number of maximum attempts to create queues for which the sum of their maximum sizes is greater than zero.

Create queues for two groups with a fixed maximum size.

>>> groups = ("2015", "2016")
>>> queues = create_queues_by_group(groups, 2)
>>> sum(queue.max_size for queue in queues.values())
4

Create queues for two groups with a fractional maximum size. Their total max size should still be an integer value greater than zero.

>>> seed = 314159
>>> queues = create_queues_by_group(groups, 0.1, random_seed=seed)
>>> int(sum(queue.max_size for queue in queues.values())) > 0
True

A subsequent run of this function with the same groups and random seed should produce the same queues and queue sizes.

>>> more_queues = create_queues_by_group(groups, 0.1, random_seed=seed)
>>> [queue.max_size for queue in queues.values()] == [queue.max_size for queue in more_queues.values()]
True
augur.filter.filter_by_ambiguous_date(metadata, date_column='date', ambiguity='any')

Filter metadata in the given pandas DataFrame where values in the given date column have a given level of ambiguity.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • date_column (str) – Column in the dataframe with dates.

  • ambiguity (str) – Level of date ambiguity to filter metadata by

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-XX"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> filter_by_ambiguous_date(metadata)
{'strain2'}
>>> sorted(filter_by_ambiguous_date(metadata, ambiguity="month"))
['strain1', 'strain2']

If the requested date column does not exist, we quietly skip this filter.

>>> sorted(filter_by_ambiguous_date(metadata, date_column="missing_column"))
['strain1', 'strain2']
augur.filter.filter_by_date(metadata, date_column='date', min_date=None, max_date=None)

Filter metadata by minimum or maximum date.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • date_column (str) – Column in the dataframe with dates.

  • min_date (float) – Minimum date

  • max_date (float) – Maximum date

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> filter_by_date(metadata, min_date=numeric_date("2020-01-02"))
{'strain2'}
>>> filter_by_date(metadata, max_date=numeric_date("2020-01-01"))
{'strain1'}
>>> filter_by_date(metadata, min_date=numeric_date("2020-01-03"), max_date=numeric_date("2020-01-10"))
set()
>>> sorted(filter_by_date(metadata, min_date=numeric_date("2019-12-30"), max_date=numeric_date("2020-01-10")))
['strain1', 'strain2']
>>> sorted(filter_by_date(metadata))
['strain1', 'strain2']

If the requested date column does not exist, we quietly skip this filter.

>>> sorted(filter_by_date(metadata, date_column="missing_column", min_date=numeric_date("2020-01-02")))
['strain1', 'strain2']
augur.filter.filter_by_exclude(metadata, exclude_file)

Exclude the given set of strains from the given metadata.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • exclude_file (str) – Filename with strain names to exclude from the given metadata

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> with NamedTemporaryFile(delete=False) as exclude_file:
...     characters_written = exclude_file.write(b'strain1')
>>> filter_by_exclude(metadata, exclude_file.name)
{'strain2'}
>>> os.unlink(exclude_file.name)
augur.filter.filter_by_exclude_all(metadata)

Exclude all strains regardless of the given metadata content.

This is a placeholder function that can be called as part of a generalized loop through all possible functions.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name

Returns

Empty set of strains

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_exclude_all(metadata)
set()
augur.filter.filter_by_exclude_where(metadata, exclude_where)

Exclude all strains from the given metadata that match the given exclusion query.

Unlike pandas query syntax, exclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • exclude_where (str) – Filter query used to exclude strains

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_exclude_where(metadata, "region!=Europe")
{'strain2'}
>>> filter_by_exclude_where(metadata, "region=Europe")
{'strain1'}
>>> filter_by_exclude_where(metadata, "region=europe")
{'strain1'}

If the column referenced in the given query does not exist, skip the filter.

>>> sorted(filter_by_exclude_where(metadata, "missing_column=value"))
['strain1', 'strain2']
augur.filter.filter_by_max_date(metadata, max_date, **kwargs)

Filter metadata by maximum date.

Alias to filter_by_date using max_date only.

augur.filter.filter_by_min_date(metadata, min_date, **kwargs)

Filter metadata by minimum date.

Alias to filter_by_date using min_date only.

augur.filter.filter_by_non_nucleotide(metadata, sequence_index)

Filter metadata for strains with invalid nucleotide content.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • sequence_index (pandas.DataFrame) – Sequence index

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "invalid_nucleotides": 0}, {"strain": "strain2", "invalid_nucleotides": 1}]).set_index("strain")
>>> filter_by_non_nucleotide(metadata, sequence_index)
{'strain1'}
augur.filter.filter_by_query(metadata, query)

Filter metadata in the given pandas DataFrame with a query string and return the strain names that pass the filter.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • query (str) – Query string for the dataframe.

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_query(metadata, "region == 'Africa'")
{'strain1'}
>>> filter_by_query(metadata, "region == 'North America'")
set()
augur.filter.filter_by_sequence_index(metadata, sequence_index)

Filter metadata by presence of corresponding entries in a given sequence index. This filter effectively intersects the strain ids in the metadata and sequence index.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • sequence_index (pandas.DataFrame) – Sequence index

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "ACGT": 28000}]).set_index("strain")
>>> filter_by_sequence_index(metadata, sequence_index)
{'strain1'}
augur.filter.filter_by_sequence_length(metadata, sequence_index, min_length=0)

Filter metadata by sequence length from a given sequence index.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • sequence_index (pandas.DataFrame) – Sequence index

  • min_length (int) – Minimum number of standard nucleotide characters (A, C, G, or T) in each sequence

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain")
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
{'strain1'}

It is possible for the sequence index to be missing strains present in the metadata.

>>> sequence_index = pd.DataFrame([{"strain": "strain3", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain")
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
set()
augur.filter.filter_kwargs_to_str(kwargs)

Convert a dictionary of kwargs to a JSON string for downstream reporting.

This structured string can be converted back into a Python data structure later for more sophisticated reporting by specific kwargs.

This function excludes data types from arguments like pandas DataFrames and also converts floating point numbers to a fixed precision for better readability and reproducibility.

Parameters

kwargs (dict) – Dictionary of kwargs passed to a given filter function.

Returns

String representation of the kwargs for reporting.

Return type

str

>>> sequence_index = pd.DataFrame([{"strain": "strain1", "ACGT": 28000}, {"strain": "strain2", "ACGT": 26000}, {"strain": "strain3", "ACGT": 5000}]).set_index("strain")
>>> exclude_by = [(filter_by_sequence_length, {"sequence_index": sequence_index, "min_length": 27000})]
>>> filter_kwargs_to_str(exclude_by[0][1])
'[["min_length", 27000]]'
>>> exclude_by = [(filter_by_date, {"max_date": numeric_date("2020-04-01"), "min_date": numeric_date("2020-03-01")})]
>>> filter_kwargs_to_str(exclude_by[0][1])
'[["max_date", 2020.25], ["min_date", 2020.17]]'
augur.filter.get_groups_for_subsampling(strains, metadata, group_by=None)

Return a list of groups for each given strain based on the corresponding metadata and group by column.

Parameters
  • strains (list) – A list of strains to get groups for.

  • metadata (pandas.DataFrame) – Metadata to inspect for the given strains.

  • group_by (list) – A list of metadata (or calculated) columns to group records by.

Returns

  • dict – A mapping of strain names to tuples corresponding to the values of the strain’s group.

  • list – A list of dictionaries with strains that were skipped from grouping and the reason why (see also: apply_filters output).

>>> strains = ["strain1", "strain2"]
>>> metadata = pd.DataFrame([{"strain": "strain1", "date": "2020-01-01", "region": "Africa"}, {"strain": "strain2", "date": "2020-02-01", "region": "Europe"}]).set_index("strain")
>>> group_by = ["region"]
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{'strain1': ('Africa',), 'strain2': ('Europe',)}
>>> skipped_strains
[]

If we group by year or month, these groups are calculated from the date string.

>>> group_by = ["year", "month"]
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{'strain1': (2020, (2020, 1)), 'strain2': (2020, (2020, 2))}

If we omit the grouping columns, the result will group by a dummy column.

>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata)
>>> group_by_strain
{'strain1': ('_dummy',), 'strain2': ('_dummy',)}

If we try to group by columns that don’t exist, we get an error.

>>> group_by = ["missing_column"]
>>> get_groups_for_subsampling(strains, metadata, group_by)
Traceback (most recent call last):
  ...
augur.filter.FilterException: The specified group-by categories (['missing_column']) were not found.

If we try to group by some columns that exist and some that don’t, we allow grouping to continue and print a warning message to stderr.

>>> group_by = ["year", "month", "missing_column"]
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{'strain1': (2020, (2020, 1), 'unknown'), 'strain2': (2020, (2020, 2), 'unknown')}

If we group by year month and some records don’t have that information in their date fields, we should skip those records from the group output and track which records were skipped for which reasons.

>>> metadata = pd.DataFrame([{"strain": "strain1", "date": "", "region": "Africa"}, {"strain": "strain2", "date": "2020-02-01", "region": "Europe"}]).set_index("strain")
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, ["year"])
>>> group_by_strain
{'strain2': (2020,)}
>>> skipped_strains
[{'strain': 'strain1', 'filter': 'skip_group_by_with_ambiguous_year', 'kwargs': ''}]

Similarly, if we group by month, we should skip records that don’t have month information in their date fields.

>>> metadata = pd.DataFrame([{"strain": "strain1", "date": "2020", "region": "Africa"}, {"strain": "strain2", "date": "2020-02-01", "region": "Europe"}]).set_index("strain")
>>> group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, ["month"])
>>> group_by_strain
{'strain2': ((2020, 2),)}
>>> skipped_strains
[{'strain': 'strain1', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''}]
augur.filter.include(metadata, include_file)

Include strains in the given text file from the given metadata.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • include_file (str) – Filename with strain names to include from the given metadata

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> with NamedTemporaryFile(delete=False) as include_file:
...     characters_written = include_file.write(b'strain1')
>>> include(metadata, include_file.name)
{'strain1'}
>>> os.unlink(include_file.name)
augur.filter.include_by_include_where(metadata, include_where)

Include all strains from the given metadata that match the given query.

Unlike pandas query syntax, inclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • include_where (str) – Filter query used to include strains

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> include_by_include_where(metadata, "region!=Europe")
{'strain1'}
>>> include_by_include_where(metadata, "region=Europe")
{'strain2'}
>>> include_by_include_where(metadata, "region=europe")
{'strain2'}

If the column referenced in the given query does not exist, skip the filter.

>>> include_by_include_where(metadata, "missing_column=value")
set()
augur.filter.parse_filter_query(query)

Parse an augur filter-style query and return the corresponding column, operator, and value for the query.

Parameters

query (str) – augur filter-style query following the pattern of “property=value” or “property!=value”

Returns

  • str – Name of column to query

  • callable – Operator function to test equality or non-equality of values

  • str – Value of column to query

>>> parse_filter_query("property=value")
('property', <built-in function eq>, 'value')
>>> parse_filter_query("property!=value")
('property', <built-in function ne>, 'value')
augur.filter.read_priority_scores(fname)
augur.filter.register_arguments(parser)
augur.filter.run(args)

filter and subsample a set of sequences into an analysis set

augur.filter.validate_arguments(args)

Validate arguments and return a boolean representing whether all validation rules succeeded.

Parameters

args (argparse.Namespace) – Parsed arguments from argparse

Returns

Validation succeeded.

Return type

bool