augur.filter.subsample

class augur.filter.subsample.PriorityQueue(max_size)

Bases: object

A priority queue implementation that automatically replaces lower priority items in the heap with incoming higher priority items.

Examples

Add a single record to a heap with a maximum of 2 records.

>>> queue = PriorityQueue(max_size=2)
>>> queue.add({"strain": "strain1"}, 0.5)
1

Add another record with a higher priority. The queue should be at its maximum size.

>>> queue.add({"strain": "strain2"}, 1.0)
2
>>> queue.heap
[(0.5, 0, {'strain': 'strain1'}), (1.0, 1, {'strain': 'strain2'})]
>>> list(queue.get_items())
[{'strain': 'strain1'}, {'strain': 'strain2'}]

Add a higher priority record that causes the queue to exceed its maximum size. The resulting queue should contain the two highest priority records after the lowest priority record is removed.

>>> queue.add({"strain": "strain3"}, 2.0)
2
>>> list(queue.get_items())
[{'strain': 'strain2'}, {'strain': 'strain3'}]

Add a record with the same priority as another record, forcing the duplicate to be resolved by removing the oldest entry.

>>> queue.add({"strain": "strain4"}, 1.0)
2
>>> list(queue.get_items())
[{'strain': 'strain4'}, {'strain': 'strain3'}]
add(item, priority)

Add an item to the queue with a given priority.

If adding the item causes the queue to exceed its maximum size, replace the lowest priority item with the given item. The queue stores items with an additional heap id value (a count) to resolve ties between items with equal priority (favoring the most recently added item).

get_items()

Return each item in the queue in order.

Yields:

Any – Item stored in the queue.

exception augur.filter.subsample.TooManyGroupsError(msg)

Bases: ValueError

augur.filter.subsample.calculate_sequences_per_group(target_max_value, group_sizes, allow_probabilistic=True)

Calculate the number of sequences per group for a given maximum number of sequences to be returned and the number of sequences in each requested group. Optionally, allow the result to be probabilistic such that the mean result of a Poisson process achieves the calculated sequences per group for the given maximum.

Parameters:
  • target_max_value (int) – Maximum number of sequences to return by subsampling at some calculated number of sequences per group for the given counts per group.

  • group_sizes (list of int) – A list with the number of sequences in each requested group.

  • allow_probabilistic (bool) – Whether to allow probabilistic subsampling when the number of groups exceeds the requested maximum.

Raises:

TooManyGroupsError – When there are more groups than sequences per group and probabilistic subsampling is not allowed.

Returns:

  • int or float – Number of sequences per group.

  • bool – Whether probabilistic subsampling was used.

augur.filter.subsample.create_queues_by_group(groups, max_size, max_attempts=100, random_seed=None)

Create a dictionary of priority queues per group for the given maximum size.

When the maximum size is fractional, probabilistically sample the maximum size from a Poisson distribution. Make at least the given number of maximum attempts to create queues for which the sum of their maximum sizes is greater than zero.

Examples

Create queues for two groups with a fixed maximum size.

>>> groups = ("2015", "2016")
>>> queues = create_queues_by_group(groups, 2)
>>> sum(queue.max_size for queue in queues.values())
4

Create queues for two groups with a fractional maximum size. Their total max size should still be an integer value greater than zero.

>>> seed = 314159
>>> queues = create_queues_by_group(groups, 0.1, random_seed=seed)
>>> int(sum(queue.max_size for queue in queues.values())) > 0
True

A subsequent run of this function with the same groups and random seed should produce the same queues and queue sizes.

>>> more_queues = create_queues_by_group(groups, 0.1, random_seed=seed)
>>> [queue.max_size for queue in queues.values()] == [queue.max_size for queue in more_queues.values()]
True
augur.filter.subsample.get_groups_for_subsampling(strains, metadata, group_by=None)

Return a list of groups for each given strain based on the corresponding metadata and group by column.

Parameters:
  • strains (list) – A list of strains to get groups for.

  • metadata (pandas.DataFrame) – Metadata to inspect for the given strains.

  • group_by (list) – A list of metadata (or generated) columns to group records by.

Returns:

A mapping of strain names to tuples corresponding to the values of the strain’s group.

Return type:

dict

Examples

>>> strains = ["strain1", "strain2"]
>>> metadata = pd.DataFrame([{"strain": "strain1", "date": "2020-01-01", "region": "Africa"}, {"strain": "strain2", "date": "2020-02-01", "region": "Europe"}]).set_index("strain")
>>> group_by = ["region"]
>>> group_by_strain = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{'strain1': ('Africa',), 'strain2': ('Europe',)}

If we group by year or month, these groups are generated from the date string.

>>> group_by = ["year", "month"]
>>> group_by_strain = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{'strain1': (2020, (2020, 1)), 'strain2': (2020, (2020, 2))}

If we omit the grouping columns, the result will group by a dummy column.

>>> group_by_strain = get_groups_for_subsampling(strains, metadata)
>>> group_by_strain
{'strain1': ('_dummy',), 'strain2': ('_dummy',)}

If we try to group by columns that don’t exist, we get an error.

>>> group_by = ["missing_column"]
>>> get_groups_for_subsampling(strains, metadata, group_by)
Traceback (most recent call last):
  ...
augur.errors.AugurError: The specified group-by categories (['missing_column']) were not found.

If we try to group by some columns that exist and some that don’t, we allow grouping to continue and print a warning message to stderr.

>>> group_by = ["year", "month", "missing_column"]
>>> group_by_strain = get_groups_for_subsampling(strains, metadata, group_by)
>>> group_by_strain
{'strain1': (2020, (2020, 1), 'unknown'), 'strain2': (2020, (2020, 2), 'unknown')}

We can group metadata without any non-ID columns.

>>> metadata = pd.DataFrame([{"strain": "strain1"}, {"strain": "strain2"}]).set_index("strain")
>>> get_groups_for_subsampling(strains, metadata, group_by=('_dummy',))
{'strain1': ('_dummy',), 'strain2': ('_dummy',)}