augur.filter.include_exclude_rules

augur.filter.include_exclude_rules.apply_filters(metadata, exclude_by, include_by)

Apply a list of filters to exclude or force-include records from the given metadata and return the strains to keep, to exclude, and to force include.

Parameters

metadata (pandas.DataFrame) – Metadata to filter
exclude_by (list[tuple]) – A list of 2-element tuples with a callable to filter by in the first index and a dictionary of kwargs to pass to the function in the second index.
include_by (list[tuple]) – A list of 2-element tuples in the same format as the exclude_by argument.

Returns

set – Strains to keep (those that passed all filters)
list[dict] – Strains to exclude along with the function that filtered them and the arguments used to run the function.
list[dict] – Strains to force-include along with the function that filtered them and the arguments used to run the function.

For example, filter data by minimum date, but force the include of strains from Africa.

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-10-02"}, {"region": "North America", "date": "2020-01-01"}], index=["strain1", "strain2", "strain3"])
>>> exclude_by = [(filter_by_date, {"min_date": numeric_date("2020-04-01")})]
>>> include_by = [(force_include_where, {"include_where": "region=Africa"})]
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{'strain2'}
>>> sorted(strains_to_exclude, key=lambda record: record["strain"])
[{'strain': 'strain1', 'filter': 'filter_by_date', 'kwargs': '[["min_date", 2020.25]]'}, {'strain': 'strain3', 'filter': 'filter_by_date', 'kwargs': '[["min_date", 2020.25]]'}]
>>> strains_to_include
[{'strain': 'strain1', 'filter': 'force_include_where', 'kwargs': '[["include_where", "region=Africa"]]'}]

We also want to filter by characteristics of the sequence data that we’ve annotated in a sequence index.

>>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}, {"strain": "strain3", "A": 1250, "C": 1250, "G": 1250, "T": 1250}]).set_index("strain")
>>> exclude_by = [(filter_by_sequence_length, {"sequence_index": sequence_index, "min_length": 27000})]
>>> include_by = [(force_include_where, {"include_where": "region=Europe"})]
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{'strain1'}
>>> sorted(strains_to_exclude, key=lambda record: record["strain"])
[{'strain': 'strain2', 'filter': 'filter_by_sequence_length', 'kwargs': '[["min_length", 27000]]'}, {'strain': 'strain3', 'filter': 'filter_by_sequence_length', 'kwargs': '[["min_length", 27000]]'}]
>>> strains_to_include
[{'strain': 'strain2', 'filter': 'force_include_where', 'kwargs': '[["include_where", "region=Europe"]]'}]

augur.filter.include_exclude_rules.construct_filters(args, sequence_index)

Construct lists of filters and inclusion criteria based on user-provided arguments.

Parameters

args (argparse.Namespace) – Command line arguments provided by the user.
sequence_index (pandas.DataFrame) – Sequence index for the provided arguments.

Returns

list – A list of 2-element tuples with a callable to use as a filter and a dictionary of kwargs to pass to the callable.
list – A list of 2-element tuples with a callable and dictionary of kwargs that determines whether to force include strains in the final output.

augur.filter.include_exclude_rules.filter_by_ambiguous_date(metadata, date_column='date', ambiguity='any')

Filter metadata in the given pandas DataFrame where values in the given date column have a given level of ambiguity.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
date_column (str) – Column in the dataframe with dates.
ambiguity (str) – Level of date ambiguity to filter metadata by

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-XX"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> filter_by_ambiguous_date(metadata)
{'strain2'}
>>> sorted(filter_by_ambiguous_date(metadata, ambiguity="month"))
['strain1', 'strain2']

If the requested date column does not exist, we quietly skip this filter.

>>> sorted(filter_by_ambiguous_date(metadata, date_column="missing_column"))
['strain1', 'strain2']

augur.filter.include_exclude_rules.filter_by_date(metadata, date_column='date', min_date=None, max_date=None)

Filter metadata by minimum or maximum date.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
date_column (str) – Column in the dataframe with dates.
min_date (float) – Minimum date
max_date (float) – Maximum date

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> filter_by_date(metadata, min_date=numeric_date("2020-01-02"))
{'strain2'}
>>> filter_by_date(metadata, max_date=numeric_date("2020-01-01"))
{'strain1'}
>>> filter_by_date(metadata, min_date=numeric_date("2020-01-03"), max_date=numeric_date("2020-01-10"))
set()
>>> sorted(filter_by_date(metadata, min_date=numeric_date("2019-12-30"), max_date=numeric_date("2020-01-10")))
['strain1', 'strain2']
>>> sorted(filter_by_date(metadata))
['strain1', 'strain2']

If the requested date column does not exist, we quietly skip this filter.

>>> sorted(filter_by_date(metadata, date_column="missing_column", min_date=numeric_date("2020-01-02")))
['strain1', 'strain2']

augur.filter.include_exclude_rules.filter_by_exclude(metadata, exclude_file)

Exclude the given set of strains from the given metadata.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
exclude_file (str) – Filename with strain names to exclude from the given metadata

Returns

Strains that pass the filter

Return type

set[str]

>>> import os
>>> from tempfile import NamedTemporaryFile
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> with NamedTemporaryFile(delete=False) as exclude_file:
...     characters_written = exclude_file.write(b'strain1')
>>> filter_by_exclude(metadata, exclude_file.name)
{'strain2'}
>>> os.unlink(exclude_file.name)

augur.filter.include_exclude_rules.filter_by_exclude_all(metadata)

Exclude all strains regardless of the given metadata content.

This is a placeholder function that can be called as part of a generalized loop through all possible functions.

Parameters: metadata (pandas.DataFrame) – Metadata indexed by strain name
Returns: Empty set of strains
Return type: set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_exclude_all(metadata)
set()

augur.filter.include_exclude_rules.filter_by_exclude_where(metadata, exclude_where)

Exclude all strains from the given metadata that match the given exclusion query.

Unlike pandas query syntax, exclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
exclude_where (str) – Filter query used to exclude strains

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_exclude_where(metadata, "region!=Europe")
{'strain2'}
>>> filter_by_exclude_where(metadata, "region=Europe")
{'strain1'}
>>> filter_by_exclude_where(metadata, "region=europe")
{'strain1'}

If the column referenced in the given query does not exist, skip the filter.

>>> sorted(filter_by_exclude_where(metadata, "missing_column=value"))
['strain1', 'strain2']

augur.filter.include_exclude_rules.filter_by_max_date(metadata, max_date, **kwargs)

Filter metadata by maximum date.

Alias to filter_by_date using max_date only.

augur.filter.include_exclude_rules.filter_by_min_date(metadata, min_date, **kwargs)

Filter metadata by minimum date.

Alias to filter_by_date using min_date only.

augur.filter.include_exclude_rules.filter_by_non_nucleotide(metadata, sequence_index)

Filter metadata for strains with invalid nucleotide content.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "invalid_nucleotides": 0}, {"strain": "strain2", "invalid_nucleotides": 1}]).set_index("strain")
>>> filter_by_non_nucleotide(metadata, sequence_index)
{'strain1'}

augur.filter.include_exclude_rules.filter_by_query(metadata, query)

Filter metadata in the given pandas DataFrame with a query string and return the strain names that pass the filter.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
query (str) – Query string for the dataframe.

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_query(metadata, "region == 'Africa'")
{'strain1'}
>>> filter_by_query(metadata, "region == 'North America'")
set()

augur.filter.include_exclude_rules.filter_by_sequence_index(metadata, sequence_index)

Filter metadata by presence of corresponding entries in a given sequence index. This filter effectively intersects the strain ids in the metadata and sequence index.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "ACGT": 28000}]).set_index("strain")
>>> filter_by_sequence_index(metadata, sequence_index)
{'strain1'}

augur.filter.include_exclude_rules.filter_by_sequence_length(metadata, sequence_index, min_length=0)

Filter metadata by sequence length from a given sequence index.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
min_length (int) – Minimum number of standard nucleotide characters (A, C, G, or T) in each sequence

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain")
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
{'strain1'}

It is possible for the sequence index to be missing strains present in the metadata.

>>> sequence_index = pd.DataFrame([{"strain": "strain3", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain")
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
set()

augur.filter.include_exclude_rules.force_include_strains(metadata, include_file)

Include strains in the given text file from the given metadata.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
include_file (str) – Filename with strain names to include from the given metadata

Returns

Strains that pass the filter

Return type

set[str]

>>> import os
>>> from tempfile import NamedTemporaryFile
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> with NamedTemporaryFile(delete=False) as include_file:
...     characters_written = include_file.write(b'strain1')
>>> force_include_strains(metadata, include_file.name)
{'strain1'}
>>> os.unlink(include_file.name)

augur.filter.include_exclude_rules.force_include_where(metadata, include_where)

Include all strains from the given metadata that match the given query.

Unlike pandas query syntax, inclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name
include_where (str) – Filter query used to include strains

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> force_include_where(metadata, "region!=Europe")
{'strain1'}
>>> force_include_where(metadata, "region=Europe")
{'strain2'}
>>> force_include_where(metadata, "region=europe")
{'strain2'}

If the column referenced in the given query does not exist, skip the filter.

>>> force_include_where(metadata, "missing_column=value")
set()

augur.filter.include_exclude_rules.parse_filter_query(query)

Parse an augur filter-style query and return the corresponding column, operator, and value for the query.

Parameters

query (str) – augur filter-style query following the pattern of “property=value” or “property!=value”

Returns

str – Name of column to query
callable – Operator function to test equality or non-equality of values
str – Value of column to query

>>> parse_filter_query("property=value")
('property', <built-in function eq>, 'value')
>>> parse_filter_query("property!=value")
('property', <built-in function ne>, 'value')