augur.filter.include_exclude_rules

augur.filter.include_exclude_rules.apply_filters(metadata, exclude_by, include_by)

Apply a list of filters to exclude or force-include records from the given metadata and return the strains to keep, to exclude, and to force include.

Parameters
  • metadata (pandas.DataFrame) – Metadata to filter

  • exclude_by (list[tuple]) – A list of 2-element tuples with a callable to filter by in the first index and a dictionary of kwargs to pass to the function in the second index.

  • include_by (list[tuple]) – A list of 2-element tuples in the same format as the exclude_by argument.

Returns

  • set – Strains to keep (those that passed all filters)

  • list[dict] – Strains to exclude along with the function that filtered them and the arguments used to run the function.

  • list[dict] – Strains to force-include along with the function that filtered them and the arguments used to run the function.

For example, filter data by minimum date, but force the include of strains from Africa.

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-10-02"}, {"region": "North America", "date": "2020-01-01"}], index=["strain1", "strain2", "strain3"])
>>> exclude_by = [(filter_by_date, {"min_date": numeric_date("2020-04-01")})]
>>> include_by = [(force_include_where, {"include_where": "region=Africa"})]
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{'strain2'}
>>> sorted(strains_to_exclude, key=lambda record: record["strain"])
[{'strain': 'strain1', 'filter': 'filter_by_date', 'kwargs': '[["min_date", 2020.25]]'}, {'strain': 'strain3', 'filter': 'filter_by_date', 'kwargs': '[["min_date", 2020.25]]'}]
>>> strains_to_include
[{'strain': 'strain1', 'filter': 'force_include_where', 'kwargs': '[["include_where", "region=Africa"]]'}]

We also want to filter by characteristics of the sequence data that we’ve annotated in a sequence index.

>>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}, {"strain": "strain3", "A": 1250, "C": 1250, "G": 1250, "T": 1250}]).set_index("strain")
>>> exclude_by = [(filter_by_sequence_length, {"sequence_index": sequence_index, "min_length": 27000})]
>>> include_by = [(force_include_where, {"include_where": "region=Europe"})]
>>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by)
>>> strains_to_keep
{'strain1'}
>>> sorted(strains_to_exclude, key=lambda record: record["strain"])
[{'strain': 'strain2', 'filter': 'filter_by_sequence_length', 'kwargs': '[["min_length", 27000]]'}, {'strain': 'strain3', 'filter': 'filter_by_sequence_length', 'kwargs': '[["min_length", 27000]]'}]
>>> strains_to_include
[{'strain': 'strain2', 'filter': 'force_include_where', 'kwargs': '[["include_where", "region=Europe"]]'}]
augur.filter.include_exclude_rules.construct_filters(args, sequence_index)

Construct lists of filters and inclusion criteria based on user-provided arguments.

Parameters
  • args (argparse.Namespace) – Command line arguments provided by the user.

  • sequence_index (pandas.DataFrame) – Sequence index for the provided arguments.

Returns

  • list – A list of 2-element tuples with a callable to use as a filter and a dictionary of kwargs to pass to the callable.

  • list – A list of 2-element tuples with a callable and dictionary of kwargs that determines whether to force include strains in the final output.

augur.filter.include_exclude_rules.filter_by_ambiguous_date(metadata, date_column='date', ambiguity='any')

Filter metadata in the given pandas DataFrame where values in the given date column have a given level of ambiguity.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • date_column (str) – Column in the dataframe with dates.

  • ambiguity (str) – Level of date ambiguity to filter metadata by

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-XX"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> filter_by_ambiguous_date(metadata)
{'strain2'}
>>> sorted(filter_by_ambiguous_date(metadata, ambiguity="month"))
['strain1', 'strain2']

If the requested date column does not exist, we quietly skip this filter.

>>> sorted(filter_by_ambiguous_date(metadata, date_column="missing_column"))
['strain1', 'strain2']
augur.filter.include_exclude_rules.filter_by_date(metadata, date_column='date', min_date=None, max_date=None)

Filter metadata by minimum or maximum date.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • date_column (str) – Column in the dataframe with dates.

  • min_date (float) – Minimum date

  • max_date (float) – Maximum date

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> filter_by_date(metadata, min_date=numeric_date("2020-01-02"))
{'strain2'}
>>> filter_by_date(metadata, max_date=numeric_date("2020-01-01"))
{'strain1'}
>>> filter_by_date(metadata, min_date=numeric_date("2020-01-03"), max_date=numeric_date("2020-01-10"))
set()
>>> sorted(filter_by_date(metadata, min_date=numeric_date("2019-12-30"), max_date=numeric_date("2020-01-10")))
['strain1', 'strain2']
>>> sorted(filter_by_date(metadata))
['strain1', 'strain2']

If the requested date column does not exist, we quietly skip this filter.

>>> sorted(filter_by_date(metadata, date_column="missing_column", min_date=numeric_date("2020-01-02")))
['strain1', 'strain2']
augur.filter.include_exclude_rules.filter_by_exclude(metadata, exclude_file)

Exclude the given set of strains from the given metadata.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • exclude_file (str) – Filename with strain names to exclude from the given metadata

Returns

Strains that pass the filter

Return type

set[str]

>>> import os
>>> from tempfile import NamedTemporaryFile
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> with NamedTemporaryFile(delete=False) as exclude_file:
...     characters_written = exclude_file.write(b'strain1')
>>> filter_by_exclude(metadata, exclude_file.name)
{'strain2'}
>>> os.unlink(exclude_file.name)
augur.filter.include_exclude_rules.filter_by_exclude_all(metadata)

Exclude all strains regardless of the given metadata content.

This is a placeholder function that can be called as part of a generalized loop through all possible functions.

Parameters

metadata (pandas.DataFrame) – Metadata indexed by strain name

Returns

Empty set of strains

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_exclude_all(metadata)
set()
augur.filter.include_exclude_rules.filter_by_exclude_where(metadata, exclude_where)

Exclude all strains from the given metadata that match the given exclusion query.

Unlike pandas query syntax, exclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • exclude_where (str) – Filter query used to exclude strains

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_exclude_where(metadata, "region!=Europe")
{'strain2'}
>>> filter_by_exclude_where(metadata, "region=Europe")
{'strain1'}
>>> filter_by_exclude_where(metadata, "region=europe")
{'strain1'}

If the column referenced in the given query does not exist, skip the filter.

>>> sorted(filter_by_exclude_where(metadata, "missing_column=value"))
['strain1', 'strain2']
augur.filter.include_exclude_rules.filter_by_max_date(metadata, max_date, **kwargs)

Filter metadata by maximum date.

Alias to filter_by_date using max_date only.

augur.filter.include_exclude_rules.filter_by_min_date(metadata, min_date, **kwargs)

Filter metadata by minimum date.

Alias to filter_by_date using min_date only.

augur.filter.include_exclude_rules.filter_by_non_nucleotide(metadata, sequence_index)

Filter metadata for strains with invalid nucleotide content.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • sequence_index (pandas.DataFrame) – Sequence index

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "invalid_nucleotides": 0}, {"strain": "strain2", "invalid_nucleotides": 1}]).set_index("strain")
>>> filter_by_non_nucleotide(metadata, sequence_index)
{'strain1'}
augur.filter.include_exclude_rules.filter_by_query(metadata, query)

Filter metadata in the given pandas DataFrame with a query string and return the strain names that pass the filter.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • query (str) – Query string for the dataframe.

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> filter_by_query(metadata, "region == 'Africa'")
{'strain1'}
>>> filter_by_query(metadata, "region == 'North America'")
set()
augur.filter.include_exclude_rules.filter_by_sequence_index(metadata, sequence_index)

Filter metadata by presence of corresponding entries in a given sequence index. This filter effectively intersects the strain ids in the metadata and sequence index.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • sequence_index (pandas.DataFrame) – Sequence index

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "ACGT": 28000}]).set_index("strain")
>>> filter_by_sequence_index(metadata, sequence_index)
{'strain1'}
augur.filter.include_exclude_rules.filter_by_sequence_length(metadata, sequence_index, min_length=0)

Filter metadata by sequence length from a given sequence index.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • sequence_index (pandas.DataFrame) – Sequence index

  • min_length (int) – Minimum number of standard nucleotide characters (A, C, G, or T) in each sequence

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"])
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain")
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
{'strain1'}

It is possible for the sequence index to be missing strains present in the metadata.

>>> sequence_index = pd.DataFrame([{"strain": "strain3", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain")
>>> filter_by_sequence_length(metadata, sequence_index, min_length=27000)
set()
augur.filter.include_exclude_rules.force_include_strains(metadata, include_file)

Include strains in the given text file from the given metadata.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • include_file (str) – Filename with strain names to include from the given metadata

Returns

Strains that pass the filter

Return type

set[str]

>>> import os
>>> from tempfile import NamedTemporaryFile
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> with NamedTemporaryFile(delete=False) as include_file:
...     characters_written = include_file.write(b'strain1')
>>> force_include_strains(metadata, include_file.name)
{'strain1'}
>>> os.unlink(include_file.name)
augur.filter.include_exclude_rules.force_include_where(metadata, include_where)

Include all strains from the given metadata that match the given query.

Unlike pandas query syntax, inclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.

Parameters
  • metadata (pandas.DataFrame) – Metadata indexed by strain name

  • include_where (str) – Filter query used to include strains

Returns

Strains that pass the filter

Return type

set[str]

>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"])
>>> force_include_where(metadata, "region!=Europe")
{'strain1'}
>>> force_include_where(metadata, "region=Europe")
{'strain2'}
>>> force_include_where(metadata, "region=europe")
{'strain2'}

If the column referenced in the given query does not exist, skip the filter.

>>> force_include_where(metadata, "missing_column=value")
set()
augur.filter.include_exclude_rules.parse_filter_query(query)

Parse an augur filter-style query and return the corresponding column, operator, and value for the query.

Parameters

query (str) – augur filter-style query following the pattern of “property=value” or “property!=value”

Returns

  • str – Name of column to query

  • callable – Operator function to test equality or non-equality of values

  • str – Value of column to query

>>> parse_filter_query("property=value")
('property', <built-in function eq>, 'value')
>>> parse_filter_query("property!=value")
('property', <built-in function ne>, 'value')