augur.filter.include_exclude_rules module
- augur.filter.include_exclude_rules.apply_filters(metadata, exclude_by, include_by)
Apply a list of filters to exclude or force-include records from the given metadata and return the strains to keep, to exclude, and to force include.
- Parameters:
metadata (pandas.DataFrame) – Metadata to filter
- Returns:
set – Strains to keep (those that passed all filters)
list of dict – Strains to exclude along with the function that filtered them and the arguments used to run the function.
list of dict – Strains to force-include along with the function that filtered them and the arguments used to run the function.
For example, filter data by minimum date, but force the include of strains from Africa.
Examples
>>> from augur.dates import numeric_date >>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-10-02"}, {"region": "North America", "date": "2020-01-01"}], index=["strain1", "strain2", "strain3"]) >>> exclude_by = [(filter_by_min_date, {"date_column": "date", "min_date": numeric_date("2020-04-01")})] >>> include_by = [(force_include_where, {"include_where": "region=Africa"})] >>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by) >>> strains_to_keep {'strain2'} >>> sorted(strains_to_exclude, key=lambda record: record["strain"]) [{'strain': 'strain1', 'filter': 'filter_by_min_date', 'kwargs': '[["date_column", "date"], ["min_date", 2020.25]]'}, {'strain': 'strain3', 'filter': 'filter_by_min_date', 'kwargs': '[["date_column", "date"], ["min_date", 2020.25]]'}] >>> strains_to_include [{'strain': 'strain1', 'filter': 'force_include_where', 'kwargs': '[["include_where", "region=Africa"]]'}]
We also want to filter by characteristics of the sequence data that we’ve annotated in a sequence index.
>>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}, {"strain": "strain3", "A": 1250, "C": 1250, "G": 1250, "T": 1250}]).set_index("strain") >>> exclude_by = [(filter_by_min_length, {"sequence_index": sequence_index, "min_length": 27000})] >>> include_by = [(force_include_where, {"include_where": "region=Europe"})] >>> strains_to_keep, strains_to_exclude, strains_to_include = apply_filters(metadata, exclude_by, include_by) >>> strains_to_keep {'strain1'} >>> sorted(strains_to_exclude, key=lambda record: record["strain"]) [{'strain': 'strain2', 'filter': 'filter_by_min_length', 'kwargs': '[["min_length", 27000]]'}, {'strain': 'strain3', 'filter': 'filter_by_min_length', 'kwargs': '[["min_length", 27000]]'}] >>> strains_to_include [{'strain': 'strain2', 'filter': 'force_include_where', 'kwargs': '[["include_where", "region=Europe"]]'}]
- augur.filter.include_exclude_rules.construct_filters(args, sequence_index)
Construct lists of filters and inclusion criteria based on user-provided arguments.
- augur.filter.include_exclude_rules.extract_variables(pandas_query)
Try extracting all variable names used in a pandas query string.
If successful, return the variable names as a set. Otherwise, nothing is returned.
Examples
>>> extract_variables("var1 == 'value'") {'var1'} >>> sorted(extract_variables("var1 == 'value' & var2 == 10")) ['var1', 'var2'] >>> extract_variables("var1.str.startswith('prefix')") {'var1'} >>> extract_variables("this query is invalid")
Backtick quoting is also supported.
>>> extract_variables("`include me` == 'but not `me`'") {'include me'} >>> extract_variables("`include me once` == 'a' or `include me once` == 'b'") {'include me once'}
- augur.filter.include_exclude_rules.filter_by_ambiguous_date(metadata, date_column, ambiguity)
Filter metadata in the given pandas DataFrame where values in the given date column have a given level of ambiguity.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
date_column (str) – Column in the dataframe with dates.
ambiguity (str) – Level of date ambiguity to filter metadata by
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-XX"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"]) >>> filter_by_ambiguous_date(metadata, date_column="date", ambiguity="any") {'strain2'} >>> sorted(filter_by_ambiguous_date(metadata, date_column="date", ambiguity="month")) ['strain1', 'strain2']
If the requested date column does not exist, we quietly skip this filter.
>>> sorted(filter_by_ambiguous_date(metadata, date_column="missing_column", ambiguity="any")) ['strain1', 'strain2']
- augur.filter.include_exclude_rules.filter_by_exclude(metadata, exclude_file)
Exclude the given set of strains from the given metadata.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
exclude_file (str) – Filename with strain names to exclude from the given metadata
- Return type:
Examples
>>> import os >>> from tempfile import NamedTemporaryFile >>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"]) >>> with NamedTemporaryFile(delete=False) as exclude_file: ... characters_written = exclude_file.write(b'strain1') >>> filter_by_exclude(metadata, exclude_file.name) {'strain2'} >>> os.unlink(exclude_file.name)
- augur.filter.include_exclude_rules.filter_by_exclude_all(metadata)
Exclude all strains regardless of the given metadata content.
This is a placeholder function that can be called as part of a generalized loop through all possible functions.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"]) >>> filter_by_exclude_all(metadata) set()
- augur.filter.include_exclude_rules.filter_by_exclude_where(metadata, exclude_where)
Exclude all strains from the given metadata that match the given exclusion query.
Unlike pandas query syntax, exclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
exclude_where (str) – Filter query used to exclude strains
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"]) >>> filter_by_exclude_where(metadata, "region!=Europe") {'strain2'} >>> filter_by_exclude_where(metadata, "region=Europe") {'strain1'} >>> filter_by_exclude_where(metadata, "region=europe") {'strain1'}
If the column referenced in the given query does not exist, skip the filter.
>>> sorted(filter_by_exclude_where(metadata, "missing_column=value")) ['strain1', 'strain2']
- augur.filter.include_exclude_rules.filter_by_max_date(metadata, date_column, max_date)
Filter metadata by maximum date.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
date_column (str) – Column in the dataframe with dates.
max_date (float) – Maximum date
- Return type:
Examples
>>> from augur.dates import numeric_date >>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"]) >>> filter_by_max_date(metadata, date_column="date", max_date=numeric_date("2020-01-01")) {'strain1'}
If the requested date column does not exist, we quietly skip this filter.
>>> sorted(filter_by_max_date(metadata, date_column="missing_column", max_date=numeric_date("2020-01-01"))) ['strain1', 'strain2']
- augur.filter.include_exclude_rules.filter_by_max_length(metadata, sequence_index, max_length)
Filter metadata by sequence length from a given sequence index.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
max_length (int) – Maximum number of standard nucleotide characters (A, C, G, or T) in each sequence
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"]) >>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain") >>> filter_by_max_length(metadata, sequence_index, max_length=27000) {'strain2'}
- augur.filter.include_exclude_rules.filter_by_min_date(metadata, date_column, min_date)
Filter metadata by minimum date.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
date_column (str) – Column in the dataframe with dates.
min_date (float) – Minimum date
- Return type:
Examples
>>> from augur.dates import numeric_date >>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"]) >>> filter_by_min_date(metadata, date_column="date", min_date=numeric_date("2020-01-02")) {'strain2'}
If the requested date column does not exist, we quietly skip this filter.
>>> sorted(filter_by_min_date(metadata, date_column="missing_column", min_date=numeric_date("2020-01-02"))) ['strain1', 'strain2']
- augur.filter.include_exclude_rules.filter_by_min_length(metadata, sequence_index, min_length)
Filter metadata by sequence length from a given sequence index.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
min_length (int) – Minimum number of standard nucleotide characters (A, C, G, or T) in each sequence
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"]) >>> sequence_index = pd.DataFrame([{"strain": "strain1", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain") >>> filter_by_min_length(metadata, sequence_index, min_length=27000) {'strain1'}
It is possible for the sequence index to be missing strains present in the metadata.
>>> sequence_index = pd.DataFrame([{"strain": "strain3", "A": 7000, "C": 7000, "G": 7000, "T": 7000}, {"strain": "strain2", "A": 6500, "C": 6500, "G": 6500, "T": 6500}]).set_index("strain") >>> filter_by_min_length(metadata, sequence_index, min_length=27000) set()
- augur.filter.include_exclude_rules.filter_by_non_nucleotide(metadata, sequence_index)
Filter metadata for strains with invalid nucleotide content.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"]) >>> sequence_index = pd.DataFrame([{"strain": "strain1", "invalid_nucleotides": 0}, {"strain": "strain2", "invalid_nucleotides": 1}]).set_index("strain") >>> filter_by_non_nucleotide(metadata, sequence_index) {'strain1'}
- augur.filter.include_exclude_rules.filter_by_query(metadata, query, column_types=None)
Filter metadata in the given pandas DataFrame with a query string and return the strain names that pass the filter.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
query (str) – Query string for the dataframe.
column_types (str) – Dict mapping of data type
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"]) >>> filter_by_query(metadata, "region == 'Africa'") {'strain1'} >>> filter_by_query(metadata, "region == 'North America'") set()
- augur.filter.include_exclude_rules.filter_by_sequence_index(metadata, sequence_index)
Filter metadata by presence of corresponding entries in a given sequence index. This filter effectively intersects the strain ids in the metadata and sequence index.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
sequence_index (pandas.DataFrame) – Sequence index
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa", "date": "2020-01-01"}, {"region": "Europe", "date": "2020-01-02"}], index=["strain1", "strain2"]) >>> sequence_index = pd.DataFrame([{"strain": "strain1", "ACGT": 28000}]).set_index("strain") >>> filter_by_sequence_index(metadata, sequence_index) {'strain1'}
- augur.filter.include_exclude_rules.force_include_strains(metadata, include_file)
Include strains in the given text file from the given metadata.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
include_file (str) – Filename with strain names to include from the given metadata
- Return type:
Examples
>>> import os >>> from tempfile import NamedTemporaryFile >>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"]) >>> with NamedTemporaryFile(delete=False) as include_file: ... characters_written = include_file.write(b'strain1') >>> force_include_strains(metadata, include_file.name) {'strain1'} >>> os.unlink(include_file.name)
- augur.filter.include_exclude_rules.force_include_where(metadata, include_where)
Include all strains from the given metadata that match the given query.
Unlike pandas query syntax, inclusion queries should follow the pattern of “property=value” or “property!=value”. Additionally, this filter treats all values like lowercase strings, so we convert all values to strings first and then lowercase them before testing the given query.
- Parameters:
metadata (pandas.DataFrame) – Metadata indexed by strain name
include_where (str) – Filter query used to include strains
- Return type:
Examples
>>> metadata = pd.DataFrame([{"region": "Africa"}, {"region": "Europe"}], index=["strain1", "strain2"]) >>> force_include_where(metadata, "region!=Europe") {'strain1'} >>> force_include_where(metadata, "region=Europe") {'strain2'} >>> force_include_where(metadata, "region=europe") {'strain2'}
If the column referenced in the given query does not exist, skip the filter.
>>> force_include_where(metadata, "missing_column=value") set()
- augur.filter.include_exclude_rules.parse_filter_query(query)
Parse an augur filter-style query and return the corresponding column, operator, and value for the query.
- Parameters:
query (str) – augur filter-style query following the pattern of “property=value” or “property!=value”
- Returns:
str – Name of column to query
callable – Operator function to test equality or non-equality of values
str – Value of column to query
Examples
>>> parse_filter_query("property=value") ('property', <built-in function eq>, 'value') >>> parse_filter_query("property!=value") ('property', <built-in function ne>, 'value')
- augur.filter.include_exclude_rules.skip_group_by_with_ambiguous_day(metadata, date_column)
Alias to filter_by_ambiguous_date for day. This is to have a named function available for the filter reason.
- augur.filter.include_exclude_rules.skip_group_by_with_ambiguous_month(metadata, date_column)
Alias to filter_by_ambiguous_date for month. This is to have a named function available for the filter reason.