augur.io.metadataο
- augur.io.metadata.read_metadata(metadata_file, id_columns=('strain', 'name'), chunk_size=None)ο
Read metadata from a given filename and into a pandas DataFrame or TextFileReader object.
- Parameters:
- Return type:
pandas.DataFrame or pandas.io.parsers.TextFileReader
- Raises:
KeyError β When the metadata file does not have any valid index columns.
Examples
For standard use, request a metadata file and get a pandas DataFrame.
>>> read_metadata("tests/functional/filter/data/metadata.tsv").index.values[0] 'COL/FLR_00024/2015'
Requesting an index column that doesnβt exist should produce an error.
>>> read_metadata("tests/functional/filter/data/metadata.tsv", id_columns=("Virus name",)) Traceback (most recent call last): ... Exception: None of the possible id columns (('Virus name',)) were found in the metadata's columns ('strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url')
We also allow iterating through metadata in fixed chunk sizes.
>>> for chunk in read_metadata("tests/functional/filter/data/metadata.tsv", chunk_size=5): ... print(chunk.shape) ... (5, 14) (5, 14) (2, 14)
- augur.io.metadata.read_metadata_with_sequences(metadata, fasta, seq_id_column, seq_field='sequence', unmatched_reporting=DataErrorMethod.ERROR_FIRST, duplicate_reporting=DataErrorMethod.ERROR_FIRST)ο
Read rows from metadata file and yield each row as a single dict that has been updated with their corresponding sequence from the fasta file. Matches the metadata record with sequences using the sequence id provided in the seq_id_column. To ensure that the sequences can be matched with the metadata, the FASTA headers must contain the matching sequence id. The FASTA headers may include additional description parts after the id, but they will not be used to match the metadata.
Will report unmatched records if requested via unmatched_reporting. Note the ERROR_FIRST method will raise an error at the first unmatched metadata record but not for an unmatched sequence record because we can only check for unmatched sequences after exhausting the metadata generator.
Will report duplicate records if requested via duplicate_reporting.
Reads the fasta file with pyfastx.Fasta, which creates an index for the file to allow random access of sequences via the sequence id. Will remove any existing index file named <fasta>.fxi to force the rebuilding of the index so that thereβs no chance of using stale cached indexes. See pyfastx docs for more details: https://pyfastx.readthedocs.io/en/latest/usage.html#fasta
- Parameters:
metadata (str) β Path to a CSV or TSV metadata file
fasta (str) β Path to a plain or gzipped FASTA file
seq_id_column (str) β The column in the metadata file that contains the sequence id for matching sequences
seq_field (str, optional) β The field name to use for the sequence in the updated record
unmatched_reporting (DataErrorMethod, optional) β How should unmatched records be reported
duplicate_reporting (DataErrorMethod, optional) β How should duplicate records be reported
- Yields:
dict β The parsed metadata record with the sequence
- augur.io.metadata.read_table_to_dict(table, duplicate_reporting=DataErrorMethod.ERROR_FIRST, id_column=None)ο
Read rows from table file and yield each row as a single dict.
Will report duplicate records based on the id_column if requested via duplicate_reporting after the generator has been exhausted.
- Parameters:
table (str) β Path to a CSV or TSV file or IO buffer
duplicate_reporting (DataErrorMethod, optional) β How should duplicate records be reported
id_column (str, optional) β Name of the column that contains the record identifier used for reporting duplicates. Uses the first column of the metadata if not provided.
- Yields:
dict β The parsed row as a single record
- Raises:
AugurError β Raised for any of the following reasons: 1. There are parsing errors from the csv standard library 2. The provided id_column does not exist in the metadata 3. The duplicate_reporting method is set to ERROR_FIRST or ERROR_ALL and duplicate(s) are found
- augur.io.metadata.write_records_to_tsv(records, output_file)ο
Write each record from records as a single row to a TSV output_file. Uses the keys of the first record as output column names. Ignores extra keys in other records. If records are missing keys, they will have an empty string as the value.