augur parseļƒ


Parse delimited fields from FASTA sequence names into a TSV and FASTA file.

usage: augur parse [-h] --sequences SEQUENCES --output-sequences
                   OUTPUT_SEQUENCES --output-metadata OUTPUT_METADATA
                   [--output-id-field OUTPUT_ID_FIELD] --fields FIELDS
                   [FIELDS ...]
                   [--prettify-fields PRETTIFY_FIELDS [PRETTIFY_FIELDS ...]]
                   [--separator SEPARATOR] [--fix-dates {dayfirst,monthfirst}]

Named Argumentsļƒ

--sequences, -s

sequences in fasta or VCF format

--output-sequences

output sequences file

--output-metadata

output metadata file

--output-id-field

The record field to use as the sequence identifier in the FASTA output. If not provided, this will use the first available of (ā€˜strainā€™, ā€˜nameā€™). If none of those are available, this will use the first field in the fasta header.

--fields

fields in fasta header

--prettify-fields

apply string prettifying operations (underscores to spaces, capitalization, etc) to specified metadata fields

--separator

separator of fasta header

Default: '|'

--fix-dates

Possible choices: dayfirst, monthfirst

attempt to parse non-standard dates and output them in standard YYYY-MM-DD format

Example: how to parse metadata from fasta-headersļƒ

If you download sequence data from data bases like GISAID or fludb, there often is an option to include meta data such as dates into the header of fasta files. This might for example look like this:

>A/Canoas/LACENRS_1793/2015|A|H3N2|07/17/2015||Brazil|Human|KY925125 ATGā€¦ >A/Canoas/LACENRS_773/2015|A|H3N2|05/06/2015||Brazil|Human|KY925599 ATGā€¦ [ā€¦]

The fasta header contains information such as influenza lineage, dates (in an unpreferred format), country, etcā€¦ To turn this metadata into a table, augur has a special command called parse. A rule to parse the above file could look like this:

rule parse:
    input:
        sequences = "data/h3n2_ha.fasta"
    output:
        sequences = "results/sequences_h3n2_ha.fasta",
        metadata = "results/metadata_h3n2_ha.tsv"
    params:
        fields = "strain type subtype date season country host accession"
    shell:
        """
        augur parse \
            --sequences {input.sequences} \
            --fields {params.fields} \
            --output-sequences {output.sequences} \
            --output-metadata {output.metadata} \
            --fix-dates monthfirst
        """

Note the additional argument --fix-dates monthfirst. This triggers an attempt to parse these dates and turn them into ISO format assuming that the month preceeds the date in the input data. Note that this is a brittle process that should be spot-checked.