normalize-strings

Normalize strings to a Unicode normalization form and strip leading and trailing whitespaces.

Strings need to be normalized for predictable string comparisons, especially in cases where strings contain diacritics (see https://unicode.org/faq/normalization.html).

usage: augur curate normalize-strings [-h] [--metadata METADATA]
                                      [--id-column ID_COLUMN]
                                      [--metadata-delimiters METADATA_DELIMITERS [METADATA_DELIMITERS ...]]
                                      [--fasta FASTA]
                                      [--seq-id-column SEQ_ID_COLUMN]
                                      [--seq-field SEQ_FIELD]
                                      [--unmatched-reporting {error_first,error_all,warn,silent}]
                                      [--duplicate-reporting {error_first,error_all,warn,silent}]
                                      [--output-metadata OUTPUT_METADATA]
                                      [--output-fasta OUTPUT_FASTA]
                                      [--output-id-field OUTPUT_ID_FIELD]
                                      [--output-seq-field OUTPUT_SEQ_FIELD]
                                      [--form {NFC,NFKC,NFD,NFKD}]

INPUTS

Input options shared by all augur curate commands. If no input options are provided, commands will try to read NDJSON records from stdin.

--metadata

Input metadata file. Accepts β€˜-’ to read metadata from stdin.

--id-column

Name of the metadata column that contains the record identifier for reporting duplicate records. Uses the first column of the metadata file if not provided. Ignored if also providing a FASTA file input.

--metadata-delimiters

Delimiters to accept when reading a metadata file. Only one delimiter will be inferred.

Default: (β€˜,’, β€˜t’)

--fasta

Plain or gzipped FASTA file. Headers can only contain the sequence id used to match a metadata record. Note that an index file will be generated for the FASTA file as <filename>.fasta.fxi

--seq-id-column

Name of metadata column that contains the sequence id to match sequences in the FASTA file.

--seq-field

The name to use for the sequence field when joining sequences from a FASTA file.

--unmatched-reporting

Possible choices: error_first, error_all, warn, silent

How unmatched records from combined metadata/FASTA input should be reported.

Default: error_first

--duplicate-reporting

Possible choices: error_first, error_all, warn, silent

How should duplicate records be reported.

Default: error_first

OUTPUTS

Output options shared by all augur curate commands. If no output options are provided, commands will output NDJSON records to stdout.

--output-metadata

Output metadata TSV file. Accepts β€˜-’ to output TSV to stdout.

--output-fasta

Output FASTA file.

--output-id-field

The record field to use as the sequence identifier in the FASTA output.

--output-seq-field

The record field that contains the sequence for the FASTA output. This field will be deleted from the metadata output.

OPTIONAL

--form

Possible choices: NFC, NFKC, NFD, NFKD

Unicode normalization form to use for normalization.

Default: β€œNFC”