Creating an ingest workflow

This tutorial dissects the ingest workflow of the pathogen-repo-guide and the decisions needed to create an ingest workflow for a new pathogen.

Note

You only need to create an ingest workflow if you do not want to use an existing pathogen ingest workflow maintained by Nextstrain.

Table of Contents

Prerequisites 

Install Nextstrain.
Run through the Running an ingest workflow tutorial. This will verify your installation and ensure that you are able to run an ingest workflow.

Additionally, to follow this tutorial, you will need:

An understanding of Snakemake workflows.
Pathogen-specific knowledge (e.g. WHO naming scheme) to help with decisions on how to set up the ingest workflow

The Nextstrain pathogen-repo-guide can be used for setting up a pathogen repository to hold the files necessary to run and maintain pathogen workflows. This tutorial will only focus on using the guide to set up the ingest workflow.

Go to the Nextstrain pathogen-repo-guide repository
Follow the GitHub guide for creating a repository from a template.
Follow the GitHub guide to download the new repository.
Change directory to your new pathogen repository

$ cd <new-pathogen-repository>

Decide on data source 

The first step for creating an ingest workflow is to decide on the data source for your pathogen’s data. The pathogen-repo-guide only focuses on downloading public data from NCBI, using the rules defined in ingest/rules/fetch_from_ncbi.smk.

Note

If your pathogen does not have sequences on NCBI, then you will need to explore other data sources that are not covered in this tutorial.

NCBI Datasets 

By default, the pathogen-repo-guide is set to use the NCBI Datasets CLI tool to download viral sequences using a provided NCBI taxonomy ID. This is the simplest route for setting up an ingest workflow, but it is limited to a standard set of fields that is parsed by NCBI Datasets.

You can decide whether NCBI Datasets include sufficient data for your pathogen by inspecting the uncurated data from NCBI Datasets CLI.

Add your pathogen’s NCBI taxonomy ID to the ncbi_taxon_id parameter in the ingest/defaults/config.yaml config file.
Dump the uncurated metadata by running

$ nextstrain build ingest dump_ncbi_dataset_report

Inspect the generated file ingest/data/ncbi_dataset_report_raw.tsv
If there are other fields in the raw file that you would like to include in the workflow, you can add them to the ncbi_datasets_fields parameter

If the data looks sufficient for your pathogen, then skip to the Curation steps.

NCBI Entrez 

If your pathogen requires data from other fields not parsed by NCBI Datasets, then you will need to use the NCBI Entrez tool to download all available data in a GenBank file.

Add an Entrez search term to the entrez_search_term parameter in the ingest/defaults/config.yaml config file.
Create a custom script to parse the GenBank file into a flat JSON Lines/NDJSON format. (We may provide an example script in the future, but this is currently not available.)
Edit the parse_genbank_to_ndjson rule in ingest/rules/fetch_from_ncbi.smk to use the custom script.
Switch the Snakemake ruleorder within the ingest/rules/fetch_from_ncbi.smk file.

ruleorder: format_ncbi_datasets_ndjson < parse_genbank_to_ndjson

Make sure the field_map parameters in the config file are using the field names of your custom NDJSON output.

Curation steps 

After the public data is downloaded, the next part of the workflow runs a pipeline of data curation commands and scripts to format the metadata and sequences.

The long term goal is to build out the augur curate suite of commands to include all of the custom curation steps. For now, we’ve bundled custom scripts into the ingest repository that is then vendored in the pathogen-repo-guide using git-subrepo. Please do not edit the vendored scripts in ingest/vendored directly. If you run into issues or encounter bugs with the vendored scripts, please make an issue in the ingest repository. Once the bug has been fixed in the original source code, you can follow the instructions to update the vendored scripts.

We highly encourage you to go through the commands and custom scripts used in the curate rule within ingest/rules/curate.smk to gain a deeper understanding of how they work. We will give a brief overview of each step and their relevant config parameters defined in ingest/defaults/config.yaml to help you get started.

Transform field names 

The ingest/vendored/transform-field-names script will rename the fields in the NDJSON records.

Note

This is the first step of the pipeline so any subsequent references to field names should use the new field names.

Config parameters

curate.field_map
- A dictionary where the key is the original field name and value is the new field name
- The default dictionary uses the original field names from NCBI Datasets and transforms them to the standard Nextstrain metadata fields.

Normalize strings 

The augur curate normalize-strings command will normalize string values in the NDJSON records for predictable string comparisons. Currently, there are no config parameters for this command.

Transform strain names 

The ingest/vendored/transform-strain-names script will verify the strain field values match an expected pattern.

Config parameters

curate.strain_regex
- Python regular expression pattern the strain names must match
- The default pattern (^.+$) accepts any non-empty string because we do not have a clear standard for strain names across pathogens
curate.strain_backup_fields
- List of other NDJSON fields to use as strain name if the strain fails to match expected pattern
- The default list uses the GenBank accession field as a stable back up field for messy strain fields.

Format dates 

The augur curate format-dates command will format date fields to ISO 8601 dates (YYYY-MM-DD), where incomplete dates are masked with ‘XX’ (e.g. 2023 -> 2023-XX-XX).

Config parameters

curate.date_fields
- List of NDJSON date fields to be formatted
- The default list includes the standard date fields that are expected from NCBI records
curate.expected_date_formats
- List of expected date formats in the provided date fields
- The default list includes the date formats that are expected from NCBI records

Transform GenBank location 

The ingest/vendored/transform-genbank-location script will try to parse locations in NDJSON records according to GenBank country qualifier. It parses the location field into three fields:

country
division
location

Currently, there are no config parameters for this script.

Titlecase 

The augur curate titlecase command will make the first letter of every word uppercase in provided string fields.

Config parameters

curate.titlecase.fields
- List of NDJSON fields to titlecase
- The default list includes all of the geolocation fields from NCBI records (after running transform-genbank-location)
curate.titlecase.abbreviations
- List of strings to keep as all uppercase
- The default list includes the country “USA” as an example
curate.titlecase.articles
- List of strings to keep as all lowercase
- The default list includes articles (e.g., ‘and’, ‘the’, ‘of’, etc) that we’ve encountered in past ingest pipelines

Transform authors 

The ingest/vendored/transform-authors script will abbreviate the authors list in the NDJSON records to <first author> et al..

Config parameters

curate.authors_field
- The NDJSON field that contains the authors list
- The default value uses the field expected from NCBI records
curate.authors_default_value
- The default string to use if the authors list is empty
- The default value ? will allow you to easily filter for records without authors.
curate.abbr_authors_field
- The field name to use for the new abbreviated authors field.
- If none are provided, the original authors field will be replaced with the abbreviated authors.
- The default field is abbr_authors so you can compare the original and abbreviated author values.

Apply geolocation rules 

The ingest/vendored/apply-geolocation-rules script will apply geolocation standardizations across all records.

Config parameters

curate.geolocation_rules_url
- The URL for a public set of geolocation rules
- The default URL points to the Nextstrain ncov-ingest geolocation rules, which is currently the most complete set of geolocation rules.
curate.local_geolocation_rules
- A path to a local set of geolocation rules used to override the general rules
- The default points to the empty file ingest/defaults/geolocation_rules.tsv where you can add your pathogen specific rules

Geolocation rules

Geolocation rules are defined in a TSV file with the format

region/country/division/location<\t>region/country/division/location

The first set of locations are the expected geolocations that are in the metadata and the second set of geolocations after the tab are the standard geolocations that will be applied to the metadata. Each geo resolution (region, country, division, location) is expected to be a field in the NDJSON. By using the region/country/division/location hierarchy, we ensure that locations with the same name (e.g., two cities with the same name but in different countries) are treated differently based on their full hierarchy. If there are rules that can be applied across multiple locations, then a wildcard (*) can be used instead of a specific value.

Let’s say you have the following locations in your NDJSON

{“region”: “North America”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”}
{“region”: “North America”, “country”: “United States”, “division”: “New York”, “location”: “New York”}

And you provide these geolocation rules

North America/United States/New York/New York               North America/United States/New York/New York City
North America/United States/New York/*      North America/United States/New York State/*
North America/United States/*/*     North America/USA/*/*

The first rule looks for the specific hierarchy to correct the location from “New York” to “New York City”. The second rule has a wildcard as the location, so it will correct all applicable divisions from “New York” to “New York State”. The third rule has wildcards for both division and location, so it will correct all applicable countries from “United States” to “USA”.

Running through the ingest/vendored/apply-geolocation-rules script should produce the following

{“region”: “North America”, “country”: “USA”, “division”: “New York State”, “location”: “Buffalo”}
{“region”: “North America”, “country”: “USA”, “division”: “New York State”, “location”: “New York City”}

Merge user metadata 

The ingest/vendored/merge-user-metadata script merges user curated annotations with the NDJSON records, with the user curations overwriting the existing fields.

Config parameters

curate.annotations
- A path to a file of user annotations
- The default points to the empty file ingest/defaults/annotations.tsv where you can add your pathogen-specific annotations
curate.annotations_id
- The NDJSON field that has the ID used to match records to annotations
- The default value uses the GenBank accession since they are guaranteed to be unique

User annotations

The user annotations are defined in a TSV file with the format

id<\t>field<\t>value

The id is used to match the NDJSON records. The field is the field you are trying to overwrite or add to the NDJSON record. The value is the value you are trying to add to the NDJSON record.

Let’s say you have the following NDJSON records

{“accession”: “AAAAA”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”}
{“accession”: “BBBBB”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”}

And you provide these user annotations

AAAAA       age     10
BBBBB       age     12
BBBBB       location        Niagara Falls

The first two annotations add the age field to the records and the third annotation overwrites the existing location field for the record BBBBB.

Running through the ingest/vendored/merge-user-metadata script should produce the following:

{“accession”: “AAAAA”, “country”: “United States”, “division”: “New York”, “location”: “Buffalo”, “age”: 10}
{“accession”: “BBBBB”, “country”: “United States”, “division”: “New York”, “location”: “Niagara Falls”, “age”: 12}

Passthru 

The augur curate passthru is being used to split the NDJSON records into the metadata TSV and sequences FASTA files.

Config parameters

curate.output_id_field
- The NDJSON field to use as the sequence identifiers in the FASTA file
- The default value uses the GenBank accession since they are guaranteed to be unique
curate.output_sequence_field
- The NDJSON field that contains the genomic sequence
- The default value uses sequence which is the field name we use for NCBI Datasets.

Subset metadata 

Finally we use the tsv-select command to subset the metadata to a list of metadata columns.

Config parameters

curate.metadata_columns
- A list of metadata columns to include in the final output metadata TSV
- The columns will be output in the order specified

Advanced usage 

The default ingest workflow of the pathogen-repo-guide is generalized to be able to work with any pathogen, but this means you will need to tailor the ingest workflow for pathogen specific steps.

Add custom curation steps 

The curation pipeline is designed to be extremely customizable, with each curation step reading NDJSON records from stdin and outputing modified NDJSON records to stdout. If you write a custom script that follows the same pattern, you can add your script as another step anywhere in the curation pipeline before the final augur curate passthru command.

A typical pathogen-specific step for curation is the standardization of strain names since pathogens usually have different naming conventions (e.g. influenza vs measles). For example, we’ve added a step in the curation pipeline to normalize the strain names for the Zika ingest workflow.

1. We added a custom Python script to the Zika repository which reads NDJSON records from stdin, edits the strain field per record, then outputs the modified records to stdout.

2. The script was added to the curation pipeline before the ingest/vendored/merge-user-metadata step to still allow user annotations to override the modified strain names if necessary.

Nextclade as part of ingest 

Nextstrain is pushing to standardize our core ingest workflows to include Nextclade runs, which allows us to merge clade/lineage designations and QC metrics with the metadata in our publicly hosted data. However, this is not possible until you have already created a Nextclade dataset for your pathogen.

Here’s our typical process for adding Nextclade to ingest workflows for new pathogens

Create an ingest workflow without Nextclade.
Run the ingest workflow to generate a set of curated metadata and sequences.
Use the curated metadata and sequences as input to generate a reference tree.
Create a Nextclade dataset by following the Nextclade dataset creation guide.
Update the ingest workflow to run Nextclade using the new Nextclade dataset.

If your pathogen already has a Nextclade dataset, you can use the pathogen-repo-guide’s ingest/defaults/nextclade_config.yaml config file to include the Nextclade rules from ingest/rules/nextclade.smk as part of the ingest workflow.

Add your Nextclade dataset name to the nextclade.dataset_name parameter
Run the ingest workflow with the additional config file

nextstrain build ingest --configfile defaults/nextclade_config.yaml

Example ingest workflows 

Although we strive to keep Nextstrain core ingest workflows standardized, we cannot guarantee that every pathogen ingest workflow will be kept up-to-date.

We recommend using the zika ingest workflow and the mpox ingest workflow as example workflows that demonstrate our latest developments.

Next steps 

Learn more about augur curate commands
We are planning to write another detailed tutorial for creating a phylogenetic workflow, but until that is ready you can follow the simple phylogenetic workflow tutorial.