Running an ingest workflow

This tutorial uses the Nextstrain CLI to help you get started running ingest workflows. Ingest workflows download public data from NCBI and output ingest datasets, which include curated metadata and sequences that can be used as input for phylogenetic or Nextclade workflows.

Note

You only need to run an ingest workflow if you do not want to use the data files already publicly hosted by Nextstrain. Individual pathogen repositories include documentation that links to their data files.

In this tutorial, you will run the ingest workflow of our Zika repository and view outputs on your computer. You will have a basic understanding of how to run ingest workflows for other pathogens and a foundation for understanding how to customize ingest workflows.

Table of Contents

Prerequisites
Download the Zika repository
Run the default workflow
Configuring the ingest workflow
- Inspecting the uncurated metadata
- Updating the workflow config
Advanced usage: Customizing the ingest workflow
Next steps

Prerequisites 

Install Nextstrain. These instructions will install all of the software you need to complete this tutorial.

Download the Zika repository 

All pathogen ingest workflows are stored in pathogen repositories (version-controlled folders) to track changes over time. Download the Zika repository.

$ git clone https://github.com/nextstrain/zika
Cloning into 'zika'...
[...more output...]

When it’s done, you’ll have a new directory called zika/.

Run the default workflow 

The zika ingest workflow uses the NCBI Datasets CLI tools to download public data and uses a combination of augur curate and other data manipulation tools to curate the downloaded data into a format suitable for phylogenetic workflows.

Change directory to the Zika pathogen repository downloaded in the previous step

$ cd zika

Run the default ingest workflow with the Nextstrain CLI.

$ nextstrain build ingest
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
Building DAG of jobs…
[...a lot of output...]

This should take just a few minutes to complete. There should be two final output files:

ingest/results/metadata.tsv
ingest/results/sequences.fasta

The output files should have the same data formats as the public data files hosted by Nextstrain, available at:

Your results may have additional records depending on whether new data has been released since the public data files were last uploaded.

Configuring the ingest workflow 

Now that you’ve seen the default outputs of the ingest workflow, you can try configuring the ingest workflow to change the outputs.

Inspecting the uncurated metadata 

If you want to see the uncurated NCBI Datasets data to decide what changes you would like to make to the workflow, you can download the uncurated NCBI data.

Hint

These commands are very similar to the commands run by the ingest workflow with some minor differences. The ingest workflow restricts the columns to those defined in config["ncbi_datasets_fields"] and keeps the header names as the more computer friendly “Mnemonic” of the NCBI Datasets’ available fields.

Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately.

$ nextstrain shell .

Create the ingest/data directory if it doesn’t already exist.

$ mkdir -p ingest/data

Download the dataset with the pathogen NCBI taxonomy ID.

$ datasets download virus genome taxon <taxon-id> \
    --filename ingest/data/ncbi_dataset.zip

Extract and format the metadata as a TSV file for easy inspection

$ dataformat tsv virus-genome \
    --package ingest/data/ncbi_dataset.zip \
    > ingest/data/raw_metadata.tsv

Exit the Nextstrain shell to return to your usual shell environment.

$ exit

The produced ingest/data/raw_metadata.tsv will contain all of the fields available from NCBI Datasets.

Updating the workflow config 

We’ll walk through an example custom config to include an additional column in the curated output. For example, examining the raw NCBI metadata shows us that virus-name is a NCBI Datasets field that is not currently downloaded by the default Zika ingest workflow. If you wanted this field to be included in your outputs, you could perform the following steps.

Create a new build config directory ingest/build-configs/tutorial/

$ mkdir ingest/build-configs/tutorial

Copy the default config to ingest/build-configs/tutorial/config.yaml

$ cp ingest/defaults/config.yaml ingest/build-configs/tutorial/config.yaml

Modify the config parameters within your new custom config ingest/build-configs/tutorial/config.yaml.

Add virus-name to the ncbi_datasets_fields to make the workflow parse the column from the downloaded NCBI data.
Update the curate.field_map with an entry for the new field to match the underscore naming scheme of column names.
```
curate:
  field_map:
    virus-name: virus_name
```
Add virus_name to the curate.metadata_columns to configure the workflow to include the new column in the final output file.
(Optional) Remove any other config parameters that you are not modifying

Note

Config parameters that are dictionaries will merge with the parameters defined in ingest/defaults/config.yaml while all other types will overwrite the default. See Snakemake documentation for more details on how configuration files work.

All config parameters available are listed in the ingest/defaults/config.yaml file. Any of the config parameters can be overridden in a custom config file.

Run the ingest workflow again with the custom config file.

$ nextstrain build ingest --configfile build-configs/tutorial/config.yaml --forceall
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
Config file defaults/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs…
[...a lot of output...]

Inspect the new ingest/results/metadata.tsv to see that it now includes the additional virus_name column.

Advanced usage: Customizing the ingest workflow 

Note

This section of the tutorial requires an understanding of Snakemake workflows.

In addition to configuring the ingest workflow, it is also possible to extend the ingest workflow with your own custom steps. We’ll walk through an example customization that joins additional metadata to the public data that you’ve curated in the previous steps.

Create an additional metadata file ingest/build-configs/tutorial/additional-metadata.tsv

genbank_accession    column_A    column_B    column_C
AF013415    AAAAA    BBBBB    CCCCC
AF372422    AAAAA    BBBBB    CCCCC
AY326412    AAAAA    BBBBB    CCCCC
AY632535    AAAAA    BBBBB    CCCCC
EU303241    AAAAA    BBBBB    CCCCC
EU074027    AAAAA    BBBBB    CCCCC
EU545988    AAAAA    BBBBB    CCCCC
NC_012532    AAAAA    BBBBB    CCCCC
DQ859059    AAAAA    BBBBB    CCCCC
JN860885    AAAAA    BBBBB    CCCCC

Create a new rules file ingest/build-configs/tutorial/merge-metadata.smk

rule merge_metadata:
  input:
    metadata="results/metadata.tsv",
    additional_metadata="build-configs/tutorial/additional-metadata.tsv",
  output:
    merged_metadata="results/merged-metadata.tsv"
  shell:
    """
    tsv-join -H \
      --filter-file {input.additional_metadata} \
      --key-fields "genbank_accession" \
      --append-fields "*" \
      --write-all "?" \
      {input.metadata} > {output.merged_metadata}
    """

This rule uses tsv-join to merge the additional metadata with the metadata output from the ingest workflow. The records will be merged using the genbank_accession column and all fields from the additional-metadata.tsv file will be appended to the metadata. Any record in the metadata.tsv that does not have a matching record in the additional-metadata.tsv will have a default ? value in the new columns.

Add the following to the custom config file ingest/build-configs/tutorial/config.yaml

custom_rules:
  - build-configs/tutorial/merge-metadata.smk

The custom_rules config tells the ingest workflow to include your custom rules so that you can run them as part of the workflow.

Run the ingest workflow again with the customized rule.

$ nextstrain build ingest merge_metadata --configfile build-configs/tutorial/config.yaml
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
Config file config/defaults.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
[...a lot of output...]

5. Inspect the ingest/results/merged-metadata.tsv file to see that it includes the additional columns column_A, column_B, and column_C. The records with the genbank_accession listed in the additional-metadata.tsv file should have the placeholder data in the new columns, while other records should have the default ? value.

Next steps 

Run the zika phylogenetic workflow with new ingested data as input by running
```
$ mv ingest/results/* phylogenetic/data/
$ nextstrain build phylogenetic
```
If you’ve customized the ingest workflow then you may need to modify the phylogenetic workflow to use the new ingested data file. We are planning to write another tutorial to cover other modifications to your phylogenetic workflow.
Learn how to create an ingest workflow