Run using custom data

This tutorial builds on the previous tutorial. You will learn how to run the workflow with your own genomic data, using the reference data from the previous tutorial as the genetic context for these new data. Although you will download data from GISAID for this tutorial, you can replace these data with your own local sequences and metadata in your future analyses.

Prerequisites

  1. Run using example data. This tutorial sets up the command line environment used in the following tutorial.

  2. Register for a GISAID account, if you do not have one yet. However, registration may take a few days. Follow alternative data preparation methods in place of Curate data from GISAID, if you wish to continue the following tutorial in the meantime.

Setup

If you are not already there, change directory to the ncov directory:

cd ncov

and activate the nextstrain conda environment:

conda activate nextstrain

Curate data from GISAID

We will retrieve 10 sequences from GISAID’s EpiCoV database.

  1. Navigate to GISAID and select Login.

    GISAID login link
  2. Login to your GISAID account.

    GISAID login
  3. In the top left navigation bar, select EpiCoV then Search.

    GISAID EpiCoV Search
  4. Filter to sequences that pass the following criteria:

    1. Has a complete genome

    2. Has high coverage

    3. Has an exact collection date

    GISAID EpiCoV select first 10 sequences
  5. Select the first 10 sequences.

  6. Select Download in the bottom right of the search results.

  7. Select Input for the Augur pipeline as the download format.

    GISAID EpiCoV download as Input for the Augur pipeline

    Note

    You may see different download options, but it is fine as long as Input for the Augur pipeline is available.

  8. Select Download.

  9. Download/move the .tar file into the ncov/data/ directory.

  10. Extract by opening the downloaded .tar file in your file explorer. It contains a folder prefixed with gisaid_auspice_input_hcov-19_ containing two files: one ending with .metadata.tsv and another with .sequences.fasta.

  11. Rename the files as custom.metadata.tsv and custom.sequences.fasta.

  12. Move the files up to the ncov/data/ directory.

  13. Delete the empty gisaid_auspice_input_hcov-19_-prefixed folder and the .tar file if it is still there.

Hint

Read the full data prep guide for other ways to curate custom data.

Run the workflow

From within the ncov/ directory, run the ncov workflow using a pre-written --configfile:

nextstrain build . --configfile ncov-tutorial/custom-data.yaml

Break down the command

The workflow can take several minutes to run. While it is running, you can investigate the contents of custom-data.yaml (comments excluded):

inputs:
  - name: reference_data
    metadata: https://data.nextstrain.org/files/ncov/open/reference/metadata.tsv.xz
    sequences: https://data.nextstrain.org/files/ncov/open/reference/sequences.fasta.xz
  - name: custom_data
    metadata: data/custom.metadata.tsv
    sequences: data/custom.sequences.fasta

refine:
  root: "Wuhan-Hu-1/2019"

builds:
  custom-build:
    title: "Build with custom data and example data"
    subsampling_scheme: all
    auspice_config: ncov-tutorial/auspice-config-custom-data.json

This is the same as the previous file, with some additions:

  1. A second input for the custom data, referencing the metadata and sequences files downloaded from GISAID.

  2. A builds section that defines one output dataset using:

    1. A custom name custom-build, which will be used to create the dataset filename, in this case auspice/ncov_custom-build.json.

    2. A custom title Build with custom data and example data, which will be shown when you visualize the dataset in Auspice.

    3. A pre-defined subsampling scheme all, that tells the workflow to skip subsampling and use all input data.

    4. An Auspice config file, ncov-tutorial/auspice-config-custom-data.json, that defines parameters for how Auspice should display the dataset produced by the workflow. It has the following contents:

      {
        "colorings": [
          {
            "key": "custom_data",
            "title": "Custom data",
            "type": "categorical"
          }
        ],
        "display_defaults": {
          "color_by": "custom_data"
        }
      }
      

      This JSON tells Auspice to:

      1. Create a new coloring custom_data that reflects a special metadata column generated by the ncov workflow. When there is more than one input, each data input produces a new final metadata column with categorical values yes or no representing whether the sequence was from the input.

      2. Set the default Color By as the new custom_data coloring.

    Note

    Build is a widely used term with various meanings. In the context of the ncov workflow, the builds: section defines output datasets to be generated by the workflow (i.e. “build” a dataset).

Visualize the results

Run this command to start the Auspice server, providing auspice/ as the directory containing output dataset files:

nextstrain view auspice/

Navigate to http://127.0.0.1:4000/ncov/custom-build. The resulting dataset should have similar phylogeny to the previous dataset, with additional sequences:

Phylogenetic tree from the "custom data" tutorial as visualized in Auspice
  1. The custom dataset name custom-build can be seen in the dataset selector, as well as the dataset URL.

  2. The custom dataset title can be seen at the top of the page.

  3. The custom coloring is used by default. You can see which sequences are from the custom data added in this tutorial.

    Note

    You may not see all 10 custom sequences - some can be filtered out due to quality checks built into the ncov workflow.