Curate data from GISAID search and downloads
============================================
The following instructions describe how to curate data for a region-specific analysis (e.g., identifying recent introductions into Washington State) using GISAID's “Search” page and curated regional data from the “Downloads” window. Inferences about a sample's origin strongly depend on the composition of your dataset. For example, discrete trait analysis models cannot infer transmission from an origin that is not present in your data. We show how to overcome this issue by adding previously curated contextual sequences from Nextstrain to your region-specific dataset.
.. contents:: Table of Contents
:local:
Login to GISAID
---------------
Navigate to `GISAID (gisaid.org) `__ and select the “Login” link.
.. figure:: ../../images/gisaid-homepage.png
:alt: GISAID homepage with login link
GISAID homepage with login link
Login to your GISAID account. If you do not have an account yet, register for one (it's free) by selecting the “Registration” link.
.. figure:: ../../images/gisaid-login.png
:alt: GISAID login page with registration link
GISAID login page with registration link
Select “EpiCoV” from the top navigation bar.
.. figure:: ../../images/gisaid-navigation-bar.png
:alt: GISAID navigation bar with “EpiCoV” link
GISAID navigation bar with “EpiCoV” link
Search for region-specific data
-------------------------------
Select “Search” from the EpiCoV navigation bar.
.. figure:: ../../images/gisaid-epicov-navigation-bar.png
:alt: GISAID EpiCoV navigation bar with “Search” link
GISAID EpiCoV navigation bar with “Search” link
Find the “Location” field and start typing “North America /”. As you type, the field will suggest more specific geographic scales.
.. figure:: ../../images/gisaid-initial-search-interface.png
:alt: GISAID initial search interface
GISAID initial search interface
Finish by typing “North America / USA / Washington”. Select all strains collected between May 1 and June 1 with complete genome sequences and collection dates. Click the checkbox in the header row of the results display, to select all strains that match the search parameters.
.. figure:: ../../images/gisaid-search-results.png
:alt: GISAID search results for “Washington”
GISAID search results for “Washington”
.. warning::
GISAID limits the number of records you can download at once to 5000. If you need to download more records, constrain your search results to smaller windows of time by collection date and download data in these smaller batches.
Select the “Download” button in the bottom right of the search results. There are two options to download data from GISAID, both of which we describe below.
Option 1: Download “Input for the Augur pipeline”
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From the resulting “Download” window, select “Input for the Augur pipeline” as the download format.
.. figure:: ../../images/gisaid-search-download-window.png
:alt: GISAID search download window showing “Input for the Augur pipeline” option
GISAID search download window showing “Input for the Augur pipeline” option
Select the “Download” button and save the resulting file to the ``data/`` directory with a descriptive name like ``gisaid_washington.tar``. This tar archive contains compressed metadata and sequences named like ``1622567829294.metadata.tsv.xz`` and ``1622567829294.sequences.fasta.xz``, respectively.
You can use this tar file as an input for the Nextstrain workflow, as shown below. The workflow will extract the data for you. Create a new workflow config file, in the top-level of the ``ncov`` directory that defines your analysis or “builds”.
.. code:: yaml
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow will detect and extract the metadata and sequences
# from GISAID tar archives.
metadata: data/gisaid_washington.tar
sequences: data/gisaid_washington.tar
Next, you can move on to the heading below to get contextual data for your region of interest. Alternately, you can extract the tar file into the ``data/`` directory prior to analysis.
.. code:: bash
tar xvf data/gisaid_washington.tar
Rename the extracted files to match the descriptive name of the original archive.
.. code:: bash
mv data/1622567829294.metadata.tsv.xz data/gisaid_washington_metadata.tsv.xz
mv data/1622567829294.sequences.fasta.xz data/gisaid_washington_sequences.fasta.xz
You can use these extracted files as inputs for the workflow.
.. code:: yaml
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow also accepts compressed metadata and sequences
# from GISAID.
metadata: data/gisaid_washington_metadata.tsv.xz
sequences: data/gisaid_washington_sequences.fasta.xz
Option 2: Download “Sequences” and “Patient status metadata”
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Alternately, you can download sequences and metadata as two separate uncompressed files. First, select “Sequences (FASTA)” as the download format. Check the box for replacing spaces with underscores. Select the “Download” button and save the resulting file to the ``data/`` directory with a descriptive name like ``gisaid_washington_sequences.fasta``.
.. figure:: ../../images/gisaid-search-download-window-sequences.png
:alt: GISAID search download window showing “Sequences (FASTA)” option
GISAID search download window showing “Sequences (FASTA)” option
From the search results interface, select the “Download” button in the bottom right again. Select “Patient status metadata” as the download format. Select the “Download” button and save the file to ``data/`` with a descriptive name like ``gisaid_washington_metadata.tsv``.
.. figure:: ../../images/gisaid-search-download-window-metadata.png
:alt: GISAID search download window showing “Patient status metadata” option
GISAID search download window showing “Patient status metadata” option
You can use these files as inputs for the workflow like so.
.. code:: yaml
# Define inputs for the workflow.
inputs:
- name: washington
metadata: data/gisaid_washington_metadata.tsv
sequences: data/gisaid_washington_sequences.fasta
Download contextual data for your region of interest
----------------------------------------------------
Next, select the “Downloads” link from the EpiCoV navigation bar.
.. figure:: ../../images/gisaid-epicov-navigation-bar-with-downloads.png
:alt: GISAID EpiCoV navigation bar with “Downloads” link
GISAID EpiCoV navigation bar with “Downloads” link
Scroll to the “Genomic epidemiology” section and select the “nextregions” button.
.. figure:: ../../images/gisaid-downloads-window.png
:alt: GISAID downloads window
GISAID downloads window
Select the major region that corresponds to your region-specific data above (e.g., “North America”).
.. figure:: ../../images/gisaid-nextregions-download-window.png
:alt: GISAID “nextregions” download window
GISAID “nextregions” download window
Agree to the terms and conditions and download the corresponding file (named like ``ncov_north-america.tar.gz``) to the ``data/`` directory.
.. figure:: ../../images/gisaid-nextregions-download-terms-and-conditions.png
:alt: GISAID “nextregions” download terms and conditions
GISAID “nextregions” download terms and conditions
This compressed tar archive contains metadata and sequences corresponding to `a recent Nextstrain build for that region `__ with names like ``ncov_north-america.tsv`` and ``ncov_north-america.fasta``, respectively. For example, the “North America” download contains data from `Nextstrain's North America build `__. These regional Nextstrain builds contain data from a specific region and contextual data from all other regions in the world. By default, GISAID provides these “nextregions” data in the “Input for the Augur pipeline” format.
As with the tar archive from the search results above, you can use the “nextregions” compressed tar archives as input to the Nextstrain workflow and the workflow will extract the appropriate contents for you. For example, you could update your ``inputs`` in the workflow config file from above to include the North American data as follows.
.. code:: yaml
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow will detect and extract the metadata and sequences
# from GISAID tar archives.
metadata: data/gisaid_washington.tar
sequences: data/gisaid_washington.tar
- name: north-america
# The workflow will similarly detect and extract metadata and
# sequences from compressed tar archives.
metadata: data/ncov_north-america.tar.gz
sequences: data/ncov_north-america.tar.gz
Alternately, you can extract the data from the compressed tar archive into the ``data/`` directory.
.. code:: bash
tar zxvf data/ncov_north-america.tar.gz
You can use these extracted files as inputs for the workflow.
.. code:: yaml
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow will detect and extract the metadata and sequences
# from GISAID tar archives.
metadata: data/gisaid_washington.tar
sequences: data/gisaid_washington.tar
- name: north-america
# The workflow supports uncompressed or compressed input files.
metadata: data/ncov_north-america.tsv
sequences: data/ncov_north-america.fasta