Curate data from GISAID search and downloadsď
The following instructions describe how to curate data for a region-specific analysis (e.g., identifying recent introductions into Washington State) using GISAIDâs âSearchâ page and curated regional data from the âDownloadsâ window. Inferences about a sampleâs origin strongly depend on the composition of your dataset. For example, discrete trait analysis models cannot infer transmission from an origin that is not present in your data. We show how to overcome this issue by adding previously curated contextual sequences from Nextstrain to your region-specific dataset.
Login to GISAIDď
Navigate to GISAID (gisaid.org) and select the âLoginâ link.
Login to your GISAID account. If you do not have an account yet, register for one (itâs free) by selecting the âRegistrationâ link.
Select âEpiCoVâ from the top navigation bar.
Search for region-specific dataď
Select âSearchâ from the EpiCoV navigation bar.
Find the âLocationâ field and start typing âNorth America /â. As you type, the field will suggest more specific geographic scales.
Finish by typing âNorth America / USA / Washingtonâ. Select all strains collected between May 1 and June 1 with complete genome sequences and collection dates. Click the checkbox in the header row of the results display, to select all strains that match the search parameters.
Warning
GISAID limits the number of records you can download at once to 5000. If you need to download more records, constrain your search results to smaller windows of time by collection date and download data in these smaller batches.
Select the âDownloadâ button in the bottom right of the search results. There are two options to download data from GISAID, both of which we describe below.
Option 1: Download âInput for the Augur pipelineâď
From the resulting âDownloadâ window, select âInput for the Augur pipelineâ as the download format.
Select the âDownloadâ button and save the resulting file to the data/
directory with a descriptive name like gisaid_washington.tar
. This tar archive contains compressed metadata and sequences named like 1622567829294.metadata.tsv.xz
and 1622567829294.sequences.fasta.xz
, respectively.
You can use this tar file as an input for the Nextstrain workflow, as shown below. The workflow will extract the data for you. Create a new workflow config file, in the top-level of the ncov
directory that defines your analysis or âbuildsâ.
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow will detect and extract the metadata and sequences
# from GISAID tar archives.
metadata: data/gisaid_washington.tar
sequences: data/gisaid_washington.tar
Next, you can move on to the heading below to get contextual data for your region of interest. Alternately, you can extract the tar file into the data/
directory prior to analysis.
tar xvf data/gisaid_washington.tar
Rename the extracted files to match the descriptive name of the original archive.
mv data/1622567829294.metadata.tsv.xz data/gisaid_washington_metadata.tsv.xz
mv data/1622567829294.sequences.fasta.xz data/gisaid_washington_sequences.fasta.xz
You can use these extracted files as inputs for the workflow.
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow also accepts compressed metadata and sequences
# from GISAID.
metadata: data/gisaid_washington_metadata.tsv.xz
sequences: data/gisaid_washington_sequences.fasta.xz
Option 2: Download âSequencesâ and âPatient status metadataâď
Alternately, you can download sequences and metadata as two separate uncompressed files. First, select âSequences (FASTA)â as the download format. Check the box for replacing spaces with underscores. Select the âDownloadâ button and save the resulting file to the data/
directory with a descriptive name like gisaid_washington_sequences.fasta
.
From the search results interface, select the âDownloadâ button in the bottom right again. Select âPatient status metadataâ as the download format. Select the âDownloadâ button and save the file to data/
with a descriptive name like gisaid_washington_metadata.tsv
.
You can use these files as inputs for the workflow like so.
# Define inputs for the workflow.
inputs:
- name: washington
metadata: data/gisaid_washington_metadata.tsv
sequences: data/gisaid_washington_sequences.fasta
Download contextual data for your region of interestď
Next, select the âDownloadsâ link from the EpiCoV navigation bar.
Scroll to the âGenomic epidemiologyâ section and select the ânextregionsâ button.
Select the major region that corresponds to your region-specific data above (e.g., âNorth Americaâ).
Agree to the terms and conditions and download the corresponding file (named like ncov_north-america.tar.gz
) to the data/
directory.
This compressed tar archive contains metadata and sequences corresponding to a recent Nextstrain build for that region with names like ncov_north-america.tsv
and ncov_north-america.fasta
, respectively. For example, the âNorth Americaâ download contains data from Nextstrainâs North America build. These regional Nextstrain builds contain data from a specific region and contextual data from all other regions in the world. By default, GISAID provides these ânextregionsâ data in the âInput for the Augur pipelineâ format.
As with the tar archive from the search results above, you can use the ânextregionsâ compressed tar archives as input to the Nextstrain workflow and the workflow will extract the appropriate contents for you. For example, you could update your inputs
in the workflow config file from above to include the North American data as follows.
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow will detect and extract the metadata and sequences
# from GISAID tar archives.
metadata: data/gisaid_washington.tar
sequences: data/gisaid_washington.tar
- name: north-america
# The workflow will similarly detect and extract metadata and
# sequences from compressed tar archives.
metadata: data/ncov_north-america.tar.gz
sequences: data/ncov_north-america.tar.gz
Alternately, you can extract the data from the compressed tar archive into the data/
directory.
tar zxvf data/ncov_north-america.tar.gz
You can use these extracted files as inputs for the workflow.
# Define inputs for the workflow.
inputs:
- name: washington
# The workflow will detect and extract the metadata and sequences
# from GISAID tar archives.
metadata: data/gisaid_washington.tar
sequences: data/gisaid_washington.tar
- name: north-america
# The workflow supports uncompressed or compressed input files.
metadata: data/ncov_north-america.tsv
sequences: data/ncov_north-america.fasta