gdc-client

The National Cancer Institute maintains data for research uses. One of the tools available to download from the Genome Data Commons is the gdc-client. Using the gdc-client on the HPC cluster is described here.

Understanding the workflow to optimize data transfers

The actual download needs to make use of a data transfer node, such as dtn3.case.edu. That is to take advantage of the science data firewall rules that have been arranaged to avoid packet-level inspection of large-scale data.  The outline is as follows:

The script structure can be a little bit recursive, and so an outline is included here.  The slurm job submission script specifies the resource needed to run a download script that will make the ssh-call to the data transfer node. The resource requirement is modest for this task, and requires only 1 cpu and will not require significant memory. An example is shown here:

#!/bin/bash

#SBATCH   -n 1           # 1 task is enough for the ssh

#SBATCH   --mem 4GB      # memory proportional to 1 cpu

#SBATCH   -t 2-00:00:00  # 2 days, example job duration

#SBATCH   <other appropriate flags for the job>

cd <script directory>

./download.sh

The ssh-call occurs within the download.sh script

ssh 192.168.223.238 "cd <path-to-download-directory>; module load gdc-client; timeout 172800 gdc-client download -n 30 --debug -m <manifest-file-path> -t <gdc-token-file-path>"

If the token is invalid (expires, wrong file specified, etc.) the gdc-client will emit an error message containing the token FORBIDDEN. To test for this scenario, you can redirect the command output to a file and test with grep.

ssh 192.168.223.238 "cd <path-to-download-directory>; module load gdc-client; timeout 172800 gdc-client download -n 30 --debug -m <manifest-file-path> -t <gdc-token-file-path> > gdc-client.out"

if grep -Fq "FORBIDDEN" <path-to-download-directory>/gdc-client.out

    echo "Invalid token"

gdc-client availability on the CWRU cluster

The software is available as a module on the data transfer nodes, and needs to be loaded within the ssh call performed from the compute node.  This is to establish the environment on the data transfer node to run gdc-client. The client software module is available to load through all nodes in the cluster; however, the command will only run efficiently through the data transfer node.

Usage notes

The '--debug' flag is particularly useful when testing, and may be omitted during 'production' transfers using full manifest files. Further information is available as shown below.  Linked here is the GDC client user guide.

[mrd20@dtn3 ~]$ gdc-client download --help

usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE]

                           [-t TOKEN_FILE] [-d DIR] [-s server]

                           [--no-segment-md5sums] [--no-file-md5sum]

                           [-n N_PROCESSES]

                           [--http-chunk-size HTTP_CHUNK_SIZE]

                           [--save-interval SAVE_INTERVAL] [--no-verify]

                           [--no-related-files] [--no-annotations]

                           [--no-auto-retry] [--retry-amount RETRY_AMOUNT]

                           [--wait-time WAIT_TIME] [-u] [-m MANIFEST]

                           [file_id [file_id ...]]

positional arguments:

  file_id               The GDC UUID of the file(s) to download

optional arguments:

  -h, --help            show this help message and exit

  --debug               Enable debug logging. If a failure occurs, the program

                        will stop.

  --log-file LOG_FILE   Save logs to file. Amount logged affected by --debug

  -t TOKEN_FILE, --token-file TOKEN_FILE

                        GDC API auth token file

  -d DIR, --dir DIR     Directory to download files to. Defaults to current

                        dir

  -s server, --server server

                        The TCP server address server[:port]

  --no-segment-md5sums  Do not calculate inbound segment md5sums and/or do not

                        verify md5sums on restart

  --no-file-md5sum      Do not verify file md5sum after download

  -n N_PROCESSES, --n-processes N_PROCESSES

                        Number of client connections.

  --http-chunk-size HTTP_CHUNK_SIZE

                        Size in bytes of standard HTTP block size.

  --save-interval SAVE_INTERVAL

                        The number of chunks after which to flush state file.

                        A lower save interval will result in more frequent

                        printout but lower performance.

  --no-verify           Perform insecure SSL connection and transfer

  --no-related-files    Do not download related files.

  --no-annotations      Do not download annotations.

  --no-auto-retry       Ask before retrying to download a file

  --retry-amount RETRY_AMOUNT

                        Number of times to retry a download

  --wait-time WAIT_TIME

                        Amount of seconds to wait before retrying

  -u, --udt             Use the UDT protocol.

  -m MANIFEST, --manifest MANIFEST

                        GDC download manifest file