gdc-client

The National Cancer Institute maintains data for research uses. One of the tools available to download from the Genome Data Commons is the gdc-client. Using the gdc-client on the HPC cluster is described here.

Understanding the workflow to optimize data transfers

The actual download needs to make use of a data transfer node, such as dtn3.case.edu. That is to take advantage of the science data firewall rules that have been arranaged to avoid packet-level inspection of large-scale data. The outline is as follows:

- Jobs are submitted through the SLURM scheduler of the HPC cluster
- The job organizing the download (providing access to gdc token, and manifest files) runs on a compute node in the cluster. This compute node job is a shell script that includes an ssh session call to a data transfer node (see example call in 'gdc-client availability' section).
- The ssh session call must specify the linux environment to run the gdc-client process, and will include a flag to determine the maximum number of gdc-client processes to run in parallel to accomplish connections to the gdc server at NCI GDC in Chicago.
- gdc-client, running on the data transfer node, will establish up to the maximum number of processes, depending on the number of processes active on the data transfer node.
- The gdc-client process must be limited to a finite execution time to avoid failed transfers overloading the transfer nodes. The examples below stop the gdc-client after two days (172,800 seconds) with the timeout command.
- The compute node job may perform further tasks on the downloaded data.
- The data transfer nodes have multiple interface addresses. The following IPs will access the 'Data Science Network', providing the greatest throughput:
  - dtn3: 192.168.223.218
  - dtn2: 192.168.223.219
  - dtn1: 192.168.223.238

The script structure can be a little bit recursive, and so an outline is included here. The slurm job submission script specifies the resource needed to run a download script that will make the ssh-call to the data transfer node. The resource requirement is modest for this task, and requires only 1 cpu and will not require significant memory. An example is shown here:

#!/bin/bash

#SBATCH -n 1 # 1 task is enough for the ssh

#SBATCH --mem 4GB # memory proportional to 1 cpu

#SBATCH -t 2-00:00:00 # 2 days, example job duration

#SBATCH <other appropriate flags for the job>

cd <script directory>

./download.sh

The ssh-call occurs within the download.sh script

ssh 192.168.223.238 "cd <path-to-download-directory>; module load gdc-client; timeout 172800 gdc-client download -n 30 --debug -m <manifest-file-path> -t <gdc-token-file-path>"

If the token is invalid (expires, wrong file specified, etc.) the gdc-client will emit an error message containing the token FORBIDDEN. To test for this scenario, you can redirect the command output to a file and test with grep.

ssh 192.168.223.238 "cd <path-to-download-directory>; module load gdc-client; timeout 172800 gdc-client download -n 30 --debug -m <manifest-file-path> -t <gdc-token-file-path> > gdc-client.out"

if grep -Fq "FORBIDDEN" <path-to-download-directory>/gdc-client.out

echo "Invalid token"

gdc-client availability on the CWRU cluster

The software is available as a module on the data transfer nodes, and needs to be loaded within the ssh call performed from the compute node. This is to establish the environment on the data transfer node to run gdc-client. The client software module is available to load through all nodes in the cluster; however, the command will only run efficiently through the data transfer node.

Usage notes

The '--debug' flag is particularly useful when testing, and may be omitted during 'production' transfers using full manifest files. Further information is available as shown below. Linked here is the GDC client user guide.

[mrd20@dtn3 ~]$ gdc-client download --help

usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE]

[-t TOKEN_FILE] [-d DIR] [-s server]

[--no-segment-md5sums] [--no-file-md5sum]

[-n N_PROCESSES]

[--http-chunk-size HTTP_CHUNK_SIZE]

[--save-interval SAVE_INTERVAL] [--no-verify]

[--no-related-files] [--no-annotations]

[--no-auto-retry] [--retry-amount RETRY_AMOUNT]

[--wait-time WAIT_TIME] [-u] [-m MANIFEST]

[file_id [file_id ...]]

positional arguments:

file_id The GDC UUID of the file(s) to download

optional arguments:

-h, --help show this help message and exit

--debug Enable debug logging. If a failure occurs, the program

will stop.

--log-file LOG_FILE Save logs to file. Amount logged affected by --debug

-t TOKEN_FILE, --token-file TOKEN_FILE

GDC API auth token file

-d DIR, --dir DIR Directory to download files to. Defaults to current

dir

-s server, --server server

The TCP server address server[:port]

--no-segment-md5sums Do not calculate inbound segment md5sums and/or do not

verify md5sums on restart

--no-file-md5sum Do not verify file md5sum after download

-n N_PROCESSES, --n-processes N_PROCESSES

Number of client connections.

--http-chunk-size HTTP_CHUNK_SIZE

Size in bytes of standard HTTP block size.

--save-interval SAVE_INTERVAL

The number of chunks after which to flush state file.

A lower save interval will result in more frequent

printout but lower performance.

--no-verify Perform insecure SSL connection and transfer

--no-related-files Do not download related files.

--no-annotations Do not download annotations.

--no-auto-retry Ask before retrying to download a file

--retry-amount RETRY_AMOUNT

Number of times to retry a download

--wait-time WAIT_TIME

Amount of seconds to wait before retrying

-u, --udt Use the UDT protocol.

-m MANIFEST, --manifest MANIFEST

GDC download manifest file