gdc-client
The National Cancer Institute maintains data for research uses. One of the tools available to download from the Genome Data Commons is the gdc-client. Using the gdc-client on the HPC cluster is described here.
Understanding the workflow to optimize data transfers
The actual download needs to make use of a data transfer node, such as dtn3.case.edu. That is to take advantage of the science data firewall rules that have been arranaged to avoid packet-level inspection of large-scale data. The outline is as follows:
Jobs are submitted through the SLURM scheduler of the HPC cluster
The job organizing the download (providing access to gdc token, and manifest files) runs on a compute node in the cluster. This compute node job is a shell script that includes an ssh session call to a data transfer node (see example call in 'gdc-client availability' section).
The ssh session call must specify the linux environment to run the gdc-client process, and will include a flag to determine the maximum number of gdc-client processes to run in parallel to accomplish connections to the gdc server at NCI GDC in Chicago.
gdc-client, running on the data transfer node, will establish up to the maximum number of processes, depending on the number of processes active on the data transfer node.
The gdc-client process must be limited to a finite execution time to avoid failed transfers overloading the transfer nodes. The examples below stop the gdc-client after two days (172,800 seconds) with the timeout command.
The compute node job may perform further tasks on the downloaded data.
The data transfer nodes have multiple interface addresses. The following IPs will access the 'Data Science Network', providing the greatest throughput:
dtn3: 192.168.223.218
dtn2: 192.168.223.219
dtn1: 192.168.223.238
The script structure can be a little bit recursive, and so an outline is included here. The slurm job submission script specifies the resource needed to run a download script that will make the ssh-call to the data transfer node. The resource requirement is modest for this task, and requires only 1 cpu and will not require significant memory. An example is shown here:
#!/bin/bash
#SBATCH -n 1 # 1 task is enough for the ssh
#SBATCH --mem 4GB # memory proportional to 1 cpu
#SBATCH -t 2-00:00:00 # 2 days, example job duration
#SBATCH <other appropriate flags for the job>
cd <script directory>
./download.sh
The ssh-call occurs within the download.sh script
ssh 192.168.223.238 "cd <path-to-download-directory>; module load gdc-client; timeout 172800 gdc-client download -n 30 --debug -m <manifest-file-path> -t <gdc-token-file-path>"
If the token is invalid (expires, wrong file specified, etc.) the gdc-client will emit an error message containing the token FORBIDDEN. To test for this scenario, you can redirect the command output to a file and test with grep.
ssh 192.168.223.238 "cd <path-to-download-directory>; module load gdc-client; timeout 172800 gdc-client download -n 30 --debug -m <manifest-file-path> -t <gdc-token-file-path> > gdc-client.out"
if grep -Fq "FORBIDDEN" <path-to-download-directory>/gdc-client.out
echo "Invalid token"
gdc-client availability on the CWRU cluster
The software is available as a module on the data transfer nodes, and needs to be loaded within the ssh call performed from the compute node. This is to establish the environment on the data transfer node to run gdc-client. The client software module is available to load through all nodes in the cluster; however, the command will only run efficiently through the data transfer node.
Usage notes
The '--debug' flag is particularly useful when testing, and may be omitted during 'production' transfers using full manifest files. Further information is available as shown below. Linked here is the GDC client user guide.
[mrd20@dtn3 ~]$ gdc-client download --help
usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE]
[-t TOKEN_FILE] [-d DIR] [-s server]
[--no-segment-md5sums] [--no-file-md5sum]
[-n N_PROCESSES]
[--http-chunk-size HTTP_CHUNK_SIZE]
[--save-interval SAVE_INTERVAL] [--no-verify]
[--no-related-files] [--no-annotations]
[--no-auto-retry] [--retry-amount RETRY_AMOUNT]
[--wait-time WAIT_TIME] [-u] [-m MANIFEST]
[file_id [file_id ...]]
positional arguments:
file_id The GDC UUID of the file(s) to download
optional arguments:
-h, --help show this help message and exit
--debug Enable debug logging. If a failure occurs, the program
will stop.
--log-file LOG_FILE Save logs to file. Amount logged affected by --debug
-t TOKEN_FILE, --token-file TOKEN_FILE
GDC API auth token file
-d DIR, --dir DIR Directory to download files to. Defaults to current
dir
-s server, --server server
The TCP server address server[:port]
--no-segment-md5sums Do not calculate inbound segment md5sums and/or do not
verify md5sums on restart
--no-file-md5sum Do not verify file md5sum after download
-n N_PROCESSES, --n-processes N_PROCESSES
Number of client connections.
--http-chunk-size HTTP_CHUNK_SIZE
Size in bytes of standard HTTP block size.
--save-interval SAVE_INTERVAL
The number of chunks after which to flush state file.
A lower save interval will result in more frequent
printout but lower performance.
--no-verify Perform insecure SSL connection and transfer
--no-related-files Do not download related files.
--no-annotations Do not download annotations.
--no-auto-retry Ask before retrying to download a file
--retry-amount RETRY_AMOUNT
Number of times to retry a download
--wait-time WAIT_TIME
Amount of seconds to wait before retrying
-u, --udt Use the UDT protocol.
-m MANIFEST, --manifest MANIFEST
GDC download manifest file