Google Drive [1] provides Google Online Storage for Case community to store files. As of 2021, the storage is no longer unlimited, so consider what data should be stored by this method. Rclone [2] is a command line program used to sync files and directories to and from Google Drive.
Rclone as a synchronization tool works differently from scp and sftp tools, and much more like the linux 'rsync' program. The instructions for copying the contents of a directory are very simple, and appear below. To selectively transfer files requires creating file manifests -- wildcard regular expression file selection on the command line is not supported.
Use OnDemand Interactive Desktop. Alternatively, you can also get access to dtn3 node from login node (must do X forwarding during login access as well: ssh -X <dtn3.case.edu).
Load rclone module
module load rclone
Configure your Google Drive with config
rclone config
It will show your existing remote Drives if any and prompt you for your input. Refer to instructions at http://rclone.org/drive/ [3]. Instructions for Box and AWS S3 are at https://rclone.org/box/ and https://rclone.org/s3/ respectively.
Current remotes:
Name Type
==== ====
remote drive
e) Edit existing remote
n) New remote
d) Delete remote
s) Set configuration password
q) Quit config
e/n/d/s/q>
Most of the prompts get the default values. No for Edit Advanced Config. Choose the option "rclone should act on behalf of Enterprise".
Note that with auto config option, when the page opens asking you for authorization, include username: <caseID>@case.edu (SSO credentials). If you have followed the instructions, you will have "remote" listed under current remotes issuing "rclone config" as above
You may want to look at possible rclone options
rclone --help
output:
Sync files and directories to and from local and remote object stores - v1.29.
Syntax: [options] subcommand <parameters> <parameters...>
Subcommands:
copy source:path dest:path
mkdir remote:path
Make the path if it doesn't already exist
rmdir remote:path
...
Authentication involves token generation, and verification. On the HPC data transfer nodes, there is no browser capability, and so the command line approach is needed.
Begin configuration: rclone config
Choose whether a token already exists or not
Leave 'fill-in' prompts blank (e.g. client id), and select 'no' when prompted for advanced and autoconfig (that would require a local browser). A URL will be generated.
Open the browser (e.g. Firefox) in the current host (hpc node Not your Personal Computer) and copy that URL to a browser window URL bar.
The result will be a prompt to authenticate to Google. Do so with your Case credentials.
Once authenticated, token text will be available, which you can copy
Paste the authorization token text back to the configuration shell session
Note: You need Access Key ID and Secret Access Key to connect to your AWS connect and access the S3 bucket.
AWS Access Key ID - leave blank for anonymous access or runtime credentials.
access_key_id> XXX
AWS Secret Access Key (password) - leave blank for anonymous access or runtime credentials.
secret_access_key> YYY
Create a directory "admin" in your Google Drive "remote"
rclone mkdir remote:admin
Transfer file/directory from current location to the Google Drive
rclone copy <file/directory> remote:admin
output:
On dtn3 node for 19gb file, this is the obtained speed:
Upload: Transferred: 19298061765 Bytes (15797.38 kByte/s)
List Directories in Remote Drive
rclone lsd remote:admin
output:
58472 2016-05-05 13:16:44 19 script
26009529096 2016-05-05 13:16:25 14512 viz-bk
rclone ls remote:admin/script
output:
4268 2016-05-05 13:13:22 2 job-monitor
4207 2016-05-05 13:14:23 2 test-maintenance
rclone purge remote:admin/script
Download the files from the remote drive ("remote") to your home directory
rclone copy remote:admin .
Download on dtn3 node: about 70 mb/s
Note the higher download speed than the upload speed.
Files may be selectively copied from a directory using a "manifest file" containing the file targets. The form of the command is:
rclone copy --files-from <manifest-file> [source] [dest]
for example
rclone copy --files-from scripts.txt /home/mrd20/scripts/ boxremote:scripts/
will copy files listed in scripts.txt from source directory /home/mrd20/scripts/ to destination directory boxremote:scripts/
The manifest file can reference a directory hierarchy. An example 'scripts.txt' is shown, and the corresponding output from the destination following running the command.
scripts.txt:
hpcams/hellow.html
youdrawit/scrape.sh
Contents of the directory boxremote:scripts/, using both 'lsd' to list only directories, and 'ls' which lists subdirectories and their contents:
[mrd20@hpctransfer]$ rclone lsd boxremote:scripts/
-1 2019-07-18 14:28:34 -1 hpcams
-1 2019-07-18 14:28:33 -1 youdrawit
[mrd20@hpctransfer]$ rclone ls boxremote:scripts/
805 youdrawit/scrape.sh
88 hpcams/hellow.html
For more information, see the materials on 'filtering' [4]
rclone help flags | grep multi-thread
output:
--multi-thread-cutoff SizeSuffix Use multi-thread downloads for files above this size. (default 250M)
--multi-thread-streams int Max number of streams to use for multi-thread downloads. (default 4)
If --multi-thread-cutoff 250MB and --multi-thread-streams 4 are in effect (the defaults) - https://rclone.org/docs/:
0MB..250MB files will be downloaded with 1 stream
250MB..500MB files will be downloaded with 2 streams
500MB..750MB files will be downloaded with 3 streams
750MB+ files will be downloaded with 4 streams
Transfer using Rclone copy. Use rclone sync command responsibly i.e. Do a trial run with no permanent change using --dry-run first. Limit --multi-thread-streams to 4 as it instantiates multiple threads (i.e. CPU% up to 400% with 4 streams).
rclone copy -vv --multi-thread-cutoff 250M --multi-thread-streams 4 <remote-mount>:<path> .
output:
2020/10/05 15:41:19 DEBUG : 18mar31a_mk4_1_00005gr_00001sq_v03_00023hln_00005enn.frames.mrc: multi-thread copy: stream 1/4 (0-355991552) size 339.500M finished
2020/10/05 15:41:22 DEBUG : 18mar31a_mk4_1_00005gr_00001sq_v03_00023hln_00005enn.frames.mrc: multi-thread copy: stream 3/4 (711983104-1067974656) size 339.500M finished
2020/10/05 15:41:26 DEBUG : 18mar31a_mk4_1_00005gr_00001sq_v03_00023hln_00005enn.frames.mrc: multi-thread copy: stream 2/4 (355991552-711983104) size 339.500M finished
2020/10/05 15:41:26 DEBUG : 18mar31a_mk4_1_00005gr_00001sq_v03_00023hln_00005enn.frames.mrc: multi-thread copy: stream 4/4 (1067974656-1423899024) size 339.436M finished
2020/10/05 15:41:27 DEBUG : 18mar31a_mk4_1_00005gr_00001sq_v03_00023hln_00005enn.frames.mrc: Finished multi-thread copy with 4 parts of size 339.500M
2020/10/05 15:41:29 DEBUG : 18mar31a_mk4_1_00005gr_00001sq_v03_00023hln_00005enn.frames.mrc: MD5 = 167279fc343526d67314732f11dff76a OK
2020/10/05 15:41:29 INFO : 18mar31a_mk4_1_00005gr_00001sq_v03_00023hln_00005enn.frames.mrc: Multi-thread Copied (new)
2020/10/05 15:41:29 INFO :
Transferred: 6.631G / 6.631 GBytes, 100%, 114.151 MBytes/s, ETA 0s
Transferred: 5 / 5, 100%
Elapsed time: 1m0.3s
Example: syncing to AWS S3 bucket
rclone sync --multi-thread-cutoff 250M --multi-thread-streams 4 --s3-no-check-bucket <rclone-remote-name-for-AWS S3-bucket>:<S3-bucket-name>/<path-to-source-dir>/ <path-to-destination-dir>
Check the job script (e.g. rcloneAWSjob.sh) template. Rclone path may have changed; check with "module display rclone". Also, you can distribute your transfer across 3 data transfer nodes (dtn2-dtn3)
#!/bin/bash
SBATCH --time=48:00:00 # will be exited after 48 hours
while true; do
ssh dtn3 /usr/local/rclone/1.53.1/rclone sync --multi-thread-cutoff 250M --multi-thread-streams 4 --s3-no-check-bucket <source> <destination>
# wait for 60 minutes (i.e. 3600s) before initiating another rclone transfer command
sleep 3600
done
Run the job
sbatch rcloneAWSjob.sh
Below is a sample slurm script that downloads a file from a remote (named BOX_DATA with a folder named DATA), renames it and uploads the renamed file back to Box, all from within the scratch directory.
#!/bin/bash
#SBATCH --time=5
#SBATCH -o dtn_transfer%j.log
# Change to scratch directory for job
cd $PFSDIR
# Transfer an input file from Box/DATA to the directory we are currently in
ssh dtn2 /usr/local/rclone/1.55.1/rclone copy BOX_DATA:/DATA/test_file.txt $PFSDIR
# Perform actions
mv test_file.txt result_file.txt
# Transfer a result file from the directory we are currently in to Box/DATA
ssh dtn2 /usr/local/rclone/1.55.1/rclone copy $PFSDIR/result_file.txt BOX_DATA:/DATA/
# Cleanup
rm result_file.txt
If you have a need to copy files from GDerive to MS Sharepoint via rclone installed in HPC, please contact hpc-supportATcaseDotEDU. To know more about your options, please contact CWRU HelpDesk.
References:
[1] Google Drive
[2] Rclone Home
output: