GlobusOnline
Globus
Globus is a file transfer tool that is used by many universities to help researchers move and share big data. It provides fast, reliable, and secure file transfer, sharing, and publishing. It is built on GridFTP, and can take a large file and break it into several independent data streams. Those data streams are transmitted from source to destination using multiple parallel connections, instead of a single one, which improves file transfer speed. The Globus software also tracks the progress of the file transfer, finds alternate paths, provides status updates and a report to the initiator when the transfer is complete. The web interface is a simple drag and drop GUI, but the same services can be called from the command line of the HPC or from inside a Slurm job.
Any storage system that is enabled as a Globus endpoint can be easily configured to allow file sharing. Data can be transferred from HPC storage, or directly from your own machine or your lab's machine if they are registered as endpoints. For more detail on server setup see the Globus Connect Server page.
Globus's Frequently Asked Questions are a good place to look for assistance. When using the command line, issue globus --help or read through the documentation at Globus-Command-Line-Interface (CLI).
Notes
Globus is integrated with Case's SSO login, and after authentication users are placed in their /home directory.
You need to re-authenticate after the period of 24 hours. If your transfer takes more than 24 hours, first check if you have a large number of small files (see the note under Archiving files below). Then, to avoid re-authentication, you can find an 'advanced' pull-down below the login and password. That allows to set a value longer than 24 hours in the field "Credential Lifetime (in hours)".
If you were a Globus user before February 13, 2016, and logged in with a Globus username/password, you can continue to do so by choosing Globus ID in the login dropdown.
Globus separates a large dataset into multiple streams that transfer in parallel. The number of parallel streams initiated is dependent on the file's size. Files less than 50MB will be broken into 2 streams , files between 50 Mb and 250 Mb will be broken into 4 streams, and files >250MB will be broken into 8 streams.
Archiving files prevents parallelism, so zipped or tarred files will transfer in a single stream only.
However, if you have too many small files in a directory, archive them or tar them so that the size of a single tar file will be in the range of a few GB to 100GB. Else, it can cause Globus to go into "endpoint is too busy" state and your job will timeout, restart, timeout etc (source: here).
Case has enrolled in Globus Provider plan under "cwru". That allows case users and external collaborators with affiliated cwru accounts (see affiliate request form) to share the folder to upload/download (transfer) files.
The external collaborators don't need to have cwru affiliated account if they just want to transfer files. Globus sharing option (see Globus Endpoints sharing section below) can be used.
Globus Transfer test data at https://fasterdata.es.net/performance-testing/DTNs/
Login to Globus
Browse: https://www.globus.org/
Continue and use one of the HPC endpoints: 'CWRU DTN Collection' or 'CWRU hpctransfer', to transfer files. Use Case SSO credentials.
Setting Up Globus Connect Personal
Please follow the steps on this page: Globus Connect Personal | globus
Transferring Files
1. Visit GlobusOnline and start copying your first file to the cluster by clicking Manage Data -> Start Transfer
2. Choose 'CWRU DTN Collection' as the destination, and your PC as the source
3. Enter your CASEID, and Password when prompted
4. Drag and drop files, or select at use the directional arrows to start transferring files
Now you are ready to drag and drop files online but make sure your GlobusConnect desktop application is running when the file transfer happens.
If you are not using VPN (get Case VPN), authenticate My Proxy using CLI:
ssh <globus-user-name>@cli.globusonline.org
endpoint-activate "<cwru-end-point>"
You will be prompted for your Case Credentials:
Enter username (Default: 'xxxx'):<caseID>
Enter password:
Globus-Command-Line-Interface for File Transfer
You may desire to create a script in HPC to transfer file/directory using Globus-Command-Line-Interface instead of GUI interface for convenience. Please visit HPC Guide to Globus CLI.
Sharing End Points
Login to the Globus web interface and click on Groups tab.
If you have already joined the group (e.g. cwru), you will see that group. You can search for cwru or other group and send request to join the group to access the shared endpoints and/or share your files to others. Please also email hpc-support@case.edu indicating that you are affiliated to case, if you have not used CaseID in your Globus email contact.
Transfer and/or Share Files
Login to Globus and select transfer files
Choose the endpoints (cwru# for HPC and/or PC endpoints) and provide credentials if prompted (e.g. SSO credential for HPC end points).
Select the folder you wish to share.
Click on "share" link in the middle. It will prompt you to create share and then allows you to add access permission after clicking "create share" button.
Select people or group you want to share the folder with.
Set read/write permissions.
Check the status of the shared endpoint ("ENDPOINTS" -> Administered by you)
For details, visit Globus Data Sharing
If you are NOT using VPN, authenticate My Proxy using CLI:
ssh <globus-user-name>@cli.globusonline.org
endpoint-activate "<cwru-end-point>"
You will be prompted for your Case Credentials:
Enter username (Default: 'fsteen'):<caseID>
Enter password:
Mounting USB Hard Drive
Follow these steps to allow Globus to recognize a USB external drive connected to your personal machine:
Ensure you have the latest version of Globus Connect Personal installed.
Open the Globus Connect Personal settings window to add your USB drive. On Mac OS X, click "Preferences" in the Globus Connect Personal menu. On Windows, select the Tools -> Options menu option. On Linux, select the File -> Preferences menu option.
Click on the "+" button and select your USB drive. Optionally, change the directory path that you would like to access.
Now, when you access your Globus Connect Personal endpoint on the Start Transfer page, choose the personal endpoint and replace the path to the path of your USB drive to browse/transfer files. For example, for MAC, you can start with /Volumes
Troubleshooting
Please check Globus FAQ (https://docs.globus.org/faq/transfer-sharing/). You can also get Globus help using TaskID (https://app.globus.org/help).
Problem 1: Data transfer started but stalled after some time:
Solution: By default, when the storage quota is exceeded or there is not sufficient disk space during your transfer, Globus will periodically retry the transfer. The Event Log/Debug on the web app Activity page will warn "storage quota exceeded" or FILE_ACCESS. If no progress on your transfer can be made in three days, the transfer will fail and you will then be notified via email. However, if you select the option to “fail on quota error”, your transfer will fail immediately rather than retry for three days, and you will be notified immediately via email.
Problem 2: Details: 500 Command failed. : Chunk not ready before timeout.\r\n
Solution: The files in question are likely no longer still on disk.
1) canceling their job
2) freeing up space on the 'CWRU Research Archive' endpoint so as to avoid further QUOTA_EXCEEDED faults
3) resubmitting the job with one of the sync options selected so as to only transfer those files that have not already been transferred
Problem 3: Globus going into "endpoint is too busy" state
Solution: If you have too many small files in a directory, archive them or tar them so that the size of a single tar file will be in the range of a few GB to 100GB. Else, it can cause globus to go into "endpoint is too busy" state and your job will timeout, restart, timeout etc (source: here). From Gloubs support:
Because of the way that GCS->GCP transfers work, each file transferred will require its own data channel connection. A normal GCS endpoint will have 1000 ports available in its data port range for data channel connections. When this port range is exhausted, then attempts by endpoints to establish new data channel connections will generate the sort of error you are reporting here. If a GCS endpoint has many GCP endpoints attempting to transfer files from it, the chance of these sorts of errors becomes greater. This sort of error is especially common for GCS->GCP transfers involving so-called Lot's of Small Files (LoSF) datasets, where the data port range on the GCS endpoint can be easily exhausted by one or more GCP endpoints transferring many, many small files faster than the ports in the data port range can be recycled by the operating system.
One solution for this sort of issue is to change the presentation of the datasets made available on the endpoint such that they are no longer LoSF in nature. The idea would be to change the datasets to be a few large archive files rather than many small loose files. This would make it so that each GCS->GCP transfer would use fewer data channel connections when conducting the transfer, and those connections would also be in use longer, thus ensuring that the OS on the GCS endpoint has much more time to recycle data port range ports so that they aren't quickly exhausted as can happen in an LoSF scenario. This is the most common solution for sites that are experiencing this sort of issue.