FAQ: Troubleshooting

This site is designed to help you resolve your issues. It covers all possible problems that you encounter during accessing the cluster, job submission, and managing files.

The viminfo file got corrupted

Problem: I am getting the errors something like this "EXXX: viminfo: Missing '>' line ..." when trying to use vi editor.

Solution: Your ~/.viminfo file is corrupted. You can simply delete it using:

rm -f ~/.viminfo

If you deleted home files accidentally, you may have deleted ~/vim70 directory. Reocvoer that directory from the snapshot.

My job enters the queue successfully, but it waits a long time before it gets to run

The job scheduling software on the cluster makes decisions about how best to allocate the cluster nodes to individual jobs and users. There are ways to make your job more likely to get nodes faster. Don't ask for more walltime and processors than your job requires. Debug your code, start with small scale, and then scale it up. There are cases when the users in your group may have used most of the resources allocated to your group. In that case, email at hpc-supportATcase.edu

to be a member or to increase your shares. Note that members get priorities over guests, and members with more shares have lower queue time. Refer to access policies.

Unable to access Cluster via WiFi + VPN using your laptop

Please make sure that you are using Case credentials (Case ID and SSO password) to log into the cluster. Also, run the password check and send us the result. You would also need Case VPN to connect to the cluster when you are not on the campus network or when you use the CaseGuest wifi. VPN can be obtained from vpnsetup.case.edu.

If you are not challenged with the password but are getting the following:

Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

Check your ssh config file, ~/.ssh/config, for the following parameter "PasswordAuthentication no" and comment it by leading #.

If you get the similar error as showed below while trying to access HPC using "ssh <caseID>@<login-node>.case.edu

Connection closed by 129.22.x.x

Try via other machines. If you are able to access HPC cluster, you may have issues with VPN or your laptop configurations else email to HelpDesk regarding VPN connection.

SSH Connection closed by remote host

If you are connecting via ssh, you might want to include the following option:

ssh -o ServerAliveInterval=60 <CaseID>@pioneer.case.edu

It is possible that when there is no ssh activity, the connection is closed down. If the ssh connection drop is an issue, we would also suggest running the job in a batch mode -

HPC Batch & Interactive Job

SSH fails - Connection refused/terminated

When you connect using ssh -vvv <caseID>@<login-node>.case.edu, and received a connection error:

connect to address 192.168.252.113: Connection refused

then it is possible that the ssh daemon the server has stopped. Also, your /home/<caseID> should have 75x permission NOT 77x.

ls -ld /home/caseID

drwxr-xr-x 94 <caseID> <GROUP> 53248 Aug 2 18:03 /home/<caseID> # it is 755

Please immediately contact us at hpc-supportATcase.edu

Bad owner or permissions on /home/<caseID>/.ssh/config

the permissions of the .ssh folder can't be 777. Please change it to 700.

chmod 700 ~/.ssh

Issue with ~/.gvfs

Possible symptoms are: (1) "df: '/home/$caseid/.gvfs': Transport endpoint is not connected" (2) lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /home/$caseid/.gvfs Output information may be incomplete.

Solution: Issue the following command:

fusermount -u .gvfs

SSH fails -- Permission denied (publickey,gssapi-keyex, gssapi-with-mic,password)

User ssh configuration on desktop and laptop disabled password authentication. Solution is to modify on user machines /etc/ssh/ssh_config:

PasswordAuthentication yes

SSH Authentication Error/Unable to Copy File

Got error similar to following after the job submission:

An error has occurred processing your job, see below.

Post job file processing error; job 4

Unable to copy file /var/spool/torque/spool/329467.hpcmaster.OU to abc123@hpclogin.tis.cwru.edu:

Solution: This is usually a symptom of the queuing system not being able to copy STDOUT/STDERR from the local disk on the compute nodes back to the login nodes. To do this, the queuing system requires SSH to be enabled with a passwordless key-pair. This should have been configured when your account was created, but could have been broken in a couple of ways:

1. Check the permissions of your home directory:

ls -l /home | grep abc123

drwxr-xr-x 53 abc123 xyz123 8192 Jun 9 08:00 abc123

You need to make sure your home directory (and the underlying .ssh subdirectory) are not writable by either the group or other users. SSH has a feature where it will ignore the contents of the .ssh subdirectory if another user were able to modify your authorized_keys file. If you need to fix the permissions of your home directory (or the underlying .ssh subdirectory) you can use the chmod command:

chmod 755 /home/abc123/

2. Check whether your SSH key-pair exists and intact:

cd ~/.ssh/

authorized_keys id_rsa id_rsa.pub known_hosts

You should see your private key (id_rsa), your public key (id_rsa.pub) and a copy of your public key in the authorized_keys file.

If you need to restore your SSH key-pair, you can rerun the script that establish the ssh key pairs:

---> Please type the command "sshpass" from the command prompt from your home directory /home/<caseID>.

Getting bash prompt

If sshpass doesn't work, your .bashrc and .bash_profile may be missing or corrupted. You can recover your .bashrc and .bash_profile file by compying it from /etc/skel/.bashrc.

cp /etc/skel/.bashrc ~/.bashrc

cp /etc/skel/.bash_profile ~/.bash_profile

Host Key Has Changed

Connecting through ssh you encounter an error like the following:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

Someone could be eavesdropping on you right now (man-in-the-middle attack)!

It is also possible that a host key has just been changed.

The fingerprint for the ED25519 key sent by the remote host is

...

Host key for markov.case.edu has changed and you have requested strict checking.

Host key verification failed.

When the host key changes (which occurs from time to time with routing admin activities on the login nodes) the cached copy in your ssh file ~/.ssh/known_hosts needs to be removed. You can edit the file directly, and delete the lines containing 'markov' (or 'pioneer'), or even remove the file itself. Then with every new ssh session, the current host key will be recaptured into ~/.ssh/known_hosts. You may also need to do the same in your Personal computer from where you are trying to connect to the cluster.

Forwarding & Connection Issues

Case 0: Ensure that the required visual application (e.g. Xming for putty, quartz for Mac) is running on your local computer.

Also, consider using HPC Visual Access.

Case 1:

X connection to localhost:32.0 broken (explicit kill or server)

Solution: Start your Xming Server (for Cygwin, start XWin server)

Case 2:

X11 forwarding is disabled to avoid man-in-the-middle attacks

X connection to localhost:50.0 broken (explicit kill or server shutdown).

Solution: Confirm the issue by trying to login via other machines. Try deleting .Xauthority file at /home/<caseID> as it might have been corrupted.

rm /home/<CaseID>/.Xauthority

Case 3:

/usr/bin/xauth: error in locking authority file /home/<user>/.Xauthority

Cause: .Xauthority file was corrupted but could not recreate as the hard storage limit had been reached

Solution: need to delete files which may take some time to be reflected.

Case 4:

X11 connection rejected because of wrong authentication.

Can't open display: localhost:51.0

As in Case 2, remove the .Xauthority file; exiit and re-connect.

Case 5:

connect /tmp/.X11-unix/X0: No such file or directory Error: Can't open display: localhost:10.0

Recheck DISPLAY environment (echo $DISPLAY) variable on local computer, and ensure it includes 'localhost', and then reconnect.

Case 6:

srun: error: x11: unable to connect node <computeNode>

Solution: You can safely delete the file ~/.ssh/known_hosts or just offending entry for <compute node>.

rm ~/.ssh/known_hosts

Case 7:

If you get error similar to:

"error in locking authority file" and "MoTTY X11 proxy: Authorization don't recognized.

Solution: Run the following command which should clear the locks on the .Xauthority file.

xauth -b

Offending RSA Key & HostKey verification Failed

You need to remove line given by the number after : in ~/.ssh/known_hosts file.

However the message may show up again (it may not - as we have fixed the source of the conflict), and you would need to remove the known_hosts line again to be able to reconnect.

We suggest adding a config file at ~/.ssh/config that contains:

Host *

ServerAliveInterval 120

Still Can't figure out the issue:

Please send us the verbose using -vvv flags at hpc-supportATcase.edu

ssh -vvv <caseID>@<login-node>.case.edu

MATLAB License Checkout Failed

License checkout failed.

License Manager Error -18

...

Solution: Check the status of the license clicking Software License Status . If all the licenses have been checked out, wait for sometime. If it has been checked out for a long time communicate with the user who is checking out most of the licenses. If you think it is being checked out most of the time, contact Case Help Desk. You can also contact us.

MATLAB MDCS Validation failed

Refer to MATLAB site.

I can not submit my job using sbatch OR DOS/Linux Formatting issues

The error I got is as follows:

qsub: script is written in DOS/Windows text format

Solution: You just need to change the format to unix from DOS. Use the command below for all your script files.

dos2unix <slurm-script-file>

If you are downloading your ascii text file in your windows PC from the HPC, your file format may all get messed up. You need to change the formatting to DOS:

unix2dos <ASCII File>

(Note: The commands are only available in login nodes.

I deleted important files on my home directory by accident. Is there a way to recover those files?

First of all, please double check before using some of the unix remove commands (e.g. rm -rf <files> next time. You can either recover from snapshots or backedup storage. Please refer to Recovering Files.

In the text editor I am prompted about swap file. What is this and what should I do?

Sometimes, the text session may have been closed without without quitting it. So, Linux saves the file as .<filename>.swp. When you try to open the file, you will see the screen similar to (using Vi editor):

E325: ATTENTION

Found a swap file by the name ".<filename>.swp"

owned by: root dated: Wed Feb 12 17:03:27 2014

file name: <path-to-filename><filename>.sh

modified: YES

...

Solution:

In the screen itself, you should see what needs to be done. Recover your file and delete the swap file:

vi -r <filename>

rm .<filename>.swp

I am unable to delete my directory or files, what should I do?

If you do not have write permission for a file or directory as showed below, you can not delete them; it complains about write protected directory. Contact us with the path to directory or files to delete.

dr-xr-xr-x 2 sxg125 oscsys 20480 Mar 2 09:26 bin

dr-xr-xr-x 5 sxg125 oscsys 4096 Mar 2 09:26 boot

(Note: No w: write permission; Only rx: read & execute permission)

I am getting Globus Online Permission Denied Issue, what should I do?

Create a directory where you are copying the files before including it in the Globus Online path. Globus Online can not create a directory if the directory is not already created resulting in permission denied issue.

If you are using /mnt then provide the whole path /mnt/<path-to-directory> from where you are copying files.

I am getting the error "Write failed: Broken Pipe", what should I do?

If you have too many processes open, you will get such error. In that case, you need to kill the defunct processes. Try to kill the processes from the specific login node first:

kill -9 `ps -ef | grep <caseID> | grep -v grep | awk '{print $2}'`

Replace <caseID> with your ID. If you are unable to do so, please contact us with the name of the login node.

I am getting the error "unrecognized command line option -std=c++11 or other gcc/g++ related issues", what should I do?

You need to load the gcc module. Check out the available version with "module spider gcc"

I am getting the qrun error while submitting the job

Unable to communicate with hpcmaster(192.168.207.245)

Cannot connect to specified server host 'hpcmaster'.

qsub: cannot connect to server hpcmaster (errno=111) Connection refused

Please contact us immediately.

I am affiliated to two or more groups but am getting permission error; what should I do?

If you are affiliated to different groups and want to access the particular group files, please change your group to that particular group first using the command below:

newgrp <group-name>

If want to run jobs using account group that is not your default (primary) group, use the -A option in PBS script as showed:

#SBATCH -A <group-name>

My Job (s) quit prematurely; what could be the issue?

There can be error in your script or the code. Please check your output file (*.o<jobID>). It can happen because of insufficient memory as well. The default memory is 1gb but you can request the desired value using.

#SBATCH --mem=5gb

Here, 5gb memory has been requested. Sometimes, Sometimes, the node becomes non-responsive due to either high memory or high load and the time doesn't get updated appropriately causing skew in Slurm server/client clock and throws error similar to the following:

slurmstepd: Munge decode failed: Expired credential

slurmstepd: Verifying authentication credential: Expired credential

/tmp/slurm/job1933654/slurm_script: line 22: 40975 Killed

You need to match your memory with the processor for high memory jobs. See the section "High Memory Job" at HPC Interactive and Batch Submission.

Why can't I delete my files?

There are files that you cannot delete on the cluster such as .panfs.fcdfa8c0.1465877474774191000. This is a PanFS (Panasas) marker file that is tied to an open process going on with a directory. Here is a more succinct explanation about this silly name: http://www.jasmin.ac.uk/faq/basic-linux/#panfs. Usually we found out about this file when we are trying to delete a directory that still has this .panfs... file in it. The file cannot be deleted, but will disappear when the process associated with it is complete or it typically disappears when the user is logged out.

I got Network Error as showed below while accessing the cluster

"Network error: Software caused connection abort"

Refer to the network error section common error message for details. "This is a generic error produced by the Windows network code when it kills an established connection for some reason. For example, it might happen if you pull the network cable out of the back of an Ethernet-connected computer, or if Windows has any other similar reason to believe the entire network has become unreachable".

Is there a remote service in place to help resolve my issue related to HPC?

Besides the email and the office hours (Wed: 2-4p.m), remote resolution service is available through Zoom.

Getting undefined symbol error immediately after issuing sbatch command?

There may be a conflict with other loaded modules. Make sure to have only the default modules, or run "module purge".

I am unable to access cluster via OnDemand (ondemand.case.edu)

It can possibly be DUO setting issue if you get the error like "We're sorry, access is not allowed because you are not enrolled ....". Refer to https://case.edu/utech/duo. Please email hpc-support[AT]case[DOT]edu with a snapshot if you are still unable to access it.

I am getting the following sbatch error after submitting the job using sbatch

sbatch: error: --gid only permitted by root user

Solution: change the group before submitting the job using:

newgrp <new-group>

The downloaded data are removed right away from the /scratch space

By default, when a file is downloaded, it's timestamps are set to match those from the remote file. In case of wget, --no-use-server-timestamps flag, set the local file's current timestamp after the files get downloaded. Also, if you use rsync, don't use -t or -a option.

Trouble Opening the OnDemand link

Getting the following error:

Error -- invalid user name syntax: <CaseID>

Run 'nginx_stage --help' to see a full list of available command line options.

A best practice is to explicitly logout from a session, and close the browser window to manage the session cookies. Removing the cookies through the browser should also be helpful.

Page updated

Report abuse