Once you've launched a batch job on the high performance computer, you will want to know if it has started, if it is still running, and what happened once it ran.
Open a terminal on red.uits.iu.edu
Check if you have any jobs in the slurm queue
squeue -u $USER
You may see one of a few things. In all cases the first row you see will look like this
If that is all you see, your job is gone. See the next section about determining the fate of a job for next steps.
If there are lines below this, there is one line per job you have active. For example, you might see this
This shows that I have 6 jobs running and tells me about each one. The '3474427_5' is the job number, this is helpful for checking in on your jobs while they run. The 'gpu-debug' is type of compute node it is on, yours should always be gpu. Let me know if they aren't. The 'lblVids' is the name of your job. The 'ehnewman' is you :-) . The 'PD' or 'R' indicates the job status. 'PD' means it hasn't started yet whereas 'R' means it is running. The numbers '0:00' or '0:01' here show how long your job has been running. The last two columns tell you about the what compute nodes are working on your job.
Once your jobs are out of the queue, you'll want to know what they accomplished or if they ended in error. To do this, you'll look the *.out files the jobs left behind:
Open a terminal on red.uits.iu.edu
Navigate to your slurm job directory
cd ~/slurm_DLC
Check what *.out files are there
ls -lt *.out
Scroll to the top and look at which files were edited most recently
Look at the last few lines of the most recent file (For example, if this was a file called 'slurm-3474427_6.out')
tail slurm-3474427_6.out
At this point you could see several possible things. Here are a few possibilities based on a key word you might see among the text:
'COMPLETED!' If you were labeling videos, it means it finished labeling all videos! Hurrah!
'MEMORY' This shows up if your job ran out of memory. We can fix this by asking for more memory in the *.sh script.
'OUT OF TIME' This means your job didn't finish before the clock time ran out. If you were training a network, this is expected.
random text pointing to random lines of code of random functions - This means there is a problem in the python script. Tell me about it!