The first thing you should always try is adding print statements or using a debugger like DDT to track down where jetstream is getting stuck. Did you change the code and accidentally cause jetstream to get stuck in a loop somewhere?
Both Randy Belanger and Nathalie Saadeh ran into hanging issues with large cases (>1000 proc). Nathalie in particular had issues with hanging in the PETSC subroutines. Tom Reist traced it back to a change in the 2019 Intel compilers that didn't "play nice" with the gpfs file system. Niagara was looking into it, but they didn't find a fix (at least at the time). To check whether this is the issue,
Try a case with a smaller number of blocks/processors. See if the meshmovement hangs when you run this case. If it does, then the issue probably isnt this, and you'll have to keep digging. Sorry! But if it passes, then try...
Revert to the old compilers and Niagara environment by removing all the "module load <...>" lines in your .bashrc and replacing with (or just execute the following commands once for a temporary solution)
module --force purge
module load NiaEnv/2018a
module load intel/2018.3
module load intelmpi/2018.3
Then do a clean make of any relevant packages (e.g. PETSC), and look at the configure.log to make sure it uses the 2018 versions in the lines highlighted above. You may want to do a make clean and full Jetstream make with ./make_jetstream as well to be safe. Note that if you need metis for tacs, you will not be able to use it with the above environment & compilers, so compile it separately as described here. If you try to load metis with the NiaEnv/2018a stack, you will be prompted to load intel/2018.2, but this will lead to compilation issues in the aeroelastic mod.
You could alternatively try using the 2022 or later versions of the above modules (probably preferred over the above solution, since in general newer = better),
module load NiaEnv/2022a
module load intel/2022u2
module load intelmpi/2022u2+ucx-1.11.23
As of September 2024, this also appears to have solved the issue. Though note that this has not been thouroughly tested. If issues reappear, please report this to the system administrator. If you do this, you may also see the issue below of missing shared libraries. Note that if you do this, you will need to reinstall PETSc using a newer version.
Once again, please first make sure that you did not change any code, and try running your code on a single thread to ensure the issue is related to MPI.
In 2022, Randy Belanger experienced intermittent issues with collective I/O, i.e. his code would run for several time steps and up to several hours before seeing the error messages. The number of time steps or run time would not be consistent between runs, indicating that this was not an issue with the code itslef, but rather the compiler. After lengthy discussions with Niagara (ask Randy or Alex for details), two solutions were found:
Compile Jetstream with OpenMPI instead of IntelMPI. This requires no modifications to Make.in (because OpenMPI uses mpif90, which is the default for FC), but a change from mpifort to mpif90 in SPARSKIT, and potentially a modification to some MPI_Waitall commands to ignore statuses. Jetstream was compiled and run with the following modules loaded:
NiaEnv/2021a, intel/2021u3, openmpi/4.1.2+ucx-1.10.1
SciNet (Compute Canada) also suggested the following set of modules, but these were never tested:
NiaEnv/2021a, gcc/10.3.0, openmpi/4.1.1
Note, some concerns were raised by Thomas Reist about OpenMPI leading to nondeterministic behaviour due to how it handles collective operations such as MPI_AllReduce, but for steady cases this is unlikely to case noticable differences, and was not investigated further.
SciNet claimed to have updated the GPFS version ($PROJECT, $SCRATCH, $HOME) and fixed the issue, but this was never tested, and it is possible the issue may come back.
Annie was experiencing this bug in September 2024, and it was resolved by recompiling PETSC (ensuring that debug=off) then recompiling jetstream (with petsc=on). This is a good example of how a make clean can often solve a lot of issues.
/home/z/zingg/[user_name]/bin/jetstream_[ver]_x86_64: error while loading shared libraries: libmkl_intel_thread.so.2: cannot open shared object file: No such file or directory
There are clearly some shared libraries that can not be found. This can be confirmed by running ldd /path/to/jetstream_executable and checking for the missing library, which in this case would look like libmkl_intel_thread.so.2 => not found. You may need to do some digging as to why the library is not being found. Most likely, the actual path to the library is not included in $LD_LIBRARY_PATH
In this specific case, the required library is a part of the MKL library, a part of Intel. Although these libraries are usually included in $LD_LIBRARY_PATH because they can be found in $MKLROOT, which is indeed part of $LD_LIBRARY_PATH, in this case, two different parts of jetstream were compiled to reffer to different $MKLROOT directories and shared libraries. When compiling PETSC for example, if the 2019 compilers (NiaEnv/2019b, intel/2019u4, intelmpi/2019u4) were used then this part of the code will reference
$MKLROOT = /scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl
but if 2022 compilers (NiaEnv/2022, intel/2022u2, intelmpi/2022u2+ucx-1.11.2) were used then this part of the code will instead reference
$MKLROOT =/scinet/intel/oneapi/2022u2/mkl/2022.1.0
and as long as the same compilers are used to compile the rest of the code, there will be no isse. The problem arises when we compile the rest of jetstream using different compilers. For example, in this case PETSC was compiled with the 2022 compilers, but jetstream with the 2019 compilers. As a result, the libraries found in the 2022 directory /scinet/intel/oneapi/2022u2/mkl/2022.1.0 were not a part of $LD_LIBRARY_PATH, resulting in the error message thrown above.
There are 2 possible fixes:
(Quick and dirty) Before running any program (e.g. in the runScript.slurm file), add the appropriate directory to $LD_LIBRARY_PATH, which in this case would look like adding the folowing line
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/scinet/intel/oneapi/2022u2/mkl/2022.1.0
(Preferred) Ensure that you compile all necessary modules using the same compilers (you will need to run a make clean first). You can double-check once you finish compiling your code with ldd /path/to/jetstream_executable
First check Randy or Alex's scripts for ideas on how to do these transfers. Another great resource is the Niagara wiki: https://docs.scinet.utoronto.ca/index.php/HPSS
If you get the following error (or something simmilar) when running a command like hsi | pigz | tar :
*** get: Error -5001 on transfer. <stdout> from /archive/z/zingg/path/to/tarball
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Could not read tarball /archive/z/zingg/path/to/tarball
Or even
HPSS EIO error, will retry in 10 seconds
HPSS EIO error, will retry in 60 seconds
*** unable to reposition file to offset 325,008,228,352 after write error
*** put: Error -1 on transfer. <stdin> to /archive/z/zingg/path/to/tarball
Then it is likely that your tarballs are too big and are overwhelming the RAM available. Although the Niagra wiki says that file sizes <1TB are optimal, in reality you should try to keep tarballs under 200GB (though Randy has had luck pushing it up to 500GB). So try:
Break your data into smaller chunks of 200GB, or
Break the pipeline, i.e. first generate the compressed tarball first, then transfer to HPSS as a standalone hsi job. The inverse applies when transferring back, i.e. first read the tarball into $SCRATCH, then uncompress. You may run all of these commands sequentially in the same submission script, just not together in a single pipeline (i.e. piped to each other with the | operator).