Submitting Distributed InVEST Jobs on Palmetto

The Integrated Valuation of Ecosystem Services and Tradeoffs tool, or InVEST, is a free and open-source suite of models developed by the Natural Capital Project at Stanford University for determining the value of various ecosystem functions. The models are written in Python and can be run by installing the Natural Capital package. There are standard installers for Windows and Mac OS, but can also work in Linux OS by setting up a Python environment and installing the tool as a Python package. This enables us to leverage Clemson's high-performance Palmetto Cluster to speed up the computations and to distribute multiple jobs across many different processing nodes.

This tutorial will show you how to submit distributed InVEST jobs on Palmetto which allows you to run multiple models simultaneously from one terminal. We will process the carbon sequestration values for 50 land use rasters in parallel to demonstrate the time savings.

Prerequisites

- An account on the Palmetto Cluster. You can request one here.

- Basic familiarity with using a command line interface, including writing and modifying text files.

- Basic familiarity with the Palmetto Cluster (storage, resources, file transfer, submitting jobs, etc.).

- Successful completion of the "Using the InVEST Tools on the Palmetto Cluster" tutorial.

Getting the Data

Download all files from the bottom of this page. This includes sample data, a PBS Submission File, a shell script, and two Python files. The sample data are 50 land use/land cover rasters from the National Land Cover Database (2011), each stored as a unique TIF file. These show the land use for watersheds at the 12-digit Hydrologic Unit Code (HUC) level. The carbon sequestration parameters for each land use type are contrived for this exercise and are as an example only.

Workflow Overview

The above image shows the workflow followed in this tutorial. As you can see the the starting and ending points are your computer, however all of the heavy lifting is done by the Palmetto Cluster. Our scripts take advantage of the vast computational resources provided to Clemson University students, faculty, and staff for free and allow you to save time running your models and instead focus more on the results and any post-processing steps you may take.

You will upload the input land cover rasters, which for convenience of file transfer are zipped into two archives, data1.zip and data2.zip. These are uploaded to the Palmetto Cluster and extracted onto the scratch2 directory. Then, a shell script will unzip each land use raster and organize them into unique directories and a Python script is used to create the datastack files for each of the 50 jobs. From there, you will submit the PBS job file to the queue and the processing will occur.

Preparing the Data for the Models

Using your preferred SSH client, log into Palmetto. For example, the MobaXTerm Portable client can be downloaded here. Once downloaded, extract the file and run the program.

Select Session in the upper left-hand corner, then select SSH.

For the Remote host, type: login.palmetto.clemson.edu. Click OK.

When prompted, enter your Clemson username, password, and dual authenticate.

When you are successfully logged in, upload all the files to your /scratch2/username folder.

Next, start an interactive session.

qsub –I –l select=1:ncpus=1:mem=6gb,walltime=2:00:00

Next, switch to your scratch2/username directory.

cd /scratch2/username

Create a new directory called "carbon" within your scratch2/username directory.

mkdir carbon

Now we need to provide some information to the python file, PBS submission file, and shell script. We will start with the python file datastack.py, which will create a datastack file for each input file.

Open and modify the datastack.py file using a text editor. For example using nano:

nano datastack.py

Your screen should now look something like this:

There are two variables at the top of the file: username and num_zips. Within the quotation marks, type your username next to the username variable.

Next you will provide the number of zip files containing a ".tif" file and its corresponding data files. For this tutorial there are 50 zip files, change the 0 next to num_zips to 50. DO NOT CHANGE ANYTHING ELSE.

Save the file by pressing ctrl+o followed by enter, then press ctrl+x to exit nano.

Now open and modify subDatastackArray.sub using a text editor. For example using nano:

nano subDatastackArray.sub

Your screen should now look something like this:

There is one line and one variable we need to edit. First we are going to edit the line "#PBS -J 1-x" by changing x to the number of zip files you have (note: this should be the same number you provided for the num_zips variable in the last step), for this tutorial change x to 50.

Nex,t do the same as in the last step and enter your username within the quotes next to the username variable. DO NOT CHANGE ANYTHING ELSE. Save the file by pressing ctrl+o followed by enter, then press ctrl+x to exit nano.

Now we need to extract the sample data from data1.zip, from the command line in your scratch2/username directory type the following command and press enter:

unzip -j data1.zip -d .

The -j flag tells unzip to extract the files from data1.zip, and the -d flag tells unzip to extract them to a specific location, in this case "." which represents the current directory we are working in.

Once all the files have been extracted verify the results by viewing the contents of your directory.

ls

You should see a bunch of zip files numbered 1-50 listed in the console.

Next we need to unzip the 50 zip files we just extracted, we have provided a shell script, get_data.sh, to make the process easier. First we need to open and modify get_data.sh using a text editor. For example using nano:

nano get_data.sh

Your screen should look something like this:

On line four(4) we are going to change x to the number of zip files we have, in our case that is 50. Replace x with 50. DO NOT CHANGE ANYTHING ELSE, save the file by pressing ctrl+o followed by enter, then press ctrl+x to exit nano.

Next we need to give ourselves permission to run the shell script. To do so type the following command and press enter:

chmod 777 get_data.sh

Now we have permission to run the file, to do so type the following command and press enter:

./get_data.sh

The shell script unzips each of the numbered zip files one at a time and places them in folder with a matching name. For example the contents of 13.zip will be extracted to a folder called 13.

Now we need to create a JSON file containing a list of inputs, referred to as a datastack, along with the workspace directory where any intermediary files and the final output will be written. It also requires the name of the model to be run. Each of the 50 folders contains its own inputs, therefore we need a JSON file for each folder. We will use a Python file, datastack.py, that writes out a JSON file for each folder.

Once again we need to give ourselves permission to run the Python file. To do type the following command and press enter:

chmod 777 datastack.py

Now we have permission to run the file and we can do so by typing the following command and pressing enter:

python datastack.py

The Python file, as mentioned above, creates a JSON file for each folder of inputs and names it datastack*.invest.json, replacing the * with the number of the folder. For example the inputs of folder 13 will have a corresponding JSON file called datastack13.invest.json.

The last step in preparing our data is to give ourselves permission to run the PBS submission file, to do so type the following command and press enter:

chmod 777 subDatastackArray.sub

We are now ready to submit our InVEST jobs to the Palmetto cluster! Exit your interactive session and continue to the next section. Use the exit command until you are back at the login node "username@login001".

exit

Submitting your InVEST Job for Distributed Computing

In the previous tutorial we created a PBS Job Submission file to run one model, in this tutorial we have provided you with a job submission file that will run a model for each input we created in the previous steps simultaneously as a job array. Our completed submission file for this tutorial looks like this:

#!/bin/bash

#PBS -N investC

#PBS -l select=1:ncpus=1:mem=6gb,walltime=1:00:00

#PBS -j oe

#PBS -J 1-50

username='your_username'

cd $PBS_O_WORKDIR

module add anaconda3/5.1.0

source activate invest

invest -v -y -l -d /scratch2/$username/datastack${PBS_ARRAY_INDEX}.invest.json -w /scratch2/$username/carbon/output-${PBS_ARRAY_INDEX}/ carbon

The only difference between this submission file and the one in the last tutorial is the addition of the line "#PBS -J 1-50". The -J flag for PBS Submissions tells the job scheduler on Palmetto that we wish to run the same commands below in more than one instance, this creates an array of those instances. In our case we are running 50 instances of the same carbon model, however we tweaked the path name to point to a different JSON file in each instance. We do this using the ${PBS_ARRAY_INDEX} variable provided by the PBS Job Scheduler environment, this variable keeps track of which instance it is currently on when running our commands. For example when it reaches the 13th instance ${PBS_ARRAY_INDEX} will be equal to 13 and point to the inputs contained in datastack13.invest.json.

We also use the ${PBS_ARRAY_INDEX} variable to name the work space location to store the outputs of each instance, in our case the locations are sub-directories located within the carbon directory we created earlier. Those sub-directories correspond to the instance number of the job, the number of the JSON file, as well as the folder name containing the inputs. For example, the outputs for instance 13 will be stored in a directory called output-13, its path name is "/scratch2/username/carbon/output-13".

Navigate to your /scratch2 directory.

cd /scratch2/username

Submit the job array to the queue.

qsub subDatastackArray.sub

Shortly after submission you should see output folders beginning to appear.

ls /scratch2/username/carbon

Validate that the model outputs have run. You should see the tot_c_cur.tif file containing the carbon storage in each output file.

ls /scratch2/username/carbon/output-1

Congratulations! You now have successfully submitted an InVEST job array to Palmetto!

Preparing Results for export out of the Palmetto Cluster

Now that you have successfully run your models on Palmetto it is now time to prepare your results for export. For simplicity the only thing we are preparing for export is the "tot_c_cur.tif" file in each of the output directories. To do that we have provided a Python file, "exportResults.py", that will do the following:

    1. Create a new directory, "output_tifs", to store only the ".tif" files from each output directory.

    2. Index each of the ".tif" files to match the index of their output directory (e.g. "tot_c_cur.tif" in the directory output-13 will be renamed "tot_c_cur-13.tif").

    3. Copy the newly named ".tif" file to the "output_tifs" directory.

    4. Once all of the ".tif" files have been renamed and copied to the "output_tifs" directory, that directory will be zipped up into a zip file called "output_tifs.zip".

To get started we need to provide some information to the Python file, similar to what we did in previous steps. Open and modify exportResults.py using a text editor. For example, using nano:

nano exportResults.py

Your screen should now look something like this:

Just like before there are two variables near the top of the file username and num_tifs, within the quotes type your username next to the username variable. Next change the 0 next to num_tifs to the number of ".tif" files you are wishing to export, in our case it is 50. DO NOT CHANGE ANYTHING ELSE, save the file by pressing ctrl+o followed by enter, then press ctrl+x to exit nano.

Next we have to give ourselves permission to execute the Python file we just modified, to do so, type the following command and press enter:

chmod 777 exportResults.py

With permissions set we can now run the Python file, to do so, type the following command and press enter:

python exportResults.py

There should be no output on the command line if it runs successfully, to verify the results of the script, type the following command (substituting username with your username) and press enter:

ls /scratch2/username/carbon/

Your screen should look something like this:

You should see the newly created output_tifs directory and the output_tifs.zip file.

Congratulations! Now you can export the output_tifs.zip file from the Palmetto Cluster and view the results in the GIS of your choice!