3.2.2.1) Checkpoint System
Checkpoint System
AERPAW provides a set of tools that can be used to establish "checkpoints" for scripts. These can be used to block execution on a node until a process on a different node completes. As a secondary feature, these can be used to pass more general information about the state of an experiment between nodes being using within that experiment.
There is a Python script that can be included in experimenter programs that wraps the functionality described below into functions that can be called on the fly. The script is located at: AERPAW-Dev/AHN/E-VM/Profile_software/checkpoint_system/checkpoint.py
A more complete description of the API provided by the checkpoint system is further below.
Example Usage with checkpoint.py
To take advantage of the checkpoint script, first copy or link the file from AERPAW-Dev/AHN/E-VM/Profile_software/checkpoint_system/checkpoint.py into the same directory as the script(s) that will use the functions provided.
Two example Python files are below, intended to be run on two different nodes, that communicate with the checkpoint server run by AERPAW. The sample scripts will block and wait for each other to complete a task (in this case, a time.sleep() call).
Script 1
from checkpoint import *
C_VM_HOST = "192.168.32.25"
C_VM_PORT = 12435
# at start of experiment, reset state of checkpoint server
reset_checkpoint_server(C_VM_PORT, C_VM_HOST)
# set a flag indicating to the other script that the server has been reset
set_checkpoint("experiment_started", C_VM_PORT, C_VM_HOST)
# wait for a flag to be set by the other script after it performs a "task"
wait_for_checkpoint("task_complete", C_VM_PORT, C_VM_HOST)
# set a flag indicating that the experiment has completed
set_checkpoint("experiment_complete", C_VM_PORT, C_VM_HOST)
# print something out indicating completion
print("done!")
Script 2
from checkpoint import *
C_VM_HOST = "192.168.32.25"
C_VM_PORT = 12435
# wait for script 1 to reset the server and unset the "experiment_complete" flag from previous run(s)
while check_checkpoint("experiment_complete", C_VM_PORT, C_VM_HOST):
time.sleep(1)
# verify that the experiment has started
wait_for_checkpoint("experiment_started", C_VM_PORT, C_VM_HOST)
# do a "task" that the first script waits for completion on
time.sleep(5)
set_checkpoint("task_complete", C_VM_PORT, C_VM_HOST)
print("done!")
Example usage with bash (duplicate of Script 1)
#!/bin/bash
# reset checkpoint sesrver
curl -X POST http://192.168.32.25:12435/checkpoint/reset
sleep 1
# set a flag indicating that the server has been reset
curl -X POST http://192.168.32.25:12435/checkpoint/bool/experiment_started
# wait for flag to be set by other script
while ((1)) ; do
echo "waiting for task_complete"
response=$(curl -X GET http://192.168.32.25:12435/checkpoint/bool/task_complete)
if [ $response == "True" ]; then
break
fi
sleep 1
done
# set a flag indicating experiment completion
curl -X POST http://192.168.32.25:12435/checkpoint/bool/experiment_complete
Additional Checkpoint System Functionality
The checkpoint system also has additional functionality supporting setting and retrieval of several variable types. These types are:
boolean -> either true or false, set once
int -> increments every time it is "set". starts at zero
string -> holds arbitrary data as a URL encoded string
The checkpoint server is implemented as an HTTP service that can be reached at http://192.168.32.25:12435/checkpoint/.... Examples of interfacing with the server can be found in the previously mentioned file at AERPAW-Dev/AHN/E-VM/Profile_software/checkpoint_system/checkpoint.py. A complete description of the API is below.
The API is designed to use HTTP POST and GET requests as follows:
HTTP POST .../bool/variable -> sets boolean flag "variable" to true
HTTP POST .../int/variable -> increments integer "variable" by 1
HTTP POST .../string/variable?val=set-value -> sets value of string "variable" to val (in this case, "set-value")
HTTP GET .../bool/variable
HTTP GET .../int/variable
HTTP GET .../string/variable
On GET reply, the server sends the data back as a raw value
There is a way to reset all variables in the server by POSTing to:
HTTP POST .../reset -> resets all variables to default values
A reset should be performed at the start of every experiment to avoid persisting any state between experiment runs.