3.2.2.1) Checkpoint System

Checkpoint System

AERPAW provides a set of tools that can be used to establish "checkpoints" for scripts. These can be used to block execution on a node until a process on a different node completes. As a secondary feature, these can be used to pass more general information about the state of an experiment between nodes being using within that experiment.

There is a Python script that can be included in experimenter programs that wraps the functionality described below into functions that can be called on the fly. The script is located at: AERPAW-Dev/AHN/E-VM/Profile_software/checkpoint_system/checkpoint.py

A more complete description of the API provided by the checkpoint system is further below.

Example Usage with checkpoint.py

To take advantage of the checkpoint script, first copy or link the file from AERPAW-Dev/AHN/E-VM/Profile_software/checkpoint_system/checkpoint.py into the same directory as the script(s) that will use the functions provided.

Two example Python files are below, intended to be run on two different nodes, that communicate with the checkpoint server run by AERPAW. The sample scripts will block and wait for each other to complete a task (in this case, a time.sleep() call).

Script 1

from checkpoint import *


C_VM_HOST = "192.168.32.25"

C_VM_PORT = 12435


# at start of experiment, reset state of checkpoint server

reset_checkpoint_server(C_VM_PORT, C_VM_HOST)


# set a flag indicating to the other script that the server has been reset

set_checkpoint("experiment_started", C_VM_PORT, C_VM_HOST)


# wait for a flag to be set by the other script after it performs a "task"

wait_for_checkpoint("task_complete", C_VM_PORT, C_VM_HOST)


# set a flag indicating that the experiment has completed

set_checkpoint("experiment_complete", C_VM_PORT, C_VM_HOST)


# print something out indicating completion

print("done!")


Script 2

from checkpoint import *


C_VM_HOST = "192.168.32.25"

C_VM_PORT = 12435


# wait for script 1 to reset the server and unset the "experiment_complete" flag from previous run(s)

while check_checkpoint("experiment_complete", C_VM_PORT, C_VM_HOST):

    time.sleep(1)


# verify that the experiment has started

wait_for_checkpoint("experiment_started", C_VM_PORT, C_VM_HOST)


# do a "task" that the first script waits for completion on

time.sleep(5)

set_checkpoint("task_complete", C_VM_PORT, C_VM_HOST)

print("done!")


Example usage with bash (duplicate of Script 1)

#!/bin/bash   


# reset checkpoint sesrver

curl -X POST http://192.168.32.25:12435/checkpoint/reset

sleep 1


# set a flag indicating that the server has been reset

curl -X POST http://192.168.32.25:12435/checkpoint/bool/experiment_started


# wait for flag to be set by other script

while ((1)) ; do

 echo "waiting for task_complete"

 response=$(curl -X GET http://192.168.32.25:12435/checkpoint/bool/task_complete)

 if [ $response == "True" ]; then

 break

 fi

 sleep 1

done


# set a flag indicating experiment completion

curl -X POST http://192.168.32.25:12435/checkpoint/bool/experiment_complete

Additional Checkpoint System Functionality

The checkpoint system also has additional functionality supporting setting and retrieval of several variable types. These types are:

The checkpoint server is implemented as an HTTP service that can be reached at http://192.168.32.25:12435/checkpoint/.... Examples of interfacing with the server can be found in the previously mentioned file at AERPAW-Dev/AHN/E-VM/Profile_software/checkpoint_system/checkpoint.py. A complete description of the API is below.

The API is designed to use HTTP POST and GET requests as follows:

On GET reply, the server sends the data back as a raw value

There is a way to reset all variables in the server by POSTing to:

A reset should be performed at the start of every experiment to avoid persisting any state between experiment runs.