As shown in Fig.2, We employ the mathematical model derived by [1] to simulate the acute inflammation process in response to an infection.
Fig.2. Sepsis Simulator: 8 of the 19 physiological state features are observable to the agent and the action only affects a subset of them: Na, PI, AI.
State Transition: As shown in the feature interaction network, the state features influence each other’s progression through Ordinary Differential Equations (ODEs).
Whenever a blood purification operation is made, three components in the circulation are eliminated, i.e., activated neutrophils Na and the pro- and anti-inflammatory mediators PI and AI. Readers can refer to [1] for a detailed description of the 18 ODEs to for feature interactions and 3 ODEs for the hypothetic mechanism of blood purification.
The parameters in the state transition functions (21 ODEs) characterize the subject, and we assume they follow the normal distribution with respective mean and standard deviation known from existing literature. We sample subjects individually by generating ODE parameters using Markov-Chain Monte Carlo (MCMC) sampling of their posterior parameter distributions. The parameters for each subject are accepted if they are compatible with their respective distributions and the observations in the resulted simulation are close to experimental data (measurements of 23 rats in [1]. To evaluate the treatment effects of different approaches more efficiently, we select subjects who will die without blood purification operations for our experiments.
We randomly sample 3,000 subjects for training, 1,000 for validation, and 1,000 for testing. All subjects end with death when no blood purification operation is performed. The network structure of deep RL approaches is similar to models for the Cancer task, except that the backend network is LSTM-based due to partial observability.
Learning efficient treatment policies for Septic subjects is more difficult compared with Cancer due to the larger state space and the partially observable environment. Therefore, we adopt the following training approaches to ensure robust learning: 1) Mini-batch gradient descent with batch size 10, 000 for both reward estimator and RL agents; 2) Set the learning rate 0.01 for RL agents and 0.001 for the reward estimator; 3) randomly sample trajectory pairs from the latest 10 epochs for model updates.
[1] Song, Sang OK, et al. "Ensemble models of neutrophil trafficking in severe sepsis." PLoS computational biology 8.3 (2012): e1002422.