As shown in Fig.1, we use the mathematical model proposed by [1] to simulate the general cancer evolution and drug treatment effects.
Fig. 1. Cancer Simulator: the state transition during 6-month simulation depends on last-step feature values, current dosage action and survival analysis.
We randomly sample 10,000 subjects for training, 2,000 for validation, and 2,000 for testing. For all deep learning approaches, the reward and policy model share the similar neural network structure and hyper-parameters: 2 fully-connected layers, the first followed by ReLU activation and the second followed by model-specific activation functions. In each epoch, the agent and reward model get updated with trajectories of all the training samples. The learning rate is set to 0.01 and all the networks converge after 400 epochs. For deep RL methods, we set the discount factor γ to 1.
[1] Zhao, Yufan, Michael R. Kosorok, and Donglin Zeng. "Reinforcement learning design for cancer clinical trials." Statistics in medicine 28.26 (2009): 3294-3315.