Extending the existing learning approaches with the aim of including queueing-based models.
Models of many real-life applications, such as queuing models of communication networks, have a countably infinite state-space. Learning procedures developed to produce optimal policies mainly focus on finite state settings and do not apply to these models. To overcome this lacuna, in arXiv:2306.02574 , we study the problem of optimal control of a family of discrete-time Markov Decision Processes (MDPs) governed by an unknown parameter, defined on a countably-infinite state space , with finite action space, and an unbounded cost function with the goal of designing a learning algorithm that minimizes the regret compared to the long-term average cost optimal policy. An algorithm based on Thompson sampling with dynamic episodes is proposed: at the beginning of each episode, using Bayes' rule, a posterior distribution is formed, and an estimate is realized from this distribution and used during the episode. In contrast to finite-state MDPs, where stability and the existence of a stationary distribution are assured in a simple manner, the countable state-space setting needs additional conditions to ensure ergodicity. By imposing stability conditions and using the solution of the average cost Bellman equation, we prove a sub-linear upper bound for the Bayesian regret. Finally, for two queuing models with unknown dynamics, we argue that our algorithm can be applied to develop approximately optimal control algorithms.