Zixian Yang and Lei Ying, “Learning-Based Pricing and Matching for Two-Sided Queues,” arXiv preprint arXiv:2403.11093v2, 2024.
We consider a dynamic system with multiple types of customers and servers. Each type of waiting customer or server joins a separate queue, forming a bipartite graph with customer-side queues and server-side queues. The platform can match the servers and customers and then the matched pairs leave the system. The platform will charge a customer a price according to their type when they arrive and will pay a server a price according to their type. The arrival rate of each queue is determined by the price according to some unknown demand or supply functions. Applications include ridesharing platforms such as Uber and Lyft.
The goal is to design pricing and matching algorithms to maximize the total profit of the platform over a finite time horizon and minimize the queue lengths of both customers and servers.
We proposed a learning-based pricing algorithm that yields a sublinear regret regarding the profit of the platform and a sublinear queue length bound. We established a near-optimal tradeoff between regret and queue length.
Zixian Yang, R. Srikant, and Lei Ying, “Learning While Scheduling in Multi-Server Systems with Unknown Statistics: MaxWeight with Discounted UCB,” in Proc. 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:4275-4312, 2023.
Multi-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, crowdsourcing, and healthcare systems. We consider a multi-server system with multiple servers and multiple types of jobs, where different job types require different amounts of processing time at different. The statistics of the processing times are unknow and even nonstationary.
The goal is to schedule jobs on servers without knowing the statistics of the processing times to fully utilize the processing power of the servers.
We proposed MaxWeight with Discounted UCB algorithm for the scheduling problem in multi-queue multi-server systems with unknown statistics. We proved that the proposed algorithm is stable in both stationary and nonstationary settings.
For the proposed algorithm, we proved that the asymptotic average queue length is orderwise optimal with respect to the traffic slackness. We also obtained an upper bound on the moment generation function of any-time queue length.
The traditional multi-armed bandit (MAB) model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. We want to understand the exploration, exploitation, and engagement in these systems.
We propose a multi-armed bandit with abandonment model, where the abandonment probability depends on the current recommended item and the user’s past experience. Our goal is to maximize the total reward per episode, where an episode ends when the user abandons (leaves) the system temporarily.
We proposed ULCB and KL-ULCB algorithms, which do more exploration when the user likes the previous recommended item and less exploration when the user does not. We proved that KL-ULCB is asymptotically optimal. Numerical results show that our algorithms significantly outperform the traditional UCB and KL-UCB algorithms, and have order-wise lower regrets than generic reinforcement learning algorithms like Q-learning.