Wang Chi Cheung's Webpage

I research on online learning in non-stationary environments, where the decision maker (DM) could start with partial knowledge on the unknown online model.

Non-stationarity: Classical research works provide important foundation on online learning under stationarity. In non-stationary worlds, the classical sense of convergence no longer applies.

How should the DM adapt to drifts, while accumulating online information?

Partial knowledge: At the start of the e-commerce era, learning from scratch is often the norm. Correspondingly, most existing online learning models start with little or no knowledge on the latent model. By contrast, in the current data-rich era, the DM often has non-trivial knowledge on the latent model.

How/when should the DM embed his/her auxiliary knowledge in online learning? What sort of auxiliary knowledge is useful?

My research agenda sheds light on the above in two models: multi-armed bandits (MABs) and online resource allocation.

Using (possibly biased) Offline Data in MABs

In the current data-rich era, we rarely start learning from scratch. Nevertheless, blindly incorporating offline data can be detrimental. When the offline distribution that generates the offline data is "far from" the online distribution that models on online rewards, we should stick to the classical UCB. Otherwise, we should reap the benefits of offline data and outperform UCB.

Can we design an online policy that achieves the best-of-both-worlds? In an ICML 2024 (spotlight) paper, my PhD student Lixing and I found the answer to be mixed:

Only (possibly biased) offline data is insufficient.

• No information on the bias: No online algorithm can outperform the UCB.

• Know upper bound on the bias: we design MIN-UCB, which out-performs UCB.

Our analysis on MIN-UCB not only gives the best possible regret bounds, but also some surprising insights. For example, biases in offline data can sometimes be beneficial to online learning!

Online resource allocation with few samples

In online non-stationary resource allocation over a horizon of T steps, a central challenge is the impossibility result: an o(T)-regret cannot be achieved, if the DM has no offline data. An o(T)-regret means that the DM’s decision converges to an optimal decision, as T increases. How much data do we need for an o(T) regret?

In an OR paper, Guodong and I developed the Offline-to-Online (O2O) framework. O2O achieves a square-in-T regret, even when the DM has only M = 1 sample trajectories. Here, a sample trajectory is a realization of the T non-stationary demands in the T time steps. In other words, only one single sample from each time step suffices for a o(T)-regret.

As shown in Figure 2, O2O first (a) collects M sample trajectories. Then, it (b) constructs a bootstrap of MT samples, which serves to “stationarize” the data. After that, we harness online convex optimization tool in (c) to condense the information in the bootstrap. The condensation leads to a collection of “shadow price” vector that informs us the opportunity costs of different resources. Lastly, the shadow price vectors (d) are randomly sample to induce our desired online allocation.

By allowing M to be any positive integer, O2O allows great practicality. In terms of the sample complexity result, O2O in fact presents an exponential improvement to the state-of-the-art by Jiang et al. 2024:

Reusable resources

Traditional inventory models assume that once a resource is allocated, it is consumed (depleted). I lead a research agenda that modernizes this view for the Sharing Economy (e.g., cloud computing, rental fashion, workforce scheduling), where an allocated resource unit can be re-allocated after a usage duration.

My PhD students and I tackled the complexity of context-dependent usage durations—where the time a resource is occupied depends on who uses it. In a series of research works, we consider settings that stretches from stationary to slow changing, and then to adversarially changing.

Managing the usage duration of resources adds an additional layer of complexity, compared to the classical settings with non-reusable resource allocation. Despite the challenges, with my PhD student Xilin, we demonstrate how to estimate the underlying fluid relaxation online, while maintaining an online allocation policy. With my PhD student Tianming, we demonstrate the surprising effectiveness of protection level policies, where the protection levels crucially depend on the revenues and usage durations of different customers’ contexts.

Non-stationary MABs

Our study starts with the seminal paper, which propose a stochastic K-armed bandit model with a general non-stationarity setting. In their setting, the amount of non-stationarity is characterized by a scalar quantity V, the variation budget. We continue the research journey in this paper, by analysing a well-known heuristic SW-UCB (Sliding Window UCB), and eliminating the need for knowing V in our algorithm design.

Our first contribution provides tight regret bounds on SW-UCB. Our novel analysis on SW-UCB involves an explicit decomposition of the regret into ``regret of drift'' (bias) and ``regret of uncertainty'' (estimation error), characterizing exactly how the optimal window length trades off these two error sources.

To overcome the challenge of unknown V, we introduce the ``Bandit-over-Bandit'' (BOB), featuring an unconventional marriage between stochastic bandit algorithm (SW-UCB) and adversarial bandit algorithm (EXP3). The synergy allows the DM to be competitive against the optimally chosen window length (via EXP3), while achieving our desired regret bound via the statistical estimation by SW-UCB.

The journey did not stop at MABs: we continued with online RL. Then, my PhD students and I dive further into the fore-front of combining non-stationary online resource allocation model with MAB feedback, see this and this.

Page updated

Google Sites

Report abuse