PPO learning curves in the 20*20 market, with temporally extended feedback setting. Worker regret and firm regret decrease over training; social welfare increases and then converges. Friction loss converges to a non-zero value.
Comparison of PPO against CA-ETC in the temporally extended feedback setting in the 20*20 market. The result is consistent with the small market. PPO outperforms CA-ETC in both regret and social welfare, but CA-ETC has lower friction loss.
Left: interview coverage---the fraction that each pair (i, j) was interviewed at least once by the end of the episode across all evaluation environments.
Right: mean cumulative tenure of each pair at the final period. Both figures are from the large market setting.
Notably, although PPO achieves higher performance, the friction loss does not converge to zero, indicating insufficient exploration. This gap arises because PPO samples only a limited set of pairs before committing for long periods to promising matches. In this sense, its behavior mirrors the career path dependence documented in labor economics, where relationships become more persistent with tenure and current matches shape future learning and mobility costs (Farber, 1996; Topel and Ward, 1992; Miller, 1984; Kambourov and Manovskii, 2009). While this improves realized welfare through tenure-based learning, it leaves many unobserved pairs close to their priors, highlighting the need to combine PPO-like adaptivity with CA-ETC-like coordinated exploration and stable-matching structure.