Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement learning benchmark casting dynamic matching as a partially observable Markov game where decentralized agents choose whom to interview, match with, and when to separate while learning from noisy, delayed signals. Comparing independent PPO with a bandit-style explore-then-commit baseline (CA-ETC), we find PPO attains higher welfare and lower regret, while CA-ETC retains lower information-friction loss through coordinated exploration.