XIPER enables RL agents to learn from unlabeled cross-domain videos, without ground-truth task reward and low-dimensional state information. XIPER works by learning a cross-domain video prediction model that consists of: (i) An expert video prediction model, trained to model expert behaviors, and (ii) a domain translation model, trained to map agent domain observations into the expert domain. Then using the likelihood of the prediction as the reward signal to train RL agents.