Active Reward Learning from Online Preferences