Human-Object Interaction Prediction in Videos through Gaze Following