Summary: We trained the proposed model using 4 out of 5 video sequences of a manipulator performing tabletop manipulation. A video sequence that contained previously unseen counterclockwise motion was held out from training for testing. Each dataset considers different locations of all gears while the visiting arm follows the same order of visitation. Thus the focus of the task is to measure the capacity of the proposed approach to relate visual cues to its prediction. In such settings, standard prediction models would be expected to overfit to the training data and fail to predict the intended motion correctly unless they capture environmental cues.