Original input video that we want to remove some background objects:
Mask video that indicates regions to be removed in the given short video clip.
Note that the mask is automatically generated by a deep learning-based object segmentation model.
Users just mark a rectangular box in the first frame of the given short video clip.
Output video: The unwanted background object is removed in the video clip.
Look at this presentation for more detail about the proposed method for video inpainting.