The proposed MV generation framework
Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing
In the proposed music video (MV) generation system, an uniform video segmentation is first applied to segment a queried long user generated video (UGV) into several video segments.
For each video segment, a multi-task deep neural network (MDNN) is adopted to predict the pseudo acoustic (music) features from the visual (video) features, called pseudo song prediction.
A dynamic time warping (DTW) algorithm with a pseudo-song-based deep similarity matching (PDSM) metric is used to align the UGV and a music track based on the acoustic features.
The video editing module based on the target and concatenation costs then selects and concatenates the segments of the UGV through the DTW-aligned result to generate a music-compliant professional-like video for each candidate music track.
Finally, the cost ranking module will rank all the generated MVs and recommend the best MV to the user.