Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing
In the proposed music video (MV) generation system, an uniform video segmentation is first applied to segment a queried long user generated video (UGV) into several video segments.
For each video segment, a multi-task deep neural network (MDNN) is adopted to predict the pseudo acoustic (music) features from the visual (video) features, called pseudo song prediction.
A dynamic time warping (DTW) algorithm with a pseudo-song-based deep similarity matching (PDSM) metric is used to align the UGV and a music track based on the acoustic features.
The video editing module based on the target and concatenation costs then selects and concatenates the segments of the UGV through the DTW-aligned result to generate a music-compliant professional-like video for each candidate music track.
Finally, the cost ranking module will rank all the generated MVs and recommend the best MV to the user.