Diffusion probabilistic models, also known as diffusion models, generate an image through gradually removing noise and are trained through the objective of learning the latent structure of the dataset.
I could not find a diffusion model that merged only two known sources together. After reading more about how the model works, this makes sense since the model is trained from a large dataset to produce an image, rather than trying to merge only two pieces of data like example-based synthesis and NST.
Thus, I used Deforum which has the option to use a text prompt and video input to create an animation. I tried to match the prompt with the source exemplar.
ITERATION 1
I attempted to use the Video Input feature to merge the content of the original video and have Deforum add the source exemplar (through the text prompt, so not the exact same input) into the frames, but that wasn't how this tool works. With the Video Input feature and these settings, it looked like Deforum used Stable Diffusion to recreate the text prompt for each frame extrapolated from the original video. Instead of an output that merged the original video and source exemplar/text prompt, I received a video of 291 different AI-generated images of the text prompt. Not what I wanted, and I also needed to change the fps since the output came to 24 seconds and not 10 seconds. However, I've seen examples of others using the Video Input successfully, so I did some more searching on the Internet and was guided by a Reddit post in Iteration 2.
In the Deforum Google Colab v0.7, I changed the following from the Default settings, where text in blue indicates code/text in Colab:
Settings:
Changed animation_mode: None to animation_mode: Video Input
Zeroed all Motion Parameters
Under Video Input:, to get the original video from my Google Drive, changed video_init_path: "/content/drive/MyDrive/CrAIfinish/catVideo.MOV"
Prompts:
prompts = [
]
animation_prompts = {
0: "An Impressionist Painting by Mary Cassat titled The Cup of Tea",
}
Load Settings:
Kept seed_behavior: iter
ITERATION 2
The changes from Iteration 1 helped better conform to the content of the original video, but still not exactly what I wanted. Deforum still emphasized the content of the images generated from the text prompt, so I tried to reduce this emphasis in Iteration 3.
Changing the seed_behavior from iter to fixed meant that only one Stable Diffusion generated image was merged, so that change pushed the output closer to what I wanted. However, I don't know exactly what that generated image is and how close it is to my source exemplar.
In the Deforum Google Colab v0.7, I changed the following from the Default and Iteration 1 settings, where text in blue indicates code/text in Colab:
Settings:
Changed animation_mode: None to animation_mode: Video Input
Zeroed all Motion Parameters
Under Video Input:, to get the original video from my Google Drive, changed video_init_path: "/content/drive/MyDrive/CrAIfinish/catVideo.MOV"
Prompts:
prompts = [
]
animation_prompts = {
0: "An Impressionist Painting by Mary Cassat titled The Cup of Tea",
}
Load Settings:
Under Batch Settings, I changed to seed_behavior: fixed
Under Init Settings, I checked use_init and changed strength: 0.5
Removed Default text from init_image
Create Video From Frames:
Changed to fps: 29
ITERATION 3
This best met my desire output out of these Deforum iterations, where the style from the source exemplar and the content from the original video are most explicit. This is the Deforum video shown on the Home Page.
Changing the strength better emphasized the content of the video compared to Iteration 2.
The only thing I changed from the Iteration 2 settings, where text in blue indicates code/text in Colab, was:
Load Settings:
Under Init Settings, I changed strength: 0.5 to strength: 0.75