Which of these three images should I choose as the starting image of clip 2?
I like them all but decide to go with the third one. I feed that to Veo with this prompt:
The camera pans to the left as the witch puts the rat and the herbs she is holding into the bubbling, steaming cauldron. As she does so she says in a menacing tone: “Hubble, bubble, toil and trouble, fire burn and caldron bubble. Thrice the brinded cat hath mewed.” After dropping the rat and herbs into the caldron she takes a large ladle and stirs the mixture.
Within a couple of minutes I have a good result. She doesn't stir the mixture - I should have said 'spoon' rather than 'ladle'. And the camera doesn't pan to the left. Nice steam, though! I think I am putting about the right amount of text, about 16 words, to be spoken comfortably in an 8 second clip.
I used this image
and this prompt:
The witch stirs the mixture with a large spoon. The creatures swim in the bubbling liquid. Steam rises. The camera slowly rotates and zooms in towards the cauldron. The witch speaks the magic formula "Thrice and once, the hedge-pig whined. Harpier cries:—'tis time! 'tis time! Round about the cauldron go." After a few seconds a skull rises to the surface of the liquid.
The fragment of incantation is from Shakespeare's Macbeth, Scene 4 Act 1. Actually the text I'm using doesn't seem to be in the same order as the original one. The complete text of Macbeth is accessible HERE in English and HERE in French.
Failure. No reason given. I asked ChatGPT to optimise the prompt and it suggested this:
A dark, torch-lit cavern. The witch stands over a large, steaming cauldron, stirring the bubbling green liquid with a heavy, ornate wooden spoon. A toad, rat and snake swim through the mixture as thick steam coils into the air. The camera begins a slow, dramatic rotation around the cauldron, gently zooming in. The witch chants: “Thrice and once, the hedge-pig whined. Harpier cries:—’tis time! ’tis time! Round about the cauldron go.” As her incantation ends, a human skull eerily rises from beneath the surface of the churning potion. Visual tone: Menacing, cinematic, dimly-lit with warm torch glow, fantasy realism Camera: Smooth dolly zoom and slow orbit (parallax effect), starting wide, ending in medium close-up of the cauldron Audio cues: Bubbling liquid, flickering flames, the witch’s incantation, distant echoes
Failure. No reason given. "Try again later", it says. I switch to Library view and a video is there, but it's built on the first prompt above, not the ChatGPT one. I switch back into Flow and the video is there too, but not usable because the witch now looks very different, with red hair and no makeup. The witch doesn't stir the liquid. There isn't a proper zoom and rotation, though the cauldron seems to grow a little. Also, I think her voice is different. Do you agree? The upscaled version had no sound so I used a video editor to copy the sound from the 720p version to the 1080p version to get this:
I try to use ChatGPT to make a similar image without the witch but I've hit my daily limit for image edits. Do I have to pay €23 per month to increase my daily image editing limit? Grok cannot match ChatGPT for image generation, but maybe Google Gemini can? I use the VPN to be able to edit images in Google Gemini but it does a terrible job, turning the first image below into the second when given this prompt:
Can you remove the skull from this image and make the liquid a little greener and the image a little lighter? Also, the rat and snake should be half-submerged. 16:9
It didn't even remove the skull from the image. And what is that weird animal with two tails?? Actually even the snake seems to have two tails and no head.
I want to remove the skull from the starting image so that it can rise to the surface during the clip. In other words, I would like that image with the skull to be the final image in the clip, but it doesn't seem possible for Veo 3 to work with a final image at this time. Is there a workaround? I believe it's possible to instruct Veo to jump from one view to another during the clip, so my idea is to use the image with skull as the starting image, then tell Veo to jump to a different view in the second frame, then tell it that the last frame should be the same as the first. In the resulting video I would then suppress the first frame. Am I being too clever? I try this prompt:
At the second frame, jump to a similar frame in which the skull is not visible in the cauldron. A witch in black robes stirs the mixture with a large spoon. Only her arm is visible. The creatures swim in the bubbling liquid. Steam rises. The camera slowly rotates and zooms in towards the cauldron. The witch speaks the magic formula "Thrice and once, the hedge-pig whined. Harpier cries:—'tis time! 'tis time! Round about the caldron go." After a few seconds a skull rises to the surface of the liquid. The last frame in the video should be the same as the first.
Failed generation, or so it said. But this time all I had to do was refresh the page to see that the generation had actually worked. But the upscaled version again has no audio so I had to download both the 720p version (which has audio) and the 1080p version (which doesn't) and then copy the audio from the low-res version to the high-res version. Far from ideal. Here below is the 1080p version with the recovered audio.
It's the first time I've tried to use a jump during a clip and there is a jump, but it's about three seconds into the clip. After the jump, the skull is indeed hidden, and there is indeed some zoom effect. However the final frame is not the same as the first frame as demanded. I could keep just the part after the jump, but I think I'll keep it all because otherwise I'll spoil the incantation.
Even though I have a satisfactory clip, I thought I would try extending the previous clip to see whether that would give a better result. Veo gives the possibility of extending a clip by providing a prompt to indicate what should happen in the extension, but this only works using Veo 2, and Veo 2 cannot generate audio. I tried it anyway, using this script (which shouldn't have included the speech instructions).
The witch puts down the ladle and picks up a large wooden spoon which she uses to stir the potion. Steam rises. The camera slowly rises and approaches the cauldron and we see a skull float to the surface. The witch speaks the magic formula "Thrice and once, the hedge-pig whined. Harpier cries:—'tis time! 'tis time! Round about the caldron go."
Below is the result. Veo 2 doesn't offer the possibility of upscaling so the video is in 720p. The witch doesn't put down the ladle, instead using it to scoop up a very small skull. I don't know what she's trying to say.
It occurred to me that there might be way of extending a clip while using Veo 3 and thus keeping its sound and upscaling features. I could try taking the last frame of the previous clip and using it as the first frame of the new one. How good will the continuity be between the new clip and the old? So I captured the last frame of clip 2 and fed it to Veo with the prompt:
The witch puts down the ladle and picks up a large wooden spoon which she uses to stir the potion. Steam rises. The camera slowly rises and approaches the cauldron and we see a skull float to the surface. The witch speaks the magic formula "Thrice and once, the hedge-pig whined. Harpier cries:—'tis time! 'tis time! Round about the caldron go."
Here's the result:
It's not bad. Instead of the camera approaching the cauldron, there is a fade transition that wasn't asked for. Neither was the little laugh at the end. Veo 3 clips quite often end with a little unexpected behavior in the last second, apparently. The voice is probably closer to that of clip 2 than the other clip 3 attempts. But the image seems overly sharp at the start so the continuity with clip 2 would not be perfect, but you can't really judge that without a side-by-side comparison of the last image from clip 2 and the first image of clip 3 (which was meant to be the same as the last image of clip 2). Here they are:
I think you can see (looking at the makeup, for example) that Veo has sharpened the image in the first frame of clip 3 rather than using the image it was given without modification. I suppose one could anticipate that modification by softening the image before feeding it to Veo, but is it worth the trouble?
And then there's the problem of the voice. It seems to me that Veo 3 should allow people to specify in the prompt which voice should be used by each character, choosing from a set of standard voices, just as in elevenlabs.
Update: up to this point I have experienced a lot of failures when attempting to generate videos with Veo 3 but from this point on failures become rare. I'm not sure why - either Veo 3 improved or it was a problem with the free VPN I was using, perhaps.
By the way, I haven't thought to specify in my prompts so far that British accents should be used, but American accents may not be appropriate for a movie set in medieval England.
Click HERE to see the next clips being developed.