As you've guessed, the reason that the people in the inn are peasants is that Sir Nigel, once he has been persuaded to drink lots of beer, is going to be robbed. But how exactly should I show that in the next clip? I decided a while back, somewhat randomly, that my video won't have a spoken narration, but now it occurs to me that panels with text could help to separate scenes that are disconnected in time or space, and could make the storyline clearer. After Julia proposes beer and a massage to Sir Nigel, for example, I could use a panel saying 'Two hours and four pints later...'
So I need to show Sir Nigel now drunk, but not quite passed out since it will be fun to hear him talking. His clothes will now be more tattered and his cape will be gone. I tell ChatGPT to modify the image of me against the white background like this:
This person has been aggressed. Can you make an image of this person without the cape and with his clothes in a somewhat tattered and dirty state?
Nice! I continue with:
Can you make an image in which we are looking down on the person is his disheveled state as he sits on the floor of the inn with his back against a wall. He is very drunk and is holding an empty wooden tankard in his hand. The woman who was holding the tankard of beer earlier is standing next to him, looking down at him. Hyper-realism. 3:2 landscape format.
The result looks nice, although she is not standing.
I lighten and widen that image with the caption:
The camera is hand held and slowly zooms in. We are in medieval England and everyone speaks with a British accent. In the background there is the continuous sound of the people talking and laughing. The man is drunk and says with slurred, drunken speech "Harriet, Harriet, where are you?" The woman reaches into a pocket inside his jacket and withdraws his purple velvet purse. She says, looking at the camera, "This should cover the beer, the massage, and the night asleep on my floor."
Despite a clear prompt, it is Julia who calls for Harriet, and not Sir Nigel. Also the purse is too modern, too big and taken from an external pocket rather than an internal one. I modify the prompt:
The camera is hand held and slowly zooms in. We are in medieval England and everyone speaks with a British accent. In the background there is the continuous sound of the people talking and laughing. The man is very drunk and he (not the woman) says with slurred, drunken speech "Harriet, Harriet, where are you?" Then the woman reaches into a pocket inside his jacket and withdraws a small purple velvet bag that is tied with a string at the top. She says, looking at the camera, "This should cover the beer, the massage, and the night asleep on my floor!" Hyper-realism.
That's better, though Sir Nigel isn't drunk enough, the bag is too big and I didn't ask for the burst of laughter at the end, which I will edit out. I wonder whether Veo put that in because I asked Julia to look towards the camera?
That clip only took an hour or two to make. I now have 22 usable eight second clips. That would mean my movie is now three minutes long. In the very beginning I had imagined that the finished movie might be five minutes long, and I still think that's about right. Some clips may get shortened a little, but the insertion of some narrative panels will compensate for that.
A panel will say 'The following morning...'
I send the handsome image of Sir Nigel on horseback outside the inn to ChatGPT along with this prompt:
Can you modify this image so that the disheveled man that you just created for me is standing in the doorway, looking out. He is hungover and hunched over and he is supporting himself with his right hand on the door frame. The horse has gone. Golden hour lighting. Hyper-realism. 3:2 format.
After several adjustments, I get a usable image. The gibberish text on the sign by the door now clearly says 'TONIGHT: ALL HAIL NIGHTLY MINIONS', which is odd. I expand the image and send to Veo with this prompt:
The camera is hand held and slowly orbits right. We are in medieval England and everyone speaks with a British accent. The man is hungover. He says mournfully and shaking his head "My cape, my purse, my dignity, all gone." He staggers down the steps, looks around and raises his arms in despair, wailing "And where is Adam, my faithful steed?" Then he buries his face in his hands, weeping. Hyper-realism.
I like the result.
A panel will say 'Sir Nigel continues his quest on foot, and by nightfall arrives in a forest...'
I return to Midjourney to hopefully get a photo-realistic image based on this prompt. I upload the image of disheveled Sir Nigel as the reference image. This time I set the stylization value to 50. I use this prompt:
The reference character is seated by a campfire in a forest at night. The trees around him are visible, lit by moonlight. The ground is not flat. Hyper-realistic imagery.
It's a rather vague prompt - I'm giving Midjourney a lot of artistic licence. Here are the four results:
I find the first three to be pleasing but the last is a dull composition. The second is unusable because the clothes are not consistent with the reference image. So do you prefer the first or the third? I'm going with the third, although his clothes look heavier than they should.
I'm not sure what should happen in this scene. It would be an interesting test to have Sir Nigel sing something. Or a wolf could appear and be scared away by Sir Nigel waving a burning stick. Maybe we could squeeze all of that into 8 seconds? I don't think so, but maybe yes if Sir Nigel just hums a tune?
I send the third image to Veo with this prompt:
Hand held camera. The man is humming a tune and holding his hands towards the fire for warmth. We hear the sounds of the forest at night, including the hooting of an owl. As the camera orbits right we see a wolf approaching from the left, snarling and baring its teeth. The man grabs a burning stick and throws it at the wolf, shouting with a noble British accent "Away!". The wolf runs away, yelping. Hyper-realistic.
Apparently Veo is an uncooperative mood tonight. There are so many things wrong with this clip. He shouts "Away!" before even having noticed the wolf. The wolf doesn't snarl and it gets too close. Sir Nigel grabs a burning stick out of mid air. There is no hooting of owls, only what I assume is meant to be humming, but Sir Nigel isn't humming so it must be the wolf? The camera doesn't orbit right, it zooms in, which is easier for the AI since it doesn't have to invent new scenery. At least the imagery is photo-realistic, including the fire. I modify the prompt:
The hand held camera orbits right. The man is holding his hands vertically towards the fire for warmth. We hear the sounds of the forest at night. We see a wolf approaching from the left, snarling loudly and menacingly and baring its teeth. Then the man suddenly sees the wolf. He leans forward and grabs a burning stick from the fire. With a violent gesture he then throws the burning stick at the wolf as hard as he can, shouting loudly with a noble British accent "Away! Away!". The wolf runs away, yelping. Hyper-realistic.
It's a little better, but now we have someone off camera shouting "Away, away!" at the beginning of the clip and as usual the camera refuses to orbit. I'm determined to make this clip work well so I decide to try the singing anyway as a separate clip to which I will join the clip of the 'terrifying wolf'. Let's see if Veo can come up with the words and a tune itself with this vague prompt:
The hand held camera orbits right. The man is holding his hands vertically towards the fire for warmth. We hear the sounds of the forest at night. The man sings a lonely song about the forest with a noble British accent, with NO musical accompaniment. Hyper-realistic.
The generation fails - maybe eight seconds is just not enough to 'sing a song' or maybe it needs me to provide the lyrics.
Let's see whether we can cut to a first person view as the wolf approaches. If it works then maybe I can mash this clip with the previous one to get something usable:
The hand held camera cuts immediately to a first person view through the eyes of the man. We see a wolf approaching fast, snarling loudly and menacingly and baring its teeth. The man forward and grabs a burning stick from the fire. With a violent gesture he then throws the burning stick at the wolf as hard as he can. The wolf runs away, yelping. Hyper-realistic.
Wonderful! It's not really a first person view through Sir Nigel's eyes, though - it's a view from his right, but the wolf is a lot more menacing and I'm sure I can combine this with the existing clips to get something usable. If you watch carefully you'll see that the wolf doesn't actually turn around - it suddenly snaps to a rear view. Maybe I'll edit out that glitch.
But it's all too quick - I really want to add some quiet time before the wolf arrives so I make another clip with this:
The hand held camera orbits left and zooms in on the man. The man holds his hands vertically towards the fire for warmth. We hear the sounds of the forest at night. The man says to himself with a noble British accent "I must stay awake and keep the fire going, for the forest is a dangerous place at night." Hyper-realistic.
The zoom is a bit sudden and the vertical raising of his hands doesn't seem to be for the purpose of staying warm, but this is very usable. Here's what I got from editing the clips together in CapCut. I didn't use the first one at all but I did add some snarling that I generated at elevenlabs.
A panel will say 'A week later, our intrepid knight is crossing the mountains...'
I give the disheveled reference image to Midjourney with this prompt:
A mid-length shot of this man walking though a snowy mountainscape, using a stick for balance. It is snowing and windy and he is suffering.
Not bad, but no really good one. None of the images are mid-length (waist up) and they have all transformed his shirt into a coat.
First one: not plausible to be crossing the mountains with bare feet.
Second: wrong clothes, wrong person.
Third: the best but his shirt has become a coat and the snowflakes are too conspicuous and too bright.
Fourth: not bad but his upper body looks too wide and again, he should not have a coat.
I modify the prompt and resubmit:
The reference man is walking through a snowy mountainscape, using a stick for balance. He has a shirt but no coat. It is snowing and windy and he is suffering. There is no vegetation on the ground.
The new set is worse than the old set, with more vegetation and less resemblance to the reference image.
I try sending the best (the third) from the first set with this prompt:
Shorten the coat in the reference image so that it is just a shirt, not a coat.
Midjourney totally messes up and gives me four images of men in coats, like a fashion show (not shown).
Maybe ChatGPT can do a better job? I upload the image and give this prompt:
Shorten the coat in the uploaded image so that it is just a shirt, not a coat, please. Photo-realistic. 3:2 format.
Now Sir Nigel is crossing the mountains in a T-shirt! What a man.
I try again..
Can you remake the image to look like the uploaded image except that the coat is shorter. It should be the same as before, just shorter. Long sleeves, same pattern. Not a T shirt! Photo-realistic. 3:2 format.
That'll do, though the face is less good. I widen the image and send it to Veo with the prompt:
The camera quickly pulls back, left and up and twists around until we have an aerial view of the man trekking through the mountains. We see vultures circling in the distance. It is snowy and windy - we hear the sound of the wind. Hyper-realistic.
But suddenly I seem to hit a major problem. As explained before, Veo does not allow images of photo-realistic people to be uploaded by people in the EU. You can make videos of rocks or plants or animals but not people!!! For more than a week I have simply used a VPN to pretend that I am in the US in order to be able to upload images of people, but suddenly today August 13 2025, that no longer works and I get that blocking message again. I try switching to a different US city in the VPN and that doesn't work. I try restarting my computer and that doesn't work. When I paid for my subscriptions to Google One AI Pro I used a credit card and provided my French address - maybe that is the reason I am now being blocked. Or maybe it's the French address associated with my email address attached to my Google account. I check online whether other people are suddenly having the same problem as me, but can't find anything.
My project is 80% complete - do I now have to give up the idea of completing it? What are the possible solutions or workarounds?
If the 'no people' ban doesn't apply in the UK then maybe I could work something out, but just pointing my VPN there won't help.
I could try using Veo 3 within Pollo, which has just become possible, but it's much more expensive, 170 credits per video and I have 674 credits left there, enough for just three clips.
I see that Veo 3 fast and quality are now both available via freepik.com in their paid-for plans.
Another possibility is to use text to video in Veo 3 rather than image to video, and provide very detailed descriptions of Sir Nigel's face, body and clothes, as previously discussed.
I research whether there are now any other AI video generators capable of generating audio like Veo 3 and the answer appears to be no for the time being. The first suggestion from Gemini is Sora: "OpenAI Sora: Although primarily known for its impressive video generation, Sora, integrated with ChatGPT Plus/Pro, allows for the creation of realistic and expressive videos, and its ecosystem facilitates integrating external audio solutions." I currently do have a paid subscription to OpenAI so I do have access to Sora and must try it.
The clip in the snow has no speech so I don't really need Veo 3's audio capabilities (I can add wind sounds later) so I decide to try using Midjourney to make the video. But there is a misunderstanding and instead of generating a video with my Sir Nigel image as the first frame it generates four still images like this, pretty but unusable. (And why so many vultures?)
I try again and get the message 'Failed to submit. Could not validate link'. I re-upload the image and try again.
The quality is okay but there is another misunderstanding: Midjourney thinks that 'twists around' refers to the man whereas it was meant to refer to the camera.
But wait! I notice that Midjourney has produced four videos, not one, and in one of them he doesn't twist around. Also, Midjourney upscales the video from 720p to 1080p just like Veo does, which is good, but it is silent of course, and even though the clip is only six seconds long, the face goes smudgy at the end. Does this person look more like a plump Matt Damon than Sir Nigel? I won't use Midjourney video again. I generate the sound of wind in elevenlabs, add that to the video in CapCut and end up with this:
The roaring wind sound is inconsistent with the slow and gentle falling of the snowflakes.
UPDATE: Later I am again able to use Veo 3 to generate clips so I feed the same starting image to Veo and this prompt:
The camera quickly pulls back, left and up and twists around until we have an aerial view of the man trekking through the mountains. We see vultures circling in the distance. It is snowy and windy - we hear the sound of the wind. Hyper-realistic.
The result is slightly better, with the screeching of the birds:
Click HERE for the next clips.