The next few clips will show Sir Nigel on his journey, his quest to find the Holy Snail. He will cross various inhospitable landscapes and this will give me an opportunity to be more artistic in my generation of starting images, perhaps using Midjourney (known for being artistic) rather than ChatGPT (more realistic). There won't be much talking so making these clips should be relatively fast.
The assumption is that I will continue to use the reference image of myself in medieval clothes against a white background, but perhaps I will experiment instead, at some point, with using a detailed textual description of my dressed self instead, as explained in this video. I can ask ChatGPT to analyse an image like this one and create a detailed description which can then be used in place of the reference image to try to ensure a consistent appearance.
I said:
Facial Features: Describe the facial features and hair you see in this image to help me prompt better images in Midjourney
ChatGPT said:
From the image, here’s a description of the facial features and hair you could use in a Midjourney prompt:
Hair: Medium-dark brown, straight, with a bit of wave; short length; natural texture; slightly tousled; side-parted but casually styled.
Hairline: Even, with no significant recession.
Forehead: Medium height, smooth.
Eyebrows: Medium thickness, naturally arched, dark brown.
Eyes: Blue-grey, almond-shaped, medium size, with a calm, neutral gaze.
Nose: Straight bridge, medium width, proportionate to face.
Mouth: Medium lips, slightly fuller lower lip, relaxed expression.
Jawline & Chin: Moderately defined jawline with a softly squared chin.
Skin tone: Light, warm undertone with even complexion.
Facial hair: Clean-shaven.
Age appearance: Around 40–50 years old.
You could translate this into a Midjourney-ready snippet like:
"A light-skinned man in his late 40s with short, slightly wavy medium-dark brown hair, blue-grey almond-shaped eyes, straight nose, medium lips, clean-shaven, and a moderately defined jawline. Natural warm undertone skin, calm expression."
Here’s a refined Midjourney prompt template based on your photo so you can adapt it to different scenarios while keeping your likeness consistent:
Base likeness prompt
csharp
A light-skinned man in his late 40s with short, slightly wavy medium-dark brown hair, blue-grey almond-shaped eyes, straight nose, medium lips, clean-shaven, and a moderately defined jawline. Warm undertone skin, calm and confident expression, even lighting, ultra-detailed, hyper-realistic
Example full prompt for Midjourney (you can swap the setting/clothing/pose as you wish):
css
A light-skinned man in his late 40s with short, slightly wavy medium-dark brown hair, blue-grey almond-shaped eyes, straight nose, medium lips, clean-shaven, and a moderately defined jawline. Warm undertone skin, calm and confident expression. Wearing elegant medieval nobleman’s clothing of dark green velvet with gold embroidery, standing in a medieval courtyard under a cloudy sky. Hyper-realistic, ultra-detailed, 85mm lens, f/2.8, cinematic lighting, sharp focus --ar 3:2 --v 6 --q 2
Tips to keep likeness consistent in Midjourney:
Repeat the key facial traits (hair color & style, eye color, skin tone, jawline) in every prompt.
Lock in the age range so Midjourney doesn’t “drift” into younger or older versions.
Specify the camera type and lens for consistent proportions (85mm lens works well for portraits).
Avoid vague terms like “handsome man” unless you want Midjourney to stylize more freely.
If you’d like, I can now give you four ready-to-use Midjourney prompts — each with a different setting but all keeping your face consistent. That way, you can drop them straight into MJ without extra editing.
Note that using a reference image is not the only way to try to obtain consistent characters by having ChatGPT make images based on them that can then be fed to Veo. It is also possible to try to get a textual description of the reference image which can then be incorporated into a prompt, perhaps meaning that one can avoid altogether the need to proved a starting image to Veo. Another advantage of using a textual description is that one could conceivably use two such descriptions in the same prompt - it is otherwise difficult to maintain consistent characters if there is more than one character in the scene. In this video he suggests this prompt to get ChatGPT to analyse and describe an image in a way that would be suitable for use in Midjourney:
I ask Grok to make an image based on that one sentence description and it generates this, an excellent likeness of me (calm down, ladies):
STOP PRESS: As of today 08/08/2025 I can use ChatGPT-5 instead of GPT-4. ChatGPT 5 claims to deliver:
1. Quality & Realism
GPT-5 produces sharper, more detailed, and more realistic images, with better handling of textures, lighting, and depth. It’s better at avoiding “AI giveaway” artifacts like warped text, extra fingers, or strange object blending.
2. Accuracy to Prompt
GPT-5 tends to stick more closely to the description you give — meaning if you ask for “a 1920s street scene in Paris on a rainy night,” you’re more likely to get the correct time period, location feel, and atmosphere.
3. Consistency
GPT-5 can keep character faces, outfits, or objects consistent across multiple images in a series.
So from now on all my ChatGPT images should be better!
I tried running the same prompt that I had used with ChatGPT 4 to see if ChatGPT 5 could deliver a better likeness. Here is the prompt again and the image I used is the one at the top of this page with the blue background. I also tried using my photo editor, Affinity Photo, to stick my own head on the medieval body to see whether I could get a better likeness. Below are the ChatGPT 4 image, the ChatGPT 5 image and the image that I obtained using the photo editor, respectively
The ChatGPT 5 image is no better than the ChatGPT 4 image. The two images are remarkably similar. It's possible that ChatGPT 5 generated such a similar image because this was requested within the same session where all the other images were generated. Might it have looked quite different in a new session? But the best likeness is in the last image, of course, as it should be because it's my actual face. I will use this image as the base from now on.
I think Sir Nigel will be tired after a few hours in the saddle as he begins his quest - he will probably want to stop at an inn for the night. So... should I continue using ChatGPT or switch to Midjourney for more 'artistic' images?
I first give the new image of myself (with my real head) and this prompt to ChatGPT and get this image (actually it gave me an image in 2:3 format so I made it start again):
Can you make an image of this person arriving at a rural inn on horse back (the same brown horse as in the previous images) in the golden hour. A sign above the door says ‘The Rabbit Hole’. Hyper-realistic. 3:2 format. Adjust brightness and color balance to mimic a professionally lit studio photo, but in an outdoor medieval setting.
I gave Midjourney the same image of me and a similar prompt (without the reference to previous images and with a request for 16:9 format) and got four images. They're all beautiful and detailed so I'm showing you the best three. The first one is more accurate in my features but they all have the common AI generator weakness of being unable to reproduce text well - only ChatGPT and Veo handle text well. They have a much more neutral color balance than the ChatGPT version but my prompt was self-contradictory - I asked for a 'golden hour' image (just before sunset) and also neutral colors. Midjourney is able to output in 16:9 format but ChatGPT can only do 3:2. Midjourney gave me a suitably sized horse whereas ChatGPT always gives me mini horses.
And the winner is... Midjourney. I'll use the first Midjourney image, but should I try to fix the name of the inn? Let's try asking ChatGPT to do that. I send ChatGPT the image of the sign only and (on my second attempt) I get this excellent result which I integrate into the Midjourney image.
I send the Midjourney image to Veo with this prompt (not forgetting that in a previous video Sir Nigel refused to dismount from this horse):
The hand-held camera orbits to the right. The man dismounts from his horse, attaches the bridle to a hook to the right of the door and enters the inn. The sound of boisterous laughter can be heard from inside the inn. Hyper-realistic.
Hilarious! Veo still refuses to make Sir Nigel dismount, and instead makes a Sir Nigel clone appear out of mid-air! Also, I don't hear much 'boisterous laughter' and the faces of the people inside are horribly distorted. I want my money back! At least I got the asked-for camera movement for once.
Where do I go from here? If I could use this image as the final image rather than the starting image then I could have Sir Nigel trotting in from the right, but using a final image means using Veo 2 which means no trotting sound. I could have the horse standing still while Sir Nigel shouts something into the pub, but the only way to use my cloned voice seems to be to have no background sounds, so no noise from inside the inn. I could try to make the clip using the Veo 3 Quality mode which gives better results but costs more credits, 100 credits per clip, and I only have 200 credits left. I make a big and painful decision: I'm not going to use my cloned voice after all - it's just too constraining. That means I will have to redo the last several clips to get the original voice back.
I give this new prompt to Veo:
The hand-held camera orbits to the right. The scene is set in medieval times. A scruffily boy dressed in peasant clothes appears via the doorway and descends the steps saying in a low class British accent "Welcome to the Rabbit Hole, your excellence" and the horseman replies with a noble British accent "Look after my steed, will you, young man". The boy takes hold of the horse's reins. There is, coming from inside the inn, the sound of a dozen British men and women peasants laughing, talking and shouting. Hyper-realistic.
I thought this second attempt had to come out better than the Sir Nigel clone clip but ... I was wrong. Does the new clip deliver a 'hyper-realistic' rendering of the peasant boy? No - he looks like a character from a Disney movie. And the clip has no sound! Veo warns that this happens sometimes with Veo 3, but it's the first time it's happened to me.
I have 170 credits left, enough to generate 8 clips in fast mode or 1 clip in quality mode. I won't get next month's 1000 credits for another 16 days, so I'm bound to run out before then, so I decide to buy extra credits now - I can buy 2500 credits for €28 and they are valid for 12 months (whereas unused monthly credits are not carried over from month to month).
Having made that decision, I decide to re-run the peasant boy clip generation in Veo 3 but using the quality mode and the same prompt except that I specify 'hyper-realistic peasant boy'. The generation fails. I try again and the generation 'succeeds' but unfortunately Veo understood from my poorly written prompt that the crowd should begin laughing only after Sir Nigel says his line. Also, the clip shows too much of the inside of the inn, giving me less control of that in the next clip, the crowd inside is happy but not boisterous, the peasant boy still looks Disney-like, there are unasked-for musical tones and Sir Nigel doesn't seem to be moving his lips. Apart from that it's fine!
Despite the high cost, I decide to try again in quality mode (€1.12 per generation). I reword the prompt like this:
The camera is hand-held. The scene is set in medieval times. In the background there is, coming from inside the inn, the continuous sound of a dozen British men and women peasants boisterously laughing, talking and shouting. A hyper-realistic scruffily boy dressed in peasant clothes appears via the doorway and descends the steps saying in a low class British accent "Welcome to the Rabbit Hole, your excellence" and the horseman replies with a noble British accent "Look after my steed, will you please, young man". The boy takes hold of the horse's reins. Hyper-realistic.
The generation fails. I try again and it generates a clip (not shown) which is okay except that it is silent. I get a notification: 'No audio generated; Speech generation by minors is currently unsupported'. But two minutes ago it DID generate the boy's speech! Is Google going to charge me anyway for the silent clip? Maybe I can use the audio from the clip above with the new silent clip? It probably won't work well but I'm a sucker for punishment so I try (also adding audio generated with elevenlabs from the prompt 'a boisterous crowd of a dozen British peasants laughing, talking and shouting in medieval times'), and get this:
Not too bad - that'll have to do.
This will be a complex and dynamic scene, a challenge for Veo 3. I send this prompt and the usual reference image of Sir Nigel to Midjourney:
In medieval times we are in a small rural British inn where a dozen customers are drinking, laughing and talking in a boisterous atmosphere. We are looking over the shoulder of the reference man at a cheerful buxom blonde red-faced woman about 35 years old who is carrying a large tankard of beer. The colors are warm. Hyper-realistic.
Again I get four fabulously rich and beautiful images - here are the best two:
It's amazing to be able to generate images as rich and pleasing as these for just a few pennies, but in fact none of the images are usable. None of them are views over Sir Nigel's shoulder. There are almost no women in the images other than Julia. There are too many people, packed too closely together. Due to imprecise prompting, they all show people in very elegant clothes, even including some people in armour. Some of the images are not photo-realistic. Julia looks younger than 35 and in the second image is hardly buxom. The first image is the best, but she doesn't actually seem to be holding the tankard. I was expecting (but didn't specify) a pewter or wooden tankard - did glass tankards actually exist in medieval times? Yes they did, but I would still prefer wooden. I send this revised prompt:
In medieval times we are in a small rural British inn where a half dozen customers are drinking, laughing and talking in the company of several ugly fat women with whom they are flirting. Some people are standing and others are seated at tables in lascivious poses. There is a boisterous atmosphere. It is a low class inn and everyone is wearing peasant clothes. We are looking over the shoulder of the reference man at a beautiful cheerful buxom blonde red-faced woman about 40 years old who is carrying a large wooden tankard of beer. The colors are warm. Hyper-realistic.
Midjourney rejects the prompt with this message: 'Sorry, the AI moderator is unsure about this prompt.' I remove the words in bold and try again and the prompt is accepted and gives four more richly detailed images of which these are the three most interesting (the other image had only a dozen women standing with their backs to the camera):
When I say 'interesting' I don't mean 'acceptable' - the last image is a waste of my time and of computing power. The first image is the closest, with the right atmosphere, but most people have their backs to the camera, the woman at right doesn't look like a cheerful 40 year old in peasant clothes, and there are two many customers and not enough women. In the second image there are again far more customers than I asked for and the woman at right again has her back to us. I try again with this:
In medieval times we are in a small rural British inn where five male customers are drinking, laughing and talking in the company of three ugly fat women. Some people are standing and others are seated at tables. There is a boisterous atmosphere. It is a low class inn and everyone is wearing peasant clothes. We are looking over the shoulder of the reference man at a beautiful cheerful buxom blonde red-faced woman about 45 years old who is facing the camera and carrying a large wooden tankard of beer. The colors are warm. Hyper-realistic.
Here are the two best images of the four generated:
Look at the details in the clothes! But I'm afraid that such a level of detail could pose a challenge for Veo - will I end up returning to ChatGPT? None of the four images include the beautiful woman looking at the camera, and now I understand the confusion: I haven't been submitting the reference image with these recent prompts. So sometimes I have been looking over the shoulder of a customer, sometimes of the beautiful woman, but never of Sir Nigel. These images still have too many people. Two have a limited depth of field, which is an interesting initiative by Midjourney, and two use a 'Dutch (slanted) angle'. 'In cinematography, the Dutch angle is one of many cinematic techniques often used to portray psychological uneasiness or tension in the subject being filmed.'
I try FOR THE LAST TIME with this (note the absence of a reference to the reference character). I also turn down the stylization value from 100 (out of 1000) to zero. The lower the stylization the less Midjourney will apply its artistic licence.
In medieval times we are in a small rural British inn where exactly five male customers are drinking, laughing and talking in the company of exactly three ugly fat women. Some people are standing and others are seated at tables. There is a boisterous atmosphere. It is a low class inn and everyone is wearing peasant clothes. A beautiful cheerful buxom blonde red-faced woman about 45 years old is facing the camera and carrying a large wooden tankard of beer. The colors are warm. Hyper-realistic.
None of these images are photo-realistic, but the prompt compliance is better than before, with fewer customers, as requested. So by turning down the stylization we are increasing prompt compliance at the expense of photo-realism? The last image might be okay if we can make it more photo-realistic, have her actually holding the tankard, and fix the smudged face at right.
I ask ChatGPT to fix the image:
Can you make this image photo-realistic after having first made it look like the central woman is actually holding the tankard and unblurring the face of the man at the right with his left forearm on the table? 3:2 format (landscape) please.
The image is more photo-realistic but ChatGPT's understanding of 'holding the tankard' leaves something to be desired. She is now holding ... the cryptogram!!! Did she steal it from Sir Nigel?
I continue my efforts. So much work for one 8 second clip, and we don't even have a usable starting image yet!
That is more photo-realistic, but the central woman should be holding the tankard (the wooden cylinder) and the cryptogram should not appear in the image.
Eureka! A usable image at last! I just noticed the guy on the floor to her left, apparently about to throw up into the bucket. Nice!
I widen the image to 16:9 in Pollo and send it to Veo with this prompt:
The camera is hand held. We are in medieval England and everyone speaks with a British accent. In the background there is the continuous sound of the people talking and laughing. The man behind the bucket at the bottom left lifts his head and vomits into the bucket. The woman at the center says with a welcoming smile and a low class accent while walking towards the camera and looking at the camera "Welcome, kind sir. I'm Julia and I run this rabbit hole. You look really thirsty - can I bring you a pint of ale?" She offers towards the camera the tankard of beer she is carrying.. Hyper-realism.
That worked really well except that the noise of the crowd is too quiet so in CapCut I add the same crowd noise that I used in a previous clip to get this. No, it would be a mistake to use the same audio so soon so I make a new one.
Very pleased with that, even if the vomit was not forthcoming. Having Midjourney make an image which I then ask ChatGPT to make more photo-realistic might be a workflow I use again. the problem is that I then pass from the correct 16:9 format of Midjourney to the incorrect 3:2 format of ChatGPT, meaning I then have to use Pollo to get the 16:9 format back again. Not ideal. Annoying and surprising that ChatGPT, perhaps the most famous of AIs, currently cannot generate in 16:9 format.
BUT WAIT! An hour after thinking I had at last finished making this clip I realized that I had forgotten something essential: to replace the Julia character's face with that of my sister Julia, just as I have done with several siblings and my father already. I have two options: backtrack and replace the face with real Julia's face, or decide that this person is not Julia after all and give that name to some other character that we have not yet met. I decide to go for the first option.
I give ChatGPT this image of my sister Julia and this prompt:
Can you make three changes to the image you just created - replace the face of the woman at the center with this face, lower her arm and the tankard she is holding, and raise the head of the person who is kneeling behind the bucket at bottom left.
The result isn't bad and I'll go with it. I make the colours less warm, widen the image in Pollo and sent the image to Veo with the same prompt as before...
I'll show you the result in a moment, but it's about time for me to have another look at the Veo 'scene builder' because I would like to extend this clip and have Julia say more.
When I use the scene builder I don't use it correctly and this prompt does not give me any continuity with the first clip...
The camera is hand held. We are in medieval England and everyone speaks with a British accent. In the background there is the continuous sound of the people talking and laughing. The woman at the center says with a welcoming smile and a low class accent while walking towards the camera and looking at the camera "Or maybe you would prefer a very thorough full-body massage by our lovely Harriet" The crowd makes a loud burst of laughter as the woman at the right turns towards the camera, stands up, curtsies and smiles, revealing that she is very old and is missing half her teeth. Hyper-realism.
Here are clips 20 and 21 joined together, with a total discontinuity at the join.
You'll notice perhaps that the resolution of the clip is less good than the others. When using the scene builder it seems the resolution is limited to 1280x720 whereas when NOT using the scene builder one has the option of upscaling to 920x1088. Upscaling could conceivably be done elsewhere e.g. in Pollo but that would be yet another step in the process.
The scene builder has an 'extend' option that you might think is the option I should have chosen but it's not compatible with Veo 3 and therefore switches you to Veo 2 which as you know cannot generate sound - a huge drawback.
It's possible in the scene builder to capture a frame of the original clip so that it becomes an 'asset' i.e. an image that you can use as a starting frame for the next clip. I'll try that now within the scene builder but it's what I've been doing sometimes without using the scene builder - the only difference I see is that if you use the scene builder you can't take advantage of the upscaling feature with in Veo, so why use the scene builder at all?
So... I tried that but then put the two clips together as well in CapCut which is a much better video editor than the 'scene builder'. This is the CapCut result. You can see a change in quality where the two clips are joined but it'll have to do.
Cheers to you for getting this far! Click HERE for the next clips.