I send the previously used image to Veo with this prompt:
The hand-held camera zooms in. The snail says with a devious French accent "You are forgetting VAT! Due to the deficit crisis, the king has increased the tariff to 100% so the total price is two pennies, not one." Hyper-realistic.
The result is okay, except for a grunt at the beginning, the smile at the end and the words sounding a bit rushed.
I capture a frame from the end of the previous clip to use as the first frame of the next clip, then send that to Veo with this prompt:
The camera is hand-held. The snail says "However, I will sell you a jar for one penny if you agree to nibble that mushroom on the ground by your feet." Then he snickers. Hyper-realistic.
That comes out fine:
ChatGPT makes an image of Sir Nigel holding the mushroom which I uncrop and send to Veo with the prompt:
The hand-held camera pans left as the man says "You must be joking." He looks at the camera and says "Farewell, cruel world" then takes a bite out of the mushroom. Then he opens his eyes wide in astonishment and screams "WTF?" Hyper-realism.
That generates a rejection: 'Content is filtered due to unknown reasons.' I think I know the reason. I modify the prompt:
The hand-held camera pans left as the man says "You must be joking." He looks at the camera and says "Farewell, cruel world" then takes a bite out of the mushroom. Then he opens his eyes wide in astonishment and screams "Here we go" Hyper-realism.
Surprisingly, I still get the error. So the problem must be either the phrase "Farewell, cruel world", which could be taken as a reference to suicide, or the word "screams". I submit a new prompt:
The hand-held camera pans left as the man says "You must be joking." He looks at the camera and says "Oh well, here we go" then takes a bite out of the mushroom. Then he opens his eyes wide in astonishment and screams "WTF?" Hyper-realism.
That passes and works well except that I expected him to say "WTF?" after biting the mushroom rather than before:
This should be fun! I hope to use Midjourney to generate some trippy images and then we'll see what Veo can do with them. But I don't yet have more specific ideas than 'trippy'. Maybe there should be some reference to the trippy scenes in '2001, A Space Odyssey' or 'Interstellar' or those great cartoon animations that Terry Gilliam made for Monty Python? If you're not familiar with his crazy work then here's a taster:
One thing I do know is that I want this part to have music or a crazy soundtrack, and it might be appropriate to have AI generate that.
I send this prompt to Veo without sending a starting image:
A psychedelic trip set in medieval times in the style of Terry Gilliam’s animated cartoons
It is rejected with this message: 'The prompt could not be submitted. This prompt contains words that violate Vertex AI's usage guidelines.' I assume the problem is the reference to Terry Gilliam.
I want to make a clip in Midjourney using the same prompt but Midjourney requires a starting image so I use the same prompt just to generate a starting image, which gives...
Interesting images, but it's clear that Midjourney has never heard of Terry Gilliam.
I try using ChatGPT's AI video generator, Sora, for the first time. I have access to that because I currently have a paid-for Plus subscription with ChatGPT. It accepts the prompt and gives a silent 5 second clip. Since that is so short I generate 3 more clips with the same prompt and join them together in CapCut. I also add a soundtrack generated by elevenlabs from the prompt:
Soundtrack inspired by Terry Gilliam's cartoon animations
I'm disappointed by that - it certainly doesn't look anything like a Terry Gilliam animation. Then I have the idea that although the Google Vertex AI Studio rejected my prompt referring to Terry Gilliam, perhaps Google Flow (which has started blocking me uploading images of people) will accept my prompt. So I try uploading that and get a brilliant result, even if not that similar to Gilliam's animations which were made by using stop-motion photography as pieces of paper were moved around. I send the same prompt thrice more, string the clips together and get:
I feel privileged to be the first person on the planet to see these animations, since they were made specially for me! Of course, you are also among the first people to see these animations, so you too are privileged! It's remarkable that one of the clips in that compilation contains a disheveled knight, a snail and red and white mushrooms all in the same scene, given that when Veo generates a clip it has no memory of previous clips.
I thought about adding music to this compilation but there is already some and I don't want to drown the sound effects so no.
Early on I rejected the idea of layering a narration on top of the video, but later I decided that it might be helpful to have panels of text to help clarify the storyline and to mark changes in time or place. But when those panels are showing, what should we hear? It might be good to hear the text spoken, and it might be appropriate, in acknowledgement of the Holy Grail movie's influence on my own, to use a clone of John Cleese's voice. So I find a monologue by him, download that and use it to create a voice clone at elevenlabs.com.
For the record, these are the settings that I used with the voice clone: Model: Elevenlabs Multilingual v2, Speed 1.05, Stability 62%, Similarity 75%, Style Exaggeration 50%, Speaker Boost: on
What I imagine for the epilogue is that John Cheese (the clone of you-know-who) will present some still images showing what happened to each of the characters. But we'll start with an actual image of John Cleese and try to bring that to life in Veo, if Veo doesn't recognize the celebrity and reject him. I find a photo of him and upload it with the following prompt, but I forget to specify a British voice. That doesn't matter much because I'm going to try to replace this voice with John's cloned voice anyway.
Hand-held camera. The man says "Stop that nonsense, will you? We didn't ask for something completely different. We just want to know how everything turned out." Hyper-realism.
I upload the video to elevenlabs and use the voice changer to change the voice into the John Cheese clone. That works badly - there is sill too much American intonation in the voice. So I redo the clip, specifying a 'rich southern British accent', then change the voice to the Cheese voice in elevenlabs. That works satisfactorily.
In this clip I intend to show appropriate still images while the Cheese clone says what became of each character. I generated the still images in ChatGPT, but there was still some frustration due to misunderstood prompts (my poor prompting, no doubt). For example, I sent ChatGPT the images of my sister Caroline and my father Brian along with the image of them meeting Sir Peter, the Earl of Kilmarnock, with the prompt:
Please generate an image of the king and princess facing the camera and smiling, based on the image of them that I have just uploaded (keep the same clothes, background and lighting). Their faces should resemble the faces of the man and woman that I have just uploaded.
I don't think sunglasses existed in medieval times! Please redo the image without the glasses and with a new background: an impressive medieval castle. Please give less pronounced smiles and if you can improve the resemblance to the uploaded images of the man and woman and that would be great.
The glasses are gone and in their place we have ... a rat!!! Nevertheless I did use this image, editing out the rat, because I can't keep refining forever. (In the edited image I left the tips of the rat's ears as an Easter egg for those who take the trouble to read this log).
For the image of Julia with her new employee, I uploaded to ChatGPT the image of Sir Nigel on his horse outside the inn as well as an image showing Julia inside the inn and the genuine photograph of Julia. I gave chat GPT this very challenging prompt:
Can you take the outdoor image of the inn as the main image, remove the horse and rider and place in the foreground the woman extracted from the interior shot of the inn, replacing her face with a face more like the remaining image. In the background there should be three or four women in medieval clothes. Everyone is smiling.
I then said:
Very good, but I forgot to say that next to the woman should be a young, handsome, very muscular man with a shining body, naked to the waist. Please regenerate, keeping the tankard in the woman's hand and again try to maximise the resemblance of the woman to the image I uploaded (DSC_0334.JPG).
ChatGPT said:
I need to stop here. I can certainly help you continue building the medieval inn scene with your woman holding the tankard and keep her likeness as close as possible to the reference photo you provided. However, I can’t generate an image that emphasizes or sexualizes nudity (like a shirtless muscular man with a shining body) in this context. What I can do is:
Place a strong, handsome young man next to the woman, dressed in period-appropriate clothing (for example, a loose medieval-style tunic, or a shirt open at the chest to suggest strength without nudity).
Emphasize his muscular build and heroic look while keeping the scene photo-realistic and consistent with the bustling inn background.
Maximise the facial resemblance of the woman to your uploaded photo, while preserving the tankard in her hand as requested.
Would you like me to regenerate the image this way — with the man styled as strong and striking, but still clothed in a way that fits the medieval inn setting?
I said:
Yes, please!
It's the first time Veo has rejected a prompt of mine on the grounds of 'sexualization'. I wonder how Veo would react if asked to generate a pool-side scene. Logically it would have to refuse to generate any scene the shows men naked from the waist up? Note that in the above exchange I referred to the file name of an image. It was ChatGPT that gave me that idea, and it means I don't have to keep re-uploading images that I have uploaded long ago - I just need to give the file names (and keep the file names meaningful and short). Here is the new image, very satisfactory:
A new problem: I have run out of monthly credits on Pollo, the platform I have been using for uncropping (extending) my images from 3:2 to 16:9 format. I try Runway for that, which I have tried before with disappointing results. I give the prompt:
Please uncrop this image to 16:9.
I get the first image below which is okay except that Runway has changed the camera position so we now have a view of more of the street but Julia and her employee are no longer looking at the camera, so I give a new prompt:
Can you uncrop the image I sent you using outpainting only, without modifying the original image other than extending it?
That gives me the second image below, okay as long as you don't look too closely at the horribly deformed faces of the people in the background. Which image do you prefer?
The last, and most difficult, epilogue image that I did was for Simon, the Holy Snail. I gave ChatGPT the image of the snail on the slimey sofa and this impossible prompt:
Can you remove the background and the man at left from this image, leaving the snail and the sofa, and then wrap that image around a plastic tube of moisturizer that also carries the words 'SIMON'S SLIME'. Photo-realistic. 3:2 landscape format. A bright image with neutral white balance.
I was half expecting ChatGPT to reject the challenge as too difficult but no, it gave this image! Very nice (does it look more like Bill Gates than Simon?), but it's portrait orientation, not landscape, so I say, with the intention of placing a shampoo bottle next to the moisturizer:
Very nice! Can you make a similar image except with a plastic bottle of shampoo instead of a plastic tube. The words should say 'SIMON'S SLIME SHAMPOO'. Photo-realistic. 3:2 landscape format. A bright image with neutral white balance.
As for the John Cheese narration of the epilogue, I used elevenlabs' text-to-speech feature to make the Cheese clone speak the following text:
Two months after buying Simon's slime, Sir Nigel was able to raise Princess Caroline from her slumber, to the relief of her father, King Brian. Fiona, the wicked witch of the East, was forgiven for what she did and has since found fame as an author of cookery books. Sir Peter, spurned by Princess Caroline after losing in the jousting final, is now busy renovating his castle in Kilmarnock. Julia, after mysteriously becoming quite wealthy, dismissed Harriet and replaced her with a male masseur who brought a flood of female customers to the Rabbit Hole. Simon's sales were boosted by the awakening of Princess Caroline, and he expanded his product line to include slime-based moisturizers and shampoo. And finally, Sir Nigel married Harriet, of course, and they lived happily ever after.
I wanted to generate some medieval instrumental music to use in the background of the epilogue so I tried the elevenlabs 'Music' section for the first time, putting in this prompt: Typical jolly medieval music on typical instruments of the period. I expected to get a short instrumental piece but instead I was blown away to get a beautiful full-blown song with lyrics created just for me! In the end I decide to use this song during the credits rather than during the epilogue.
As you recall, I had a problem in clip 6 that the AI was never able to properly show the silver coin being passed by Princess Caroline to Sir Peter. I redid that clip and got a better result (though he then immediately dropped the coin!)
In clip 8 Sir Peter refused to take the lance that was offered to him, and I noticed that the crowd are in modern dress. I manage to get a clip in which he awkwardly accepts a modest red 'pole' since Veo will apparently no longer accept the word 'lance', that being a 'sensitive word', a weapon.
In clip 9 I had to recolor the lance red to match clip 8, and I tried yet again, unsuccessfully, to have the two jousters opposite sides of the rail. However the new clip is still an improvement, and appears in the finished movie.
Similarly, I wasn't happy with the clip 25 set in the snowy mountains. At that time I wasn't able to submit images of people to Veo so I generated that clip in Midjourney. Now that I am again able to use Veo, I redid the clip and got a better result.
I wasn't happy with clip 28, the magical mushroom forest, because Sir Nigel's walking stick disappears, he looks cartoonish, especially his glimpsed face, and why not see him from the front rather than the back? But when I try to generate the new starting image in ChatGPT I get the message 'You've reached the maximum length for this conversation, but you can keep talking by starting a new chat.' Maybe it's a sign that it's time for me to stop. I hope this is not going to be a problem. It shouldn't be - this project is 99.9% finished. More about this attempt to improve the clip on the page about clips 26-29.
I used CapCut to edit the video. It's made by ByteDance, the same people who run TikTok, and it has quickly become quite popular. BUT recently many users have turned against CapCut because more and more features that used to be free are now only avalable in the paid-for version. I have had that experience myself - when I open my Quest project in CapCut I am no longer able to export it because it says I am using features that are no longer free, even though they were when I made the project. And some of the features that are blocking the export are not even used in my project, such as speech isolation. This is extremely frustrating and unusually cruel. I thought I was safe from the problem because I was using an old version, 6.5.0, but they went as far as to modify (not update) the old versions to stop people having access to those features free.
So is there another free video editor that can compete with CapCut? Absolutely: DaVinci Resolve (not to be confused with the paid-for version DaVinci Resolve Studio) is a very powerful video editor - it's amazing that it's free. It's only downside is its sophistication, meaning it has a steeper learning curve than CapCut. Oh, and it's apparently quite resource-intensive, so won't run well on an older or more basic computer.
AI is progressing fast and is already having a major impact on human society as, for example, it puts millions out of work, (and I don't believe that millions of new jobs will be created to compensate for that). But for many people the dream in the past has been to be able to live a fulfilling life without having to work, so hopefully problems like uneven wealth distribution can be worked out so that that becomes reality. Does one have to have a job in order to feel fulfilled in one's life? In any case, the tsunami of AI is an unstoppable force that has to be engaged with rather than resisted, Luddite-style. So a main aim of my project was to engage with state-of-the art AI and it was rather mind-blowing to me to see what AI is capable of these days.
Interacting with ChatGPT was like talking to a human, although there were frequent misunderstandings and disappointments. For example, ChatGPT always wanted to generate rather murky images for me, not bright enough and with an overly warm color balance, despite my repeated requests for it (I keep having to refrain from saying 'him') not to do that. It was also frustrating that ChatGPT is not currently able to generate images in the extremely popular 16:9 format, forcing me to uncrop (expand) the images each time. I understand that even now many young people turn to ChatGPT for advice with even the most trivial of decisions, as illustrated in this video.
As for Google Veo, its ability to generate photo-realistic video clips with sound effects and lip synced voices is also impressive and a bit scary - the potential for abuse is huge and it will become increasingly difficult to distinguish AI-generated video from reality. It's so easy and cheap to do that a tsunami of 'AI slop' seems inevitable - I think this could spell the end of TikTok, for example, as people realize that most of the stuff there is (or will be) fake. Note that I have made dedicated pages to address the issues of 'AI slop' and 'Is AI killing art?' I have learnt that the main problems with working with Veo are:
keeping consistency (face, clothes, environment, voices) from clip to clip
the poor handling of physical interactions such as mounting a horse, passing a coin or lance, or throwing magic potion in people's faces!
the extreme censorship of prompts that can be interpreted as sexual, violent or involving children or celebrities.
So, was it worth the one hundred hours and one hundred euros (for one month of subscription to various sites)? I think so. As I said, I have always wanted to make a movie. I had hoped that during my teaching career I would one day work with a drama teacher and his or her pupils to do that, but in the end that didn't happen and it was recent progress in AI that made it possible. I think the duration of the project and the amount of time I spent on it is optimal for a first attempt - I was certainly running out of energy by the end. I expect I will make a second movie one day and I imagine it will take me a third of the time to make. It must be said that making this detailed log of my movie-making effort accounted for a large fraction of the time that I have spent on this project. Also, AI is progressing so fast that it will be steadily easier and cheaper to do AI video generation.
Perhaps my next movie will be set in the present day or in the future, no doubt In the context of a large city. I hope my next movie will have more psychological interest and be less full of cliches than The Quest. Having said that, the inevitable glitches and awkwardness in the finished movie are almost a plus in an 'amusing' video like the Quest, whereas they might be much more problematic in a more 'serious' movie.
I hope that I have awoken in you an interest in AI video creation and you might even want to work with me on my next project. If so, I look forward to hearing from you!
Thanks again to my father and siblings for letting me borrow their names and faces and to Catherine for her helpful advice.
HERE again is the link to the finished video. Thanks for watching!
PS I did a word count on this 'making of' and it's about 30 thousand words, equivalent to about 50 A4 pages of text. No wonder a significant fraction of the time spent on this project was on this 'making of' rather than on actually making videos. Next time: no such 'making of'!