AI videos are very likely to include voices, so being able to control those voices is essential. Not only the words that are said, but the pace, the intonation, the accent etc. You might even want to use your own voice in the videos, or the voices of people you know.
The current undisputed king of voice sites is elevenlabs.io. I'm not sure to what extent such sites use AI but they certainly can be useful for the development of AI videos. The free plan at elevenlabs gives you the possibility of doing text to speech, choosing from a range of voices, voice changer (speech to speech}, with the same range of target voices as text to speech, and speech to text. With a paid-for plan you also have the possibility of cloning your voice, after which you can use your cloned voice as the target of text to speech or voice changer. To clone your voice you just need to provide recordings of yourself reading random text - I provided three 90 second recordings of me reading newspaper articles. Of course, if you have some very clear recordings of someone else's voice, you can clone their voice too, with their permission, after which you can make them say whatever you want!
I tested my voice clone by using it to replace the voice of Stewie Griffin in a one minute clip from Family Guy. Brian (the dog) also speaks in the clip so I would have to change only Stewie’s voice and not Brian's. I downloaded the clip, opened it in the free video editor CapCut, then exported the audio only in MP3 format. Then I silenced the audio in the source video and imported the audio I had just exported - that separated the audio from the video so that I could work on it as needed. CapCut also has the option of separating the audio with a single step, but that option is not available in the free version. Then I imported the audio into elevenlabs and used ‘voice changer' to replace both voices in the audio with my voice clone. I exported the result and imported it into CapCut. Then I split the original audio into sections so as to separate the voices of Stewie and Brian, and separated the voice clone audio track at exactly the same places. Then I deleted the segments containing Stewie's voice in the original audio and the segments corresponding to Brian in the voice clone track. I also added a few seconds of the original video at the end of my version, then the same few seconds of my version with my voice, then the same few seconds with a stock voice from elevenlabs called ‘Blondie’ to allow for direct comparisons. Finally I exported the video and uploaded it to YouTube. Does Stewie now sound like me? I'll let you decide. Stewie is American of course - I would have got a result closer to my voice if I had used speech spoken with a British accent.
If you would like me to make a clone of your voice so that your character in my short video speaks with a voice resembling yours, here's what you need to do:
Find the best microphone you can lay your hands on - it may well be the one in your phone.
Find an environment with no background noise and minimal reverberation - a room with lots of soft furnishings rather than hard surfaces.
If you're using your phone, look for a voice recording app. There may already be one installed, but be aware that the one from Google probably won't store the recording on your phone but rather in the cloud (at recorder.google.com), which is less convenient. The free one in the Android Play Store which is made by 'quality apps' seems popular and okay.
Set the app to record in MP3 format with the highest possible quality.
Put your phone somewhere where it will stand freely during the recording so as not to record the sounds of your hand touching it during the recording.
Find some text to read, e.g. in a book, without too many proper nouns.
Read some text for about 90 seconds while you record. Speak with your normal voice (see below). Your mouth should be about 15 cm from the microphone (on my Pixel 6A the microphone used by the app seems to be on the bottom edge of the phone) and direct your voice slightly away from the microphone so as not to record the sound of blasts of air hitting the mike when you speak 'plosive' consonants like 'k'.
Make two more 90 second recordings.
Send me the three MP3 files and I will use the Instant Voice Clone feature on elevenlabs.com to make your voice clone. I think sending the files by email is best - WhatsApp would probably reduce the quality.
If you strongly prefer making a video recording rather than an audio recording then I can use that by extracting the audio from the video - it just means you will be sending me larger files.
Warning: once I have your voice clone I will be able to make it say whatever I want, so don't let me clone your voice unless you trust me! Similarly, you might worry about your voice samples being sent to elvenlabs.com - I can assure you that this company is highly respected and used by many AI creators.
More advice on recording from this page on elevenlabs:
The AI will attempt to mimic everything it hears in the audio. This includes the speed of the person talking, the inflections, the accent, tonality, breathing pattern and strength, as well as noise and mouth clicks. Even noise and artefacts which can confuse it are factored in.
Ensure that the voice maintains a consistent tone throughout, with a consistent performance. Also, make sure that the audio quality of the voice remains consistent across all the samples. Feeding the AI audio that is very dynamic, meaning wide fluctuations in pitch and volume, will yield less predictable results.
Another important thing to keep in mind is that the AI will try to replicate the performance of the voice you provide. If you talk in a slow, monotone voice without much emotion, that is what the AI will mimic. On the other hand, if you talk quickly with much emotion, that is what the AI will try to replicate.
It is crucial that the voice remains consistent throughout all the samples, not only in tone but also in performance. If there is too much variance, it might confuse the AI, leading to more varied output between generations.