To train an AI voice model, we need high-quality audio of the target person's voice.
Look for studio acapella versions of the artist’s songs.
Use isolated vocal tracks from music albums or YouTube.
You can also extract vocals from songs using LALAL.AI or Ultimate Vocal Remover.
If you have a full song, extract the vocals only using UVR (Ultimate Vocal Remover).
Ensure the final files are:
At least 5 minutes of clean audio
WAV format (44.1kHz, mono/stereo)
Now that we have voice samples, let’s set up RVC AI to train our model.
Go to the RVC GitHub Repository: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
Clone or Download the repository:
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
cd Retrieval-based-Voice-Conversion-WebUI
Install dependencies:
pip install -r requirements.txt
If you’re using Google Colab, open the provided notebook and run the cells.
Now, we’ll train the AI to replicate Ariana Grande’s voice.
Open RVC WebUI by running:
python infer-web.py
Upload the voice dataset (cleaned Ariana Grande vocals).
Choose the "Train Model" option.
Select Hyperparameters:
Model type: v2
Training steps: 50 epochs (for a basic model), 200+ for better quality.
GPU: Enable (if available).
Wait for training to complete (this may take hours depending on your system).
Now that we have an AI model trained on Ariana Grande's voice, we can make it sing Humdard by Arijit Singh.
Download the original song (Humdard).
Use Ultimate Vocal Remover to extract the instrumental (karaoke track).
Upload Arijit Singh’s acapella to RVC WebUI.
Choose Ariana Grande’s trained model.
Set pitch & format settings:
Formant shift: Adjust to match Ariana’s natural tone.
Pitch shift: 0 to +12 (depending on gender & tone).
Click Convert and wait for processing.
Download the converted AI vocals.
Open Audacity (or any DAW software).
Merge the AI vocals with the instrumental track.
Adjust the volume and effects to make it sound natural.
Export as MP3/WAV.
Screengrabs from RVC Web UI setup:
The audio track generated as a result: (The instrumental is very choppy as we can observe, but the lyrics and acapella is beyond expectations, I uploaded it on soundcloud 😂) :
explanation of the pitch extraction algorithms used in RVC (Retrieval-Based Voice Conversion) and when to use each:
What it does: Uses Praat’s Parselmouth library to estimate pitch.
Best for: General voice conversion when speed is preferred over accuracy.
Pros: Fast and lightweight.
Cons: Less accurate than Crepe or RMVPE, may produce robotic sounds in complex vocals.
What it does: A pitch detection algorithm from World Vocoder that smooths pitch variations.
Best for: Natural-sounding results, especially for clean vocals.
Pros: Good balance between accuracy and smoothness.
Cons: Slower than Parselmouth, but sometimes struggles with extreme pitch shifts.
What it does: Uses a deep learning model for pitch detection, achieving very high accuracy.
Best for: High-accuracy pitch tracking, especially for complex vocals.
Pros: Very accurate, captures subtle pitch variations.
Cons: Slow and computationally expensive.
What it does: A newer deep-learning-based pitch extractor designed for robustness.
Best for: Best overall results in most situations (recommended for modern RVC use).
Pros: More stable and less prone to errors compared to other methods.
Cons: Slightly slower than PM and Harvest, but generally the best choice.
For fast results: PM (Parselmouth)
For natural tone & smoothness: Harvest
For high accuracy in complex vocals: Crepe
For best overall quality: RMVPE (Recommended)