AI Voice Cloning

Have you ever wondered how to make AI sing your favourite songs using a celebrity’s voice? With the Retrieval-Based Voice Conversion (RVC) model, you can train an AI voice clone and make it sing any song—for free!

In this guide, I’ll show you how to clone Ariana Grande’s voice and use it to sing Humdard by Arijit Singh. Let’s dive in!

Step 1: Collect Voice Samples

To train an AI voice model, we need high-quality audio of the target person's voice.

1.1 Find Clean Voice Clips

Look for studio acapella versions of the artist’s songs.
Use isolated vocal tracks from music albums or YouTube.
You can also extract vocals from songs using LALAL.AI or Ultimate Vocal Remover.

1.2 Extract & Prepare the Audio

If you have a full song, extract the vocals only using UVR (Ultimate Vocal Remover).
Ensure the final files are:
- At least 5 minutes of clean audio
- WAV format (44.1kHz, mono/stereo)

Step 2: Set Up RVC Model

Now that we have voice samples, let’s set up RVC AI to train our model.

2.1 Download RVC WebUI

Go to the RVC GitHub Repository: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI

Clone or Download the repository:

git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git

cd Retrieval-based-Voice-Conversion-WebUI

Install dependencies:

pip install -r requirements.txt

If you’re using Google Colab, open the provided notebook and run the cells.

Step 3: Train the AI Voice Model

Now, we’ll train the AI to replicate Ariana Grande’s voice.

3.1 Train the Model

1. Open RVC WebUI by running:
  
  python infer-web.py
2. Upload the voice dataset (cleaned Ariana Grande vocals).
3. Choose the "Train Model" option.
4. Select Hyperparameters:

- Model type: v2
- Training steps: 50 epochs (for a basic model), 200+ for better quality.
- GPU: Enable (if available).

Wait for training to complete (this may take hours depending on your system).

Step 4: Convert a Song into AI Voice

Now that we have an AI model trained on Ariana Grande's voice, we can make it sing Humdard by Arijit Singh.

4.1 Get the Song’s Acapella

Download the original song (Humdard).
Use Ultimate Vocal Remover to extract the instrumental (karaoke track).

4.2 Convert Arijit Singh’s Vocals to Ariana Grande’s Voice

Upload Arijit Singh’s acapella to RVC WebUI.
Choose Ariana Grande’s trained model.
Set pitch & format settings:
- Formant shift: Adjust to match Ariana’s natural tone.
- Pitch shift: 0 to +12 (depending on gender & tone).
Click Convert and wait for processing.

Step 5: Merge the AI Vocals with Music

Download the converted AI vocals.
Open Audacity (or any DAW software).
Merge the AI vocals with the instrumental track.
Adjust the volume and effects to make it sound natural.
Export as MP3/WAV.

Screengrabs from RVC Web UI setup:

The audio track generated as a result: (The instrumental is very choppy as we can observe, but the lyrics and acapella is beyond expectations, I uploaded it on soundcloud 😂) :

output_1.wav

explanation of the pitch extraction algorithms used in RVC (Retrieval-Based Voice Conversion) and when to use each:

1. PM (Parselmouth)

What it does: Uses Praat’s Parselmouth library to estimate pitch.
Best for: General voice conversion when speed is preferred over accuracy.
Pros: Fast and lightweight.
Cons: Less accurate than Crepe or RMVPE, may produce robotic sounds in complex vocals.

2. Harvest

What it does: A pitch detection algorithm from World Vocoder that smooths pitch variations.
Best for: Natural-sounding results, especially for clean vocals.
Pros: Good balance between accuracy and smoothness.
Cons: Slower than Parselmouth, but sometimes struggles with extreme pitch shifts.

3. Crepe

What it does: Uses a deep learning model for pitch detection, achieving very high accuracy.
Best for: High-accuracy pitch tracking, especially for complex vocals.
Pros: Very accurate, captures subtle pitch variations.
Cons: Slow and computationally expensive.

4. RMVPE (Robust Multi-View Pitch Estimation)

What it does: A newer deep-learning-based pitch extractor designed for robustness.
Best for: Best overall results in most situations (recommended for modern RVC use).
Pros: More stable and less prone to errors compared to other methods.
Cons: Slightly slower than PM and Harvest, but generally the best choice.

Which One Should You Use?

For fast results: PM (Parselmouth)
For natural tone & smoothness: Harvest
For high accuracy in complex vocals: Crepe
For best overall quality: RMVPE (Recommended)

Page updated

Report abuse