"Sovits" (so-vits-svc) is a project that leverages machine learning techniques to transform singing voices from one timbre to another while preserving the original performance's pitch and rhythm. The system integrates Soft Voice Conversion with Variational Inference Text-to-Speech models to achieve high-quality voice conversion.
Key Components:
Soft Voice Conversion (SoftVC): SoftVC is designed to extract linguistic features from the input singing voice, capturing the content without the speaker's unique characteristics. This process involves encoding the input audio into a representation that emphasizes phonetic information while minimizing speaker-specific traits.
Variational Inference Text-to-Speech (VITS): VITS is a neural network-based text-to-speech model that generates natural-sounding speech from textual input. In the context of so-vits-svc, VITS is adapted to synthesize singing voices by conditioning on the linguistic features extracted by SoftVC and the desired target speaker's characteristics.
Conversion Process:
Feature Extraction: The input singing voice is processed by SoftVC to obtain linguistic features that represent the phonetic content.
Speaker Conditioning: These linguistic features are combined with the target speaker's embedding, which encodes the unique characteristics of the desired output voice.
Voice Synthesis: The VITS model synthesizes the singing voice by generating audio that matches the input's pitch and rhythm but with the timbre of the target speaker.
The link to the github page: https://github.com/svc-develop-team/so-vits-svc?tab=readme-ov-file
The Legal Controversy of Sovits
The primary issues revolved around the unauthorized use of data—many users trained AI models on the voices of celebrities and musicians without permission. This led to concerns about copyright infringement, voice cloning misuse, and potential legal repercussions. As a result, Recell, the original uploader, took down the project in March 2023, distancing themselves from any further developments or applications.
Luckily, the project is kept and maintained by another team and users can still download it today.
AI-Singing Workflow
1. Voice Model Training
Collect Clean Vocals: Gather dry (unaccompanied) vocal recordings of the target singer
Train (Multiple) Models: Train a voice model of the target voice. Some creators train several Sovits models of the same voiceto reflect different vocal registers, emotions, or timbral textures.
Output: A trained AI voice model that replicates the singer’s vocal timbre.
2. Track Selection and Arrangement
Choose Target Audio: Select a source song as the vocal reference
Segment-Based Selection: Creators sometimes divide the song and use different references for different sections (e.g., verse = Singer A's version, chorus = Singer B's version) to optimize phrasing and emotion.
3. Inference (Voice Conversion)
Run Inference: Use the trained model to replace the original voice on the source track while keeping pitch, timing, and expression intact.
Layering and Blending: Combine different AI-generated stems or harmonize with other voice models to enrich the texture.
4. Post-Processing
Mixing: Blend the converted vocals with instrumentals. Adjust EQ, compression, reverb, and volume balance.
Harmonization (Optional): Add harmonies manually or generate them using tools like pitch shifters or separate Sovits tracks.
Effects and Enhancement: Add stylistic effects like autotune, delay, vocal doubling, or saturation to match the genre or aesthetic.
Quality Control: Check for audio artifacts, timing issues, or tone mismatches across stitched parts.
5. Publishing and Community Sharing
Upload to Bilibili: Publish the final product with tags like #AI歌姬, #Sovits, or #AI-Dongxuelian.
Community Feedback Loop: Engage in danmu comments and fan discussions.