Hello, if you are here, I know you are probally interested in converting your UTAU voicebanks into DIffsinger, so, I've created this semi automatic system to make thing easier for that, and also to make the DIffsinger voicebank to sound near to the original UTAU source audio and not the raw samples. I hope you like my guide and have fun using it.
Here's a demo showcase on how the results will sound using the lite database (Demo for the full database will be added in the future)
DISCLAIMER
PLEASE READ THE WHOLE GUIDE, WATCH THE VIDEO AND READ THE Q&A TAB BEFORE ASKING QUESTIONS, THERE'S A LOT OF PEOPLE ASKING THE SAME QUESTIONS OVER AND OVER BECAUSE NOBODY IS READING HERE, I've tried to simplify things but I need u guys to READ it before asking, especially the Q&A TAB.
With this guide, you are permitted to:
Convert your personal UTAU voicebanks into DiffSinger.
Convert voicebanks from other engines such as DeepVocal and VocalSharp (although I have not tested these).
Use the label files for study purposes to improve your personal dataset.
Use the label files for human singing or convert them into other languages.
Use the voicebank you have created commercially.
The following actions are prohibited:
Converting and distributing commercial voicebanks (such as porting Vocaloid, SynthV, or Voisona voicebanks).
Converting and distributing unauthorized voicebanks (for example, converting Teto UTAU into DiffSinger). You may only convert your personal voicebank or a voicebank that you have permission for that.
Creating “Jinriki” DiffSinger banks using unauthorized data, such as recordings of famous people singing.
Using this guide or its files to create unauthorized derivative voicebanks.
DISCLAIMER:
I will NOT be responsible for anyone who breaks these rules. I created this guide only to help people convert their UTAU voicebanks into DiffSinger. If I see anyone publicly distributing unauthorized voicebanks, I will take down the links and provide distribution only privately via email upon request.
Lite Databse:
5 songs (10 minutes of data)
It needs to be trained among a corpus dataset (Public datasets will be recomended in the end of the guide)
Quick training and results
Works with any kind of Japanese UTAU voicebanks (CV, CVVC, VCV, C+V)
Support for other languages will be added in the future
Full database:
20 Songs (1 hour of data)
Can be trained without other datasets in the corpus (Expressiveness and dynamic of the vocal will depend on the voice treatment if the corpus is not used)
Slower training and will need more steps to sound decent.
Support only Japanese voicebanks like the lite database, support for other languages will be added in the future
Releases in 2026
Q&A:
Is cross-language supported?
A: If you have cross-language datasets in your corpus files, it will work fine, but the pronunciation will depend on the data you're using.
Are voice colors supported too?
A: Yes, you can create multiple voice colors for your voicebanks; you just need to train them as separate speakers, like it was another voicebank.
Do I need to use the entire dataset for the full voicebank?
A: No, you don't need to use the whole dataset. You can use just part of it, but keep in mind that the amount of data you use will determine how the voice will sound depending on your workflow. If you use less data, I highly recommend using some corpus dataset to expand the vocal range and improve pronunciation.
Can I tune the USTX files to generate autopitch?
A: Probably yes. I haven’t tried it yet, but it seems to work. I only recommend doing this with the full dataset, because pitch generation needs much more data than pronunciation and dynamics. And also remember that if you pitch shifted the USTX file (like one octave high or lower), you need to check the labels again before training it)
Do I need to use OpenUtau, or can I use original UTAU/UTAU-Synth?
A: You can use whichever engine you prefer. The only thing I recommend paying attention to is that if you use multiple resamplers, use only the ones that give similar quality results to avoid voice cracking or strange noises in the DiffSinger voicebank.
Can I mix voicebanks into one?
A: I think yes, you can mix samples from different voicebanks, but if the audio source has different quality or the voice timbre is too different, I highly recommend making separated Voice colors for each voicebank instead of mixing them. But if the audio source is similar, I think it's fine to mix.
Can I edit the label phonemes?
A: Yes, you can freely edit the label phonemes if you want to make your voicebank to pronunce other things that you edited in the USTX, and also a quick TIP: between the trasitions of A and O, some labels have an W, you can remove it if you need.
Can I train it using tension, voicing and other parameters?
A: Yes, you can train it using other parameters, but I highly recommend to create a version without them, because these parameters are not stable enough to handle the UTAU voices sometimes, so it can generate weird noises or pronunciations problems.
Can I change the key of the notes to make the USTX to fit better the voicebank?
A: Yeah, you can freely go up and down with the notes to make it to sounds better with the voice you are working it. If you swap lyrics, you will need to change the phoneme in the label too, but if you just go up and down with the notes, you don't need to change the phoenemes in the label, just adjust the timing as the guide says.
EXTRA PHONEMES.
A: Some of you guys noticed that the USTX files included some extra phonemes like xxx, those phonemes are not obrigatory to use and since people are struggling with it instead of deleting them, I will explain what they will do:
xxx is a trash phoneme, it will be used when you want to discard some noise on the samples.
cl is a glottal stop phoneme.
vf is a vocal fry phoneme
SP is a silence phoneme, it should be in between the phrases
AP is a breath phoneme
exh is a end breath phoneme
Extra Dictionary
A: I've added a download for my dictionaries folder because some people got stuck on this part, so, unless you want to make you own dictionary, you can use the dictionary. If you are only training in japanese, you can use only the japanese dictionary (I recommend using it among nit70 and PJS on the datasets too to avoid phoneme errors and cover all the japanese phonetic), if you want to use the cross language, use the other dictionaries and add or remove the phonemes you need.
Also a quick update: I've removed the xxx phonemes now, so I think you guys will not have any problem with that now, but if you guys need to add any of these phonemes in the label files, feel free, just remember to add them into the dictionary before training.
Here's Two of the 3 Voicebanks I've produced to make this (Chanmi and Apollo), for Apollo, he doesn't have extra parameters like tension, but Chanmi I will provide 2 versions, one with tension and one without tension. They support the most recent OpenUtau updates and can sing in 7 languages (Japanese, Englsih, French, Portuguese, Spanish, Mandarin and Korean)
Chanmi Legacy: https://drive.google.com/file/d/1LdfdBW_y4hBdjsJBdH75GaYl85XJGHqM/view?usp=drivesdk
Chanmi Legacy With Tension: https://drive.google.com/file/d/1juk3NVqsrKXOIWrvZdCybKDPPVACvgC1/view?usp=drivesdk
Apollo Legacy: https://drive.google.com/file/d/1OTiSmc4ateD42FkMT62ZPcjyyOQVQEvL/view?usp=drivesdk
PART 1: Download all the necessary tools:
For this guide, you will need to download some stuff before starting it, here's the link for everything you will need:
BASE LABEL + USTX PACK: HERE
OpenUtau: https://www.openutau.com/
VLabeler: https://vlabeler.com/
Dictionary folder: https://drive.google.com/drive/folders/1BUZK2CltXQMhwSM5jTr1xWeGGCJft8nR?usp=sharing (Read question 11 on the Q&A TAB)
After downloading and extracting everything, you are ready to start to work with the USTX files.
PART 2: RENDERING THE USTX FILES
Now we go to the part that we prepare the vocals for training. First of all, you will open the USTX files in openutau, and load your voicebank into it. For this tutorial, I will be using the lite base pack, but the process will be the same as the full one. you will notice that there's 7 songs to render, never change the name of the ustx to avoid conflicting with the label ones.
Now after you open the USTX file (or UST if you are using Original UTAU), you will need to render the entire project with the same resampler, but as a warning, DO NOT USE WORDLINER-R, EVEN IF THE VOICE SOUNDS GOOD WITH IT, it will make the diffsinger generate weird frequencies and noises. I highly recommend to use HI-FI Sampler, Moresampler or F2Resamp, but you can use whatever resampler you prefer, except for WORDLINER-R.
BASE LABEL + USTX PACK: HERE
OpenUtau: https://www.openutau.com/
VLabeler: https://vlabeler.com/
After downloading and extracting everything, you are ready to start to work with the USTX files.
PART 2: RENDERING THE USTX FILES
Now we go to the part that we prepare the vocals for training. First of all, you will open the USTX files in openutau, and load your voicebank into it. For this tutorial, I will be using the lite base pack, but the process will be the same as the full one. you will notice that there's 7 songs to render, never change the name of the ustx to avoid conflicting with the label ones.
After loading your vocals, you will render them into the wav folder inside the pack you're using it.
PART 3: Preparing the label files
After you finish the rendering of the vocals, you need to pay atention on the most important thing of the project: the .wav files NEED TO HAVE THE SAME NAME AS THE .lab FILES THAT ARE INSIDE THE LAB FOLDER.
After preparing your files, you now will open VLabeler and create a new project inside the the folder where the lab and wav folders are located. (Create as a NNSVS project)
Now you just need to align the base labels with your vocals, this is a lot important because this is what determine the timing and pronunciation of the notes, and each voicebank has it's own timing due to oto differences.
For people who never used Vlabeler, here's how you should align the labels, you need to drag the label until the end of the phoneme that you're labeling (unlike utau's oto that we make the whole note, here we label phoneme per phoneme), like the image bellow:
After finishing the labels, save the project and save all Labels using export all label files option
PART 4: Extra step to improve range
This step is NOT OBRIGATORY, if you don't want to do it, you already finished the porting and can go to the training part. But, if you think that your voicebank needs more vocal range, here's a little tip: you can render again the ustx files but putting the vocals one octave lower and higher, and render again on the wav folder (remember to name them with a -12 or +12 depending on what you did), also create copies of the lab files of the same songs and put the same name as the wav file. After that, you just look into vlabeler if everything is fine, and if it is, you are ready to train your voicebank.
PART 5: Training Guide & Dataset Links
Here's the part that I'm done with my job, I've provided the guide for voicebank making, but since I'm not a pro on training, I can't provide a training setup guide, but I will provide some external links for those guides to help. Also I will let here the link for some public datasets to improve range/pronunciation of the voicebanks.
ENGLISH GUIDE
JAPANESE GUIDE BY KANON HERE
PUBLIC CORPUS DL:
EXTRA: How to train using a corpus + Training Collab Link
I'm making this extra step because probally some people will ask me how to do it, so, I'm gonna show how easy it is. Basically you just need to put all the voices in the dataset folder, but numbering them as different voicebanks ID. so, in order you will put:
if it's THE SAME VOICEBANK BUT IT'S DIFFERENT LANGUAGE DATA, you will put the same number ID to train all the data together (you can do it with a single folder, but I highly recommend to make separate folder for each language to make things easier to edit after)
If it's THE SAME VOICEBANK BUT IT'S A VOICE COLOR, you need to create another ID for the voicebank, like it was a entire new voicebank, to make the data to not conflict with other voice colors
So as a example, here's a corpus example, I have apollo with ID 01 in both japanese and portuguese data, but apollo legacy, soft and hard with different IDS because they need to be considered other voicebanks, so I puted the IDS as 15, 16, 17. And finally nit70 and PJS as 02 and 03, they are public corpus datasets wich will help your voicebanks to sound better and more natural in japanese, they will train together with our voicebanks for that.
Also among the collab link, I will be linking the difftrainer tool, I never used it, but it's a GUI for local diffsinger training, so, if your machine can handle local training and you don't want to deal with anaconda, you can use it
PUBLIC COLLAB LINK HERE
DIFFTRAINER HERE