An UTAU voicebank created with the voice of PAMA from Minecraft: Story Mode S1E7. This VB is not, nor will it ever be, up for distribution. NO AI was involved in the creation of this VB.
[DEFAULT]
Voice sourced straight from raw, unedited dialogue.
Language: Japanese
Configuration: CV (+VV vowels)
Optimum Range: A#2-G#5
Encoding/Aliasing: Hiragana + Romaji
[VOCODE]
Alternate/intended "vocal mode", achieved by running the samples through Vocal Synth 2 in Audacity with the same effect as the in-game dialogue.
Language: Japanese
Configuration: CV (+VV vowels)
Optimum Range: F2-A#5
Encoding/Aliasing: Hiragana + Romaji
Voice source: PAMA (Minecraft: Story Mode) (VA: Jason Topolski)
Artist, Voicebank Configuration: Cyanide (me!)
Scroll down past these paragraphs if you want to get right to the tutorial.
Jinrikis-- if you're unfamiliar-- are UTAU voicebanks created using samples of a character or otherwise non-vocal-synth voice. Because of copyright risks, as well as ethics surrounding the reproduction of a voice actor's voice without consent, distribution of these voicebanks are generally highly discouraged and frowned upon. Aside from just being fun projects, they've always been a good, creative, SAFE way to execute the idea of, "what if this character I really like sang this song?"
I first gained an interest in them back around in 2022 from Chase's GLaDOS jinriki. I was like, woah, you can do that?? And it became a goal of mine to, someday, do something like that. Of course, I was barely 12 and lacked a computer to do that on, so it took me a while.
Last year, I got fixated on Minecraft: Story Mode.
MC:SM's always been a game that I've held close to my heart, and ever since I replayed it almost exactly 2 years before writing this (August 2023) PAMA has been at the top of my list of favourite characters of all time. Aside from all the other deranged reasons I love them, their voice is very unique and admittedly what I would personally want to sound like if I was a computer. And because vocal synths are my main interest and I had the files readily available, I created my first ever attempt at a jinriki in December 2024 after my second replay. (click image to be sent to a video of it)
It sounded like abysmal dogshit. It took me multiple attempts to get the [u] samples to sound like they weren't being drowned and strangled, and even then I still failed, plus they just ended up sounding like they were recovering from 60 years of smoking three packs a day.
So then, later, I sampled from different lines that were edited differently and recorded in a different tone. For the original attempt, I had used exclusively dialogue that was canon and in-game. But-- and I won't go into a ramble about it-- there's dozens of leftover recordings from a stage of development before their character was rewritten, so I decided to see how it would sound if I used those instead.
(click image to be sent to an example on Soundcloud)
So, while being an improvement from the original, that was also horrible. It was still incredibly garbled, but at least it had a brighter tone, which was actually surprising, because the unused voice lines are actually lower-pitched/"huskier" than the canon ones!!
I came to the conclusion pretty quickly that the problem was a combination of two things: the incredibly short length recordings (<1 sec) and the vocoder filter on the voice.
It took me and another member of a Discord server I'm in months to finally figure out how to get it right-- they tried figuring out how to reverse engineer the effect and find out exactly what the carrier wave was, I tried figuring out a way to replicate it with Audacity's built-in vocoder, and then in July, someone in an audio engineering/design server they joined came and told them it could be replicated with iZotope's Vocal Synth 2 VST, and then that information was forwarded to our server. That's all it was.
I procrastinated for a while, but 2 days before my Heat Abnormal cover released, I finally locked in and finished the final version of this stupid project I accidentally got way too serious about.
Anyways, you're probably here to learn how to make one yourself if you've gotten this far down.
I'll spare you the extra details about the vocoder, I already talked anough about that, and it only really applies to my situation.
Here's my semi-detailed tutorial on how to make your own jinriki.
Okay, so you've got a basic idea. Now you need to get your hands on source audio and Audacity.
Source audio can be a little tricky depending on what it is. If it's a character from a video game, you might be lucky like me and have all of their voice lines readily available, maybe with a little bit of archive extraction involved. For example, here's what Telltale Speech Extractor looked like when I sorted it by uncategorized dialogue to get access to the unedited audio (and the unused edited ones). You might have an easier time since yours will probably be labeled, because whoever named them is probably far more competent than Telltale's development team.
If your character is from a TV show, anime, etc, you might need to take some extra time finding stuff without background noise, or if you're really desperate, you can just filter it out yourself.
Take the time to go through all of the character's dialogue. You'll need enough to cover every Japanese (or English, if you hate yourself) vowel (あ/a, い/i, う/u, え/e, お/o, ん/n) as well as consonants, though to make things easier you can exclude [n], [w], [y], and maybe [ts] if you're lazy like me, since the first three can be replaced with the vowels (and ん), and [ts] can just be [t] and [s] stitched together.
So here comes the first fun part: chopping the fuck out of that shit!!!!!!!!
First things first, you're gonna want to start with the vowels. That's common sense. You can't make anything else without those.
Cut out a piece of the most held-out vocalization of the vowel that you can find. It's okay if it's pretty short.
Isolate that in its own track, then select the whole thing and stretch it out if needed. It should be around half a second-- that doesn't SOUND like a lot, but trust me, it is. Also, make sure to use high-quality stretching so that it doesn't stutter.
Repeat this process for the rest of your vowels. Export all of them into your jinriki's folder with THESE SPECIFIC SETTINGS. Your voicebank will not function properly if any of these settings are incorrect, and if you export as the entire project, you will end up having the entire source audio instead of your chopped sample.
Now, for the consonants. Make sure you're only selecting the consonant and not any part of the vowel before/after-- it's alright if a little is included, but don't make it too much.
There are a lot more of these, but I believe in you. Cut out every single consonant needed. If you can't keep track of them well, I recommend making a checklist.
CONSONANTS YOU NEED TO SAMPLE FOR A VERY SIMPLE CV VOICEBANK, PRETTY MUCH NO EXTRAS:
B
CH
D
F
G
H
J
K
M
N
P
R
S + SH
T + TS
Z
Please try your best not to use an English R. Mix some shit together if you need to.
Samples like [きゃ/kya] can be achieved by stitching together [k] and [や/ya].
That comes next.
There's probably a better word for it, but I like to call it stitching. It's the most time consuming, and also the second fun part of this.
It's the same for pretty much all samples. What you need to do is align the consonant with the vowel, and cut off any extra part of the consonant that might be too long.
Select the point where these two meet, then crossfade them together.
Be sure that you can hear the consonant properly, and that it actually sounds like what you need. Repeat this process for the entire reclist.
From here, you can oto your voicebank like normal. If you're completely new to UTAU voicebank development, I don't really know why you're here, but I'll help you anyway with the basics of oto-ing CV. Hopefully you already have something like setParam or vLabeler to do this with. vLabeler is newer and more recommended, however at the time I originally wrote this tutorial, I used setParam. You can skip this part of the tutorial if you already have experience.
It's difficult to explain, but it's pretty simple in practice. You can click the sections between each label to listen and make sure you're doing everything right, and you can also always go back later to edit this file if anything goes wrong.
LEFT BLANK: Labeled as [L] and is represented the green highlight in setParam/yellow highlight in vLabeler. This is how you cut off anything unnecessary before the audio you need, such as beginning silence. I recommend leaving a bit of blank space. The keyboard shortcut is F1.
OVERLAP: Labeled as [Ovl] and is represented by the green line in setParam and vLabeler. I kinda STILL don't know how to use it, but it basically marks where the previous note fades into the current one. I recommend setting this before the consonant "hits" for hard consonants like [k] and [ts], and setting it somewhere near the beginning or middle of the consonant for soft consonants like [n] and [s]. The keyboard shortcut is F2.
PREUTTERANCE: Labeled as [Pre/Preu] and is represented by the red line in setParam and vLabeler. You should set this between the consonant and vowel-- usually you can see pretty well in the spectrogram shown in the image above where that transition is if it isn't clear in the waveform. The keyboard shortcut is F3.
CONSONANT: Labeled as [Con/Fixed] and is represented by the blue highlight in setParam and vLabeler. The name of this thing in setParam is extremely fucking infuriating and confusing, but it highlights the part of the vowel after the consonant that is not looped AND marks the start of the vowel that WILL be looped. BASICALLY, just set it at the beginning of the most consistent part of the vowel. Don't make it too long, or else your voicebank might shit itself and bug out when it loops-- certain lengths might sound distorted or otherwise not how they should. The keyboard shortcut is F4. You still with me? God I hope so.
RIGHT BLANK: Labeled as [R] and is represented by the yellow highlight in setParam/white highlight in vLabeler. This marks the end of the consistent part of the vowel that will get stretched and looped when you lengthen the note in UTAU. Everything after this is cut off. The keyboard shortcut is F5.
Once you've spent the mind-numbing hours oto-ing this thing and permanently drilling the sound of your favourite character's voice chopped up into little pieces into your eardrums, you're finally done. Give your jinriki a character.txt and install them in UTAU/OpenUTAU.
Optionally, you can also give them a cute little icon (and portrait if you use OU like me)! Makes them a little more fun to use. Just a little.
Have fun!!!!!!!!!! ^w^ (AND REMEMBER NOT TO GIVE OUT THE DL LINK. THE UTAU GODS WILL STRIKE YOU DOWN)