There wasn't a dataset already created for this problem, so I had to create it myself. Through lol.gamepedia.com I was able to scrape a list of every single game that "Phreak" had casted. These games range from 2012 all the way up until last week.
After a little wrangling I managed to get a list of the YouTube VoD links (which included the start time for matches), and began to download them using PyTube. Once each video was downloaded, I had roughly 330 videos worth of content to process, varying in length and size. I utilized FFmpeg and the start time from each individual VoD's link to crop the audio from the start time of the casting, to 15 minutes into each match. I had to cut each match early due to size limitation when loading the Resemblyzer model, but we should be able to get enough audio from 15 minutes to produce sufficient transcripts. If the transcripts prove to be not enough, I will revisit this later and look into cutting videos that are ≥ 30 minutes, into two - 15 minute videos.
During the speech diarisation* with Resemblyzer, I was able to create snippets of audio once a threshold confidence score of ≥ .75 has been reached. Once each video has been processed into individual snippets, I removed snippets that has less then 2 seconds (this might be revisited if needed) of speaking as this helps ensure there is enough data to be used for the downstream networks.
With the collection of snippets ready, I am able to do speech-to-text via SpeechRecognition on each, and start to create my corpus of transcriptions that will be used to be fed into my Text-GAN. The SpeechRecognition library offers an easy to use API for several popular services for speech-to-text including PocketSphinx, Google Speech Cloud, Wit.ai, and more.
Up until this point it has been a constant battle attempting to keep the size of the data under 1 GB, but unfortunately for the Speech-to-text portion of this project, it requires a file format of wav, aiff, or flac, meaning that the file sizes will grow exponentially. To keep the size of the project small, I will use FFmpeg to create a temporary version of the original .mp4 files as .flac and then delete the .flac once processing has been done. While this will slow the process down it's only a temporary fix to create the transcription of each snippet.
I attempted to use PocketSphinx for my speech-to-text portion to create my transcripts but ended up with less than desirable transcripts. I re-processed my audio files using Google Speech Cloud and ended up with much more coherent transcripts. By combining the 300+ transcripts into a final transcript (FT), I was about to pass the FT into the GPT-2 GAN and it was able to successful generate a fake transcript.
Now that the fake transcript has been created using the GPT-2 GAN, I am able to utilize Real-Time Voice Cloning (RTVC) which is the parent project to Resemblyzer. RTVC needs a reference clip of audio (the target), a source of text (fake transcript) and it is able to synthesize the audio and run text-to-speech, resulting in a fake caster reading a fake transcript.
*During the speech diarisation portion I had been spot checking snippets to see how accurate they were, and while some of the clips aren't fully of my desired speaker, for the most part (~75% speaker confidence) the snippets seemed to be good.
I originally started out with using a library called PocketSphinx, to create the transcripts for my audio via speech-to-text (STT). PocketSphinx directly out of the box with no modification, while decently fast, provided horrible transcripts. Admittedly, I did not try to tweak PocketSphinx (if there are even options to do so), instead I looked for a different avenue for speech-to-text. It wasn't until after transcribing (6+ days of serial transcription, roughly 15-30min per audio file) my corpus of audio into transcripts and feeding it through the GPT-2 model, that I realized how horrible the output of the text GAN really was.
After seeing the "results" from PocketSphinx, I decided to try alternative solution, Google Cloud. Google's speech-to-text is truly amazing and produced fantastic results right from the start. Google gives new users $300 in free credit and I used roughly $118 to transcribe over 4000+minutes of audio. Not only did Google Cloud's STT library out perform PocketSphinx, I was able to process these videos asynchronously in a much shorter time. Each audio file took roughly 5-7minutes to translate and I was able to send the full corpus at once. I went from 6 days of serially transcribing to ~10 minutes, with significant improvements in the transcripts.
Below you can see the results of the two libraries side-by-side and the full 15 minute clip. While not perfect, Google Cloud produced significantly better results, which ultimately affects the GPT-2 models results.
---------------------------------------------------------------------------------------------------------------------------------------I was going to say is it just that I really think so is perfect partner for Tamala it feels like you almost always have to make that sacrifice of trying to get to her in the early stages because you can get so out of it you can really have no CC you really have to go into him play begging or you lose CSB you're down and you were becoming pretty toothless towards the end of that day but I just got to say I'm pretty surprised by just how dominant PG has been on this flak westside to say that she is on the map and say okay you know if you don't want to come to Laney shouldn't be an issue flak and make them play for a win or tiebreaker in the first place so that is going to be about as important as they are games yesterday and they lost game 3 to Cloud 9 and I think the question is is how are you going to continue to play this Tamala until she is really better than you can ever do at the marathon maktub you play her at 5 and 3 I think I'll pull out my for strategy help you find something in the next couple weeks and months and years
---------------------------------------------------------------------------------------------------------------------------------------start the game out as they first pick will there be any outside of any Affiliates I would immediately think of something along the lines of Misfortune that is quite popular still in Pro play things like chase the rush or something along those lines but they also need something along the lines of Yasuo that is quite popular still in the ALCS I would think it would be well along those lines feel like they already have one snatch purchase Wish and Taylor and so is gluten in the midline but I am going to assume that assume that they do pick the holy Trinity so let's see if this is for amegg is a great team Fighter I'm excited to see how they need to get out of this one is actually a really a good fighter I think they're going to be good with both and see if I missed something important that I think is often overlooked or kind of uber under the radar as it as one of the most under-discussed very good top laner in the entire league Avenger to be to the top of the ladder as one of the top laners with the most to say in the top words by the way I have been up and down a lot these top three or four wins I think I should pick up to your behind
Overall I believe the project was a success in many ways. While some parts were better implemented or achieved better results than others, I achieved what I set out to do in the first place. I was able to achieve my project goal of generating a fake transcripts based on real transcripts, and then having a fake shoutcaster "read" over it.
Where do we go from here/what to improve on?
increase quality of data
better speech recognition (translation as well as detection)
improve text generation (Upgrade to GPT-3)
enhance speech synthesis and cloning
[1] Learn With League, edited by Indiana Black, Riot Games, oce.learnwithleague.com/shoutcasting-101/.
https://cloud.google.com/speech-to-text
https://cmusphinx.github.io/wiki/
https://github.com/CorentinJ/Real-Time-Voice-Cloning
https://github.com/resemble-ai/Resemblyzer
https://github.com/Uberi/speech_recognition
https://junyanz.github.io/CycleGAN/