Brad Anderson

GitHub

Shoutcasting with GANS

shoutcast verb
\ ˈshau̇t -ˈka-st \
Definition of shoutcasting
: to deliver an educating, entertaining, & engaging narrative of the match & its players[1]

Process

There wasn't a dataset already created for this problem, so I had to create it myself. Through lol.gamepedia.com I was able to scrape a list of every single game that "Phreak" had casted. These games range from 2012 all the way up until last week.

After a little wrangling I managed to get a list of the YouTube VoD links (which included the start time for matches), and began to download them using PyTube. Once each video was downloaded, I had roughly 330 videos worth of content to process, varying in length and size. I utilized FFmpeg and the start time from each individual VoD's link to crop the audio from the start time of the casting, to 15 minutes into each match. I had to cut each match early due to size limitation when loading the Resemblyzer model, but we should be able to get enough audio from 15 minutes to produce sufficient transcripts. If the transcripts prove to be not enough, I will revisit this later and look into cutting videos that are ≥ 30 minutes, into two - 15 minute videos.

During the speech diarisation* with Resemblyzer, I was able to create snippets of audio once a threshold confidence score of ≥ .75 has been reached. Once each video has been processed into individual snippets, I removed snippets that has less then 2 seconds (this might be revisited if needed) of speaking as this helps ensure there is enough data to be used for the downstream networks.

With the collection of snippets ready, I am able to do speech-to-text via SpeechRecognition on each, and start to create my corpus of transcriptions that will be used to be fed into my Text-GAN. The SpeechRecognition library offers an easy to use API for several popular services for speech-to-text including PocketSphinx, Google Speech Cloud, Wit.ai, and more.

Up until this point it has been a constant battle attempting to keep the size of the data under 1 GB, but unfortunately for the Speech-to-text portion of this project, it requires a file format of wav, aiff, or flac, meaning that the file sizes will grow exponentially. To keep the size of the project small, I will use FFmpeg to create a temporary version of the original .mp4 files as .flac and then delete the .flac once processing has been done. While this will slow the process down it's only a temporary fix to create the transcription of each snippet.

I attempted to use PocketSphinx for my speech-to-text portion to create my transcripts but ended up with less than desirable transcripts. I re-processed my audio files using Google Speech Cloud and ended up with much more coherent transcripts. By combining the 300+ transcripts into a final transcript (FT), I was about to pass the FT into the GPT-2 GAN and it was able to successful generate a fake transcript.

Now that the fake transcript has been created using the GPT-2 GAN, I am able to utilize Real-Time Voice Cloning (RTVC) which is the parent project to Resemblyzer. RTVC needs a reference clip of audio (the target), a source of text (fake transcript) and it is able to synthesize the audio and run text-to-speech, resulting in a fake caster reading a fake transcript.

*During the speech diarisation portion I had been spot checking snippets to see how accurate they were, and while some of the clips aren't fully of my desired speaker, for the most part (~75% speaker confidence) the snippets seemed to be good.

Speech Recognition

Speech-to-Text

I originally started out with using a library called PocketSphinx, to create the transcripts for my audio via speech-to-text (STT). PocketSphinx directly out of the box with no modification, while decently fast, provided horrible transcripts. Admittedly, I did not try to tweak PocketSphinx (if there are even options to do so), instead I looked for a different avenue for speech-to-text. It wasn't until after transcribing (6+ days of serial transcription, roughly 15-30min per audio file) my corpus of audio into transcripts and feeding it through the GPT-2 model, that I realized how horrible the output of the text GAN really was.

After seeing the "results" from PocketSphinx, I decided to try alternative solution, Google Cloud. Google's speech-to-text is truly amazing and produced fantastic results right from the start. Google gives new users $300 in free credit and I used roughly $118 to transcribe over 4000+minutes of audio. Not only did Google Cloud's STT library out perform PocketSphinx, I was able to process these videos asynchronously in a much shorter time. Each audio file took roughly 5-7minutes to translate and I was able to send the full corpus at once. I went from 6 days of serially transcribing to ~10 minutes, with significant improvements in the transcripts.

Below you can see the results of the two libraries side-by-side and the full 15 minute clip. While not perfect, Google Cloud produced significantly better results, which ultimately affects the GPT-2 models results.

PocketSpinx

"our team where river three's players gave their first time an upper level on dubious it's important and a lot of success in their careers in india titles to their names as well but the top three babies world of you guys and gold audience are flying top cyclic you what that was reasons that the zero seven forty percent keep the yankee was for the black hills right with the diving probably try to kill some of my plate gigi a playground outside in a definitively chilled out there what is all of you know goes around but again but i am you see the party dots lighting and right now the guy played"

Google Speech Cloud

"our team we're right for three players at their first time at a pro-level Nubian 64th and have a lot of title to the names as well but the top three like these are all the new guys and go right with this diving top wait till I get kills in a size like GTA Maryville right like a really really good you can have it right beside of them needs to grow up he's right you can tell when people are talking about the Maryville Squad they were like oh that's really cool how Niles was able to join golden guardian and take a jungle duo with you that's what Lane is would love to do you know our eye blinking by jungle with you guys going to kick continue to can't male for me for me it is but I caught a quiz been in the spotlight for Golden Guardian"

GPT-2 (355M Model) Samples

---------------------------------------------------------------------------------------------------------------------------------------I was going to say is it just that I really think so is perfect partner for Tamala it feels like you almost always have to make that sacrifice of trying to get to her in the early stages because you can get so out of it you can really have no CC you really have to go into him play begging or you lose CSB you're down and you were becoming pretty toothless towards the end of that day but I just got to say I'm pretty surprised by just how dominant PG has been on this flak westside to say that she is on the map and say okay you know if you don't want to come to Laney shouldn't be an issue flak and make them play for a win or tiebreaker in the first place so that is going to be about as important as they are games yesterday and they lost game 3 to Cloud 9 and I think the question is is how are you going to continue to play this Tamala until she is really better than you can ever do at the marathon maktub you play her at 5 and 3 I think I'll pull out my for strategy help you find something in the next couple weeks and months and years
---------------------------------------------------------------------------------------------------------------------------------------start the game out as they first pick will there be any outside of any Affiliates I would immediately think of something along the lines of Misfortune that is quite popular still in Pro play things like chase the rush or something along those lines but they also need something along the lines of Yasuo that is quite popular still in the ALCS I would think it would be well along those lines feel like they already have one snatch purchase Wish and Taylor and so is gluten in the midline but I am going to assume that assume that they do pick the holy Trinity so let's see if this is for amegg is a great team Fighter I'm excited to see how they need to get out of this one is actually a really a good fighter I think they're going to be good with both and see if I missed something important that I think is often overlooked or kind of uber under the radar as it as one of the most under-discussed very good top laner in the entire league Avenger to be to the top of the ladder as one of the top laners with the most to say in the top words by the way I have been up and down a lot these top three or four wins I think I should pick up to your behind

Text-To-Speech + Voice Synthesis

Parting Thoughts

Overall I believe the project was a success in many ways. While some parts were better implemented or achieved better results than others, I achieved what I set out to do in the first place. I was able to achieve my project goal of generating a fake transcripts based on real transcripts, and then having a fake shoutcaster "read" over it.

Where do we go from here/what to improve on?