Went over various subteams, Gil and grad students presented on the different projects. I asked Richard to send me his DeepScore paper that he presented in Ireland.
Subteams not finalized yet. Found 3 youtube videos for Ryan related to robot gestures. They are listed below:
Finalized subteams, I got my second choice of Shimi.
Split into subteams within subteams. I joined the Shimi Software group with Sam Lovejoy and Matt Kaufer. Richard told me to get familiar with NSynth. In order to do so, I did the following:
Explored capabilities of Nsynth by going through one of their provided Jupyter Notebooks. Encoded a small 10 second .wav file on my computer, and slowed it down and sped it up using their pre-trained model.
Slowing the encoding down took over 30 minutes, as did slowing it down. I tried to do interpolation of the wav file with another file, but my computer doesn't have a GPU, and it wasn't finishing after 40 minutes, so I didn't continue that week with the interpolation. Unfortunately, at the time I didn't have buzzcard access to room 104 so I couldn't use the computer with the GPU.
Richard and I discussed potential outcomes for my project this semester. We decided it would be cool if Shimi could listen to someone speaking and then, a as quickly as possible, regurgitate a new sound based on the person's voice. This would be done by interpolating the voice with other pre-recorded sounds. So, I leveraged some code found on StackOverflow to set up audio recording. It records for 10 seconds and stores the result in a wav file which then can be encoded and manipulated through Nsynth. I had a friend sing "Somewhere Over the Rainbow", and although I accidentally deleted the original recording, I have the sped up and slowed down versions that I obtained through Nsynth.
In both versions we can make out the word "somewhere" before the audio stops. I'm not 100 percent sure why the audio samples linked above aren't longer, as the original recording had my friend singing the full phrase "Somewhere Over the Rainbow."
The code for audio recording can be found at https://github.com/ysingh97/audio_synthesis in record_sound.py.
I obtained Buzzcard access to room 104, and so decided to take on the problem of interpolating different sounds. First, I had to set up the environment to run Nsynth on the computer. This ended up taking between 3 and 4 hours, as the wrong version of CUDA for Nysynth-GPU was on the computer, and there were a number of other snags involved in getting the environment set up. One thing I will do for future teams is to note installation issues that I encountered and describe how I got past them.
In its most basic form, Nsynth takes embeddings of two wav files and averages them together over time. The following graph, generated from the Nsynth Jupyter Notebook that I ran, illustrates this concept.
To get an idea of what interpolation should sound like, I used the hip-hop sample that was used in Week 5 and a cello sample linked below.
Interpolating the two samples gives an interesting, distorted blend of the two. Now that I had gotten an idea of what should happen after interpolation, I
My goal was to take a voice recording and interpolate it with an instrumental. Below are some of my results:
The interpolation is cut short here because I terminated the program before it could finish. In the original voice recording, I spoke the words, "This is a test of the Georgia Tech Emergency Notification System." In the interpolation with the cello, these words can barely be distinguished.
In the voice recording, I repeatedly say "testing". The words are more audible in the interpolation this time around. What's interesting is that they seem to undulate with the pitch of the secondary voice, the cello. It also feels like I'm singing the words "testing" in some garbled voice, in the interpolation.
I also played around with the crossfade function that was provided in the Jupyter notebook. The crossfade function takes the encodings of two pieces of audio and applies what they call a "hanning window" to a apply a fade. Here is what the fade does to one embedding:
Here is my result for applying a crossfade:
This works decently well, but I would like the voice to be more clear at the beginning.
The primary issue through all of my trials was that the voices were not very clear in the interpolations. While the sounds are supposed to be distorted, the voice was almost completely lost in the first interpolation. I suspect this is because my computer's microphone isn't great and my voice just wasn't recorded very well and clearly. I talked to Sam about this, and he mentioned to me that he has a recording set up with good equipment, so I'll contact him to get better audio samples for interpolation. Another issue was that the GPU equipped computer in 104 couldn't record audio, so I had to record on my computer and send the sample to the other computer to use it there. I'll talk to Richard about that issue. After that is fixed, I can use my original idea of recording a sound and immediately interpolating it with a preexisting sound. As of now, my program uses existing recordings in a subdirectory and interpolates them. Another issue is that the interpolation and crossfading can take around 10 minutes each, even with a 1080. So, we may have to readjust our idea of what is possible.
Found a startup called Lyrebird that would create a voice avatar with enough recordings. You could then feed it to new text and it would say it in a voice similar to yours. I spent the rest of the semester figuring out how to use it. I made an account, provided it with 30 or so voice recordings. I then had to figure out how to use its API. It used OAuth2.0, so I had to figure out how to get an access token. Once you get an access token for a particular account, you can use it indefinitely to make authorized API calls. So I currently have the access token for my own account. If we wanted to add a new voice, we would have to make a new account, make more recordings, and get the access token for that account. I currently have it set up such that I can input text into a Python console, and it will output a .wav file with something resembling my voice.