VIP Music

Week 1

Went over various subteams, Gil and grad students presented on the different projects. I asked Richard to send me his DeepScore paper that he presented in Ireland.

Week 2

Subteams not finalized yet. Found 3 youtube videos for Ryan related to robot gestures. They are listed below:

Week 3

Finalized subteams, I got my second choice of Shimi.

Week 4

Split into subteams within subteams. I joined the Shimi Software group with Sam Lovejoy and Matt Kaufer. Richard told me to get familiar with NSynth. In order to do so, I did the following:

Dual-booted Ubuntu on my computer
Read this link: https://magenta.tensorflow.org/nsynth
Spent a few hours setting up my Python environment, and installing all the necessary dependencies to use Nsynth and to record audio.
Attempted to read this paper: https://arxiv.org/pdf/1704.01279.pdf

Week 5

Explored capabilities of Nsynth by going through one of their provided Jupyter Notebooks. Encoded a small 10 second .wav file on my computer, and slowed it down and sped it up using their pre-trained model.

This was the original recording: https://drive.google.com/file/d/1bXkPue6s_hFFL_4yjiWedpV0lrbfIcbF/view?usp=sharing
This is the result of speeding it up using Nsynth: https://drive.google.com/file/d/1NfAwBnRnXYFINPUNJuX5RPQa4VZzq87T/view?usp=sharing
This is the result of slowing it down: https://drive.google.com/file/d/10IIFwGZbuydTQ05j8OEUT8QAT1a6tbGb/view?usp=sharing

Slowing the encoding down took over 30 minutes, as did slowing it down. I tried to do interpolation of the wav file with another file, but my computer doesn't have a GPU, and it wasn't finishing after 40 minutes, so I didn't continue that week with the interpolation. Unfortunately, at the time I didn't have buzzcard access to room 104 so I couldn't use the computer with the GPU.

Week 6

Richard and I discussed potential outcomes for my project this semester. We decided it would be cool if Shimi could listen to someone speaking and then, a as quickly as possible, regurgitate a new sound based on the person's voice. This would be done by interpolating the voice with other pre-recorded sounds. So, I leveraged some code found on StackOverflow to set up audio recording. It records for 10 seconds and stores the result in a wav file which then can be encoded and manipulated through Nsynth. I had a friend sing "Somewhere Over the Rainbow", and although I accidentally deleted the original recording, I have the sped up and slowed down versions that I obtained through Nsynth.

Sped-up version: https://drive.google.com/file/d/1yNVMKYIalGqKVKGSrVY-Ay6odDLUPoZK/view?usp=sharing
Slowed-down version: https://drive.google.com/file/d/1m3G8WtYQ7SwF36BVROH9Bx8VL7wKjOD8/view?usp=sharing

In both versions we can make out the word "somewhere" before the audio stops. I'm not 100 percent sure why the audio samples linked above aren't longer, as the original recording had my friend singing the full phrase "Somewhere Over the Rainbow."

The code for audio recording can be found at https://github.com/ysingh97/audio_synthesis in record_sound.py.

Week 7 and 8

I obtained Buzzcard access to room 104, and so decided to take on the problem of interpolating different sounds. First, I had to set up the environment to run Nsynth on the computer. This ended up taking between 3 and 4 hours, as the wrong version of CUDA for Nysynth-GPU was on the computer, and there were a number of other snags involved in getting the environment set up. One thing I will do for future teams is to note installation issues that I encountered and describe how I got past them.

In its most basic form, Nsynth takes embeddings of two wav files and averages them together over time. The following graph, generated from the Nsynth Jupyter Notebook that I ran, illustrates this concept.

To get an idea of what interpolation should sound like, I used the hip-hop sample that was used in Week 5 and a cello sample linked below.

Cello Recording: https://drive.google.com/file/d/1BHOMVnrDobZOOadrR44Q8K7_2QmwoTHW/view?usp=sharing
Hip-hop sample: https://drive.google.com/file/d/1bXkPue6s_hFFL_4yjiWedpV0lrbfIcbF/view?usp=sharing
Interpolation of cello and hip-hop: https://drive.google.com/file/d/18fu5qKjjcazTnDaXcbUWqNI43G1T899I/view?usp=sharing

Interpolating the two samples gives an interesting, distorted blend of the two. Now that I had gotten an idea of what should happen after interpolation, I

My goal was to take a voice recording and interpolate it with an instrumental. Below are some of my results:

Voice Recording: https://drive.google.com/file/d/1Iz_4pBzAFcfpGF0Lxfz7lhmSCHZCmPV7/view?usp=sharing
Interpolation with Cello: https://drive.google.com/file/d/15SnLjvakcEzdzyy5RVV7Z-HvMMpsozHr/view?usp=sharing

The interpolation is cut short here because I terminated the program before it could finish. In the original voice recording, I spoke the words, "This is a test of the Georgia Tech Emergency Notification System." In the interpolation with the cello, these words can barely be distinguished.

Voice Recording: https://drive.google.com/file/d/1UPpPlN6BkmwXdpEBIlkPaVvG4nOMRTW4/view?usp=sharing
Interpolation with Cello: https://drive.google.com/file/d/1HBPYn5xvMSkc9Xx2fsdsaREynzzNnKUR/view?usp=sharing

In the voice recording, I repeatedly say "testing". The words are more audible in the interpolation this time around. What's interesting is that they seem to undulate with the pitch of the secondary voice, the cello. It also feels like I'm singing the words "testing" in some garbled voice, in the interpolation.

I also played around with the crossfade function that was provided in the Jupyter notebook. The crossfade function takes the encodings of two pieces of audio and applies what they call a "hanning window" to a apply a fade. Here is what the fade does to one embedding:

Here is my result for applying a crossfade:

Voice Recording: https://drive.google.com/file/d/1UPpPlN6BkmwXdpEBIlkPaVvG4nOMRTW4/view?usp=sharing
Crossfade with Cello: https://drive.google.com/file/d/1EcP3Lq8IIwaeQtDVDNqtPL3i9Vacidsr/view?usp=sharing

This works decently well, but I would like the voice to be more clear at the beginning.

The primary issue through all of my trials was that the voices were not very clear in the interpolations. While the sounds are supposed to be distorted, the voice was almost completely lost in the first interpolation. I suspect this is because my computer's microphone isn't great and my voice just wasn't recorded very well and clearly. I talked to Sam about this, and he mentioned to me that he has a recording set up with good equipment, so I'll contact him to get better audio samples for interpolation. Another issue was that the GPU equipped computer in 104 couldn't record audio, so I had to record on my computer and send the sample to the other computer to use it there. I'll talk to Richard about that issue. After that is fixed, I can use my original idea of recording a sound and immediately interpolating it with a preexisting sound. As of now, my program uses existing recordings in a subdirectory and interpolates them. Another issue is that the interpolation and crossfading can take around 10 minutes each, even with a 1080. So, we may have to readjust our idea of what is possible.

To do:

Figure out how to record and play audio on 104 GPU-equipped computer.
Get better voice samples using Sam's recording equipment

Weeks 9 and 10

Got the microphone working properly for my code, issue was I had an invalid sample rate for this particular mic
My code is now able to record someone's voice and interpolate that audio with another audio file
Set up my code to recognize if you end a phrase with an instrument that we have an encoding for. For example, if you say "Shimi mix my voice with the drums", it will recognize that the last word is drums. If there is an existing encoding for the drums, called drums.npy, it will use that, saving time. If not, it will generate the encoding if there is a .wav file called drums.wav.
- I used Python's SpeechRecognition library to convert the audio recording to text to determine if an instrument was mentioned
I attempted several different ways to combine the encodings for the voice recording and the instrument, (ex. weighting the voice encoding by 1.5), but I have not yet gotten stellar results when mixing the encodings. Also, the interpolation process takes around 10 minutes.

Weeks 10 and 11

Spent quite a bit of time trying different combinations of encodings for mixing a voice recording with a cello recording, and with a drum recording: got varying results.
Used Nsynth to synthesize only an audio encoding to see how the result differs from the original recording

Rest of Semester

Found a startup called Lyrebird that would create a voice avatar with enough recordings. You could then feed it to new text and it would say it in a voice similar to yours. I spent the rest of the semester figuring out how to use it. I made an account, provided it with 30 or so voice recordings. I then had to figure out how to use its API. It used OAuth2.0, so I had to figure out how to get an access token. Once you get an access token for a particular account, you can use it indefinitely to make authorized API calls. So I currently have the access token for my own account. If we wanted to add a new voice, we would have to make a new account, make more recordings, and get the access token for that account. I currently have it set up such that I can input text into a Python console, and it will output a .wav file with something resembling my voice.

Google Sites

Report abuse