Mapping Music to Motion
Andrew Luck, Zach Markey Badger, Josh Clancy
What if everyone could experience the joy of playing and influencing music, without dedicating hundreds of hours to learning a musical instrument? Technology has been at the center of human music creation for over 35,000 years.
Games like Rock Band and Guitar Hero simulate musical performance, but people haven’t been able to create new, original music...until now...
Music is full of patterns, which means it's probabilistic...and computers are good at that. Music is also highly dimensional and has many possibilities...which could increase this challenge. While computers are one of today's most popular tools for composing and recording music, interfacing with computers for real-time performance remains to be a developing space.
With this project, we aim to create a system for generating new and interesting music with motion.
Can you name a virtuosic musician that uses a computer as their instrument on stage?
We successfully created a musical instrument that utilizes three machine learning models for skeletal tracking, gesture recognition, and music generation. We embodied this A.I. system into a boombox. This project took place in our Hardware Software Lab at the Global Innovation Exchange at University of Washington.
This page summarizes the entirety of the process, but if you just want to learn about the results, skip ahead to Milestone 3 at the bottom of this page.
Discovering this problem space was a unique process for our team, due to the design constraints of the project. Those constraints included:
1. We must collect data from sensors in the real world environment.
2. This data must be processed to build a machine learning model to infer meaningful predictions.
3. We must build the project for under $500, however many free resources are available from our school, GIX, including most materials and tools. (ie. MDF and a laser cutter)
Our team formed under the premise that we would like to create an interface for AI generated music, and ultimately that interface would be real-time.
We understood from the beginning that speed, latency, and computational power would be critical to the success of our experience. The "uncanny valley" for musical interface is under 10 ms.  To infer gestures and relay flags to a real-time musical system would require compute and a cloud service would be too latent.
To recognize people's gestures, we felt strongly that video processing would be critical. For accuracy, utilizing skeletal tracking could be a possible route and we began identifying hardware.
The ideas for performing musical gestures, possibly as a band bubbled to the top and we imagined our music machine as a boombox..."boombox 2.0".
In our second week, we began to investigate techniques for hardware and software environments to bring these ideas to life.
Original Hardware Ideas
- Kinect 2 for Video/Skeletal Tracking
- NVIDIA Jetson TX2 for GPU video processing and Tensorflow GPU acceleration
Defining Input / Interactions
We knew these parts of the design were the core of the user experience. What would be fun and expressive? What could drive sound in an interesting way?
Some of our first ideas for input included tracking:
- amount of movement
- speed of movement
- proximity of motion
Our team also began to consider pre-existing techniques in DSP and video processing for tracking motion at this point.
We were sold on the idea of the Jetson TX2 being a fast video capable GPU, and luckily, we had access to one at school for no cost to our budget.
In the two days before our Milestone 1 goal, we began to realize that Kinect was not an ideal choice, even through there was a depth camera. Newer machine learning pose estimators proved to be a better solution. Not using the Kinect also promised more ubiquity if a simple webcam could be employed.
At this point of the project there was some disagreement in regards to the technology. OpenPose was already built for Jetson and loaded onto the TX2 quickly, but PosenetJS was the new shiny thing in the room. Ultimately we moved forward with OpenPose and it did the job well. Posenet JS may have been easier to deploy to web audiences, however.
We started a Trello...
Milestone 1 - Week 3
At Milestone 1, we were confident that we had found the technologies needed to move forward. MagentaJS was generating compelling drum beats and OpenPose was running smoothly on the NVIDIA Jetson TX2.
At this stage we began to consider more deeply what gestures and scenarios we would like to create musically. The idea of the a 3 person air rock band was really exciting to us. It was social and interactive, and sounded like a blast.
Versioning Troubles and Hardware Limitations
We decided to start by creating a fully connected network with an adam optimizer in Tensorflow. This proof of concept would test the ability for gesture inferance with a single person. The first results at this stage were impressive and energizing. We had quickly received results with over 80% accuracy. This was exciting.
In the audio department we were experiencing a particularly difficult set of challenges. Google's Magenta python environment was incredibly difficult to set up, with hundreds of dependencies, bugs, and errors. After a week of trying to install, the build was finally updated and we were able to run the environment in Conda. The MagentaJS tools were also promising and we began researching alternate routes of utilizing a NodeJS server. However, Ubuntu 16.0.4 does not install a modern version of Chrome - only Chromium. We were able to install some custom WebMIDI and WebAudio plugins, however they did not work with with WebGL on the Jetson TX2.
We also ran into issue with Ubuntu 16.0.4 due to the Arm V8 architecture. The most recent version of Tensorflow we were able to run was Tensorflow 1.9. Our model was originally trained in 1.12, and when it was rolled back to 1.9 we lost more points. Magenta required Tensorflow 1.12 and we could not load the environment on Jetson.
The next limitation we realized was that the Jetson TX2 only had HDMI audio. We chose to pivot to devoting the audio to a Beaglebone Black with the Bela.io DAC. This DAC promised low latency sound synthesis and the prospect of delegating this compute to a separate machine than our visual processing was interesting.
At this stage the enclosure development began. The first version was designed in Rhino and milled on a shopbot.
Milestone 2 - Week 6
The first form of our enclosure is together at week 6! Our AI has been embodied! This was an oversized version, with lots of room for putting our development boards on top of the box. You can see our 7" touch screen on the left, webcam, on the Jetson TX2 on the top right above. We knew we must move quickly after the lengthy creation process of enclosure one and we began developing the second iteration soon after milestone 2.
At this stage (week 6), we are now recognizing gestures and sending flags via UDP and the Open Sound Control protocol to the Bela.io and Supercollider!
For our first proof of concept we trigger buffers for audio playback and start to realize some of the limitations of the Bela.io platform with the Beaglebone. CPU was becoming a problem when too many nodes of buffers played on the Supercollider server.
Enclosure two featured multiple design changes and also a change in process. We did all of the 2D cutting the Kearn laser cutter and it was far faster. We cut through 0.5" and 0.75" MDF material in minutes, rather than the lengthy cuts of the shopbot, cutting our process by about 2 hours. The engraving was performed on the shopbot after laser cutting.
Key design changes:
- Thinner box (less volume)
- 2" narrower
- different shape fillet
- web cam port
- larger touchscreen area with ribbon holes
- port for cable
- trays for AC power blocks under
We also began to consider the visual aesthetic of the project. We decided to paint the MDF a matte black and apply a silver vinyl after testing. We printed some sketch templates and mocked up several designs with pencil.
At this stage a significant amount of software develop remained. While we wrote code in parallel, there were quite a few stumbles that remained. The Bela was not performing as expected. It was time to pivot to a tablet. We formed a spreadsheet and met to divide up the remaining development. We retrained with time series data and enhanced the gesture recognition model yet again.
It's alive! We pulled it together and had some amazing results in our development environment. Our demo attempt had gremlins ofcourse, likely due to background difference and tracking with OpenPose...
Milestone 3 - Week 10
This video features some of the discrete events the performer can trigger. When locking into modes like guitar, drums, or piano, messages re-trigger so long as motion remains. On the Supercollider end, a random walk is produced based on the probability distribution of notes and rests derived from the song that Magenta has created.
Gesture Recognition Inference
Deep learning models are often best understood by starting with the inputs and outputs. Our inputs, are a series of frames (a video). For each of these frames a skeleton of key points are drawn on any recognized humans using Openpose. So that for each human in frame, 25 x and 25 y data points make up 25 key points creating the skeleton. Our output then consists of 13 labels. They are 'RR', 'guitar-knees', 'bass', 'cowbell', 'piano', 'guitar', 'goats', 'clap', 'drums', 'dab', 'stand', 'throat cut', and 'bow.' These were labeled as people acted out those movements.
Now in more detail, our data takes the form of windows in time. So that we have 6 frames within each window, each window with it’s own detected set of 50 key points (25x, 25y) so that we end up with an input of the shape (1, 6, 50). We then take these windows and put them through two time distributed layers. Time distributed layers are simply dense layers that have been separated by time steps. So that each set of the 25 key point skeletons goes through its own dense network. As the key point skeletons at each timestep are predicting based on a similar spatial patterns, we can use the same model for each time step. What this does is one, allow us to use a smaller model which can run faster, and two, force the network to first deal with spatial patterns first, allowing for specialization. After the time distributed layers we connect the time steps with four regular dense layers. These will begin understanding the temporal connections between the skeletons, and categorize the movement. These dense layers taper off to the final output layer of 13 neurons corresponding to our 13 labels. We have dropout and batch norm throughout our architecture.
VAE model (working but not well)
Discussion about Model
OpenPose Machine Learning and Data Processing
Using the keypoint data we got back from OpenPose, we saved the last 6 frames of history and did various checks to see where the body parts were in relation to "lines" that we drew based on the coordinates of their body.
For instance, we made a box around the hips and shoulders of the person in view. If the person's wrists are inside that box, we do nothing. If they are outside the box, we take note of the location of their wrists and send a message to Supercollider with that information.
Another example we did to simulate the experience of strumming a guitar was to draw a line between opposite shoulders and hips. When the person's wrists crossover that line, we send a "strum" message.
Using a combination of these techniques in addition to using the machine learning model allowed us to give the user a set of ways to influence the music. Below was the list of motions
Music Generation and Synthesis with Google Magenta and SuperCollider
We chose the open source music programming environment SuperCollider for it's high fidelity and wide platform support. To create our musical patterns and events, however, we generated MIDI with Google's Magenta project.
While it would have been ideal to generate MIDI tracks on the fly, we decided to render 500 16 bar song parts from Magenta's Music VAE trio checkpoint to offload compute. Since MIDI is such a small file size, this required little space. The video to the right is an example of a 3 part trio generated with this model, including drum, melody, and bass instrumentation.
From these 500 original 16 bar sequences, we created 250 sets of 4 single bar interpolations between them. To do this we iterated through the folder with a python shell script to select each pair of MIDI files as inputs. These interpolations provided us with the ability to bring disparate (random) song ideas together. We utilized a special Magenta checkpoint for interpolations with the MusicVAE model for this.
After completing the scripts, we had generated 12,000 bars of MIDI files, with sequences for Drums, Bass, and Melody.
SuperCollider handles realtime patterning with arrays of musical events that include timing deltas, duration, pitch and more. This information was extracted from the MIDI clips originally rendered in Magenta to build these arrays for patterning.
These note sequence arrays were also utilized to build random walks and song-based probability distributions for our pure DSP interactions, such as guitar, piano, and bass.
In SuperCollider, we created synth definitions for each instrument timbre. For instruments such as guitar, piano, and bass, this synth definition included code to interpret each note and transpose the sound via a midi ratio multiplier and and scale the speed of the sample rate playback according to the note desired.
This playback and synthesis method proved to be compute heavy, for each musical event and note created, a synth node was initialized and only released after duration of the sound ended. At times we had nearly 400 synth nodes running on the server. For strongly timed real time synthesis, this is an expensive task that Beaglebone Black/Bela.io nor a Microsoft Surface Pro could handle. It was not until we employed a high end Surface Book that we had few glitches or hick-ups in audi
All Together Now
Originally, we had planned for the instrumentation to accompany the Magenta generated backtrack as it played. We realized that this was quite busy with discrete musical events triggering at the same time and would like to explore hocketing techniques in the future. However, the results of multiplayer alone were quite fun and interesting.
This was an amazing project to work on with goals that were ambiguous in the beginning. We pulled off a system that utilized 3 machine learning models in unison with custom digital signal processing of the motion data to flag events to our audio synthesis server. The enclosure and audio components were equally pleasing.
One of toughest challenges was managing the complexity of a generative music system that synthesizes sound for hundreds of strongly timed musical events with timbre of a high resolution... in (very near) real time. Unfortunately our system did not fit entirely "in the box" because of this.
We see these audio challenges not as a failure but as an opportunity. Computers will become smaller and more powerful in the near future. If we can optimize it, it would likely be innovative and able to be implemented on a larger scale. In the end, we achieved great gesture recognition accuracy, very fast discrete event control and generated music with data and machine learning algorithms in a fun way... and we made some people smile.
 Sound on Sound, Dave Stewart, How much latency is acceptable for virtual piano?, Retrieved Mar 15, 2019, https://www.soundonsound.com/sound-advice/q-how-much-latency-acceptable-virtual-piano