Our proposed system can be divided into three elements: audio analysis, motion generation, and real-time synchronization. Audio analysis primarily involves automatic music transcription, melody detection, and musical instrument recognition. In the past, due to the diversity of signal characteristics and data labels, it was difficult to establish a systematic solution for automatic music analysis. Nowadays, deep learning-based systems that simultaneously detect multiple pitch, timing, and instrument types have become possible due to the development of neural networks (NN) in multi-task learning (MTL) approaches. Moreover, we can now superimpose different types of signal representations, allowing convolution kernels in NN to automatically select desired features. Consequently, the training model exhibits enhanced robustness, achieves transposition-invariance, and suppresses the challenging overtone errors usually generated in audio processing. More specifically, our proposed method simplifies the issue of musical transcription into semantic segmentation in computer vision. Our U-Net-based architecture considers convolution kernels via attention or dilation mechanisms to simultaneously process objects of different sizes, such as to identify both short and long musical notes.
To generate animated body movement, we have achieved preliminary results based on the motion of a violin player. Using a recording of a violin solo as the input signal, we automatically generate coordinate values of the body joints for a virtual violinist. Long-term body rhythms can also be determined by our music emotion recognition model. Instead of employing an end-to-end NN, we are focusing on more interpretable and controllable body movement generation methods. Our proposed model consists of a bowing model for the right hand, a fingering (position) model for the left hand, and a musical emotion (expression) model for the upper body. The bowing model has been designed with an audio-based attack detection network, whereas the fingering model computes left-hand position from music pitch. From this information, patterns for the generated skeleton can be determined. In terms of music emotion recognition, since periodic head tilt and upper body motion tend to follow the rhythm and music type, we incorporate rhythm tracking from the audio model and an emotion predictor model to control those aspects of body motion. These same principles can be applied to other kinds of stringed instruments. We are still tackling the problem of generating body movements solely from audio content, but there are many possibilities for future development.
For real-time synchronization, our proposed system incorporates three elements, i.e., a music tracker, a music detector, and a position estimator. The music tracker includes online dynamic time-warping (ODTW) algorithms working across multiple threads. Each thread uses ODTW to estimate the current performance speed of the live music performance. Estimated values across threads are averaged to obtain a stable and accurate estimate of performance speed. Relative tempo values are obtained by comparing the live performance with a reference performance recording. The function of the music detector is to automatically detect when the music starts, meaning that there is no need to manually launch the real-time synchronization mechanism. Finally, since music exhibits many repetitive segments, our position estimation mechanism allows us to simultaneously track the positions that the musician is currently playing. Combining these three elements, we can immediately align the position of a live performance to a reference recording, allowing a program director to design responsive events based on that information. We have applied this system to music visualization, automatic accompaniment/ensemble, and generation of automatic body movements for a virtual musician.
Our system has been utilized for several live performances, including the Sound and Sense concert (in cooperation with the Pace Culture and Education Foundation, performed in the National Concert Hall), the opening ceremony of the NTHU AI Orchestra (in collaboration with the NTHU AI Orchestra), Whispers in the Night (in collaboration with flutist Sophia Lin, performed in the Weiwuying Auditorium), and Sound and Shape (in collaboration with Koko Lab. Inc., performed at Wetland Venue). These concerts were held not only to test our technology, but also to facilitate in-depth conversation among music producers, performers, and music technology developers, with the view to introducing new-age music technology to the multimedia industry.
本系統分為三部分:音訊分析、動作生成、即時同步。音訊分析包含自動採譜、主旋律偵測、樂器種類偵測等等,在過去的做法中,因涉及不同的訊號特徵與音樂資料標註,難以建立整合型的音樂分析解決方案。如今由於神經網路在多任務學習(multitask learning)的發展,同時處理音高、時間和樂器種類的深度學習系統已經成為可能。我們疊合不同類型的訊號特徵進行特徵選取的工作,增加訓練模型的強健性,並達到移調不變性(transposition-invariant)並抑制掉音訊處理問題中典型的泛音錯誤,更精確地說,我們提出的方法將音訊分析簡化為電腦視覺中語意分割(semantic segmentation)的問題。我們基於 U-Net 的架構,考慮具有注意力或擴張機制的卷積核,同時處理不同尺寸的目標物件,例如辨識短音與長音。
在動作生成研究中,專注於小提琴演奏者的動作生成已經有了初步成果:以小提琴獨奏錄音檔為輸入訊號,即可自動產生虛擬小提琴家的肢體座標,並透過音樂情緒模型決定身體律動。相較於端到端的類神經網路訓練模式,我們的初步成果著重於可解釋、可操控的參數化肢體動作生成模式。本方法由右手的弓法模型、左手的指法模型、以及上半身的音樂情緒模型所組成:右手的模型由基於音訊的換弓點偵測達成,左手的模型則是透過音高偵測對應到把位與弦,左右手的弓指法資訊可決定生成骨架的型態。在音樂情緒的部分,由於頭部與上肢隨著節拍的週期性傾斜角度與音樂的激昂度(arousal)有關,我們根據音訊模型的拍點偵測(beat tracking)與音樂情緒模型的激昂度預測來控制頭部與上肢的傾斜角變化。同樣的原理也適用於其他種類的弦樂器。從音訊生成肢體動作的問題目前還在發展階段,未來有非常多的發展可能。
最後,在即時同步的技術上,我們提出的系統包含音樂追蹤器(music tracker)、音樂偵測器(music detector)和位置估算(position estimation)三個部分。音樂追蹤器包含多執行緒之線上動態時間校正(online dynamic time warping, ODTW)演算法,每個執行緒使用 ODTW 估測現場演奏音樂當下的演奏速度,各自的結果加以平均得到精確的演奏速度值,與參照的演奏檔案比較,可以得出速度的相對值。音樂偵測器的功能在於偵測音樂什麼時候開始,這個機制可以讓我們不需要手動操作即時同步系統。最後,由於音樂中會有許多重複的片段,所以位置估算的機制可以讓我們同時追蹤目前可能演奏到的位置。結合以上三者,我們可以即時推出現場演奏音樂在原譜或參考音檔中的位置,而表演的設計者可以根據這個資訊作事件的對應。
我們目前已經將上述技術應用在音樂視覺化、自動伴奏/合奏、以及自動肢體動作生成等三種表演類型。我們的系統已經在數個表演現場演奏,包含〈日新樂譯〉音樂會(與沛思文教基金會合作,於國家音樂廳演出)、清大 AI 樂團開幕(與清大 AI 樂團合作)、〈夜之絮語〉音樂會(與長笛家林怡君等合作,於衛武營演奏廳演出),以及在2019年底演出的〈聲形〉音樂會(與口口實驗室合作,於濕地 Venue 演出)等等,除了是對於我們方法上的驗證以外,也成為技術開發者與製作人、表演者的發想與溝通的重要平台,期望這樣的技術落地成為下一代多媒體產業的核心。