Our research interest is machine musicianship, which means applying machine and signal processing techniques to enable a machine to understand, interactive with, appreciate as well as compose music like a human being. Major research topics encompass automatic music transcription, music and human-computer interaction, automatic music generation, music performance assessment and conversion, computational music analysis, and fundamental research on machine learning and signal processing. Active projects are as follows:
Polyphonic music is arguably one type of the most complicated audio signals in the world. It is hard to analyze because of the overlapped harmonic peaks over multiple sources, the diversity of timbre, and the scarcity of human-labeled data. Both the approaches based on either signal processing or machine learning still face great challenges in solving the multi-pitch estimation problem currently. We propose the deep discrete Fourier transform (DDFT) that can estimate the fundamental frequencies (F0s) of a multi-component signal even under severe contamination of convolutional noise. Stacking multiple layers of discrete Fourier transforms, filters, and activation functions altogether, the DDFT with the combined frequency and periodicity (CFP) pitch salience function exhibits improved performance in multi-pitch estimation with increasing number of layers. As a combination of the spirits of homomorphic signal processing and multi-layer perceptron, the DDFT also provides new insights to understanding how deep learning works.
*This project is supported by the MOST Project 106-2218-E-001-003-MY3.
For a complete transcription of a guitar performance, the detection of playing techniques such as bend and vibrato is important, because playing techniques suggest how the melody is interpreted through the manipulation of the guitar strings. While existing work mostly focused on playing technique detection for individual single notes, we attempts to expand this endeavor to recordings of guitar solo tracks. Specifically, we treat the task as a time sequence pattern recognition problem, and develop a two-stage framework for detecting five fundamental playing techniques used by the electric guitar. Given an audio track, the first stage identifies prominent candidates by analyzing the extracted melody contour, and the second stage applies a pre-trained classifier to the candidates for playing technique detection using a set of timbre and pitch features. Experiments show the potential of this system in transcribing real-world electric guitar solos with accompaniment.
Resource allocation is a critical issue in the implementation of a portable, low-latency and efficient system for interactive experience. To address this, we propose the parallel dynamic time warping (PDTW), which employs server-client architecture on a multi-core system, for real-time audio-to-audio music alignment. We also discuss the evaluation methodology that benchmark the trade-offs among latency, accuracy of alignment, and computing resource. The proposed system starts from dividing the input audio stream into multiple short clips and the performing dynamic time warping (DTW) for every clip to obtain multiple estimates of the instantaneous speed of a live performance with respect to its reference performance. Fusing the estimates computed from the clips give a stable estimation of the instantaneous tempo. With the aids of parallel computing, the system not only reduces the processing time, but also improves both the alignment accuracy and the robustness to tempo variation. On Nov. 2, 2017, we demonstrated the system practically in a classic music concert performed in the National Concert Hall of Taiwan.
*This project is supported by the Data Science and AI Project of Academia Sinica.
Utilizing deep learning techniques to generate musical contents has caught intensive attention in recent years. Within this context, we further investigate the task of music style transfer, a specific yet rather practical task of music generation, being to rearrange the style of a given music piece from one to another, while preserving the essence of that piece such as melody or chord progression. In particular, we discuss the style transfer of homophonic music, composed of a predominant melody part and an accompaniment part, where the latter is modified through Gibbs sampling on a generative model combining recurrent neural networks and autoregressive models. Both the objective and subjective testing experiments are presented to assess the capability of transferring the style of an arbitrary music piece with homophonic texture into other two distinct ones, Bach’s chorales and Jazz.
🎶 Singing Voice Pitch Correction - Sound sample 1
🎶 Singing Voice Pitch Correction - Sound sample 2
Expressive singing voice correction is an appealing but challenging problem. A robust time-warping algorithm which synchronizes two singing recordings can provide a promising solution. We thereby propose to address the problem by canonical time warping (CTW) which aligns amateur singing recordings to professional ones. A new pitch contour is generated given the alignment information, and a pitch-corrected singing is synthesized back through the vocoder. The objective evaluation shows that CTW is robust against pitch-shifting and time-stretching effects, and the subjective test demonstrates that CTW prevails the other methods including DTW and the commercial auto-tuning software. Finally, we demonstrate the applicability of the proposed method in a practical, real-world scenario.
The manipulation of different interpretational factors, including dynamics, duration, and vibrato, constitutes the realization of different expressions in music. Therefore, a deeper understanding of the functions of these factors is critical for advanced expressive synthesis and computer-aided music education. We propose the novel task of automatic expressive musical term classification as a direct means to study the interpretational factors. We consider up to 10 expressive musical terms, such as Scherzando and Tranquillo, and compile a new dataset of solo violin excerpts featuring the realization of different expressive terms by different musicians for the same set of classical music pieces. Under a score-informed scheme, we design and evaluate a number of note-level features characterizing the interpretational aspects of music for the classification task. Our evaluation shows that the proposed features lead to significantly higher classification accuracy than a baseline feature set commonly used in music information retrieval tasks. Moreover, taking the contrast of feature values between an expressive and its corresponding non-expressive version (if given) of a music piece greatly improves the accuracy in classifying the presented expressive one. We also draw insights from analyzing the feature relevance and the class-wise accuracy of the prediction.
Repetition is the basic of music as an art. Finding repeating patterns (e.g., motives, themes, etc.) in a music piece is therefore an important task in the analysis of musical form and structure. However, the pattern discovery task has received less attention than many other tasks in MIR, perhaps because it is highly challenging: a meaningful pattern may derive a number of variations, lying in one of the many voices or accompaniment parts is a music piece. One basic idea of pattern discovery is based on the so-called geometric approach, such as the SIA method described by Meredith et al., which takes multidimensional representation of a music as input and finds patterns through searching all displacement vectors of each note group which is a translation to others. However, the conventional approach of SIA suffers from two issues. First, since SIA employs a brute-force searching strategy, most of the discovered patterns are of little musical interest. Second, SIA uses an iterative way to find the set of occurrences for a given pattern, and thus has heavy computation loading. To address these issues, the proposed algorithm utilizes the unsupervised machine learning such as the clustering methods which can directly get all sets of occurrences for all patterns. Then, the output is constrained with the assumption that theme-like patterns should occur at the beginning of each repeated section.
Signal processing and deep learning seem to be “the two cultures” in the field of mathematical modeling. However, they share something in common and in a fundamental sense that might unravel the secret why deep learning works so well. In recent years, many attempts that utilize the concepts of the classical signal processing techniques with deep structures and nonlinearities have also gained great success in novel feature extraction. For example, the deep scattering transform, the Saak (Subspace approximation with augmented kernels) transform, and the deep discrete Fourier transform (DDFT) take the advantage of going deep to refine the feature. Unlike those data-driven models, these models are interpretable, and therefore shed new light on the theoretical study of deep learning.