Papers

The element-wise attention mechanism has been widely used in modern sequence models for text and music. The original attention mechanism focuses on token-level similarity to determine the attention weights. However, these models have difficulty capturing sequence-level relations in music, including repetition, retrograde, and sequences. In this paper, we introduce a new attention module called the sequential attention (SeqAttn), which calculates attention weights based on the similarity between pairs of sub-sequences rather than individual tokens. We show that the module is more powerful at capturing sequence-level music relations than the original design. The module shows potential in both music relation discovery and music generation.

Paper - Poster - Video (Youtube) - Video (bilibili)

1-02 Lyrics Information Processing: Analysis, Generation, and Applications

Kento Watanabe and Masataka Goto

In this paper, we propose Lyrics Information Processing (LIP) as a research field for technologies focusing on lyrics text, which has both linguistic and musical characteristics. This field could bridge the Natural Language Processing field and the Music Information Retrieval field, leverage technologies developed in those fields, and bring challenges that encourage the development of new technologies. We introduce three main approaches in LIP, 1) lyrics analysis, 2) lyrics generation and writing support, and 3) lyrics-centered applications, and briefly discuss their importance, current approaches, and limitations.

Paper - Poster - Video (Youtube) - Video (bilibili)

1-03 Prediction of user listening contexts for music playlists

Jeong Choi, Anis Khlif and Elena Epure

In this work, we set up a novel task of playlist context prediction. From a large playlist title corpus, we manually curate a subset of multi-lingual labels referring to user activities (e.g. `jogging', `meditation', `au calme'), which we further consider in the prediction task. We explore different approaches to calculate and aggregate track-level contextual semantic embeddings in order to represent a playlist and predict the playlist context from this representation. Our baseline results show that the task can be addressed with a simple framework using information from either audio or distributional similarity of tracks in terms of track-context co-occurrences.

Paper - Poster - Video (Youtube) - Video (bilibili)

1-04 Computational Linguistics Metrics for the Evaluation of Two-Part Counterpoint Generated with Neural Machine Translation

Stefano Kalonaris, Thomas McLachlan and Anna Aljanaki

In this paper, two-part music counterpoint is modeled as a neural machine translation (NMT) task, and the relevance of automatic metrics to human-targeted evaluation is investigated.To this end, we propose a novel metric and conduct a user study comparing it to the automatic scores of a base model known to perform well on language tasks along with different models obtained with hyper-parameter tuning. Insights of this investigation are then speculatively extended to the evaluation of generative music systems in general, which still lacks a standardised procedure and general consensus.

Paper - Poster - Video (Youtube) - Video (bilibili)

1-05 BUTTER: A Representation Learning Framework for Bi-directional Music-Sentence Retrieval and Generation

Yixiao Zhang, Ziyu Wang, Dingsu Wang and Gus Xia

We propose BUTTER, a unified multimodal representation learning model for bi-directional music-sentence Retrieval and Generation. Based on the variational autoencoder framework, our model learns three interrelated latent representations: 1) a latent music representation, which can be used to reconstruct a short piece, 2) keyword embedding of music descriptions, which can be used for caption generation, and 3) a cross-modal representation, which is disentangled into several different attributes of music by aligning the latent music representation and keyword embeddings. By mapping between different latent representations, our model can search/generate music given an input text description, and vice versa. Moreover, the model enables controlled music transfer by partially changing the keywords of corresponding descriptions.

Paper - Poster - Video (Youtube) - Video (bilibili)

1-06 Unsupervised Melody Segmentation Based on a Nested Pitman-Yor Language Model

Shun Sawada, Kazuyoshi Yoshii and Keiji Hirata

This paper presents unsupervised melody segmentation using a language model based on a nonparametric Bayesian model. We adapt the unsupervised word segmentation with a nested Pitman-Yor language model (NPYLM) used in the field of natural language processing to musical note sequence. Treating music as a language, we aim to extract fundamental units, similar to ``words" in natural language, from symbolic musical note sequences using a data-driven approach, that is, the NPYLM. We assume that musical note sequences generated by the probabilistic model, integrates a note-level n-gram language model and a motif-level n-gram language model, and extract fundamental units (motif) from it. Enabling us to carry out melody segmentation, obtaining a language model for the segments, directly from a musical note sequence without annotation. We discuss the characteristics of this model by comparing the rules and grouping structure of a generative theory of tonal music (GTTM).

Paper - Poster - Video (Youtube) - Video (bilibili)

1-07 Generation of lyrics lines conditioned on music audio clips

Olga Vechtomova, Gaurav Sahu and Dhruv Kumar

We present a system for generating novel lyrics lines conditioned on music audio. A bimodal neural network model learns to generate lines conditioned on any given short audio clip. The model consists of a spectrogram variational autoencoder (VAE) and a text VAE. Both automatic and human evaluations demonstrate effectiveness of our model in generating lines that have an emotional impact matching a given audio clip. The system is intended to serve as a creativity tool for songwriters.

Paper - Poster - Video (Youtube) - Video (bilibili)

1-08 Classification of Nostalgic Music Through LDA Topic Modeling and Sentiment Analysis of YouTube Comments in Japanese Songs

Kongmeng Liew, Yukiko Uchida, Nao Maeura and Eiji Aramaki

Nostalgia has been defined as a bittersweet, social emotion, that is often induced through music. In this paper, we examine how these may be expressed in Japanese YouTube comments of nostalgic (mid-2000s) and non-nostalgic (recent) songs (music videos). Specifically, we used sentiment analysis and Latent Dirichlet Allocation (LDA) topic modeling to examine emotion word usage and broader themes across comments. A gradient boosted decision tree classifier was then able to classify nostalgic and non-nostalgic music videos above chance level. This suggests that analyses on video/music comments may be a possible method to quantify expressions of listener emotions, and categorise musical stimuli.

Paper - Poster - Video (Youtube) - Video (bilibili)

1-09 Music autotagging as captioning

Tian Cai, Michael I Mandel and Di He

Music autotagging has typically been formulated as a multi-label classification problem. This approach assumes that tags associated with a clip of music are an unordered set. With recent success of image and video captioning as well as environmental audio captioning, we we propose formulating music autotagging as a captioning task, which automatically associates tags with a clip of music in the order a human would apply them. Under the formulation of captioning as a sequence-to-sequence problem, previous music autotagging systems can be used as the encoder, extracting a representation of the musical audio. An attention-based decoder is added to learn to predict a sequence of tags describing the given clip. Experiments are conducted on data collected from the MajorMiner game, which includes the order and timing that tags were applied to clips by individual users, and contains 3.95 captions per clip on average.

Paper - Poster - Video (Youtube) - Video (bilibili)

Poster Session 2

Video Playlists: Youtube | Bilibili

2-01 Did You “Read” the Next Episode? Using Textual Cues for Predicting Podcast Popularity

Brihi Joshi, Shravika Mittal and Aditya Chetan

Podcasts are an easily accessible medium of entertainment and information, often covering content from a variety of domains. However, only a few of them garner enough attention to be deemed 'popular'. In this work, we investigate the textual cues that assist in differing popular podcasts from unpopular ones. Despite having very similar polarity and subjectivity, the lexical cues contained in the podcasts are significantly different. Thus, we employ a triplet-based training method, to learn a text-based representation of a podcast, which is then used for a downstream task of "popularity prediction". Our best model received an F1 score of 0.82, achieving a relative improvement over the best baseline by 12.3%.

Paper - Poster - Video (Youtube) - Video (bilibili)

2-02 Using Latent Semantics of Playlist Titles and Descriptions to Enhance Music Recommendations

Yun Hao and J. Stephen Downie

Music playlists, either user-generated or curated by music streaming services, often come with titles and descriptions. While crucial to music recommendations, leveraging titles and descriptions is difficult due to sparsity and noise in the data. In this work, we propose to capture useful latent semantics behind playlist titles and descriptions through proper clustering of similar playlists. In particular, we clustered 20,065 playlists with both titles and descriptions into 562 groups using track vectors learned by word2vec model on over 1 million playlists. By fitting a Naive Bayes model on titles and descriptions to predict cluster membership and using the cluster membership information for music recommendations, we present a simple and promising solution to the cold-start problem in music recommendation. We believe that when combined with other sources of features such as audio and user interaction, the proposed approach would bring further enhancement to music recommendations.

Paper - Poster - Video (Youtube) - Video (bilibili)

2-03 Symbolic Music Generation with Transformer-GANs

Aashiq Muhamed, Liang Li, Xingjian Shi, Rahul Suresh and Alexander Smola

Transformers have emerged as the dominant approach in music literature for generating minute-long compositions with compelling musical structure. These models are trained by minimizing the negative log-likelihood (NLL) of the observed sequence autoregressively. Unfortunately, the quality of samples from these models tends to degrade significantly for long sequences, a phenomenon attributed to exposure bias. Fortunately, we are able to detect these failures with classifiers trained to distinguish between real and sampled sequences. This motivates our Transformer-GAN framework that trains an additional discriminator to complement the NLL objective. We use a pre-trained SpanBERT model for the discriminator, which in our experiments helped with training stability. Using human evaluations and other objective metrics we demonstrate that music generated by our approach outperforms a baseline trained with likelihood maximization and the state-of-the-art Music Transformer.

Paper - Poster - Video (Youtube) - Video (bilibili)

2-04 An Information-based Model for Writing Style Analysis of Lyrics

Melesio Crespo-Sánchez, Edwin Aldana-Bobadilla, Iván López-Arévalo and Alejandro Molina-Villegas

One of the most important parts of the song's content is the lyrics, in which authors expose feelings or thoughts that may reflect their way of seeing the world. This is perhaps the reason why modern techniques of mining text have been applied to lyrics to find semantic aspects that allow us to recognize emotions, topics, authorship among others. In this work, we focus on the analysis of syntactic aspects assuming that they are important elements to recognize patterns related to the writing style of an individual author or a musical genre. We present a theoretical information model-based in a corpus of lyrics, which allows finding discriminating elements in a writing style that could be used to estimate, for example, the authorship or musical genre of a given lyric.

Paper - Poster - Video (Youtube) - Video (bilibili)

2-05 Hyperbolic Embeddings for Music Taxonomy

Maria Astefanoaei and Nicolas Collignon

Musical genres are inherently ambiguous and difficult to define. Even more so is the task of establishing how genres relate to one another. Yet, genre is perhaps the most common and effective way of describing musical experience. The number of possible genre classifications (e.g. Spotify has over 4000 genre tags, LastFM over 500,000 tags) has made the idea of manually creating music taxonomies obsolete. We propose to use hyperbolic embeddings to learn a general music genre taxonomy by inferring continuous hierarchies directly from the co-occurrence of music genres from a large dataset. We evaluate our learned taxonomy against human expert taxonomies and folksonomies. Our results show that hyperbolic embeddings significantly outperform their Euclidean counterparts (Word2Vec), and also capture hierarchical structure better than various centrality measures in graphs.

Paper - Poster - Video (Youtube) - Video (bilibili)

2-06 Interacting with GPT-2 to Generate Controlled and Believable Musical Sequences in ABC Notation

Cariña Geerlings and Albert Meroño-Peñuela

Generating symbolic music with language models is a promising research area, with potential applications in automated music composition. Recent work shows that Transformer architectures can learn to generate compelling four-instrument scores from large MIDI datasets. In this paper, we re-train the small (117M) GPT-2 model with a large dataset in ABC notation, and generate samples of single-instrument folk music. Our BLEU and ROUGE based quantitative, and survey based qualitative, evaluations suggest that ABC notation is learned with syntactical and semantic correctness, and that samples contain robust and believable n-grams.

Paper - Poster - Video (Youtube) - Video (bilibili)

2-07 Comparing Lyrics Features for Genre Recognition

Maximilian Mayerl, Michael Vötter, Manfred Moosleitner and Eva Zangerle

In music information retrieval, genre recognition is the task of automatically assigning genre labels to a given piece of music. Approaches for this typically employ machine learning models trained on content features extracted from the audio. Relatively little attention has been given to using textual features based on a song's lyrics to solve this task. We therefore attempt to investigate how well such lyrics features work for the task of genre recognition by training and evaluating models based on various sets of well-known textual features computed on song lyrics. Our results show that textual features produce accuracy scores comparable to audio features. Further, we see that audio and textual features complement each other well, with models trained using both types of features producing the best accuracy. To aid the reproducibility of our results, we make our code publicly available.

Paper - Poster - Video (Youtube) - Video (bilibili)

2-08 MusicBERT - learning multi-modal representations for music and text

Federico Rossetto and Jeff Dalton

Recent advances in deep learning and neural models have led to significant advances in both NLP (text) and music representations. However, the representations and tasks remain largely separate. Most MIR models focus on either music or text representations but not both. In this work we propose unifying these two modalities in a shared latent space. We propose building on a common framework of Transform-based encoders for both text and music modalities using supervised and unsupervised methods for pre-training and fine-tuning. We present initial results and present key challenges that need to be overcome to make this possible. The result of this will be a new class of models that are able to perform advanced tasks that span both NLP and music, including advanced question answering and the next generation of conversational virtual musical assistants.

Paper - Poster - Video (Youtube) - Video (bilibili)

Papers

Poster Session 1

Video Playlists: Youtube | Bilibili

1-01 Discovering Music Relations with Sequential Attention

1-02 Lyrics Information Processing: Analysis, Generation, and Applications

1-03 Prediction of user listening contexts for music playlists

1-04 Computational Linguistics Metrics for the Evaluation of Two-Part Counterpoint Generated with Neural Machine Translation

1-05 BUTTER: A Representation Learning Framework for Bi-directional Music-Sentence Retrieval and Generation

1-06 Unsupervised Melody Segmentation Based on a Nested Pitman-Yor Language Model

1-07 Generation of lyrics lines conditioned on music audio clips

1-08 Classification of Nostalgic Music Through LDA Topic Modeling and Sentiment Analysis of YouTube Comments in Japanese Songs

1-09 Music autotagging as captioning

Poster Session 2

Video Playlists: Youtube | Bilibili

2-01 Did You “Read” the Next Episode? Using Textual Cues for Predicting Podcast Popularity

2-02 Using Latent Semantics of Playlist Titles and Descriptions to Enhance Music Recommendations

2-03 Symbolic Music Generation with Transformer-GANs

2-04 An Information-based Model for Writing Style Analysis of Lyrics

2-05 Hyperbolic Embeddings for Music Taxonomy

2-06 Interacting with GPT-2 to Generate Controlled and Believable Musical Sequences in ABC Notation

2-07 Comparing Lyrics Features for Genre Recognition

2-08 MusicBERT - learning multi-modal representations for music and text

Sponsors