1 Abstract
This project focuses on modifications made to an offline concatenative synthesis tool known as PySoundConcat. The original system is outlined and modifications are proposed for the improvement of the systems output. Modifications aim to improve the aesthetic characteristics of the output. Evaluation of modification implemented is then presented through use of a survey to determine the perceived improvement in performance.
2 Overview
The PySoundConcat project is a Python based concatenative synthesis system developed originally for the author’s undergraduate research project. The system aims to provide an open source tool to explore the creative potential of combining short-time audio analyses with granular synthesis, to synthesize perceptually related representations of target audio files. Through use of analysed databases of varying sizes, an output can be generated that represents a mix of the spectral and temporal features of the original target sound and the corpus of source sounds (Perry, 2017).
The original work suffered from a number of shortcomings in terms of it’s implementation. These limited the quality of output and were detailed in the resulting paper. This project aimed to build on this work in order to improve the aesthetic qualities of output produced by the system, addressing issues outlined and adding further improvements to the system. To achieve this, a number of improvements were made to the analysis and synthesis methods as detailed in section 4. In addition, a brief evaluation of the system is detailed to quantify the effect of improvements implemented.
3 Original project overview
The Original project worked by analysing overlapping segments of audio (known as grains) from both the target sound and the source database, then searching for the closest matching grain in the source database to the target sound. Finally, the output is generated by overlap-adding the best matches.
To create the final output, there are three main operations to perform:
First, the descriptor analyses are generated for each audio file in both the source and
target database. A number of features are extracted including fundamental frequency, temporal energy (RMS), and a number of spectral descriptors to describe the frequency content of each grain (Spectral flux, flatness, crest-factor etc...)
Each audio file’s analyses are split into equally sized overlapping grains and averaged in the appropriate way to be compared to grains from the other database. The matching algorithm then calculates the grains that have the smallest overall difference, based on user defined weightings for each of the analysis types. This weighting of analyses allows for certain analyses to gain precedence over others based on user preference. The best match indexes are then saved to the output database, ready for synthesis
The synthesis process involves loading the best match grains from the source database, performing any post-processing (such as pitch shifting and amplitude scaling) to improve the similarity of the match, then windowed overlap adding the grains to create the final output. The post-processing phase involves using the ratio difference between the source and target grain to artificially alter the source grain so that it better resembles the target. This is particularly useful when using small source databases as it improves the similarity of any match (important when best matches aren’t very close to the target.) The final output is saved to the output database’s audio directory. (Perry, 2016)
4 Implementation Modifications
The following modifications were made to the system to address a variety of issues that were seen to impede the aesthetic quality of results.
4.1 Probabilistic YIN
It was noticed that the simplistic auto-correlation/parabolic interpolation method employed for the estimation of a grain’s pitch was prone to “octave errors”. A more advanced pitch detection algorithm known as pYIN (Mauch & Dixon, 2014) was implemented to account for this. This implementation was outsourced to a pre-made Python library (Gong, 2015). This resulted in a greatly improved estimation of pitch for both source and target analysis and greatly reduced octave errors, when compared to the original naive approach.
4.2 k-Dimensional Tree Based Continuity Matching
Many pre-existing concatenative synthesis applications make effective use of a path search algorithm in a post matching stage (Schwarz, 2006, p.3). This addresses the issue that consecutive match grains may be closely related to their target, but not to each other. This method involves the selection of multiple best matches to a target grain, followed by a path search through these matches descriptors to find the optimum set of grains for resynthesis. This improves overall continuity across grains in the synthesized output. Originally a viterbi algorithm was considered for this task. Unfortunately, due to time constraints, a simpler approach was taken that searched on a grain by grain basis for the shortest distance from the first matches. This approach was perhaps not as effective as a viterbi algorithm would be, however it was expected that this would provide better results than having no consideration for continuity would.
4.3 Descriptor normalisation
It is noted by Schwarz (2007) that it is important to normalise descriptor values to within a common range before comparison. This avoids a situation where certain descriptors are given precedence in matching due to their range of values. It was noticed that certain spectral values were not being normalised correctly, thus distorting match values. By applying correct normalisation to these descriptors in the analysis stage using maximum and minimum values described in (Lerch, 2012, p.41-49).
4.4 Inharmonic grain silencing
A minor modification that improved results in a number of circumstances was the option to remove all inharmonic grains. This allowed for predominantly harmonic target sounds such as operatic vocals to use only harmonic grains, removing minor inharmonic, potentially noisy samples for a cleaner overall output.
4.5 Pitch-shifting
A significant problem in the implementation of this system has been the implementation of the pitch shifter. A number of minor alterations have been made to try and account for problems with this implementation. Unfortunately the root cause of problems with this code is still unknown and unwanted noise and glitches are introduced into the output as a result, significantly degrading results.
5 Test Dataset
As with previous iterations of the project, it was important that the system was thor- oughly tested using a wide variety of source and target databases to determine it’s overall performance. The following databases were used for testing and synthesis:
5.1 Iowa Orchestral Samples
The Iowa University orchestral samples database(2014) has been used as a primary source database. The wide variety of well recorded orchestral samples allows for testing across a range of textures relating to well known musical instruments.
5.2 Alex Harker’s Alternative Vocal Techniques Samples
This database was provided by Dr. Alex Harker of the University of Huddersfield and contains a large collection of alternative vocal techniques originally recorded for electro-acoustic composition. The unusual vocal techniques allows for interesting synthesis par- ticularly in onset and transient sections of a target, where plossives and fricative sounds are commonly matched. This database is not currently publicly available.
5.3 Rayman Spoken Word database
This database was used to test the potential of the system for speech synthesis. The target database contains a small number of enthusiastically spoken phrases, which were chosen due to the variety of of pitches and textures across the different syllables. This database is also not publicly available.
5.4 Looperman Opera Acapella Samples
As it is well documented that concatenative synthesis is an effective tool for use in speech synthesis (Feugere, D’Alessandro, Delalez, Ardaillon, & Roebel, 2016), samples from an acapella database were used to test this systems performance in this area (Looperman.com, 2017).
6 Evaluation of Performance
The effect of improvements were measured using survey to determine the perceived quality of both the original and the revised system’s outputs. The survey used an Absolute Category Rating and Paired Comparison of example audio files generated to determine opinions on the system’s overall audio quality and the perceived naturalness of the sounds. 7 participants trained in music technology or a related field partook in the survey. Results can be seen in figures below:
Quality ACR (Mean & Standard Deviation of results)
Artificialness ACR (Mean & Standard Deviation of results)
Paired Comparison (Mean & Standard Deviation of results)
From the results it can be seen that there is a marginal increase in perceived quality in vocal related output, however overall the improvements have not made the significant impact expected. In the paired comparison, opinion was either split or favoured the older synthesis method significantly, suggesting that modifications to the system may have actually degraded results. It is however clear that overall, both old and new systems score poorly in terms of audio quality and perceived naturalness. The glitches and noise introduced by the poor performance of the pitch shifting algorithm are thought to be a significant factor in this problem. It is believed that modifications to other areas of the system have not degraded the output, but have not improved the output significantly enough to offset the serious degradation caused by these errors. It is clear that much further work must be carried out on this system to improve performance to an acceptable level.
It is also noted that significant limitations to the evaluation play a factor in the results. The use of only a small number of participants, combined with a limited number of directly comparable results from the original system, severely limited the paired com- parison test. More time should have been invested in regenerating old results with new datasets. However, results clearly show the lack of progress in perceptual quality that was expected.
7 Conclusion
Overall, the modifications made have not significantly improved the overall quality of the system. However, the implementation of these modification has in turn highlighted the main area that is causing a significant level of distortion in the output. Given more time to address this issue, a re-evaluation of both old and new systems may better present the difference in quality produced by the modifications designed for this project. Despite not
improving the quality of audio in general, this project has been considered a success in that it has uncovered key issues with the system that, when fixed, should benefit overall performance significantly.
References
Feugere, L., D’Alessandro, C., Delalez, S., Ardaillon, L., & Roebel, A. (2016). Evaluation of singing synthesis: Methodology and case study with concatenative and performative systems. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 08-12-Sept(September), 1245–1249. doi:10.21437/Interspeech.2016-1248
Gong, R. (2015). pypYIN. Retrieved April 4, 2017, from https://github.com/ronggong/ pypYIN
Lerch, A. (2012). An Introduction to Audio Content Analysis. doi:10.1002/9781118393550 Looperman.com. (2017). Looperman.com Acapella Samples. Retrieved April 4, 2017, from https://www.looperman.com/acapellas/genres/classical-acapellas-vocals-sounds-samples-download
Mauch, M. & Dixon, S. (2014). PYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions. 1(1), 659–663.
Perry, S. (2016). Concatenator 1.0 documentation. Retrieved from http://pezz89.github.io/PySoundConcat/index.html
Perry, S. (2017). Descriptor driven concatenative synthesis tool for Python. Fields: journal of Huddersfield student research, 3(1). doi:10.5920/fields.2017.12
Schwarz, D. (2006). Concatenative Sound Synthesis: The Early Years. Journal of New Music Research, 35(1), 3–22. doi:10.1080/09298210600696857
Schwarz, D. (2007). Corpus-based concatenative synthesis. IEEE Signal Processing Magazine, 24(2), 92–104. doi:10.1109/MSP.2007.323274
University of Iowa Electronic Music Studios. (2014). University of Iowa Electronic Music Studios. Retrieved September 7, 2016, from http://theremin.music.uiowa.edu/ MIS.html