Progress Report as of 04/05/2023

Project Overview

The inspiration behind our project lies in Sensory Substitution Devices (SSDs), which aim to use one form of sensory stimulation to replicate neurological responses typical of another form. Through use of SSDs such as the Braille language, individuals with sensory disabilities are able to gain more information about the world they live in. To cite one example in particular, a visual-to-auditory SSD is one that attempts to substitute an individual's vision with auditory signals. 

Our project in particular aims to visually sonify text, and to provide another method for individuals who lose their sight during their lives to interpret visual, written text. In order to do this, we attempt to synthesize musical, auditory representations of the visual appearance of individual letters and words.

However, we do not attempt to claim that we have replicated the neurological responses created by the sense being substituted, and given that true SSDs aim to successfully replicate the neural signals generated by the sense that was lost, our project thus can not be classified as a true SSD. However, we hope that our text sonification device could be a feasible stepping stone towards a true visually-based SSD for written text or shapes.

What We Have Done & Our Challenges

Currently, with what we have developed, we are able to sonify letters, words, and written syntax very accurately. However, we are not able to sonify complete sentences yet and are currently working towards that. Also, on matlab, we were able to determine how to modify the tempo of the interpretation of a MIDI file. We want to also be able to do this with Python. However, we are running into some trouble with our integration efforts of our total system. Since some of our design is done in Python and some on Matlab, moving into one common codebase has been a little challenging. To tackle this, we have looked into different online tools that assist with Matlab to Python code conversion and vice versa. While we cannot rely on these tools alone, it definitely is a good starting point. Another challenge we have noticed is in our MIDI files. We have noticed that for some instances where there is silence, we get a really low note for an odd reason. While we are not completely sure why this is happening, we have thought about using a filter to remove these really low notes to overcome this challenge. 

Our Plan

Over the next few weeks we have a few tasks that we have listed out that we would like to tackle. The first task we would like to do is selective amplification and change the timbres of our algorithm. Our next task will be attempting to implement a processing algortihm on our website so people will be able to hear what they input. While we do not know the exact feasability of this task, we still have to speed up the turnaround time for our sonification process. We want to be able to get it to a point where we can demo it live for our presentation. We are a little concerned at how long the sentiment classification will take, so we are still considering this. Our last task is to train ourselves to be able to understand audio output faster and with greater accuracy. We believe these next three tasks are very possible to accomplish within the next three weeks.

What We Have Learned

One thing we have learned in working with our data is how to speed up MIDI file interpretation on Matlab. This is relevant to what we are doing because it allows us to turn around audio files at different "playback speeds" for training people. By speeding up MIDI file interpretation, we could possibly allow for real time interpretation of MIDI file to text. 

We also learned more about the significant characteristics of a sound in the context of audio analysis. For example, a spectrogram doesn’t tell the whole story for music analysis, since different frequencies in music analysis are considered the same note. This is where the chromagram comes in, because it allows you to see that the different frequencies are actually the same note, and see chord patterns emerge. This is relevent to our project because we hope to be a stepping stone to an SSD. Having a clear understanding of how we can identify, analyze, and characterize various known musical characteristics is critical towards helping us to understand how we could better improve the effectiveness of our tool at communicating a message.

Data Plots

Our work thus far has been centered around establishing a proof of concept for our text sonification system, which sonifies the phrase “Hello World!” 

ASCII Representation

The first step we took was to quantitatively represent the shape of each individual letter. To do this, we generated an ASCII representation of each individual letter in the form of a pandas dataframe using the Python pyfiglet library. Since we planned on parsing each interpretation from left to right, two “silent” columns were appended on the right of the ASCII representation in order to provide temporal separation between adjacent characters during interpretation. Raw data from this step is shown below, where the letter “H” is represented by the dataframe as an example.

General Array

Then, we proceed to sonify each letter. After selecting an input chord, we assign notes to each row index of an array, upwards from its root from the bottom row. For example, if the input chord is Cmaj := (C4, E4, G4), then the data frame used in the example above will correspond to the following array.

The Letter 'H'

After overlaying the dataframe onto the array, we obtain the following musical representation of each letter. Continuing the example from above, the figure below represents the representation of the letter “H”.

From this point onwards, we proceed to create chord objects using the music21 Python library to represent each column of the array. Then, the chord objects were concatenated sequentially from left to right to create a musical representation of each individual character. After concatenating the representations of each individual character, the final dataset was exported in the MIDI file format. It is worth noting that the default MIDI standard tempo of 500,000 (120 BPM) was used–we chose not to specify an alternate tempo. From the MIDI file format, the following plot was generated using concatenated outputs from the music21 library’s HorizontalBarPitchSpaceOffset method.

MIDI Representation

It is worth noting that silence is undesirably represented by extraordinarily low-frequency notes. In order to fix this issue, a high-pass filter will need to be implemented on the MIDI file output, either on the MIDI creation side or the MIDI interpretation side.


Then, we synthesize an audio file from the MIDI file format. Initially, a crude wavetable synthesizer was implemented in Python using the Python library Pyo to turn the MIDI signal into audio, but we have recently also implemented an FM synthesizer in Matlab.


To analyze our output, the following Mel-frequency spectrogram was generated from the initial wavetable-synthesized audio file.

Spectrogram

Since musical notes are logarithmically spaced in frequency (for example, the frequency of C5 is twice that of C4, which is twice that of C3), it is not surprising that the spectrogram represents the sonified text in a rough, logarithmically-spaced repeating pattern.


While a regular spectrogram gives us a good idea of the melodic contours of the output audio, such features are, in a Schenkerian sense, ‘surface level’ aspects of musical (and linguistic) structure. It is not clear how the different notes above relate to each other in terms of musical (chordal) syntax. As such, our analysis includes a chromagram generated using the Python library librosa, which acts as a “specialized spectrogram” with the following conditions:

The result is a plot of pitch class content versus time, which is analogous to the spectrogram’s plot of frequency content versus time.

Chromagram

From this plot, it is clear that we have chosen to sonify the first word (“Hello”) with a Gmaj7 chord, and the second word (“World!”) with a Cmaj chord, as pitch classes G, B, D, F# light up in the first half of the figure, and pitch classes C,E,G light up in the second half. By modifying frequency analysis to examine our audio output with this new tool, the musical implications of our audio outputs can be identified exceptionally clearly.


We also included the Tonnetz figure in case anybody is familiar with how it works. We are familiar with neo-Riemannian Music Theory and Euler’s Tonnetz, but it is not clear to us from librosa’s documentation how this figure is created.