2 - I AM DATA

Introduction

We can reduce anything in the world to data. The number of cars in the street: data. The music you like to listen to: data. How often and with whom you interact: data. The likelihood of you having a certain thought, you guessed it: data. But us data scientists rarely have the perfect data readily available to us. Quite often, we have to rely on limited datasets or datasets that contain lots of holes. The world can also behave in weird and unpredictable ways. Every day, events occur that no one has ever seen before. We call this invention or creativity. What would the world look like if we tried to recreate it solely from the limited data it provides us with? Could it still be as magnificent and mysterious as we know it? Or what would I be like, if I recreated myself solely from data, available for others to manipulate? In the project I AM DATA, the limitations of data science are explored. With only a small set of information about how 364 words in the English language are pronounced and video material of my own body, a 3D-adaptation of myself was created. You can make this digital twin, called DATA, say anything you like - but beware! DATA does not always behave the way you expect!

Reasoning behind the project

For a few months now, I have felt stuck as a Media Technology student. I have asked myself if what I learnt in this program was in line with what I actually wanted to learn. For the most part, the answer was yes. I love being challenged to become a creative mind, I love doing research and I love programming. What I do not like, is taking the easy way. I want to graduate from Media Tech not only having paddled slightly in harder and deeper themes and skills, I want to dive into them. But every time I choose the hard way instead of the easy way as to create a (maybe overly) difficult project, teachers tend to put on the brakes. Quite logical: it is not always necessary to take the difficult route and it can be quite risky, for example timewise. Yet, when I graduate, I want to be able to implement difficult algorithms into projects if needed. This is why I spoke to our study advisor, Barbara, who advised me to stay with this approach, but explain this approach when handing in projects. This is why I included this paragraph. Regarding the work itself: I love Artificial Intelligence and especially statistics. Statistics allow us to make sense of the chaos of the world. It can reveal patterns in behaviour and in nature. But in the last few years, I get the feeling that many less tech-savvy people blindly think that a product will work better if it contains data science. I wanted to show that this is not necessarily the case, by presenting a way in which lots of difficult data science does NOT always yield the best, for us humans most logical results. That's why, in the vid presenting the project, I included the text "DATA is charicature. DATA is inevitable.": it refers as the 3D-model as a charicature, but also to data science in general being a charicature of reality and the inevitability of misrepresented information because of data science.

-----

High techy-techy part (here for the people who are highly interested)

The linguistic model implements a viterbi algorithm on a Hidden Markov Model. This algorithm calculates the most probable states that correlate with a series of observed events, based on a observation-state data base. In this case, the observations took the shape of observed letters and the state an International Phonetic Alphabet notation that represented the way in which the letters should be pronounced.

Viterbi Algorithm, quickly borrowed from Wikipedia

In the formula above, two cases are shown. The first case is the base case, in this case the chance of the first observed letter being observed as a first letter multiplied by the chance that this observation is caused by a certain state. The second step then takes a recursive approach. Firstly, we have the previously calculated chance (the first time V1,k, after that the results of each iteration of Vt,k). Secondly, we have the transition chance: the chance that the previously selected state is followed by the new state. Finally, we have the emission probability: the chance that the observation is caused by a state. All of these probabilities are multiplied. The state yielding the maximum probability is then chosen and used in the next iteration. In the last iteration, this probability is also multiplied that a state is the final state of a set (word). It is, for example, quite illogical for a word to end in \b\, which is the IPA-sound that is most likely caused when pronouncing the letter b. The Viterbi algorithm makes sure that only the states with the highest probabilities are looked at sequentially, all other probabilities are discarded. If we would not use this method, in a string with 10 observations (letter combinations) and our 53 IPA symbols, there would be 53^10 = 174.887.470.000.000.000 possible pronunciations that have to be compared. If you are interested in the Viterbi algorithm, you can find a quite mathematical explanation here and what can be described as a "Viterbi for Dummies" here.

All data is smoothed. This means that all possible starts, finishes, emissions and transitions in the set are in there 0,1 times. This is done to take data in account that is not in the data set. This is somewhat of a risk though, since impossible combinations now also get a probability > 0. I tweaked the algorithm a little bit more to give the output somewhat of a fighting chance with our small data set. Since the number of IPA-characters that can be pronounced from an observation is small, it is likely that all of these emissions are in our training set. The emission probability smoothing is therefore reduced. The number of possible transitions state_A -> state_B is 53^2. This includes all impossible transitions, such as /eng eng/, which would read as ngng. Yet, it is very unlikely that all possible transitions are included in the training set. The smoothing is therefore increased.

Once the probabilities have been calculated, I use the backtracking matrix that was filled along the way, alongside with the trellis of probabilities. This matrix includes the locations of the edges in the trellis that have given us the probability we are calculating with. By backtracking trough this matrix, the optimal state sequence can be found.

The calculated sequence is described in IPA. IPA can be pronounced by (older) versions of the macOS's say command. A short python script was therefore included to write the program's IPA notation to macOS's IPA-notation. This output string is then printed to an external file, which can be read in C++.

The C++ program has three components: the initial text input, the frame-by-frame animation and the say command. The initial text input asks te user to give an input string, which is sent to the Python program. This program then runs the Viterbi algorithm, translates the output to macOS's IPA-notation and saves this to a text file. When the user presses the space bar, speech synthesis is started. The program then sets everything up for the animation and calls the say command. For the animation, all vowels are loaded into a vector (finding a data structure that is variable in size was somewhat of a challenge). The frame-by-frame animation then loops over this vector, loads the frame images related to the vowel into open spaces in an array and displays these images. The display occurs in the shape of a mesh. To create depth, the brightness of the pixels is mapped to the z-axis of the 3D-framework.

All code and material can be found in this folder.