Karaoke Kalculator is your one-stop shop for analyzing your singing abilities!
It looks at pitch, timing, and accuracy to determine if you’re the next Mariah Carey, or just another Joe Shmoe.
It’s perfectly suited for all of your karaoke needs - so you can find out if you’re microphone ready before the party.
We know that singing can be very a very vulnerable and intimate act, so instead of getting laughed off the stage at your local karaoke bar, quantify your singing strength against our user baseline!
For our project, we set out to quantitatively analyze how similar we were singing Mariah Carey's "All I Want For Christmas Is You" to the original. We first converted the mp3 files of the original song with the vocals isolated and each of the song attempts we sang to waveform signals that are displayed here to the left over an amplitude-time domain. From top to bottom, the signal for the original song, Nina singing it, and Camden singing it is displayed.
To understand the similarity between the original song and the two song attempts, we use 3 metrics of analysis: Timing, Vocal Quality, and Pitch. We then combine these metrics into an index for each song that we compare to the index of the original song to get a percentage similarity between a song attempt and the original.
We first look at the timing of each song attempt, and see how accurately each person can sing on tempo compared to the original song.
We then look at the vocal quality/noise of the sample, calculating how much clarity of signal the original song has and then for each song attempt, and then comparing the two.
Finally, we break up each signal into multiple frames, find the pitch of each frame, and then compare the pitch content of the original song to each song attempt.
We first look at the timing difference between the original and a song attempt, looking to see if a song attempt is too fast or too slow compared to the original song.
To do this, we use the xcorr function to find the amount of time shift required to make the similarity between the original song and a song attempt the most it can be.
The xcorr function finds the cross-correlation of two signals through two complex fft's of each signal, followed by the ifft of the first fft mulitplied by the complex conjugate of the second fft. The function then returns two arrays, one of the amount of time that the second signal is tested being shifted by, and the other of the coefficients of similarity at each time shift, or lag.
We search for the lag value at the index of the highest coefficient magnitude, and this lag is then converted from the number of samples to displace the signal to the number of seconds by dividing the sample number by the sampling rate of the audio signal.
Once the lag time has been recorded for each song attempt, the signal is then shifted by the lag time so that we can observe the coefficients created by xcorr when the original song is compared to this new, shifted song attempt. In the plots on the top right, the song attempt is shown in blue, and the new song attempt after being shifted by the lag time is shown in red for each song attempt. Furthermore, in the plot on the bottom right, the coefficient magnitude-lag domain graph of Nina's song attempt can be seen before the signal was shifted, and as such, the highest correlation is not centered around 0, but instead, some negative lag amount represented in the x-axis.
Once each signal had been shifted according to the optimal lag time, we then analyzed the original song compared to itself using xcorr, so that we can compare the other xcorr coefficients of song attempts to this baseline. To do this, we calculate the number of coefficient magnitudes above 0.4 for each signal in order to quantify the amount of noise in the signal. For example, in the above coefficient-lag domain graph of the original song compared to itself, there were 18 coefficients with magnitudes above 0.4.
Above is the xcorr function graph of two songs that sound nothing alike: "Claire de Lune" and Tyler the "Creator's NEW MAGIC WAND". From our calculations, there were 19000 coefficients with magnitude above 0.4, which demonstrates a lower bound for our comparison when a lot of noise is present.
Above is the xcorr function graph of Camden's song attempt compared to the original song, and we calculated around 5000 coefficients with magnitude above 0.4, suggesting the vocal quality of Camden's song attempt does contain some noise, but not nearly as much as two entirely dissimilar songs.
To find pitch we used our fft algorithm that takes components from Harmonic Product Spectrum in order to calculate the pitch of a signal. Our algorithm separates a signal into 255 frames, each using a Hann window to smooth signal data to remove any potential errors with the fft, and then the fft is taken of each of these frames. Once done, we downsample the fft 5 times, and if the same frequency lines up in amplitude with the adjacent frequency in the following downsample for each downsample, that frequency can be quantified as the fundamental frequency of the frame. This process is demonstrated in the top graphic on the right. Once we have the fundamental frequency for each frame, we can use a conversion table of frequencies to musical pitch to find the pitch of each frame across the entire signal. A plot of the fundamental frequency values across Nina's song attempt is shown in the bottom graph on the right.
On the left, example code of how we found the fundamental frequency for each frame in Nina's song attempt is shown. All of the fundamental frequency values are stored in the fund_freq_nina array, with each index corresponding to the frame the frequency describes, and the array is later used for conversion to pitch and then comparison to the pitch array of the original song.
Once we had the timing, vocal quality/noise, and pitch values for each of the song attempts and original song, we created a weighted sum to find a "correlation index" for each signal, combining each of these metric values with a weighting we came up with based on how important we felt each metric was to determine the overall similarity of a song attempt to the original song. We then compared the correlation index of each song attempt to the correlation index of the original song compared to itself, and expressed this comparison as a percentage. So for example, Nina's weighted correlation index was 0.5145 and the correlation index of the original song was 0.9998, so Nina was 100*(0.5145/0.9998) = ~51% similar to the original song!
Camden is a second-year mechanical engineering student at Olin College. He is passionate about technology policy and seeks to combine his passions for engineering and policy in his career.
Nina is a user experience engineer in her second year of college at Olin College of Engineering. She is focused on making products that help people live life to the fullest.
As a sophomore at Olin College of Engineering, Mira is a robotics enthusiast who enjoys the challenge of integrating different systems to create powerfully engineered products.