8 students worked in or with Red Hen Summer of Code 2015, 7 of them on Audio Detection. 5 were supported by Google (3 from Red Hen and 2 from another GSoC managed organization) and the other 3 through other Red Hen funding, all on effectively identical terms. Red Hen has a small army of mentors for these students and maintains a wiki for coordination. The 8 students and projects are:
The Distributed Little Red Hen Lab is an international consortium for research on multimodal communication. We develop open-source tools for joint parsing of text, audio/speech, and video, using datasets of various sorts, most centrally a very large dataset of international television news called the UCLA Library Broadcast NewsScape.
Faculty, staff, and students at several universities around the world contribute to Red Hen Lab, including UCLA, FAU Erlangen, Case Western Reserve University, Centro Federal de Educação Tecnológica in Rio de Janeiro, Oxford University, Universidad de Navarra, University of Southern Denmark, and more. Red Hen uses 100% open source software. In fact, not just the software but everything else—including recording nodes—is shared in the consortium.
Among other tools, we use CCExtractor, ffmpeg, and OpenCV (they have all been part of GSoC in the past).
WHO USES RED HEN'S INFRASTRUCTURE?
Section 108 of the U.S. Copyright Act permits Red Hen, as a library archive, to record news broadcasts from all over the world and to loan recorded materials to researchers engaged in projects monitored by the Red Hen directors and senior personnel. Section 108 restrictions apply to only the corpus of recordings, not the software. Because everything we do is open source, anyone can replicate our infrastructure.
WHAT ARE WE WORKING WITH?
The Red Hen archive is a huge repository of recordings of TV programming, processed in a range of ways to produce derived products useful for research, expanded daily, and supplemented by various sets of other recordings. Our challenge is to create tools that allow us to access audio, visual, and textual (closed-captioning) information in the corpus in various ways by creating abilities to search, parse and analyze the video files. However, as you can see, the archive is very large, so creating processes that can scan the entire dataset is time consuming, and often with a margin of error.
The stats as of 2015-11-08 are:
Our ideas page for GSOC 2015 challenged students to assist in a number of projects, all of which have successfully improved our ability to search, parse, and extract information from the archive.
WHAT HAVE WE ACCOMPLISHED?
Google Summer of Code 2015 Projects:
Red Hen Summer of Code 2015 Projects:
EKATERINA AGEEVA - MULTIWORD EXPRESSION TOOLKIT
References: MWEtoolkit / Interface Demo / Initial Proposal
This project aims at facilitating a specific corpus annotation task, namely, tagging of multiword expressions. Such expressions include, for example, English phrasal verbs (look forward to, stand up for) or other expressions with a fixed structure (e.g. unit of time + manner-of-motion verb: time flies, the years roll slowly).
The goal of the project is to develop an integrated language-agnostic pipeline from user input of multiword expressions to a fully annotated corpus. The annotation will be performed by an existing tool, the mwetoolkit. In the course of the project the following components will be developed: the utility scripts that perform input and output conversion, the backend that communicates with the mwetoolkit, and the frontend that allows the user to customize their tagging task, while minimizing the amount of interaction with the tagger. As a result, the multiword expression tagging task will be automatized to the possible extent, and more corpora will benefit from an additional level of annotation.
Ekaterina Ageeva has built a multiword expressions toolkit (mwetoolkit), which is a tool for detecting multi-word units (e.g. phrasal verbs or idiomatic expressions) in large corpora. The toolkit operates via command-line interface. To ease access and expand the toolkit's audience, Ekaterina developed a web-based interface, which builds on and extends the toolkit functionality.
The interface allows us to do the following:
- Upload, manage, and share corpora
1. Create an account (used to store search history and manage corpus permissions)
The interface is build with Python/Django. It currently supports operations with corpora tagged with Stanford CoreNLP parser, with a possibility to extend to other formats supported by MWEtoolkit. The system uses part of speech and syntactic dependency information to find the expressions. Users may rely on various frequency metrics to obtain the most relevant search results.
KARAN SINGLA - SPEAKER DIARIZATION
References: GitHub / Blog / Initial Proposal
According to UC Berkeley's ICSI Lab, speaker diarization consists of segmenting and clustering a speech recording into speaker-homogenous regions, using an unsupervised algorithm. Given an audio track of a group in conversation, a speaker-diarization system will 1.) identify speaker turns, or the points at which one speaker stops speaking and another begins. 2.) Identify the number of speakers in the audio track, and 3.) Identify those speakers as A, B, C, etc and recognize those speakers throughout the conversation. This requires both speech and non-speech detection, as well as overlap detection and resolution.
(Image Reference: http://multimedia.icsi.berkeley.edu/speaker-diarization)
Karan Singla built a model using pyCASP that has identified speaker turns in audio content of the Broadcast speech corpus. Karan's work has shown 65% accuracy on the entire NewsScape. Karan's project originally embarked upon this task using LIUM, but found more accuracy using pyCASP at the end of the project.
His method and results are as follows:
1. Data Pre-Processing
2. Differentiating Speech & Non-Speech
3. Audio Segmentation
4. Speakers Hierarchal Clustering
Single-Show: Once we have segments which mark a speaker change, then these segments are hierarchically clustered to merge the segments with same-label as been spoken by the same speaker.
Cross-Show: Once we have speaker-segments for one show (news-session), then speakers are again hierarchically clustered to merge speakers which are common across various shows on the same network
Example : There are various news session which are covered my famous anchor Natalie Allen, then she will be recognized as Natalie Allen across all shows.
Why must this be hierarchal?
Weeks 1 - 2, Work Results:
1. Extract Audio data using ffmpeg in ".wav" from ".mp4"
*Sample output using LIUM takes the pre-processing output.
Weeks 3 - 4, Work Results:
*ALERT : TPT files have inconsistencies, therefore may be it will be better to stick to NIST TV News Diarization dataset.
Observations Karan noticed with LIUM Output :
1. LIUM Constantly recognizes more speakers
THE BIG QUESTION: Is LIUM is producing false positives ?
So what can be done?
1. We must have every speakers' data (like training GMM model for every speaker). However, this is not possible.
Use "spk_det.sh "in the "ALIZE_spk_det" folder to generate output for a mp4 file.
Making Ground-Truth Files
Currently all types of noises (music/commercial) are a part of the cluster ID's, and I removed the cluster ID's that covered segments coming in the duration of "♪" in the ".tpt" files. There is some reduction the amount of cluster-Ids for each show. (check the results doc about LIUM output which shows the number of clusters ID's obtained for each file). But there are a lot different noises in the input files which is not really a speaker and are not discarded using the TPT files.
We observed a "tpt" file corresponding to an audio file and observed the following:
1. "NER01" tag can give you the speaker boundaries
Therefore, Karan wrote a script called "tpt2rttm.sh" (find it in scripts folder) to convert .tpt file to .rttm format so that they can be treated as ground truth for evaluation purposes.
<TOTAL SPK = 53 ( original 43 : as shown by TPT file)
EVAL TIME = 3572.00 secs
MISSED SPEECH = 13.14 secs ( 1.5 percent of scored time)
SCORED SPEAKER TIME = 810.60 secs (100.0 percent of scored speech)
OVERALL SPEAKER DIARIZATION ERROR = 67.81 percent `(ALL)
There can be many possible reasons for bad results:
1. In TPT files there was a delay for each speaker turn that can lead to missed speech and particularly cases of false alarm.
Diarization using pyCASP
PyCASP is designed in way that it can run with CUDA in the backhand, therefore allows successful use of GPU's and parallel programming to make the system fast.
Diarization using PyCASP vs LIUM
Evaluation with LIUM on file : 2014-01-01_0000_US_CNN_Erin_Burnett_Out_Front.mp4 (Time Taken : 1 hour 52 minutes)
EVAL TIME = 3572.00 secs
MISSED SPEECH = 13.14 secs ( 1.5 percent of scored time)
SCORED SPEAKER TIME = 810.60 secs (100.0 percent of scored speech)
OVERALL SPEAKER DIARIZATION ERROR = 67.81 percent `(ALL)
Evaluation on the same file using PyCasp ( Time Taken : 7 minutes 32 seconds )
EVAL TIME = 3571.72 secs
MISSED SPEECH = 0.39 secs ( 0.0 percent of scored time)
SCORED SPEAKER TIME = 5205.92 secs (150.1 percent of scored speech)
OVERALL SPEAKER DIARIZATION ERROR = 54.34 percent `(ALL)
But here PyCASP recoganized just two speakers but if you increase "initial_clusters" parameter in "03_diarizer_cfg.py" then results are quite good.
200 Initial Clusters : 14 Speakers : 38.24 % diarization error : 3 hour 14 min
Karan experimented by increasing the initial numbers of clusters to check it's affect on diarization output and it was seen that high quality speaker diarization can be done using PyCASP but that time-complexity increases many folds.
The easy-scripts can also be downloaded from github repo (pycasp folder). The repository also has scripts to use ALIZE and LIUM diarization tool kits for NewsScape corpus.
Topics for future work:
1. Way to use Multi-GPU's and other ways to make balance between quality & time-complexity
Sai Krishna has successfully built a forced alignment model that minimized the word error rate in speaker diarization with closed captioning by developing a method for data pruning by phone confidence measure.
This allowed for the calculation of a confidence measure based on the estimated posterior probability to prune the bad data. Posterior probability, by definition, tells the correctness or confidence of a classification. In speech recognition, the posterior probability of a phone or a word hypothesis w given a sequence of acoustic feature vectors OT1 = O1O2 .. OT is computed (as given in equation below) as the likelihood of all paths passing through the particular phone/word (in around same time region) in the lattice normalized by the total likelihood of all paths in the lattice. It is computed using forward-backward algorithm over the lattice. His model has improved the word error rate by ~14%.
Let Ws and We respectively denote word sequences preceding and succeeding word w whose posterior probability is to be computed. Also let W' denote the word sequence (WswWe). Then,
In above equation, p(O 1T ) in the denominator is approximated as the sum of likelihoods of all paths in the lattice. As can be seen, the posterior score has contribution from LM too (the term p(W') in the numerator which signifies LM likelihood). Hence, we consider the score after contribution of LM is nullified, and when the posterior score purely reflects the acoustic match. It is also true that posterior score depends on the acoustic model, (mis)match between the training and test conditions, and also on the error rate. To alleviate the dependence of the posterior on one signal acoustic model, we re-compute the posteriors after rescoring the same lattices using a different acoustic model such as one trained on articulatory features. Such system combination makes the posteriors more reliable.
The Sequence of Algorithmic Steps which gave the best result:
> Monophone Decoding
> Triphone Decoding - Pass 1 ( Deltas and Delta-Deltas and Decoding)
> Triphone Decoding - Pass 2 ( LDA + MLLT Training and Decoding )
> Triphone Decoding - Pass 3 ( LDA + MLLT + SAT + MPE Training and Decoding)
> Subspace Gaussian Mixture Model Training
> MMI + SGMM Training and Decoding
> DNN Hybrid Training and Decoding
> System Combination ( DNN + SGMM)
Steps to Train and Test:
Using free audiobooks from Librivox that contain read speech of recorded books from Project Gutenberg, Sai Krishna first used the audiobooks without corresponding text and tried to obtain accurate text (transcripts and timestamps) using the open source acoustic and language model Librispeech, available at kaldi-asr.org. According to Karan, "LM plays an important role in the search space minimization and also on the quality of the competing alternative hypotheses in the lattice. Here, Librispeech LM is used as the task is to decode audiobooks available at Librivox. The Librispeech LM won’t be that useful if we’re to decode audio of a lecture or a speech which encapsulates a specific topic and vocabulary. In such a case, it would be better to prepare a new LM based on the text related to the topic in the audio and interpolate it with LM prepared on Librispeech corpus.
Why did we use Librispeech?Librispeech is the largest open source continuous speech corpus in English, with approximately 1000 hours available for use. It mainly consists of two parts, 460 hours of clean data and 500 hours of data containing artificially added noise. Because our goal is to decode audiobooks and consequently develop a USS system, using models trained on Librispeech data seemed to be the best choice.
Preparing Librispeech data for decoding:Each broadcast news item consists of an approximate transcription file and a video file from which the audio needs to be extracted. The audio files were downloaded and converted to 16kHz WAV format. These wavefiles were then chopped based on silence intervals of 0.3 seconds or more to create phrasal chunks averaging 15 seconds in length. The chunks were power-normalized. An average chunk length of 15 seconds is sufficient enough to capture intonation variation, and does not create memory shortage problems during Viterbi decoding. It is also observed that decoding is faster and more accurate compared to when it is performed for much longer chunks.
Obtaining Accurate Hypotheses and Timestamps:To prune out noisy/disfluent/unintelligible regions from audio, we also need a confidence (or posterior probability) score that reflects the acoustics reliably. In addition the confidence score should not reflect language model score, and instead should reflect purely acoustic likelihood score.
1. Feature extraction: The first step is to extract features from audiobooks. 39 dimensional acoustic feature vectors (12 dimensional MFCC and normalized power, with their deltas and double-deltas) are computed. Cepstral mean and variance normalization is applied. The feature vectors are then spliced to form a context window of seven frames (three frames on either side) on which linear and discriminative transformation such as Linear Discriminant Analysis is applied which helps achieve dimensionality reduction.
2. Decoding the audiobook using speaker adapted features: We use the p-norm DNN-HMM speaker independent acoustic model trained on 460 hours of clean data from Librispeech corpus for decoding. The decoding is carried out in two passes. In the first pass, an inexpensive LM such as pruned trigram LM is used to constrain the search space and generate a lattice for each utterance. The alignments obtained from the lattices are used to estimate speaker dependent fMLLR or feature-space MLLR transforms. In the second pass of decoding, an expensive model such as unpruned 4-gram LM is used to rescore the lattice, and obtain better LM scores.
Combination of phone and word decoding: The lattices generated in the previous step don’t simply contain word hypotheses, but instead contain a combination of phone and word hypotheses. Phone decoding, in tandem with word decoding helps reduce errors by a significant proportion in the occurrence of out-of-vocabulary words or different pronunciations of in-vocabulary words. A combination of phone and word decoding can be performed by simply including the phones in the text from which the LM is prepared. An example to highlight the use of this technique is as below. We can observe the sequence of phones hypothesized because of the difference in pronunciations of the uttered word and its pronunciation in the lexicon.
Bayers b ey er z (pronunciation in the lexicon)
Reference: Performed by Catherine Bayers
Hypothesis: Performed by Catherine b ay er z
3. Improving decoded transcripts using LM interpolation: We find the 1-best transcripts from the lattices generated in the previous step. These 1-best transcriptions encapsulate the specific vocabulary, topic and style of the book. As a result, a LM computed purely from the decoded text is expected to be different and more relevant for recognition compared to a LM prepared from all Librispeech text. We exploit this fact to further improve the decoding by creating a new LM which is a linear interpolation of LM prepared on decoded text and LM prepared on entire Librispeech text. The LM interpolation weight for the decoded text is set to 0.9 to create a strong bias towards the book text. A new lattice is then generated using the new LM.
4. Nullifying LM likelihood before computing posteriors: Our goal is to obtain a 1-best hypothesis and associated posterior scores that match well with the acoustics, and have little or no influence of language on them, for USS task. Every word/phone hypothesis in a particular lattice of an utterance carries an acoustic and language model likelihood score. In a pilot experiment, we tried generating a lattice based upon pure acoustic score in the following way. We prepared a unigram LM from text containing a unique list of in-vocabulary words and phones. Just one single occurrence of each word and phone in the text made sure that frequency, and consequently the unigram probability of each word and phone is the same. It was found that the 1-best hypothesis produced by this method was nowhere close to the reference sequence of words. This outcome was understandable as we had not put any language constraints, and the decoder was tied down to choose between several words (~200,000) in the lexicon just on the basis of acoustics. Understandably, the phone hypothesis obtained by lexicon look up was also worse. The example below demonstrates the large error in the hypothesis when a unigram LM was used.
Reference: RECORDING BY RACHEL FIVE LITTLE PEPPERS AT SCHOOL BY MARGARET SIDNEY.
Unigram LM: EDINGER RACHEL FADLALLAH PEPPERS SAT SQUALL PRIMER GRITZ SIDNEY.
We therefore resorted to the following approach. Rather than using the above mentioned unigram LM from the start, i.e. for the generation of lattices, it proved to be more useful to rescore the lattices (obtained in the previous step after LM rescoring and LM interpolation) containing alternative hypotheses which are much closer to the sequence of reference words. The posteriors, therefore obtained also reflected pure acoustics. The sentences below show the 1-best output of the lattice in the previous step (after performing LM rescoring and interpolation), and the 1-best output after rescoring the same lattice with unigram LM having equal unigram probabilities for each in-vocabulary word.
4-gram LM: READING BY RACHEL FIVE LITTLE PEPPERS AT SCHOOL BY MARGARET SIDNEY.
Nullified LM: READING MY RACHEL SIL FILE IT ILL PEPPERS AT SCHOOL BY MARGARET SIDNEY.
The second hypothesis is more close to acoustics. Difference between the two hypotheses are italicized. It is clear that the 1-best transcription is better and also closer to acoustics when the unigram LM is used for rescoring the lattice generated in the previous step, rather than using it for generating the lattice from scratch. Consequently, the phone-level transcripts are also better, and the posteriors purely reflect acoustic match.
5. Articulatory rescoring and system combination: The lattices generated in the previous step are rescored using a pnorm DNN-HMM acoustic model, trained on articulatory features, and speaker adapted articulatory features to yield a new lattice. This new rescored lattice is then combined with the original lattice to form a combined lattice. Pure articulatory feature based recognition is not as robust, and hence lattices are not generated using the acoustic model trained on articulatory features, and it is rather used for rescoring the lattice generated using acoustic model trained on MFCC. Lattice combination provides the advantage that two lattices scored with two different models and features contain complementary information, which yields a lattice with more robust acoustic scores. The 1-best hypothesis obtained from the above lattice is also more accurate. Word lattices are then converted to phone lattices. The 1-best phone sequence from the phone lattice along with the posteriors is what we use for building USS system.
Owen He has also assisted in this project and assisted by:
1. Shifting from Python 3 to Python 2
2. Increased the speed of training by replacing GMM from Sklearn with GMM from pyCASP.
3. Added functions to recognize features directly, so that it is ready for the shared features from the pipeline.
4. Returned the log likelihood of each prediction so that one can make rejections on untrained classes and filter out unreliable prediction results. You can also use it to search for speakers, by looking for predicted speakers with high likelihood.
5. Incorporating Karan's Speaker Diarization results
6. Made the output file have a format that's consistent with other Red Hen output files, an example output file produced with the same video that Karan used can be found here.
Additional information about Owen He's work on Speaker Identification can be found here, under the files "Pipeline" and "Speaker."
Owen He used a reservoir computing method called conceptor together with the traditional Gaussian Mixture Models (GMM) to distinguish voices between different speakers. He also used a method proposed by Microsoft Research last year at the Interspeech Conference, which used a Deep Neural Network (DNN) and an Extreme Learning Machine (ELM) to recognize speech emotions. DNN was trained to extract segment-level (256 ms) features and ELM was trained to make decisions based on the statistics of these features on a utterance level.
Owen's project focused on applying this to detect male and female speakers, specific speakers, and emotions by collecting training samples from different speakers and audio signals with different emotional features. He then preprocessed the audio signals, and created the statistical models from the training dataset. Finally, he computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached. You can check out the summary of results on GitHub.
His method is as follows:
1. Create a single, small (N=10 units) random reservoir network.
2. Drive the reservoir, in two independent sessions, with 100 preprocessed training samples of each gender, and create conceptors C_male, C_female respectively from the network response.
3. In exploitation, a preprocessed sample s from the test set is fed to the reservoir and the induced reservoir states x(n) are recorded and transformed into a single vector z. For each conceptor then the positive evidence quantities z’ C_male z and z’ C_female z are then computed. We can now identify the gender of the speaker by looking at the higher positive evidence. i.e. the speaker is male if z’ C_male z > z’ C_female z and vice versa. The idea behind this procedure is that if the reservoir is driven by a signal from male speaker, the resulting response z signal will be located in a linear subspace of the reservoir state space whose overlap with the ellipsoid given by C_male is larger than that with the ellipsoid given by C_female.
4. In order to further improve the classification quality, we also compute NOT(C_male) and NOT(C_female). This leads to negative evidence quantities z’NOT(C_male)z and z’NOT(C_female)z.
5. By adding the positive evidence z’ C_male z and negative evidence z’NOT(C_female)z, a combined evidence is obtained, which can be paraphrased as “this test sample seems to be from male speaker and seems not to be from female speaker”.
Owen replicated this to detect male and female speakers, specific speakers, and emotion by collecting training samples from male and female speakers, specific speakers to be recognized, and audio signals with different emotional features. He then preprocessed the audio signals, and created the conceptors from the training dataset. He then computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached.
The training was done using the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database from USC's Viterbi School of Engineering, found here.
Owen incorporates the python speech features library to extract MFCC features from audio files, which can be replaced by the common feature extraction component once the Red Hen Lab Audio Analysis pipeline is established.
Vasanth Kalingeri - Commercial detection system
Vasanth Kalingeri built a system for detecting commercials in television programs from any country and in any language. The system detects the location and the content of ads in any stream of video, regardless of the content being broadcast and other transmission noise in the video. In tests, the system achieved 100% detection of commercials. An online interface was built along with the system to allow regular inspection and maintenance.
Audio fingerprinting of commercial segments was used for the detection of commercials. Fingerprint matching has a very high accuracy when dealing with audio, even severely distorted audio can be recognized with very high accuracy. The major problem faced was that the system had to be generic for all TV broadcasts, regardless of audio/video quality, aspect ratio and other transmission related errors. Audio fingerprinting provides a solution for all these problems. It was seen after implementation that several theoretical ways of detecting commercials as suggested in the proposal was not as accurate as audio fingerprinting and hence audio fingerprinting stayed in the final version of the system.
Initially the user uses a set of hand tagged commercials. The system detects these set of commercials in the TV segment. On detecting these commercials, it divides the entire broadcast into blocks. Each of these blocks can be viewed and tagged as commercials by the user. This comprises of the maintenance of the system. There is a set of 60 hand labelled commercials for one to work with. This process takes about 10-30min for a 1hr TV segment, depending on the number of commercials that have to be tagged.
When the database has an appreciable amount of commercials (usually around 30 per channel) we can use it to recognize commercials in any unknown TV segment.
On running the system on any video, one can expect the following format of output file:
00:00:00 - 00:00:38 = Unclassified
00:00:38 - 00:01:20 = ad by Jeopardy
00:01:21 - 00:02:20 = ad by Jerome’s
… and so on
The above is the location of the commercials and the content it advertises in the video.
In case the system missed detecting a few commercials, one can edit this file through a web interface which looks as follows:
On making changes to the web interface, the system updates its db with new/edited commercials. This web interface can be used for viewing the detected commercials as well.
This system is extremely useful for people with the following interest:
Broadcastsegmentor uses the Panako library to create and query an audio fingerprint database that is made up of commercials. After querying the database, recognition times are used to find commercial breaks in the input audio file, which is assumed to be taken from a TV News broadcast.
Clip-find also uses Panako to create and query a fingerprint database, but its usage is focused on simply finding where (and if) the input audio file was broadcasted and when.
Audio novelty is a measure introduced by Foote in a 1999 paper. Its original purpose was to find note onsets in a solo instrument performance, but has a very broad definition that can be useful to the Red Hen Lab (that is, saying something meaningful about a TV news corpus).
Roughly speaking, audio novelty is a measure of how surprising the audio we are currently observing is when compared with audio from before and after it. It is a spectral feature, that is, we calculate and confront the spectrum of the audio to make such a judgement. This definition is very broad, since a lot of changes in audio can be deemed surprising (a change of speakers, a commercial starting or ending): monitoring the peaks in audio novelty throughout an audio file can find moments that relate to different audio analysis tasks.
Results and Future DevelopmentsWhen broadcastsegmentor is trained with every commercial in the TV broadcast, it can detect commercial breaks with very good accuracy and timing; however, building a complete “commercial database” remains a challenge to be undertaken.
[image 1: a typical broadcastsegmentor file output: the program has been fully trained in this example]
Clip-find can be used to build a complete fingerprint database out of the Red Hen Corpus: given ~10 seconds of audio taken from a TV news program, it can find it even if the database is very large.
Future developers could be interested in developing a new fingerprint generation and matching algorithm that is faster, so that the application can “parse back” the corpus faster, or with stronger noise resistance.
[image 2: clip-find outputs if and when the input audio (first file) was found in the fingerprinted Red Hen corpus]
The best usage for the audio novelty feature extractor would be in a machine learning approach, as a boolean feature with high recall (interesting moments are often novelty peaks) but low precision.
[image 3: In this audio file, taken from an Al Jazeera America broadcast, the novelty peaks out at the ~90th analysis frame.]
Funding for this Red Hen Summer of Code was provided by the Alexander von Humboldt Foundation, via an Anneliese Maier Research Prize awarded to Mark Turner to support his research into blending and multimodal communication. Red Hen is grateful to the Alexander von Humboldt Foundation for the support.
Sri Harsha has worked to develop a module for detecting non-verbal events in an audio file. Examples of "non-verbal events" are laughs, sighs, yawns or any expression that does not use language. Verbal sounds refer to the sounds which makes the different words in our language. Each sound has its own rules, which are followed to produce that particular sound. The meaning conveyed by our speech depends on the sequence of sounds we produce. In our daily conversations, apart from speaking, we laugh, cry, shout and produce sounds like deep breathing to show exhaustiveness, yawing, different sighs, cough etc. All these sounds/speech segments are produced by humans and do not convey any meaning explicitly but do provide information about the physical, mental and emotional state of a person. These speech segments are referred to as non-verbal speech sounds.
One interesting difference between verbal and non-verbal speech sounds is that a person cannot understand the verbal content if he is not familiar with the language but irrespective of the language, we can perceive the non-verbal speech sounds and get some knowledge about the emotional or physical state of a person.
An algorithm for detection of laughter segments is developed by considering the features and common pattern exhibited by most laughter segments.
Main steps involved in the algorithm are:
2. Feature extraction
3. Decision logic
Most of the laughter segments are voiced. Hence, the first step of the algorithm involves the voiced non-voiced segmentation of the speech signal. After the voiced non-voiced segmentation, only the voiced segments are considered for further analysis.
The pre-processing step involves the voiced non-voiced (VNV) segmentation of the given audio signal and also extraction of epoch locations.
VNV segmentation is performed by using the energy of the signal obtained as output of the zero frequency resonator (ZFR). In order to perform the VNV segmentation method explained in  is used. Steps involved in this method are:
Epoch refers to the location of the significant excitation of the vocal tract system. All excitation source features obtained in this algorithm are extracted around the epoch locations. This is because epochs are regions with high signal to noise ratio and more robust to environmental degradations compared to other regions of the speech signal. In this algorithm, epoch locations are obtained by using the method explained here.
2. Feature extraction
In this step, only voiced segments obtained in the first step are considered. Acoustic features based on the excitation source and vocal tract system characteristics of laughter segments are extracted for detection.
The source and system characteristics of laugh signals are analyzed using features like pitch period (T0 ), strength of excitation (α), amount of breathiness and some parameters derived from them which are explained below in detail.
Pitch period ( T0 ):
Strength of Excitation (α):
Duration of the opening phase (β)
β = α/T 0
Slope of T0 (δT0):
The pitch period contour of laughter has a unique pattern of rising rapidly at the end
of a call. So, we use the slope of the pitch period contour to capture this pattern.
Extraction of slope of T0:
Slope of α (δα)
Loudness parameter (η)
Because of high amount of airflow, laughter is typically accompanied by some amount of breathiness. Breathiness is produced with the vocal folds loosely vibrating and as a result more air escaping through the vocal tract than modally voiced sound . This type of phonation is also called glottal frication and is reflected as high frequency noise (non-deterministic component) in the signal. A breathy signal will typically have less loudness and more non-deterministic (noise) component.
Measures based on hilbert envelope (HE) are used for calculating loudness and proportion of non-deterministic component in the signal. Loudness is defined as the rate of closure of vocal folds at the glottal closure instant (gci) . This can be computed from hilbert envelope of excitation signal (residual) obtained from inverse filtering the signal (LP analysis).
Dominant resonance frequency (DRF)
Dominant resonance frequency (DRF) represents the dominant resonances of the vocal tract system. DRF values can be obtained by computing the DFT in a short window and considering the frequency with maximum amplitude in the DRF values. This frequency is called Dominant resonance frequency (DRF).
Dominant resonance strength (γ)
Dominant resonance strength (γ) refers to the amplitude of the DFT at DRF i.e., the maximum amplitude of the DFT values obtained in a frame. DRS values are higher for laughter compared to neutral speech.
3. Decision logic
A decision is obtained for every feature extracted in step 2. This decision is obtained by laying a threshold on the feature value (threshold is different for different feature).
After extracting the above features for every epoch location, a decision has to be finally made on the voiced segment based on these values. Note that in this algorithm only voiced laughter and unvoiced laughter are not considered.
For every feature, a decision is first made on each epoch in the segment. This is performed by putting a threshold on the feature value (different for each feature), which is called as ‘value threshold’ (vt) for that feature. If the feature value of an epoch satisfies this ‘value threshold’, it means that the epoch belongs to laughter according to that feature. A decision is then made on the segment by putting a threshold called ‘fraction threshold’ ( ft), which determines the percentage of epochs that should satisfy the ‘value threshold’ for the segment to be a laughter segment. After applying the two thresholds, separate binary decisions on the segment are obtained for all the features. Finally, the segment is considered as laughter if at least 50% of the features gave a positive decision.
False alarm rate (FAR) and false rejection rate (FRR) are computed for training set for various threshold values. The threshold value for which FAR and FRR values is minimum is selected as the required threshold for the particular feature. Then thresholds are also computed by considering different combination of features to verify which combination of features and thresholds will given the best performance.
This step is used to obtain the boundaries of the laughter segments based on the decision obtained in step 3.
The regions of the speech signal considered as laughter in the decision logic step are further processed to obtain the final boundaries of the laughter regions.
This step has two parts:
Summer of Code >