Hirschberg's algorithm (1975) [PDF] is a dynamic programming procedure to find the optimal alignment between sequences of lengths m and n in m × n steps.
At Vicomtech's Speech and Language Technologies group, we've used this algorithm for aligning long audios with their transcripts. This approach to long audio alignment was previously shown to work successfully by Bordel et al (2012). The work reported about in this site has been presented in three of our publications.
One of the components of a sequence aligner that applies Hirschberg's algorithm is a scoring system.
The scoring system is used to evaluate each alignment operation and choose the best alignment at each step, in order to find the optimal alignment overall.
The simplest scoring functions are binary. They assign the same score to all matches, and the same score to all mismatches. For instance, substituting p for q gets a score of 1 when p and q are equal, and a score of 0 when p and q don't match.
A scoring function that takes into account the similarity between the elements to be aligned can provide better alignment results than a binary function. For this reason, we created and tested phoneme similarity matrices, finding that they improve long audio alignment results in Spanish, English and Basque.
We created matrices based on three criteria: phonological similarity, perceptual errors, and phone-decoder errros.
Our phonological similarity metric is based on multivalued phonological features, and was devised by Kondrak (2002).
In order to apply the similarity metric, it is necessary to assign feature specifications and feature values to each phone in the aligner's phoneset, as well as defining feature salience weights. We created feature specifications and tested them, obtaining alignment improvements in Spanish, English and Basque. Salience weights are also provided. See the similarity matrices created based on phonological similarity.
Matrices were created for English and Spanish, based on data for perceptual confusion from Cutler et al. (2004) and García Lecumberri et al. (2013) repsectively.
Matrices based on decoding errors from our phone decoder were also created, for English, Spanish and Basque.
See here for all the types of phoneme similarity matrices created.
See here for information about implementing Kondrak's scoring function, which is needed for creating our phonological similarity matrix.
Details about obtaining the scores in our perceptual and phone-decoder error-based simlarity matrices are provided directly in the pages discussing these matrices, since the calculations are simpler than for the phonological ones.