bgoslrbackend

BGOSLR Backend

Bayesian, Gaussian, Open-Set Language Recognition Backend

Version 0.1: 16 December 2008

This is a proposal of how to generalize the traditional Gaussian backend, for fusing and calibrating language recognition scores, to handle the out-of-set hypothesis in a principled way.

The essential difference between in-set and out-of set languages is that the former have training data while the latter do not. By explicitly modeling also our training data in the backend (and not only test data), we get a simple and intuitively appealing solution. This solution has the property that it handles all cases from zero, to sparse, to plentiful training data. In the case of plentiful training data, it degenerates (i.e. simplifies) to the form of the traditional Gaussian backend.

We stay as close as possible to the traditional backend, by assuming a language-independent within-class covariance, so that our language models in score-space are Gaussian, with a common covariance but with language-conditional means. The generalization is that we do not make point estimates of these means, based on the training data. Instead we model the distributions of these means with a Gaussian prior, parametrized with a between-class covariance matrix. This allows us to evaluate a Bayesian integral over all possible language models.

The salient difference between the traditional and proposed backends is that we now need to estimate also the extra between-class covariance parameter. The training of the backend is therefore more complex. However, once trained, running the proposed backend on new test data is of comparable complexity to the traditional one. All we need at runtime are the following:

- Within- and between-class covariances
- zero-order stats: the number of training samples for every language
- first-order stats: the language-conditional average over the training samples, for every language
- test data

The proposed backend is described in more detail in the following document: openset_language_recognition.pdf. This is a work in progress. The document shows a MATLAB runtime implementation of this backend on sythetic data, on which behaviour is intuitively pleasing. The training (estimation of within- and between-class covariance matrices) has not been implemented, but we propose an outline of how to do this in the document.

I will post further developments of this idea on this web-page.