I have created a stand-alone Matlab demo of the research on using Non-negative Matrix Factorisation (NMF) to learn acoustic units (such as words, phrases, or acoustic events) with only weakly annotated material: The audio samples are annotated only with tags indicating the presence of a word or event, without segmentation or temporal ordering. After training, the method has learned acoustic patterns for the annotated events, even if such events were never seen in isolation in the training data.
The demo illustrates the techniques employed in the IWT project ALADIN [1,2]. The ALADIN project attempts to build a self-learning speech interface, that learns both the vocabulary and the grammar from the user itself while using the device. Some video's of the ALADIN project can be found here.
The procedure is visualised here:
Update: The demo and FramEngine code are no longer publicly available. Please get in touch with maintainers of the ALADIN project to discuss the availability of the demo, FramEngine software, data and collaboration options.
The first part of the code builds some simple wav files, which are then processed by the remainder of the code as if they were loaded from disk. From that perspective, for your own data you only need to replace the wav generation & label creation by loading your own files from disk.
Basically, the code internally generates 100 audio files, with sine waves in a number (10) of distinct frequencies. Each "file" is the concatenation of up to 4 sine waves. The number of sine waves can be thought of as the number of possible audio events in the data, the number of concatenated sine waves is a proxy for the audio events in a single audio clip.
During learning, the algorithm only sees a 10-dimensional binary vector indicating which labels occur in each file (so no temporal ordering). NMF is used to built representations of each of the labels; during prediction each clip is again associate with a 10-dimensional label vector. I also included a sliding-window decoding which attempts to give predicted labels at a more fine-grained temporal level.
One configuration (all config is in getconfigs.m) that will make things a lot faster is the computetype. You may need to compile the mex files in the various include directories for your architecture, after which you can set computetype=2.
 B. Ons, J. F. Gemmeke, and H. Van hamme, “Fast vocabulary acquisition in an NMF-based self-learning vocal user interface”, accepted for publication in Computer
Speech & Language, 2014 [pdf]
 J. F. Gemmeke, B. Ons, N. Tessema, H. Van hamme, J. van de Loo, G. De Pauw, W. Daelemans, J. Huyghe, J. Derboven, L. Vuegen, B. Van Den Broeck, P. Karsmakers, and B. Vanrumste, “Self-taught assistive vocal interfaces: An overview of the ALADIN project”, In Proc. INTERSPEECH, pages 2038—2043, 2013 [pdf]