This work presents a mathematically grounded segment-level aggregation framework for audio-based emotion intensity classification. Link to the original paper that focused on empirical design and validation of the system is here.
The approach discretizes temporal data into short independent segments, applies lightweight classifiers per segment, and aggregates predictions through a proven (attached as the appendix - please see the document below) plurality-based rule.
In short, it has been proved that the final classification accuracy increases exponentially with the number of independent segment predictions (which essentially depend on the segment level accuracy of the classifier).