After improving the taxonomy of the deepfake detection model at UMBC, my next goal on this project is to automatically segment linguistic features. This will reduce the need for hand-coding by annotators and also improve inference by the model. My plan for this innovation is to include a data preprocessing step in the model that parses a waveform into phoneme, word, and phrase-level boundaries, then makes inference on each level of linguistic representation.
The key benefit of automatic segmentation into different levels of linguistic hierarchy is that the interactions between features can be estimated. This opens a whole world of possible combinations of features and creates a sophisticated detection model.
Research in linguistics has extensively shown that speakers simultaneously adjust their speech across multiple levels of linguistic representation. For example, semantic predictability and perceptual clarity are related in human speech– when a word is unpredictable, rare, or a nonword, speakers pronounce it carefully. On the other hand, highly predictable phrases like "I don't know" are often reduced by speakers into "I dunno" or merely an intonational hum that is still understood by listeners. Meanwhile, synthetic voices currently typically do not make context-aware production adjustments.
In typical continuous moving-window machine learning models, this potential distinction between deepfaked and real speech is not leveraged. This is just one example of how linguistic information can be used to improve machine learning models, and others can be developed through human-in-the-loop strategic model building as deepfakes advance.
Research by other teams has already shown that this kind of approach has big benefits to detection accuracy. Zhang et al. (2025) showed that segmentation into phonemes can hugely improve model prediction accuracy, as shown in this cool graph they made:
Models without phoneme-level features struggle to divide real and fake speech, but when audio samples are annotated with phoneme-level boundaries, the model divides real and fake speech much better. The two distinct clouds of red (fake) and blue (real) samples in the lower panel illustrate that there is a big performance payoff by including phoneme annotation.
Figure reproduced from Zhang et al., (2025). Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes. https://doi.org/10.1609/aaai.v39i1.32093