(1) For language models, invariant directions may correspond to semantic-preserving transformations such as paraphrasing, syntactic variation, or stylistic rewriting, while nuisance (semantic-transition) directions correspond to changes in the actual meaning or intent. [Ref: https://arxiv.org/abs/2605.06458, where we characterize the local geometry of language model representations and observe structured low-rank invariant subspaces, suggesting that language models capture semantic cores that remain stable under surface-level changes.]
(2) For vision models, invariant directions may correspond to semantic-preserving distortions such as rotation, blur, illumination variation, compression, or viewpoint change, while nuisance (semantic-transition) directions correspond to changes in object identity, scene composition, or semantic content. (Unlike text, image representations contain substantially richer local variability due to fine-grained spatial, photometric, and structural details. As a result, visual invariant feature spaces are likely higher-dimensional and geometrically more complex than those in language, where semantic abstraction naturally compresses many variations into lower-rank structures.)
(3) Importantly, to further the "undertsanding", Invariant Features from multimedia should remain aligned both locally and globally. [https://arxiv.org/abs/2602.18863; https://arxiv.org/abs/2503.13805]