Xin Zhong - Research Paradigm

My research investigates how to learn representations that simultaneously capture semantic content and remain invariant to distortions, enabling robust visual and multimodal intelligence and supporting applications such as watermarking and model attribution.

Worldview

My view is that AI models encode meaning and understand the learned world through Invariant Features.

These invariant features can be interpreted as semantically stable directions embedded in the local representation (intuitively Riemannian) manifold geometry of the model. Around each local point, the model forms a local tangent-like semantic space, where infinitesimal movement along some directions preserves meaning, while movement along other directions changes the underlying concept.

(1) For language models, invariant directions may correspond to semantic-preserving transformations such as paraphrasing, syntactic variation, or stylistic rewriting, while nuisance (semantic-transition) directions correspond to changes in the actual meaning or intent. [Ref: https://arxiv.org/abs/2605.06458, where we characterize the local geometry of language model representations and observe structured low-rank invariant subspaces, suggesting that language models capture semantic cores that remain stable under surface-level changes.]

(2) For vision models, invariant directions may correspond to semantic-preserving distortions such as rotation, blur, illumination variation, compression, or viewpoint change, while nuisance (semantic-transition) directions correspond to changes in object identity, scene composition, or semantic content. (Unlike text, image representations contain substantially richer local variability due to fine-grained spatial, photometric, and structural details. As a result, visual invariant feature spaces are likely higher-dimensional and geometrically more complex than those in language, where semantic abstraction naturally compresses many variations into lower-rank structures.)

(3) Importantly, to further the "undertsanding", Invariant Features from multimedia should remain aligned both locally and globally. [https://arxiv.org/abs/2602.18863; https://arxiv.org/abs/2503.13805]

Toward a Geometry of Intelligence (My Ongoing View)

If invariant features represent local principle constraints underlying learned notions, then intelligence can be interpreted as navigation over an information geometry defined by these invariant structures. In this view, the learned representation space forms a sparse semantic manifold field (intuitively, a Riemannian manifold), where local geometry simultaneously encodes semantic-preserving directions and semantic-transition directions. Intelligence emerges not from static representations alone, but from the ability to move coherently across this manifold while respecting locally learned semantic constraints (intuitively, geodesic distance with barrier consideration). Generalization may therefore arise as spontaneous transport toward structurally aligned regions of the manifold, even without explicit training on the target concept.

(1) In language models (as illustrated in the figure above), a prompt defines an initial point and an initial semantic direction on the manifold. Reasoning can then be viewed as constrained semantic transport through the local geometry of the representation space. Some directions preserve meaning under paraphrasing, syntactic variation, or stylistic rewriting, while other directions induce controlled semantic transitions corresponding to new concepts, logical progression, abstraction, or inference. Each generated token incrementally follows locally coherent semantic pathways, allowing the model to evolve meaning without collapsing semantic consistency. In this view, reasoning emerges not merely from next-token prediction, but from sequential transport over a learned semantic geometry.

(2) In vision models, the semantic manifold field defined by invariant features is likely substantially richer and more entangled than in language. A local image input corresponds not only to a semantic concept, but also to a dense configuration of spatial structure, texture, geometry, illumination, viewpoint, motion, and object relationships. Consequently, transport on the visual manifold does not simply correspond to sequential progression, but to continuous collective movement across locally admissible semantic directions in a high-dimensional representation field.

Some local directions preserve semantic identity under transformations such as rotation, blur, illumination variation, deformation, compression, or viewpoint change, while other directions correspond to meaningful semantic transitions involving object interaction, scene evolution, motion, causality, or changes in physical state. Different visual tokens, patches, or local structures may simultaneously follow different semantic directions on the manifold, with some regions remaining semantically stable while others undergo meaningful change. Visual intelligence may therefore emerge from jointly modeling which local movements preserve semantic structure and which induce coherent semantic transition, while ensuring that these distributed local transports collectively evolve as a globally consistent scene.
Under this view, learned invariant geometry may also implicitly encode constraints imposed by the physical world itself. Certain transitions become geometrically unlikely or inaccessible because they violate learned structural regularities of reality. For example, objects cannot arbitrarily pass through walls, deform without constraint, or instantaneously appear in physically inconsistent locations. As a result, intelligence may emerge from navigating semantic manifolds whose local geometry already encodes both semantic consistency and the underlying rules of the structure of the world.