Hypergraphs and embeddings analysis

Analysis of embeddings

There is a broad need to enhancing the theory of geometric and topological deep learning, explainable embeddings being one of the most widely used example of AI models. As some of the research has shown some of the AI systems are not just lacking explainability but also the deep foundations in understanding their functionality.
One possible pathway to achieve the explainability is to understand the generation process of embeddings generation and hence to assess so-called latent (or embedding) space. Distinguishing between latent space of the model itself and the data which we are embedding.

In generative modeling, methods like variational autoencoders (VAEs) and generative adversarial networks (GANs) latent spaces are used for modeling and synthesis of data. Understanding the properties of latent space of VAE and GANs is crucial to uncover properties of DL models in general.

Some of the (mathematical) foundations of latent spaces are needed to be enhanced and studied, new geometrical and mathematical tools are needed for that.

The main research questions we ask here are as follows.
1. What is the geometry of the data (data structures)? How hypergraph structures (higherorder structures) can facilitate learning of embeddings and multidimensional data structures?
What are the geometric properties of embedding methods combined with dimensionality reduction methods? How to assess geometry of syntactic and semantic structures stored inside the textual data structures?

Understanding of topology and geometry of data can enhance the embedding methods to resolve the curse of dimensionality, as well as issues with ambiguity of answers, which various LLMs may produce. Which geometric and algebraic properties are encoding semantic, syntactic structures of texts? How can we decode the textual (NLP) structures from the mathematical structures of high-dimensional data?

2. What are key mathematical techniques (mathematical frameworks) for designing the latent space for the embeddings such as affine space, metric space for the tokenized information. How can we help to open the black box AI methods and design the explainable embedding methods?
We believe that hypergraph higher-order structures can enhance our understnading of embedding methods overall, since the hypergraph theory proposes intricate tools for treating and analysis of higher order structures themselves [6]. However, many tools and mathematical methods need to be translated.




This work is motivated by the recent publications on this topic [1,2] and discussions during SEMF meetings. 

References:

[1] C. Singh, L. Tupikina, M. Starnini, M. Santolini “Charting mobility patterns in the scientific knowledge landscape” Nat.Comms arxiv.org/abs/2302.13054 arxiv,  EPJ data science (2023)

[2] C. Singh, E. Barme, R. Ward, L. Tupikina, M. Santolini “Quantifying the rise and fall of scientific fields”, Plos One 17(6): (2022)

[3] Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proc. of the 2019 Conference on Empirical Meth.

[4] L. Tupikina, V. Balu 'Innovation Landscape: Visualizing Emerging Topics from Patents Data ’ subm. UAI  (2024)
[5] https://github.com/Liyubov/BunkaTopics_text_analysis
[6] https://github.com/Liyubov/hypergraphs_structures