Embeddings analysis
Analysis of embeddings
This work is motivated by the recent publications on this topic.
1. What is the geometry of the data structures? What are the geometric properties of embedding methods combined with dimensionality reduction methods? How to assess geometry of syntactic and semantic structures stored inside the textual data structures?
Understanding of topology and geometry of data can enhance the embedding methods to resolve the curse of dimensionality, as well as issues with ambiguity of answers, which various LLMs may produce. Which geometric and algebraic properties are encoding semantic, syntactic structures of texts? How can we decode the textual (NLP) structures from the mathematical structures of high-dimensional data?
2. What are key techniques (mathematical frameworks for designing the latent space for the embeddings such as affine space, metric space for the tokenized information), which we can help to open the black box AI methods and design the explainable embedding methods?
Main research questions
Enhancing the theory of deep learning and especially explainable AI systems.
As some of the research has shown some of the AI systems are not just lacking explainability but also the deep foundations and explanations
https://arxiv.org/abs/1312.6114
For this one way is to study so-called latent (or embedding spaces).
In generative modeling, methods like variational autoencoders (VAEs) and generative adversarial networks (GANs) create latent spaces for modeling and synthesis of data.
Understanding the properties of latent space of VAE and GANs will enable us to uncover properties of DL models in general.
Some of the (mathematical) foundations of latent spaces are needed to be enhanced and studied, but new geometrical and mathematical tools are needed for that. The research questions one may ask include but are not limited to:
a. If we were to embed all possible datasets in the world using any existing frameworks (VAE or GANs or any other technique), which properties would the latent space would have?
Would it be a continuous latent space or smooth [Liu et al. 'Latent space carthography' 2019], or have abrupt transitions as considered a sign of memorization.
b. Scientists often are interested in studying clustering, patterns of latent space, often missing the question of properties of latent space itself. What other mathematical structures, such as algebraic structures, like Clifford algebras and affine spaces concepts.
Partially this is something which has been explored during the SEMF meetings.
References:
[1] C. Singh, L. Tupikina, M. Starnini, M. Santolini “Charting mobility patterns in the scientific knowledge landscape” Nat.Comms arxiv.org/abs/2302.13054 arxiv, EPJ data science (2023)
[2] C. Singh, E. Barme, R. Ward, L. Tupikina, M. Santolini “Quantifying the rise and fall of scientific fields”, Plos One 17(6): (2022)
[3] Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proc. of the 2019 Conference on Empirical Meth.
[4] L. Tupikina, V. Balu 'Innovation Landscape: Visualizing Emerging Topics from Patents Data ’ subm. UAI (2024)
[5] https://github.com/Liyubov/BunkaTopics_text_analysis