Embeddings analysis

Analysis of embeddings

This work is motivated by the recent publications on this topic.
1. What is the geometry of the data structures? What are the geometric properties of embedding methods combined with dimensionality reduction methods? How to assess geometry of syntactic and semantic structures stored inside the textual data structures?

Understanding of topology and geometry of data can enhance the embedding methods to resolve the curse of dimensionality, as well as issues with ambiguity of answers, which various LLMs may produce. Which geometric and algebraic properties are encoding semantic, syntactic structures of texts? How can we decode the textual (NLP) structures from the mathematical structures of high-dimensional data?

2. What are key techniques (mathematical frameworks for designing the latent space for the embeddings such as affine space, metric space for the tokenized information), which we can help to open the black box AI methods and design the explainable embedding methods? 

Main research questions

In generative modeling, methods like variational autoencoders (VAEs) and generative adversarial networks (GANs) create latent spaces for modeling and synthesis of data.
Understanding the properties of latent space of VAE and GANs will enable us to uncover properties of DL models in general.

Partially this is something which has been explored during the SEMF meetings.


[1] C. Singh, L. Tupikina, M. Starnini, M. Santolini “Charting mobility patterns in the scientific knowledge landscape” Nat.Comms arxiv.org/abs/2302.13054 arxiv,  EPJ data science (2023)

[2] C. Singh, E. Barme, R. Ward, L. Tupikina, M. Santolini “Quantifying the rise and fall of scientific fields”, Plos One 17(6): (2022)

[3] Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proc. of the 2019 Conference on Empirical Meth.

[4] L. Tupikina, V. Balu 'Innovation Landscape: Visualizing Emerging Topics from Patents Data ’ subm. UAI  (2024)
[5] https://github.com/Liyubov/BunkaTopics_text_analysis