Blogposts on embeddings analysis

Analysis of embeddings

AI infused societies

Reviewing recent papers on the topic of retraining models one often hear on possible misconceptions on algorithms biases. When we speak about algorithms in general or inherited biases we are often referring to these algorithms using statistics behind their algorithmic skin. From these statistical traces we are then taking the set of parameters and what then is used to produce predictions of these algorithms. So, when people refer to the potential bias this may be decreased when taking statistical results keeping in mind potential accuracy, recall and other characteristics of these algorithms.

Another approach is to use the geometrical characterisation of the datasets. For building the proper geometrical characterisation based on the datasets we are training the model on, we need to rely on some strong theoretical foundation (geometrical axiomatisation system, functional analysis are such theory for machine learning for instance), e.g. building proper topological spaces, for example based on the simplicial complexes realisation of the datasets. Just replying on the proximity of points and assuming some metrics may not provide sufficient results as our 'human' bias towards using Euclidean metrics for most of the spaces may again induce some geometrics biases.

As (de Dampierre et al. 2024) mentions in his work on Bunkatopics referring to it as

"Algorithmically infused societies" (Wagner et al., 2021):

When using common tools for data analysis such as Principal Component Analysis (PCA) (Abdi & Williams, 2010), UMAP (McInnes et al., 2020), and unseen-species models (Chao, 1988), researchers inherit decisions that have partially been made by the creators of these packages.

Scaling in AI

AI tools are growing and are incorporated in many aspects of human lifes.

At the same time when Transformers were introduced, and performance was improved somewhat predictably as one scales up either the amount of compute or the size of the network, a phenomenon which is now referred to as scaling laws [Textbooks is all you need https://arxiv.org/abs/2306.11644]

Yet one can overcome the scaling problems to increase efficiency by curating and coordinating the data on which the models are being trained, relying on quality rather than on quantity seems to be able to do it, but the question remains whether this remains the case always and is statistically significant.

The recent work of Eldan and Li on TinyStories (a high quality dataset synthetically generated to teach English to neural networks) was demonstrating that using the high quality data can dramatically change the shape of the scaling laws, potentially allowing to match the performance of large-scale models with much leaner training/models. Yet these are only evidence based observations [Bubeck et al.]
Sparks of artificial general intelligence: Early experiments with gpt-4]. Can there be any of the theoretical proof that data quality would indeed increase the training?

For that one would need to first agree on quantitative metrics for datasets characterisation. For example the datasets which are labeled with human-generated labels can potentially be a good testbed for that.

Analysis of embeddings

This work is motivated by our recent publications on this topic: paper on embedding of the knowledge space in EPJ Data Science and previous work where we continue to explore on the work of geometric spaces of embeddings.

The main general questions we are asking ourselves in this work are related to the following directions:

1. What is the geometry of the data structures? What are the geometric properties of embedding methods combined with dimensionality reduction methods? How to assess geometry of syntactic and semantic structures stored inside the textual data structures?

Understanding of topology and geometry of data can enhance the embedding methods to resolve the curse of dimensionality, as well as issues with ambiguity of answers, which various LLMs may produce. Which geometric and algebraic properties are encoding semantic, syntactic structures of texts? How can we decode the textual (NLP) structures from the mathematical structures of high-dimensional data?

2. What are key techniques (mathematical frameworks for designing the latent space for the embeddings such as affine space, metric space for the tokenized information), which we can help to open the black box AI methods and design the explainable embedding methods?

Main research topics on explainable embeddings

Enhancing the theory of deep learning and especially explainable AI systems.
As some of the research has shown some of the AI systems are not just lacking explainability but also the deep foundations and explanations
https://arxiv.org/abs/1312.6114

For this one way is to study so-called latent (or embedding spaces).

In generative modeling, methods like variational autoencoders (VAEs) and generative adversarial networks (GANs) create latent spaces for modeling and synthesis of data.
Understanding the properties of latent space of VAE and GANs will enable us to uncover properties of DL models in general.

Building of the (mathematical) foundations of latent spaces are needed to be enhanced and studied, but new geometrical and mathematical tools are needed for that. The research questions one may ask include but are not limited to:
a. If we were to embed all possible datasets in the world using any existing frameworks (VAE or GANs or any other technique), which properties would the latent space would have?
Would it be a continuous latent space or smooth [Liu et al. 'Latent space carthography' 2019], or have abrupt transitions as considered a sign of memorization.
b. Scientists often are interested in studying clustering, patterns of latent space, often missing the question of properties of latent space itself. What other mathematical structures, such as algebraic structures, like Clifford algebras and affine spaces concepts.

Partially this is something which has been explored during the SEMF meetings.

References:

[1] C. Singh, L. Tupikina, M. Starnini, M. Santolini “Charting mobility patterns in the scientific knowledge landscape” Nat.Comms arxiv.org/abs/2302.13054 arxiv, EPJ data science (2023)

[2] C. Singh, E. Barme, R. Ward, L. Tupikina, M. Santolini “Quantifying the rise and fall of scientific fields”, Plos One 17(6): (2022)

[3] Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proc. of the 2019 Conference on Empirical Meth.

[4] L. Tupikina, V. Balu 'Innovation Landscape: Visualizing Emerging Topics from Patents Data ’ subm. UAI (2024)
[5] https://github.com/Liyubov/BunkaTopics_text_analysis

Google Sites

Report abuse