Project with Documentary campus and DM labs (Vlad Afanasiev)
"Embeddings are techniques in the theory of AI systems to encode and decode the information, they are ways through which we can also see how machines see and see how other humans who are not like us see."
Advancements in AI offer numerous opportunities for humanities to address diverse challenges, including image recognition, satellite-based identification of danger zones and text translation. However, a common issue in contemporary AI systems is the lack of transparency and explainability, which impedes the development of fully integrated AI solutions for real-world problems.
For this we leverage the mathematical methods of differential geometry (manifold theory, concepts of affine space and geometric algebras, stochastic processes and dynamical systems analysis) and integrate it with the state of the art computer science methods (topological and geometric analysis in deep learning, embeddings and dimensionality reduction algorithms). Since high-dimensional data is inevitably non-linear, creating challenges in interpreting the data patterns, developing robust interpretable methods of embeddings for such analysis is of high importance (on this please see our Blog on Infra-data analysis and Research project description of higher-order structures on embeddings analysis).
Through analysis of latent space of AI models, we are tracing patterns of AI systems processing the information, which hence contributes at designing the explainable AI systems. Overall, new mathematical frameworks are needed for understanding how AI systems create a new type of 'mirror' of how human thinking is organised.
Context: human-machine interaction context
We are interested in investigating why modern neural networks, trained on large amounts of data (see recent models here) are often failing in solving simple tasks, while being able to solve complex ones. The general phenomenon of embedding and explainability grounding in current AI systems is of a high importance given that human - AI interaction systems are currently becoming more and more ubiquitous. Hence we aim to develop interdisciplinary approaches to tackle and consider this problem across the domains of AI, mathematics (neural network-based models in particular) and collaborating with neuroscientists.
Relation to the computer linguistics and blind models
We hypothesize that in order to solve and simplify the problem of understanding the issues within the AI models generating 'wrong text', we need to consider the human cognition of humans with specific abilities (e.g. visually impaired people or people with some possible limitations of the reading or writing text). Through the collaborations and considering them we can then understand other dimensions of possible text generation issues in AI systems (see current projects below).
Analysis of human vs. machine latent space
In particular, we have been working on the topic of mapping the mindspace of AI (this is the formulation of it for the people outside of AI). Human cognitive space (mindspace) is different (in respect to how we typically reason about it) from the one from the machines. People have been mapping it using various techniques of association of various textual datasets to understand how we are mapping the information. We live and learn in an embodied world, and big difference is our perception and ability to map the information using this embodied perspective. Yet there are some people who are able to navigate multidimensional spaces differently than us and yet who may be closer to more abstract (machine like) way of mapping the information.
Numerical results
Main stages of the machine learning/AI methods for dimensionality reduction/projections often include two main phases.
The first stage is based on the principle to represent the data using some skeleton representation (e.g. a graph or a hypergraph).
Then the second stage is based on the principle that embedding of such skeleton representation into the lower dimension is possible using the corresponding embedding methods, which induce some lower dimensional metrics (on the notion of the distance one can read more in the graph theory book on metric graphs properties).
Wuitni theorem (dimensionality theory theorem) tells us that any compact of some finite dimension n can be embedded into the Euclidean space of large enough dimension, in particular, the theorem tells us about embedding of the compact into the Euclidean space of dimension (2n+1).
The mathematical lectures on the structures of some mathematical algebras, which may be applicable for description of the higher order structures (e.g. algebra Lie and its generalisation https://www.lektorium.tv/course/29864 ).
Dataset description
Arxiv is unique dataset, which contains both, human annotated data (tags or scientific categories), as well as texts, which we are embedding using several embedding models (such as T5 and others). This possibility to have the groundtruth about the model allows us to verify and check the verifiability of the model itself (while the model is used for embedding of textual information from high-dimensional to the lower dimensional space). Here and further for the moment we assume the space to be Rn Euclidean space, yet further we will discuss this assumption. The data is described in our article here.
Project formalisation and funding
This is a collaborative effort in the interdisciplinary team of researchers from computer science, mathematics, physics, digital art, music, art history. Importantly, this project has several dimensions, including scientific and visual components together with researchers at Bell labs (France), and a collaborative effort with research collaborators from SEMF (Spain), LPI (France), Dark Matter labs (UK).
Current projects: Documentary campus, Silbersalz Institute (Germany), Internship student project (Bell labs, France)
In preparation: Embed-days (together with Bunkalabs)
References
[1] C. Singh, L. Tupikina, M. Starnini, M. Santolini “Charting mobility patterns in the scientific knowledge landscape” Nat.Comms arxiv.org/abs/2302.13054 arxiv link EPJ (2024)
[2] C. Singh, E. Barme, R. Ward, L. Tupikina, M. Santolini “Quantifying the rise and fall of scientific fields”, Plos One 17(6): (2022)
[3] Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. (arXiv:2203.05794). URL http://arxiv.org/abs/2203.05794
[4] Dmitry Kobak and Philipp Berens. The art of using t-SNE for single-cell transcriptomics. 10(1):5416. ISSN 2041-1723. doi: 10.1038/s41467-019-13056-x. URL https://www.nature.com/articles/ s41467-019-13056-x.
[5] Patania, A., Vaccarino, F. & Petri, G. Topological analysis of data. EPJ Data Sci. 6, 7 (2017). https://doi.org/10.1140/epjds/s13688-017-0104-x