Abstract
Active area of research in AI is the theory of manifold learning and finding lower-dimensional manifold representation on how we can learn geometry from data for providing better quality curated datasets [1]. There are however various issues with these methods related to finding low-dimensional data representation of the data, the so-called curse of dimensionality, [2]. Geometric deep learning methods for data learning often include set of assumptions on the geometry of the feature space. Some of these assumptions include pre-selected metrics on the feature space, usage of the underlying graph structure, which encodes the data points proximity. However, the later assumption of using a graph as the underlying discrete structure, encodes only the binary pair-wise relations between data points, restricting ourselves from capturing more complex higher-order relationships, which are often often present in various systems [5]. These assumptions on the data together with data being discrete and finite may cause some generalisation, which may create wrong interpretations of the data and models, which produce the embeddings of data itself, such as BERT and others. The objective of our research is twofold, to develop the alternative framework characterize the embedding methods using the higher-order structures. For this we explore the underlying graph assumption substituting it with the hypergraph structure. We also demonstrate the embedding characterization on the usecase of the example of arXiv data.
This is common work (accept. to proceedings to the 13th Conference on networks and their applications, 2024) and part of the internship in Bell labs, France. Collaboration between Liubov Tupikina and Hritika Kathuria.
Methodological overview
Our method is based on construction of the neighborhood hypergraph construction, using it instead of the neighborhood graph structure,
Neighborhood hypergraph construction method. The neighborhood hypergraph Hn is constructed from data points {Di}n i=1 based on the idea of estimating the relations between the datapoints. If datapoints (Di1 , ..., Dik ) are close to each other, they form a hyperedge of a hypergraph in Hn of k-arity. Depending on the nature of data there are different ways to characterize closseness of datapoints, which can be based on their geometrical or semantical proximity. Geometrical proximity of points {Di} is estimated using typical l2 metrics or cosine similarity applied to coordinates of each processed datapoint Di. Semantical proximity of points is encoded by labelled similarity of datapoints, for instance, in case of arXiv data it may mean that if paper {D1} has tags a, b, c, paper D2 has tags a, b and paper D3 has tags a, d, then these papers are encoded as nodes of the hypergraph forming the hyperedge with the field a. One can also construct other encoding of the higher-order structure from data using metadata {mi}_i=1 information Hm which encodes relations between tags themselves and can characterize the fields closseness instead of using the standard binary graph representation of data as co-tags networks [3]: when tags a, b, c are nodes of hypergraph, these nodes share the hyperedge, iff there is at least one paper withh all these tags, like paper D1 in this case. Our intuition behind using the neighborhood hypergraph representation of data is based on the observation that in many textual data similarities between documents are not just based on one measure of closeness but rather on high-order relations like in words: word France is close to Italy and Slovenia in different way than to the word Paris.
Conclusions and perspectives
Based on the recent results (see github folder of the project here) we are interested to investigate further embeddings using higher-order theory (hypergraphs theory and higher-order motifs rewriting).
Recently we have analysed large data of arxiv dataset and found that for some categories the result of LLMs (BERT model) was providing misclassified clusters of points, for zooming in into the clustering issues we have studied the subhypergraphs (which participated with this misclassified cluster). This gave us possibility to also observe and question the nature of "ground truth cluster". The main idea we explore here is how the encoding of higher-order structures (such as graphs and simplicial complexes) into lower order structures can help to encode complex projections or embeddings from higher-order dimensional data (such as textual data).