Infra-data analysis, general perspectives: blog #1
Main concepts and general introduction to the infra-data-analysis
Overview of existing approaches to learn data structures and data textures
Often we are facing the assumption that one needs to treat the data as essentially discrete object, e.g. this is quite natural representation of sets of points in the high dimensional feature space. However we often need to apply the continuous mathematics frameworks such as manifold learning [Bronstein lectures] to study these structures. Our main thesis here will be that we may need ot develop and adapt specific theory to study such systems.
However when it comes to considering these points being embedded into the Euclidean space, it may lead to different further follow-up drawbacks, hence new theories and approaches are needed for detecting possible alternative approaches contrary to Euclidean spaces (Clifford algebras, affine space being possible candidates).
Let us consider a simple pipe-line which often exists for data treatement (see Figure). Let us say that we have essentiall discrete data D (points, in this case we do not consider time-series as being possible data point yet), and we are going to learn first the hypergraph structure from it, wcich we call H(D) and will define a bit later. We aim that this structure will have imprinted properties of the data. We will also discuss its connection to the so-called induced metrics of the manifold M(D)/latent space V(D). For consstruction of M(D) and V(D) we will strangely use the concepts from infrageometric calculus [7]. We will also excercise and apply these techniques on the analysis of data, show cased in [8,9].
Very often AI engineers are using cosine similarity as our distance function because that’s what OpenAI recommends for their embeddings. Other embeddings may be optimized for different strategies. Pgvector supports l2 distance, inner product, and cosine distance [12].
However the exact choice of the measure to use often is left behind and is not discussed in thorough mathematical details. Some works, which considered this topic were listed in the paper from 2016 [13].
Concepts and main other referneces
Definition of infrageometric calculus introduction from the Wolfram Institute is brillinantly introduced in the Wolfram Institute videos.
Definition of Bernstein algebras A for the graph G or hypergraph G can be found defined in Grishkov, Costa work. This concept will be specifically useful for identifying the manifold/tangent spaces for the manifold considered.
More explanations of the Bernstein algebra concepts and applications in the next blog
Figure 1. The usual framework for treatment of data, and possible infra-data-analysis framework imposed here.
Figure 2. The new pipeline proposed which employs topological data analysis, geometric algebra for creation of explainable AI algorithms and transparent algorithms for data processing and data analysis.
Figure 3. The new representation and transitions between discrete and continuous systems representations.
References:
Kusters, R., Misevic, D., Berry, H., Cully, A., Le Cunff, Y., Dandoy, L., ... & Tupikina, L. (2020). Interdisciplinary Research in Artificial Intelligence: Challenges and Opportunities. Frontiers in Big Data, 3, 45. Retrieved from https://www.frontiersin.org/articles/10.3389/fdata.2020.577974/full
Patania, A., Vaccarino, F., & Petri, G. (2017). Topological analysis of data. EPJ Data Science, 6, 7. https://doi.org/10.1140/epjds/s13688-017-0104-x
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv, arXiv:2203.05794. Retrieved from http://arxiv.org/abs/2203.05794
Kobak, D., & Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10(1), 5416. Retrieved from https://www.nature.com/articles/s41467-019-13056-x
SDGs Human development report 2020: The next frontier - human development and the anthropocene - world. (2020). ReliefWeb. Retrieved from https://hdr.undp.org/content/human-development-report-2020
Ramaciotti, P., Cointet, J. P., Muñoz Zolotoochin, G., Fernández Peralta, A., Iñiguez, G., & Pournaki, A. (2022). Inferring attitudinal spaces in social networks. Social Network Analysis and Mining, 13(1), 14.
Zapata-Carratala, C., & Bajaj, U. (in preparation). Introduction to Infrageometry.
C. Singh, L. Tupikina, M. Starnini, M. Santolini “Charting mobility patterns in the scientific knowledge landscape” Nat.Comms arxiv.org/abs/2302.13054 arxiv, EPJ data science (2023)
C. Singh, E. Barme, R. Ward, L. Tupikina, M. Santolini “Quantifying the rise and fall of scientific fields”, Plos One 17(6): (2022)
Dani Hadidaneshmand https://hadidaneshmand.github.io/dhadi.html
References for the code and repositories https://github.com/Liyubov/BunkaTopics_text_analysis
https://bawolf.substack.com/p/embeddings-are-a-good-starting-point