Infra-data analysis, general perspectives: blog #1

Main concepts and general introduction to the infra-data-analysis

Overview of existing approaches to learn data structures and data textures

Often we are facing the assumption that one needs to treat the data as essentially discrete object, e.g. this is quite natural representation of sets of points in the high dimensional feature space. However we often need to apply the continuous mathematics frameworks such as manifold learning [Bronstein lectures] to study these structures. Our main thesis here will be that we may need ot develop and adapt specific theory to study such systems.

However when it comes to considering these points being embedded into the Euclidean space, it may lead to different further follow-up drawbacks, hence new theories and approaches are needed for detecting possible alternative approaches contrary to Euclidean spaces (Clifford algebras, affine space being possible candidates).

Let us consider a simple pipe-line which often exists for data treatement (see Figure). Let us say that we have essentiall discrete data D (points, in this case we do not consider time-series as being possible data point yet), and we are going to learn first the hypergraph structure from it, wcich we call H(D) and will define a bit later. We aim that this structure will have imprinted properties of the data. We will also discuss its connection to the so-called induced metrics of the manifold M(D)/latent space V(D). For consstruction of M(D) and V(D) we will strangely use the concepts from infrageometric calculus [7]. We will also excercise and apply these techniques on the analysis of data, show cased in [8,9].

Very often AI engineers are using cosine similarity as our distance function because that’s what OpenAI recommends for their embeddings. Other embeddings may be optimized for different strategies. Pgvector supports l2 distance, inner product, and cosine distance [12].

However the exact choice of the measure to use often is left behind and is not discussed in thorough mathematical details. Some works, which considered this topic were listed in the paper from 2016 [13]. 



Concepts and main other referneces


Definition of infrageometric calculus introduction from the Wolfram Institute is brillinantly introduced in the Wolfram Institute videos.


Definition of Bernstein algebras A for the graph G or hypergraph G can be found defined in Grishkov, Costa work. This concept will be specifically useful for identifying the manifold/tangent spaces for the manifold considered.
More explanations of the Bernstein algebra concepts and applications in the next blog


Figure 1. The usual framework for treatment of data, and possible infra-data-analysis framework imposed here.

Figure 2. The new pipeline proposed which employs topological data analysis, geometric algebra for creation of explainable AI algorithms and transparent algorithms for data processing and data analysis.

Figure 3. The new representation and transitions between discrete and continuous systems representations.


References: