My research in AI

Toward a Theory of Learnability (summary)

Analytic Learning theory via regret analysis

This research track is devoted to the theoretical analysis of regret attached to a learning strategy. it is inspired from both the game theoretic definition and the information theoretic definition of regret. It will provide an insight on the fundamental limits of learning. Let a problem defined by a source S of features each represented by a vector x and ground truth labels y which can also be a vector. We have a conditional probability p_S(y|x) mapping feature data vector x, and the ground truth label y. Given a sequence of T (T integer) training features x^T=(x₁,x₂,…,x_T) and ground truth sequence y^T=(y₁,y₂,…,y_T), we denote P_S (y^T |x^T ) the conditional probability distribution of the ground truth and feature of the considered problem. If the pairs are independent (common hypothesis) then P_S (y^T |x^T ) =p_S(y₁|x₁)...p_S(y_T|x_T).

We denote P_L (y^T |x^T ) the conditional probability attained by the learning process in T learning steps t. The ideal learning would be P_S=P_L. A learning would be optimal if lim_TP_L=P_S in a practical sense to be will defined with the regret metric R_T. The first metric is the Kullback-Leibler divergence measure:

R^K_T(S,L)=E[log P_S(y^T|x^T)-log P_L(y^T,x^T)]

More interesting is the max divergence metric:

R^M_T(S,L)=max_yT[log P_S(y^T|x^T)-log P_L(y^T|x^T)].

If we consider a class of problems S and a class L of learning strategies operating on a given set of training sequences x^Twe can define the minmax regret as

R*_T=min_S _in _S max_{L in}_L R^M_T(L,S).

This measure the best learnable problem in S with the worst learning strategy in L. If S is the set of all problems we arrive to the bound R*_T>=r*_T ,

with r*_T=log (sum _yT max _{L in}_L P_L(y^T|x^T)), the sum under the logarithm is often called the Shtarkov sum.

Regret in logistic regression

The logistic regression is a special class of learning algorithm. It consists in finding the best matching between a sequence of test vector x^T and ground truth labels y^T with a family of distributions dependent of a parameter vector w and maps each vector x_t to a random variable Y_t. In the binary case, we consider y^Tin {-1,1}^T, the probability to have Y_t =y_t is

Where <w|x> is the scalar product of vectors w and x. The regression consists into finding the value of w which maximizes P(Y^T=y^T). It turns out that for the logistic regression the upper regret r*_T

is equal to d/2 log T+log C_d

where d is the dimension of the feature vectors and C_d is obtained via integration over R^d of a "discrepancy" function which really depends on the problem.

with

When the feature are uniform on the unit sphere in dimension d we have

C_d=e ^2/3(2pi/d²)^d/4When the feature, are uniform in the unit ball we have a smaller value: C_d=e ^1/6(2pi/d²)^d/4this counter intuitive results comes from the fact that in the ball the features are closer to each other than in the sphere. The picture below display the discrepancy function for d=1 for the feature in the unit circle.

Generalized logistic regression

The logistic regression can be generalized to probability distributuon class other than P(Y=y)=1/(1+exp(y*<w|x>)), the popularity of the latter comes from some mathematical advantages which facilitate its use. But we can take P(Y=1)=p(<w|x>) and P(Y=-1)=1-p(<w|x>). A celebrated example is the gaussian regression with p(v)=Phi(v) where Phi is the normal cumulative function. Indeed the gaussian regression came before the logistic regression, under the name "probit", and was abandonned later because less easy to compute (it was in the 1930). Now with the increasing computational power of our individual computers this is no longer an issue since decades. Provided that the functions -log p(v) and -log(1-p(v)) are convex, the formula for the discrepancy function is modified by a new expression for the operator B(w):

When w is limited to a finite manifold M in R^dthe expression becomes

with

PI(w) is the projection on the hyperplan tangent to M at point w.

Logistic regression in Quantum AI

Quantum learning on quantum system is a delicate concept. Let's take an example in the area of optical communication with polarized photons. The test vectors x_tare polarization directions and the labels y_t are the binary information to be sent (to simplify 1 or 0). The aim is to use the AI in order to determine the best direction w of polarization measurement which gives the best matching even in presence of noise. The main difference with classic AI is in the fact that the test vectors is unknown and is partially revealed by its measurement on an arbitrary but fixed polarization w. There is no way to repeat the measure with a different direction w' because the initial state x_tis destroyed by the first measurement and the no cloning theorem prevent to make several copies of the same photon. The direction x_t and w are unitary complex vectors of dimension 2 (to simplify we don't assume non mixed state, but we should to reflect noisy environment and replace x_t by a density operator), the probability that the measure is correct is |<w|x_t>|²when y_t =1 and 1-|<w|x_t>|²when y_t =0. Thus the system is equivalent to a generalized regression with probability function p(x)=|x|² (notice that the variable is now complex) operating in a manifold which is nothing else than the complex unit circle. The discrepancy function is displayed below (before to be composed with the projection operator on the orthogonal plan to w). In the more general case, the operator |x_t><x_t| will be replaced by a density operator.

Topological learning theory

This part of my research is mostly experimental with some theoretical validations with toy models. Everything started in Bell Labs where a challenge were given to research staff to train a neural network capable to sort numbers. To our surprise the classic neural network was unable to converge properly, while trained on signed vectors of numbers, even for just finding the maximum. This was more than surprising because there exists trivial neural networks which exactly perform the extraction of maximum. For example the formula 2 max{x₁,x₂}=[x₁+x₂]⁺+[x₁-x₂]⁺+[x₂-x₁]⁺-[-x₁-x₂]⁺which is nothing else than a neural network with one RELU layer (the [.]⁺function), which can easily be extended to any arbitrary dimension vector. In my experiments I randomly initialise many networks of same dimension and submit them to stochastic gradient descent on the same training sequences x₁,x₂,…,x_t,…. If w is the vector of weight of the neural network we denote f(w,x) the prediction of the label with the feature x. Assuming k parallel networks, we denote wⁱ₀the initial weight vector of the ith network, similarly we denote wⁱ_t its weight vector after the feature x_t . With features of dimension 2 the convergence is satisfactory, and for all x and i: lim_t f(wⁱ_t,x)=max{x} . But when the dimension increases the convergence worsen. Each gradient descent converges but not to the same local minima, and the quality of the local minima is bad. To illustrate, I take k=9, d=16, I fixed two test vectors x and y thus there are 9 trajectories made by the points (f(wⁱ_t,x),f(wⁱ_t,y)). The trajectories are intricate and in the figure below I only display the final positions of these trajectories. They are very dispersed as claimed, the red arrow gives the round truth (max{x},max{y}).

After using toy models which not necessarily reflect the non smooth and non convex aspect of the actual loss functions I arrived to the conjecture that the optimal network weight values are unattainable by gradient descent when it has the zero mean weigh property. To make it short the zero mean weight property is when the weight vectors is like a random sequence of -1,+1, 0. In the next figure I fix a ground truth made by a network made from a zero mean weight vector b: g(x)=f(b,x). The feature dimension is still 16, the final position of the 9 trajectories are dispersed and the red arrow indicates the ground truth (g(x),g(y)).