My research in AI

Toward a Theory of Learnability (summary)


Analytic Learning theory via regret analysis

This research track is devoted to the theoretical analysis of regret attached to a learning strategy. it is inspired from both the game theoretic definition and the information theoretic definition of regret. It will provide an insight on the fundamental limits of learning. Let a problem defined by a source S of features each represented by a vector x and ground truth labels y which can also be a vector. We have a conditional probability p_S(y|x) mapping feature data vector x, and the ground truth label y. Given a sequence of T (T integer) training features xT=(x1,x2,…,xT) and ground truth sequence yT=(y1,y2,…,yT), we denote PS (yT |xT ) the conditional probability distribution of the ground truth and feature of the considered problem. If the pairs are independent (common hypothesis) then PS (yT |xT ) =pS(y1|x1)...pS(yT|xT).

We denote PL (yT |xT ) the conditional probability attained by the learning process in T learning steps t. The ideal learning would be PS=PL. A learning would be optimal if limT PL=PS in a practical sense to be will defined with the regret metric RT. The first metric is the Kullback-Leibler divergence measure:

RKT(S,L)=E[log PS(yT|xT)-log PL(yT,xT)]

More interesting is the max divergence metric:

RMT(S,L)=maxyT[log PS(yT|xT)-log PL(yT|xT)].

If we consider a class of problems S and a class L of learning strategies operating on a given set of training sequences xT we can define the minmax regret as

R*T=minS in S maxL in L RMT(L,S).

This measure the best learnable problem in S with the worst learning strategy in L. If S is the set of all problems we arrive to the bound R*T>=r*T ,

with r*T=log (sum yT max L in L PL(yT|xT)), the sum under the logarithm is often called the Shtarkov sum.


Regret in logistic regression

The logistic regression is a special class of learning algorithm. It consists in finding the best matching between a sequence of test vector xT and ground truth labels yT with a family of distributions dependent of a parameter vector w and maps each vector xt to a random variable Yt. In the binary case, we consider yT in {-1,1}T, the probability to have Yt =yt is

Where <w|x> is the scalar product of vectors w and x. The regression consists into finding the value of w which maximizes P(YT=yT). It turns out that for the logistic regression the upper regret r*T

is equal to d/2 log T+log Cd

where d is the dimension of the feature vectors and Cd is obtained via integration over Rd of a "discrepancy" function which really depends on the problem.




with

When the feature are uniform on the unit sphere in dimension d we have

Cd=e 2/3(2pi/d2)d/4 When the feature, are uniform in the unit ball we have a smaller value: Cd=e 1/6(2pi/d2)d/4 this counter intuitive results comes from the fact that in the ball the features are closer to each other than in the sphere. The picture below display the discrepancy function for d=1 for the feature in the unit circle.

Generalized logistic regression

The logistic regression can be generalized to probability distributuon class other than P(Y=y)=1/(1+exp(y*<w|x>)), the popularity of the latter comes from some mathematical advantages which facilitate its use. But we can take P(Y=1)=p(<w|x>) and P(Y=-1)=1-p(<w|x>). A celebrated example is the gaussian regression with p(v)=Phi(v) where Phi is the normal cumulative function. Indeed the gaussian regression came before the logistic regression, under the name "probit", and was abandonned later because less easy to compute (it was in the 1930). Now with the increasing computational power of our individual computers this is no longer an issue since decades. Provided that the functions -log p(v) and -log(1-p(v)) are convex, the formula for the discrepancy function is modified by a new expression for the operator B(w):

When w is limited to a finite manifold M in Rd the expression becomes

with

PI(w) is the projection on the hyperplan tangent to M at point w.


Logistic regression in Quantum AI

Quantum learning on quantum system is a delicate concept. Let's take an example in the area of optical communication with polarized photons. The test vectors xt are polarization directions and the labels yt are the binary information to be sent (to simplify 1 or 0). The aim is to use the AI in order to determine the best direction w of polarization measurement which gives the best matching even in presence of noise. The main difference with classic AI is in the fact that the test vectors is unknown and is partially revealed by its measurement on an arbitrary but fixed polarization w. There is no way to repeat the measure with a different direction w' because the initial state xt is destroyed by the first measurement and the no cloning theorem prevent to make several copies of the same photon. The direction xt and w are unitary complex vectors of dimension 2 (to simplify we don't assume non mixed state, but we should to reflect noisy environment and replace xt by a density operator), the probability that the measure is correct is |<w|xt>|2 when yt =1 and 1-|<w|xt>|2 when yt =0. Thus the system is equivalent to a generalized regression with probability function p(x)=|x|2 (notice that the variable is now complex) operating in a manifold which is nothing else than the complex unit circle. The discrepancy function is displayed below (before to be composed with the projection operator on the orthogonal plan to w). In the more general case, the operator |xt><xt| will be replaced by a density operator.


Topological learning theory

This part of my research is mostly experimental with some theoretical validations with toy models. Everything started in Bell Labs where a challenge were given to research staff to train a neural network capable to sort numbers. To our surprise the classic neural network was unable to converge properly, while trained on signed vectors of numbers, even for just finding the maximum. This was more than surprising because there exists trivial neural networks which exactly perform the extraction of maximum. For example the formula 2 max{x1,x2}=[x1+x2]++[x1-x2]++[x2-x1]+-[-x1-x2]+ which is nothing else than a neural network with one RELU layer (the [.]+function), which can easily be extended to any arbitrary dimension vector. In my experiments I randomly initialise many networks of same dimension and submit them to stochastic gradient descent on the same training sequences x1,x2,…,xt,…. If w is the vector of weight of the neural network we denote f(w,x) the prediction of the label with the feature x. Assuming k parallel networks, we denote wi0the initial weight vector of the ith network, similarly we denote wit its weight vector after the feature xt . With features of dimension 2 the convergence is satisfactory, and for all x and i: limt f(wit,x)=max{x} . But when the dimension increases the convergence worsen. Each gradient descent converges but not to the same local minima, and the quality of the local minima is bad. To illustrate, I take k=9, d=16, I fixed two test vectors x and y thus there are 9 trajectories made by the points (f(wit,x),f(wit,y)). The trajectories are intricate and in the figure below I only display the final positions of these trajectories. They are very dispersed as claimed, the red arrow gives the round truth (max{x},max{y}).

After using toy models which not necessarily reflect the non smooth and non convex aspect of the actual loss functions I arrived to the conjecture that the optimal network weight values are unattainable by gradient descent when it has the zero mean weigh property. To make it short the zero mean weight property is when the weight vectors is like a random sequence of -1,+1, 0. In the next figure I fix a ground truth made by a network made from a zero mean weight vector b: g(x)=f(b,x). The feature dimension is still 16, the final position of the 9 trajectories are dispersed and the red arrow indicates the ground truth (g(x),g(y)).