This page has summaries of papers on machine learning theory and algorithms (as opposed to applied machine learning). Some of the papers I mention here are painfully long journal papers with many pages devoted to proving bounds, etc. In this case I often only briefly summarize the idea and content of the papers, leaving the math to devoted readers.
Authors: Moguerza and Munoz (w/ comments by others)
Published: Statistical Science 2006
This gives a detailed and technical intriduction to support vector machines (SVMs) and kernel methods. SVMs are a statistical technique for classifying samples (X,Y), where Y=1 or Y=-1. SVMs work, roughly, by finding the hyperplane in feature space which seperates the Y=1 class from the Y=-1 class with the largest margins. The paper argues that SVMs take advantage of sparsity in data and are effective even for problems where the data is of dimensionality as large as the number of samples. In the beginning of the paper, the author says that purely non-parametric methods have powerful representational ability (ex Neural Nets) but have black-box behavior and don't perform the bias-variance tradeoff well (and can be hard to optimize: again Neural nets). However, SVMs are consistent (have good asymptotic properties) and the empirical error converges to the expected error, "so SVM hyperplanes are neither arbitrary nor unstable" and, under some conditions, converge to the Bayes optimal rule. The paper presents many examples where SVMs are successfully used in practice, and they perform better than other classification methods.
The paper begins by detailing kernels, where K(X1,X2) -> R is a positive definite function (e.x. (X1 * X2)^2 ) that can be interpreted as a dot product in a higher-dimensional vector space. Roughly, kernels can map input features into higher dimensions. Using this kernel and our set of input features X1 to XN, we can define a function
f(X) = sum_j( aj * K(X, Xj) ), and a "hyper-plane" in the higher-dimensional space as f(X)=b.
Writing math is too painful, but the paper expresses the search for a max-margin classifier as a convex optimization problem that maximizes the margin between the data points with a hinge loss penalization for mis-classified or almost mis-classified data points. Some of the comments following the paper focus on this loss function, and also argue that SVM outputs cannot be interpreted as (conditional) probabilities.
Authors: Leslie Pack Kaelbing, Michael L. Littman, Andrew W. Moore
Published: Journal of Artificial Intelligence Research 4 (1996)
Keywords: Machine Learning, Reinforcement Learning
Great paper! Summarizes unsupervised reinforcement learning techniques, both with a model and model free. Include TD learning, Q learning, exploration vs. exploitation tradeoff, and other details. Not difficult to read for a technical audience. Explanations are clear while avoiding uneccesary detail and the paper has copious references. Granted I'm biased since I took one of the author's courses
Authors: Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman
Published: COLT2007 I think...
Keywords: Regret, average, BestAverage, PhasedAggression.
Without a doubt this is one of the better papers I've read on this topic: it's well written and their proofs (except the last technical one) are relatively straightforward. This also cites other useful papers on expert learning (especially the first two references).
The authors work in the experts allocation framework, where each period each of N experts has a gain in [0,1], and the algorithm must choose a weighting w on the probability simplex and will receive gain w * g, where g is a vector of the experts' gains. The algorithms total gain is the sum of its gains over T time steps. The authors define regret to the best (average, worst) as the amount by which the algorithm's total gain lags that of the best (average, worst) expert. They point out that, while traditional expert weighting algorithm have regret O(T^0.5) to the best, they also have regret Omega(T^0.5) to the average and worst expert. They show that any algorithm with O(T^0.5) regret to best has Omega(T^0.5) regret to worst (more detailed bounds are provided) They develop an algorithm wtih O( (N*T)^0.5*log(T) ) regret to best and constant regret to average: thus giving up a little regret to the best expert for much better regret to the average. The first of these, algorithms, BestAverage, basically runs an exponential weights algorithm, resetting whenever the best expert deiverges sufficiently from the average.
Authors: Agarwal, Hazan, Kale, and Schapire
Published: ICML 2006 (http://www.cs.princeton.edu/~ehazan/papers/icml.pdf)
Keywords: Online portfolio management, Online Newton Step
The authors present an algorithm for investment (picking which stocks to buy). This online algorithm chooses an investment in N stocks (a point on the N-dimensional simplex) every period. It works by optimizing, every period, some approximation to the current lossdefining a portfolio loss minus the two-norm of the investment vector. This algorithm can be easily implemented and the authors claim is has optimal regret with respect to the best CRP (constantly rebalanced portfolio) chosen in hindsight (I leave technical details of the algorithm to those who want to read the paper). The authors have experiments indicating that the algorithms does better than buy-and-hold on randomly selected S&P stocks. My personal experiments with this algorithm (using code from one of the authors) indicated that the regret bounds, though optimal, are far too weak to be economically meaningful (the terminal wealth of ONS may lag the best CRP by a factor exponential in 22*N*sqrt(T), where N is the number of stocks and T the number of periods) and the algorithm performed roughly as well as buy-and-hold on monthly data.
Authors: Agarwal, Hazan, Kale, and Schapire
Keywords: online convex optimization, online gradient descent, online newton step
This paper provides the theoretical underpinnings for the above paper. It mostly contains technical proofs
Authors: Thomas M Cover
Published: Mathematical Finance (1996 I think)
Keywords: portfolio bounds, constantly rebalanced portfolios
This paper is overly mathematical for what it accomplished (117 equations) but is somewhat interesting. Cover considers CRPs (constantly rebalanced portfolios), where a certain portion of wealth is "fixed" in each stock and the portfolio is rebalanced every period: CRPs can earn positive return even if none of the stock's do. Finding the best CRP represents an optimization over a simplex (since we optimize over the fraction to invest in each stock), while the best stock is just a corner of the simplex, so the best CRP does at least as well as the best stock. He considers the strategy of initially investing equal money in each CRP (i.e. over the continuous space of CRPs) and then letting the wealth in each CRP grow: doing this, he derives a bound that the terminal wealth of this strategy will only lag the terminal wealth of the best CRP chosen in hindsight by a factor (T+1)^(N-1), when there are T periods and N assets. Thus the strategy earns the same asymptotic growth as the best CRP. In practice I found that this technique isn't useful and the bound is very very weak (even over 50 years of data). However, the idea of buy-and-hold having good asymptotic performance and comparing well to the best portfolio chosen in hindsight is a good one.
Authors: Avrim Blum and Adam Kalai
Published: COLT [Computational Learning Theory] proceedings (1997)
Keywords: Universal Portfolios, Constantly rebalanced portfolios
The authors give a MUCH simpler description of constantly rebalanced portfolios (CRPs) and of their performance bound than Cover (above). They also begin with a nice example of when a CRP has positive return even though neither stock does. They motivate the problem of performing as well as the best CRP by describing how one can perform nearly as well as the best individual stock (within 1/N of the terminal wealth) by simply investing equal-weight initially and holding each stock. They use the same strategy as Cover for performing nearly as well as the best CRP: spread wealth evenly across CRPs and let it grow within each CRP, and get the same bound: the wealth of the strategy is less than the wealth of the best CRP by no more than (T+1)^(N-1) over T periods, with N assets. They also consider a case of transaction costs to derive a (slightly looser) bound.
Author: Avrim Blum
Published:Machine Learning (1992)
Keywords: weighted majority, learning boolean functions, CNF, mistake bounds
This paper develops algorithms and mistake bounds for learning non-noisy boolean functions. It considers the case when there is a very large number of possible attributes, but each example (and the concept) depends on only a few attributes, so we want ot avoid having to ever explicitly list or consider all possible attributes. The concept classes it deals with learning are boolean formulas. The paper is a good cite but (for my work) probably not directly applicable.
Authors: William Cohen and Yoram Singer
Published: SIGIR (1996) ( http://citeseer.ist.psu.edu/cohen96contextsensitive.html )
Keywords: sleeping experts, learning from experts, Ripper
This paper focuses on empirical results of various algorithms for text classification: my interest in it was its description of the sleeping experts algorithm. Roughly, in the experts model, there are many "experts" making a prediction on every sample, and we must learn which experts to trust: i.e. we will make a prediction based on the weighted predictions of the experts, and the weights might depend on the experts' past peformance. In "sleeping experts", some of the experts make no predictions on some samples. This paper has a good description, with pseudocode, of the sleeping expert algorithm, but does not derive any bounds. The paper also describes Ripper, an algorithm for learning rule sets.
Authors: Yoav Freund and Robert E. Schapire
Published: European Conference on Computational Learning Theory (1995): http://citeseer.ist.psu.edu/freund95decisiontheoretic.html
Keywords: On-line learning, experts, boosting, Hedge, adaboost, loss
This is a fantastic paper: it presents the problem well, the algorithms are intuitive and clearly presented, and the proofs are not overly long. This describes the online allocation problem: roughly how to use the hypothesis of N "experts" to create a hypothesis that isn't much worse than the best expert (alternatively, how to allocate wealth every period among N traders, conditioned on only their past performance, such that at the end you performed almost as well as the best trader chosen in hindsight). This paper presents the hedge algorithm, which works with general bounded summable loss functions and has only one parameter (a learning rate) to tune: it works by simply decreasing the weight on each expert every period by a factor like (learning_rate)^(loss), where 0 < learning_rate <= 1. The final bound, where L(hedge) is the loss of the hedge algorithm, L* is the loss of the best expert (the one with minimum loss), and 0<B<=1 is the learning rate, is:
L(hedge) <= [ ln(1/B) * L* + ln(N) ] / (1 - B)
This paper also describes boosting and relates it to the hedge algorithm, though in my opinion the description given of the boosting problem and adaboost isn't nearly as good as the explanation of the online allocation problem and Hedge. In boosting, we have a weak learner, which can learn a function X -> Y given some examples, but possibly with high error rate (the precise definition of a weak is, of course, in the paper). In boosting, a "master algorithm" has some set of labelled examples Xi -> Yi (X may be multi-dimensional). it calls the weak learner many times, giving it different distributions over these examples and looking at the hypotheses created by the weak learner each time. It then combines these hypothesis into a "master" hypothesis that is guaranteed to be "good". The paper continues with several extentions of boosting to other domains like regression.
Authors: Claudio Gentile
Published: Machine Learning 2003
Keywords: Online regression, classification, gradient descent
This paper describes a class of algorithms for classification or regression in the on-line setting. That is, the data is a bunch of pairs (X_t,Y_t) (where X may be a vector), and these data items arrive in some order: the algorithm must predict each Yhat_t using only the X_t and previously seen pairs. In the regression setting, each mis-prediction has a loss that is like (Y_t - Yhat_t)^2, and in the classification setting Y_t is always 0 or 1 and the loss is | Y_t - YHat_t |.
Rougly, the algorithm makes linear predictions using some internal weight vector (yhat = w * X), and does a gradient-descent like weight update. However, it tries to keep the q-norm (q can be any number) of the weight vector "small", preventing the weights themselves from becoming too large. The algorithm is actually simple, and the weight update takes advantage of link functions, which the author defines. The majority of the paper is focused on deriving loss bounds, showing that the loss incurred by this algorithm isn't much worse than that incurred by the best weight vector, chosen in hindsight. Typical readers will be interested in the first few pages, as the latter part of the paper is mainly technical proofs.
Authors: Cesa-Bianchi, Freund, Haussler, Hembold, Schapire, and Warmuth
Published: Journal of the ACM, 1997
Keywords: Online learning, experts, binary classification
The authors consider online learning of binary values, where each period each of N experts makes a prediction (in [0,1]), and the learner must make predictions such that, in hindsight, the learner didn't do much worse than the best expert. The loss function the authors use is |Y_hat_t - Y_t|, the the total loss is the sum over all t.
The authors first solution to this problem is an algorithm called MM. It uses a complex recusrsively-defined function v which takes exponential time to compute, but which (roughly) computes the anticipated future total loss of each expert for the rest of the game and uses that. The bound they give for MM is in terms of this v function, so it isn't easily interpreted. This is analyzed in tremendous detail and under various assumptions.
The authors then give a more familiar (and implementable) multiplicative weight update algorithm, where a certain non-negative weight is placed on each expert and our prediction at any time is the weighted average of the experts' predictions. After every period, each expert's weight is multiplied by some function of the loss it incurred that period and a learning rate. They show how, when the weight updates are done using the right function, the algorithm has a nice Baysian interpretation. This paper is dense (at 59 pages) and so filled with proofs it feels like reading an appendix.