Machine Learning and Markets

<-My Homepage



**Can we learn to beat the best stock

Authors: Allan Borodin, Ran El-Yaniv, and Vincent Gogan

Published: Journal of AI Research 2004 (Also Nips 2003 I think)

Online: PDF (from short 2003 NIPS version I THINK, on El-Yaniv's page), or HERE a complete versionfrom the Journal of AI research

Keywords: Algorithmic trading, correlation based trading

References: exponential gradient descent

The paper begins by citing such things as exponential gradient descent and the universal portfolio techniques of Cover, which give online guarantees with respect to the best constantly rebalanced portfolio chosen in hindsight.  However the algorithm they present if purely heuristic (i do NOT mean this in a demeaning way) in that it relies on continuation of statistical auto-correlation and cross-autocorrelation. The authors note that any algorithm can be made universal by initially putting a fraction of wealth into a universal algorithm.

Their algorithm is called anticor. It uses one parameter, w, a window length of past trading days to use. I give a brief description.  At any time t, the algorithm considers to past two windows w2 and w1: w2 is [t-2w+1 to t-w] and w1 is [t-w+1 to t]. Define M(i,j) as the correlation of stock i's log returns over w2 with stock j's log returns over w1: that is, how much stock i's return in the past windows correlates with stock j's return in the next window. Also, define R(i) and R(j) as stock i and j's mean log return over the most recent window w1. The algorithm says, for all i and j

if R(i) > R(j) and M(i,j) > 0

  CLAIM( i -> j ) = M(i,j) + MAX( 0, -M(i,i) ) + MAX(0,-M(j,j))

Then it transfers wealth in the portfolio before the next day from stock i to stock's j based on CLAIM( i->j ).  Basically what it's doing is as follows. Every time stock i beats stock j and stock i's performance in one window is correlated with stock j's performance in the next window ( M(i,j)>0 ), probably stock j will do well over the next window, so it transfers wealth from i to j.  The magnitude of that transfer is increased with: the strength of the cross-autocorrelation (M(i,j)), and it's increased if stock i is negatively autocorrelated with itself ( -M(i,i)) or stock j is negatively autocorrelated with itself ( -M(j,j) ).  If you think about it, all of these make intuitive sense. This is the ANTICOR(w) algorithm, with free parameter w. They also define the algorithm ANTICOR^2 which fixes w in some range (say w = 2 to 20), uses ANTICOR(w) on each w, and treats each of the ANTICOR algorithms itself as a stock, an invests in the anticor algorithms.

They use four datasets: daily NYSE returns for 36 stocks, 1962 to 1984.  TSX: Daily returns for 88 Toronto stocks, 94-98.  The 25 largest S&P500 stocks: 1998-2003,and 30 Dow Jones stocks, 2001-2003. They claim outrageous returns on all datasets, though it's hard to interpret because they record terminal wealth rather than returns.  I belive they claim that ANTICOR earns roughly 100% per year on all but the DOW dataset, and ANTICOR(ANTICOR) does uniformly better.  But on the NYSE dataset for both ANTICOR algorithms, they claim roughly 100% return for ANTICOR and a little more for ANTICOR(ANTICOR)roughly doublwealth every year and ANTICOR(ANTICOR) doing slightly better, beating the best stock.  They claim ANTICOR even does well on time-reversed dataset, where the market is going down. They find exponential gradient descent and the universal portfolio techniques to be marginally better than the market. They also show that the ANTICOR algorithm has a little more than twice the market's standard deviation, and claim that it's profits remain significant even with commisions of around 0.4%.

The results are certainly interesting, and though there aren't any bounds (I have found that economically meaningful bounds are hard to construct in market prediction problems) their algorithm is based on solid intuition and stationarity assumptions.  I would very much like to see the results verified by a third-party on a more "industry standard" dataset.



**Automated Trading with Boosting and Expert Weighting

Authors: German Creamer and Yoav Freund

Published: SSRN Working Paper 2006


Keywords: Boosting, algorithmic trading, expert trading, alternating decision tree

Summary: The authors develop a three-layer trading algorithm that uses a large number of technical indicators as features and daily returns as value to predict.

The first layer of the algorithm (TRAINING layer) uses an alternating decision tree (ADT) trained on some training set, which consists of daily returns as values and a large number of technical indicators as features.  This layer learns a single alternating decision tree for all stocks (i.e. not a different tree for each stock).  An alternating decision tree has split nodes (like a normal decision tree) and prediction nodes, and a single prediction involves following multilpe paths from root to leaf and averaging the prediction nodes along all paths.

The next layer (ONLINE LEARNING layer) uses boosting and expert weighting to choose among many ADTs learned by the first layer: each ADT is treated as an expert. The last layer (RISK MANAGEMENT) eliminates "weak" signals to prevent excessive trading...that is, weak signals to take a small long or short position are ignored.

Their experiments uses daily data for 100 randomly-selected S&P stocks from Jan. 2001 to Dec. 2004. The experiments used moving windows with two years of data for training followed by 50 days for out-of-sample testing.  They claim annual abnormal returns out of sample (abnormal means over the risk-free rate??? not clear) of 6% to 14%, varying transaction costs, where buy-and-hold lost 4%. All transaction cost estimates were low, and it's not clear if these were long-short or long-only portfolios, and if leverage was allowed. The reported Sharpe ratios are quite low (only 0.2 out of sample even with no transaction cost) indicating that returns must have been quite volatile.

I liked the paper because it is an application of several things (boosting, expert learning) that I have been reading about lately.  However, they use a huge number of technical indicators (the appendix lists at least 30) none of which have any theoretical justification, and I would be itnerested to see how their results are in a true out-of-sample test.



**Financial Forecasting using Genetic Algorithms

File: Financial Forecasting using Genetic Algorithms (1996).pdf

Authors: Sam Mahfoud and Ganesh Mani (LBS Capital Management)

Journal of Applied Artificial Intelligence 1996

Key words: Genetic Algorithms, stock forecasting, neural networks, machine learning


Folder: Machine Learning and Markets

The authors use Genetic Algorithms (GA) to predict LIVE (making actual predictions, not backtesting) of relative (to the S&P) returns 12 weeks in advance using 15 features (unmentioned, but fundamental and technical factors). The paper contains a great discussion of what the authors call the 'Pittsburgh' versus the 'Michigan' method of GA. In the Pittsburgh method, individual population members are complex and composed of multiple embedded rules, and individuals are used for prediction. In the Michigan method (used in this paper), population members are simple and use only one rule (like 'buy if P/E > 30 and growth > 10%) but the population as a whole is used for prediction, with generally individual members "voting" on only a small subset of instances. The paper discuesses conflict resolution and other issues that arise with this method, and discusses other details of GA implementation (for example, this paper uses strings to represent individuals, allowing quasi-generic functions for selection and mutation). The results were very good, with stocks selected by the GA as first quintile performing several percentage points over the last quintile. The GA performance beat a similar neural net, gave human-readable rules, and made predictions less frequently (i.e. on many instances it made no prediction) but was very accurate when it did make a prediction. Lastly the authors show that quintiles selected by both the neural net and GA performed exceptionally well, so some synergies exist. This paper is noteworthy in that it performs true prediction, so lookahead and survivorship biases could not have been an issue.

Score: 10/10


**Forecasting and Trading Currency Volatility: An application of Recurrent Neural Regression and Model Combination

File: Forecasting and Trading FX Vol - An Application of recurrent NN Regression (2002).pdf

Authors: Christian L Dunis and Xuehuan Huang

Folder: Maching Learning and Markets

Keywords: Neural Network forecasting, volatility forecasting, FX

The authors use neural networks, with GARCH(1,1) as a benchmark, to forecast future 21-day FX volatility and use this to derive a trading strategy.  This paper has a great explanation of GARCH processes, how GARCH(1,1) is generally preferred to more complex models,  and GARCH's shortcoming for multi-day future volatility forecasts. It also has many good references on neural net forecasting, and a good explanation of why at-the-money straddles are best for buying or selling volatility.

 The authors forecast 21-day future volatility using lagged 21-day realized volatility, lagged 21-day implied volatility, absolute logarithmic exchange rate returns, and lagged commodity returns on gold or oil (variables were selected by foreward selection).  They forecast USD/JPY and USD/GBP 21-day realized volatility data from Dec 1993 to April 1999, with most of the data used for training.  They generally find recurrent neural networks (RNN) perform best at minimizing RMSE, but model combinations of RNN, NN, and GARCH are best at % correct directional prediction. However, for volatility trading, model combinations don't perform well, and the best two performers are RNN followed by GARCH. RNN perform very well, occasionally scoring over 100% out-of-sample and generally being very consistent,  However, as the authors say "profitability is defined as a volatility net profit"...I don't know what this means exactly, so I'm not sure if a 100% return is what it seems like.  Overall, the paper is novel, clearly written, gnerally explains the methodology very well, and obtained strong results.  The only weakness I see is the small amount of data used to get results (only two time-series, each with ~1000+ samples)

Score: 9/10


**Nonlinear Predictability of Stock Returns Using Financial and Economic Variables

File: Nonlinear Predictability of Stock Returns using Financial and Economic Variables (1999).pdf

Authors: Min Qi

Journal of Business and Economic Statistics Oct 1999

Folder: Machine Learning and Markets

Keywords: Neural network forecasting,

References: Good refs on NN prediction and statistical tests for non-linearity

The author forecasts next-month excess S&P performance during the period ~1954 to 1992 using 9 lagged economic variables (like dividend yield, inflation, one-month T-bill rates, growth of money supply) and compares these forecasts to linear forcasts on the same variables and to buy and hold.  Roughly, the author makes monthly forecasts using the past 5 years of data.  The author finds that non-linear models outperform linear models out of sample by most statistical except during the 1980s.  Trading results indicate that both linear and non-linear models outperform buy and hold by 1% to 2% for linear and 2% to 4.5% for NN forecasts, with non-linear forecasts bearing linear forecasts in all periods with the trading strategies generally having around 75% as much standard deviation as the market (or less).  The appendices discuss Baysian regularization.  The paper is straightforward, clear, and very good...perhaps that's why it won a best paper award in Chicago.  It would be nice to see an update using data from 1992 to 2005.

Score: 9/10



**Learning to Trade via Direct Reinforcement

File: Learning to Trade via Direct Reinforcement (Moody 2001).pdf

Authors: John Moody and Matthew Saffell


Folder: Machine Learning and Markets

Keywords: Trading, Reinforcement Learning, Policy Learning, Trading Performance measures, Online learning

Note: This paper will be a very difficult read for people not familiar with machine learning at a graduate level.

Moody presents a policy learning method (recurrent reinforcement learning or RRL) for trading based on reinforcement learning. More details of this method are given in the first two references cited in this paper. It seems to learn what position to take (long,short,or neutral) in an asset based on past asset value and the current position. The policy is a set of parameters on these past prices

that is updated online to improve the Sharpe ratio or the Downside Deviaiton Ratio ( basically sharpe ratio (E(Return) / Var(Return)) where Var(R) isonly calculated over negative Returns). I belive the policy simple takes a linear combination of the parameters coefficients times the features (the past prices and current position) passed through a single TanH unit. Anyway, Moody shows that this learning method is superior to Q learning (a value function method which also required more tanh units) and to buy and hold for allocation between S&P or treasuries on daily data, and for 30-minute-frequency trading of USD/British Pound data. Moody points out that supervised learning solves structural credit assignment problem but not temporal credit assignment problem, and has transaction cost issues since it's hard to teach it not to trade very frequently given noisy financial data. Of course he argues that RRL, an unsupervised method, is better. The paper discusses value function versus policy search methods, and argues that the state space required in financial applications for value function based methods can become large. The RRL method is more stable in a noisy enviroment and it trades less. The paper would be a stronger if it made it more clear what exact information the trading system used to make a decision once the parameter update is complete at time t (e.g. the last 5 prices, the last 50, ...) and how the parameters and this data together influence the trading decision (e.g. neural net??)

Score: 8/10

**Performance Functions and Reinforcement Learning for Trading Systems and Portfolios

File: Performance Functions and Reinforcement Learning for Trading Systems and Portfolios.pdf

Authors: John Moody, Lizhong Wu, Yuansong Liao, Matthew Saffell

Journal of Forecasting 1998

Online: (PDF)

Folder: Machine Learning and Markets

Keywords: Trading, Reinforcement / Policy / Online Learning, Learning, Trading Performance measures, Running exponential moving averages.

Note: This is a companion paper to the one above (Learning to Trade via Direct Reinforcement)

This paper will be a very difficult read for people not familiar with machine learning at a graduate level.

This paper is slightly more detailed than the one above in some aspects. It motivates the idea for directly optimizing a trading system, as opposed to a seperate forecasting system or using labelled trades. Has a partial explanation of stochastic optimization and their recurrent reinforcement learning, seeming to be optimization involving the gradients of utility function with respect to the parameters. Details various efficient methods of calculating moving-averages of the Sharpe ratio, and its gradient the differential sharpe ratio (the utility function they use) are given. They also give (similar to the above paper) simulated results with a trending AR(1) series and with an S&P vs. T-Bill asset allocation problem, where optimizing differential sharpe ratio gives more profit and less variance than minimizing mean squared forecast error, and both beat buy-and-hold. These papers would be a perfect 10 if they went through, in detail, how their trading system works with one concrete example, since the parameters (eq 9) are never specified, nor is the method by which the trading system makes decisions and on what data (lagged prices?) it makes decisions.

Score: 8/10.

**Taking Time seriously: Hidden Markov Expert Applied to Financial Engineering

Taking Time Seriously - Hidden Markov Experts in Financial Engineering.pdf

Authors: Shanming Shi and Andreas Weigend

Folder: Machine Learning and Markets

Authors: Shanming Shi and Andreas S Weigend

Uses first-order Hidden Markov models to model financial time series (both intra-day FX and daily S&P). That is, a small number of Hidden states are learned from the time series, and a local prediction method (like neural nets or regression) is used for each state. For example, with two states S1,S2, the predicted distribution of future (price? return?) is P(S1)xDistribution(if S1) + P(S2)xDistribution(if S2). The paper will be readable for anyone familiar with Rabiner 1989 "A tutortial on Hidden Markov models and selected applications in speech Recognition". They find that, by using HMM to segment data into different regimes, prediction experts (Neural Nets) can do a much better job at predition and produce positive profits. Unfortunately, as is so common, while they do say that each neural net has 10 tanh hidden units and one output unit, they don't say what is being predicted or what the input to the neural net is, though presumably it's lagged price.

Score: 8/10

**Black Scholes versus Artificial Neural Network for pricing FTSE 100 options.pdf

Title: Black Scholes Versus Artificial Neural Networks in pricing FTSE 100 Options

Authors: Julia Bennell and Charles Sutcliffe

Keywords: Options Pricing, Neural Networks

Folder: Machine Learning and Markets

Shows that for out-of-the money options, Artificial Neural Networks (ANNs) cleary beat Black Scholes (BS), and for in-the-money options the performance was comparable with some adjustments. The data was European FTSE 100 call options traded Jan 1st 1998 to March 31st 1999. Inputs to the ANN were the six Black Scholes inputs (volatility, strike price, underlying proce,time to expiration,risk free rate, dividend rate or P.V.) in addition to a moneyness hint (S/K) which was found to be VERY helpful. The intro lists great references in ANN usage for pricing derivatives. I give the authors high marks for being very clear about what the input and output data to/from their neural nets are.

Score: 8/10


**Pricing and Hedging Derivative Securities with Neural Networks: Baysian Regularization, Early Stopping, and Bagging

Authors: Ramazan Gencay and Min Qi

Published: IEEE Transactions on Neural Networks 2001

Keywords: Options, pricing, cross-validation, Baysian regularization

Summary: The authors examine the effectiveness of bagging, early stopping, and Baysian regularization to improve out-of-sample pricing and hedging accuracy on S&P 500 securities.  They find that bagging, the most computationally intensive, gives the smallest pricing and hedging errors.

The authors discuss the three methods to avoid over-fitting: early stopping halts weight updates when performance on the validation set starts to decrease, baysian regularization penalizes large magnitude network weights, and bagging generates multiple bootstrap training sets by sampling (with replacement) to learn multiple predictors, and then predicts some average of these. The data is S&P index calls from Jan 1988 to Dec 1993, divided into training, test, and validation data each year.  They compare Black-Scholes (using sample std. dev.), a linear model, a baseline NN, and the 3 "regularized"  NN models (using bagging B, early stopping (ES), and Baysian Regularization BR).  They also discuss the homogeneity hint (learning C/K as a function of S/K tends to generalize better, since Black Scholes is homogenous in this manner) and mention that baseline NNs with this hint can do about as well as the various cross-validated networks without it.

They find that bagging gives much lower (about 50%) and more stable MSPE than a baseline NN or Black Scholes.  Early stopping doesn't help much over the baseline NN.  The BR NN has lower MSPE than the NN about a third of the time, and was the same about half the time, with more stable MSPE across years.  Linear regression did horribly, and all regularized NN models had lower hedging error than black-scholes about 55% of the time. Bagging seemed to do best across the board, in hedging and MSPE error.

Comments: The authos love the word "parsimonious".  I think They didn't present results for comparing a baseline NN with the homogenity hint to the "regularized" networks, even though their intro said they were presenting the regularized netoworks as a way to learn when a hint isn't available.  And they didn't comare to Black Scholes using a reasonable volatility model, like GARCH or using implied vols.

Score: 7/10


**Is Technical Analysis in the Foreign Exchange Market Profitable - A genetic programming approach

File: Is Technical Analysis in the Foreign Exchange Market Profitable - A genetic programming approach.pdf

Authors: Christopher Neely, Paul Weller, and Robert Dittmar

Published: Journal of FInancial and Quantitative Analysis, Dec 1997

Federal Reserve Bank of St. Louis

Online: (PDF)

Keywords: Genetic programming, learning in markets, technical trading, currency trading

Folder: Machine Learning and Markets

Shows the technical indicators found via genetic programming give excess profit trading currency cross rates. 1975-1980 data is used for training and selection, and 1981-1995 for testing. Uses expression trees as population members in the genetic algorith, with arithmetic operations (+,-,x,/,norm,avg,min,max,lag) and boolean (and,or,not,>,<), conditional (if-then), and numerical constants. Uses daily data normalized by dividing it by the 250-day moving average. The approach was very similar to "Using Genetic Algorithms to find Technical Trading Rules". They generally find average returns of around ~5% a year for most rules that did well in the selection period

Score: 8/10

**Using Genetic Algorithms to find Technical Trading Rules

File: Using Genetic Algorithms to find Technical Trading Rules (1999).pdf

Authors: Franklin Allen, Riso Karjalainen

Year: 1999

Journal of Financial Economics

Folder: Machine Learning and Markets

Online: (PDF)

Keywords: Technical Trading, Genetic Algorithms

The authors use genetic algorithms to search for trading rules telling when to invest in the S&P and when to invest in bonds, using data from 1928-1995. Their results indicate negative excess returns to even the best rules after (small) transactions costs. The genetic algorithm approach they use represents rules as expression trees, using operations (>,<,x,+,-,\,lag,min,max,avg) and constants, where a single rules could have a hundred nodes and populations contained 500 rules. This has powerful representational ability (perhaps too powerful and the genetic search couldn't find good rules?). Populations were trained on 5 years of data, 2 years were used for selection, and the rest of the data is the out-of-sample test period (which has varying length depending when training started). The best rules had some forecasting ability (average market returns slighlty higher and volatility slightly lower when the rules were in the market) but only a few basis points, and closer examination indicated this was probably just exploiting one-day serial correlation in returns. I give the authors credit for publishing negative results, and genetic algorithms are well explained.

Score: 7/10


**Forecasting Daily Exchange Rates using Genetically Optimized Neural Networks

File: Forecasting Daily FX rates using genetically optimized neural networks(2002).pdf

Authors: Ashok K Nag and Amit Mitra

Journal of Forecasting 2002

Folder: Machine Learning and Markets

Keywords: Genetic algorithms, neural networks, currency forecasting, FX forecasting

The authors use neural networks whos' weights, activation functions, learning and momentum rates, and layout are optimized with genetic algorithms (GANN). This is interesting, though not new, and their explanations are clear. Then they use lagged prices and technical indicators to use these networks to do forecasting, with variants of feedforward and feedback GANN and various loss functions, in addition to a fixed layout networks. The results on daily exchange rates, using an out-of-sample test set and R^2 as a criterion, has the fixed networks performing worst, with ARCH(1,1), GARCH(1,1), EGARCH(1,1),..and GANN performing slightly better, with perhaps GANN having a TINY edge over ARCH/GARCH/... The R^2 parameters relate to predicting prices, not returns, so they are not comparable across currencies or to any other prediction on other assets (e.g. R^2 on prices is usually high since even just guessing the last price does a decent job. R^2 on returns will be very low, and comparable across assets). The paper is good but the forecasts of GANN are not convincingly better than ARCH, GARCH,...

Score: 6/10

**Evolutionary Arbitrage for FTSE-100 Index Options and Futures

Keywords: Arbitrage, Genetic Programming

Folder: Machine Learning and Markets

Uses genetic programming, trained on FTSE-100 index options, to find arbitrage opportunities succesfully on intraday data. Results improve by using explanatory variables (like money-ness) and beat a naive strategy by a fair margin. Paper's idea is good but poorly organized and explained. Uses and explains bootstrapping to test if excess returns found were valid

Score: 5/10


------------General Audience------------


**Lo - Personal Indexes.pdf

File: Lo - Personal Indexes.pdf

Authors: Andrew W Lo

Journal of Indexes (2001)

Folder: Machine Learning and Markets

Keywords: Indexes, AI and finance (neural networks), Data mining

Folder: Finance and Machine Learning: general audience

References: Good refs to some of Lo's own papers on machine learning and markets

This is a general audience paper but has a few great references, mostly from Lo himself, related to technical trading and neural networks and finance. It points out that current indexes (using averaging) are a simple form of AI, and posits that in the future AI-managed personal indexes may emerge.  Begin for a general audience, this paper is in a different category from most professional journals.

Score: 7/10