### Optimization Challenge

 *) This optimization challenge is over, the winner is Sixin Zhang, Clément Farabet, Marco Scoffier, Ross Goroshin, Koray Kavukcuoglu and Yann LeCun, NYU. The code is now at: http://cs.nyu.edu/~zsx/nips2011/***) If you have good optimization methods you want us to evaluate, please send your code to us.1) Introduction Over the past decades, there has been much work on optimization methods to train hierarchical models. Widely used optimization methods are: Stochastic Gradient Descent [1], approximate Diagonal-Hessian methods [1], Conjugate Gradient [2], Limited memory BFGS [2], Hessian free [3], etc. Nonetheless, there is an ongoing debate regarding the most effective optimization method to train these models. Recent developments raise further questions regarding what optimization method is most effective in leveraging the new hardware (i.e., parallelism). The main purpose of our challenge is to further investigate these issues. And to further simplify matters, we pick out a simple neural network architecture and dataset in the literature. 2) Dataset: 100K tiny images Number of training examples: 100,000 extracted from tiny image dataset [4] Input dimension: 3072 Link: large_dataset.mat 3) ArchitecturesThere are two categories:- Network1: Autoencoder with 2 layers (4 layers after "unrolling"). - Network2: Autoencoder with 4 layers (8 layers after "unrolling") [optional]- Network3: Autoencoder with 8 layers (16 layers after "unrolling") [optional]We will describe Network 1 in more detail. Network1 is a two-layered auto-encoder and it has the following objective function: $\large \inline cost = \frac{1}{m}\sum_{i=1}^m\big\| W_1^T\sigma(W_2^T\sigma(W_2\sigma(W_1x_i+b_1)+b_2)+b_3)+b_4 - x_i\big\|^2$ where                 $\large W_1 &\in \mathbb{R}^{6144\times3072}$                 $\large \inline b_3 \in \mathbb{R}^{6144}$                  $\large \inline b_4 \in \mathbb{R}^{3072}$                  $\large \inline \sigma$ : hyperbolic tangent function                 m : number of examples in the training set Pictorially, the network architecture is as follows  Note that the network is "unrolled" such that it has 4 hidden layers, but the decoding weights are tied with the encoding weights. Network2 is a deeper network that has 4 encoding layers and 4 decoding layers. In particular, if we use the notation N(6144, 100) to describe Network1, then N(3072, 3072, 3072, 100) is the architecture of Network2. These deep autoencoders are also described in [5] 4) Evaluation Protocols: You provide us with binary or source code to train the network. Additionally, the code must have options that we can change: - Network hidden size - Termination condition based on final objective Your program should output parameters W1, b1, W2, b2, b3, b4 ... as text files, so that we can faithfully evaluate your objective function. If at termination, the parameters {W1,b1,W2,b2,b3,b4 ...} do not achieve a lower objective than the specified objective, your submission is not qualified. The network architectures above are for you to tune your optimization parameters. We will evaluate your optimization method on the same architectures with slightly different network parameters. The difference in network parameters is to make sure that your optimization parameters generalize to a slightly different problem. Our metrics are: - How long it takes to train the test architecture given a certain objective? - Given a time constraint, how well your optimization method optimizes the objective function?    Finally, the machine we use to evaluate your algorithm has 30Gb of memory, 8 cores, each has 2.40GHz. It also has a NVIDIA  GTX 470 GPU with 1Gb of memory. It runs Unbuntu Linux 10. If your algorithm leverages the multicore architecture or GPUs of the machine, that's fine for us. 5) Your submission You have to submit a program with a fixed interface such that we can easily evaluate it. You also should provide us a README file if you want us to compile your program in our platform. Here's some examples of legitimate interface: In MATLAB:     >> tic; Myprogram([6144, 100], '../', 70); toc;     (The first argument is the list of the hidden sizes.      The second argument is the dataset location).      The last argument is the final objective. See section 5 for a demo program.) In Linux:     \$ time Myprogram 6144 100 ../ 70  Other configurations can be considered if they do not require too many modifications. Send your submission along with a description of your method in pdf or ps file format to: transflearn.optim.wnips2011@gmail.com no later than 23:59 PDT, Sunday, October 23, 2011,(If the files are too large, you can just send us the link to download the code). 6) Starting MATLAB code If you would like to use MATLAB as a programming language, we have some starting code for you. Optimization of Autoencoders with minibatch LBFGS: optimizeAutoencoderLBFGS.m (this requires minFunc [6]).  Loading data: loadData.m Initializing weights: initializeWeights.mAutoencoder code: deepAutoencoder.m Writing parameters to text files: writeToTextFiles.m File list: README.txt Dataset file: large_dataset.mat Smaller dataset file for your debugging (we won't evaluate using this dataset): smaller_dataset.mat 7) FAQs:      Q: Do I need to participate to this challenge to submit my paper to the workshop?      A: No, it is not necessary. Submission of a paper and participation to the challenges are independent.    Q: Do I need to participate to both optimization and transfer learning challenges?      A: No, you do not. You can enter only one of them, if you wish.    Q: Can we change the random initialization?     A: Yes, as long as it's random and you do not take the solution as your  initialization. Consider reading [7].     Q: Can we change the activation function during training?     A: Yes, it's up to you.     Q: Can we change the objective?     A: Yes, as long as your termination condition uses the objective function that we describe above.     Q: Can we use simple methods like SGDs, LBFGS, CG etc?     A: Yes, if you think that they are effective.     Q: Do we have to use MATLAB?     A: No, but we recommend you give us the source code (preferably in MATLAB), so that we can evaluate more easily.     Q: Can we not use the finalObjective as our termination condition?     A: Yes, as long as your output {W1,b1,W2,b2,b3,b4} can achieve lower objective than the specified objective.     Q: Do we have to submit the dataset files?     A: No, unless they are not the ones that are provided here.     Q: What happens to pre-training?    A: If you want, you can pre-train your network in a layer-wise fashion and then fine-tune your deep network.        We will time your program as a whole. The pre-training phase must be fully automated inside your optimization algorithm.    Q: Can we modify your MATLAB code?    A: Yes.    Q: What happens to sparsity, weight decay etc?    A: Use them as you wish. We however will evaluate your output parameters on the specified objective function.    Q: How do you determine the final objective?    A: We cannot tell you now. But roughly at convergence of a pre-determined optimization method.    Q (October 8) : The current evaluation protocol includes the time for evaluating the objective on the whole training set?         This can take a long time, should I do this often?    A: The final objective is chosen at convergence of a predetermined optimization method. You should not evaluate          your model on the training set regularly. Try to implement a heuristic for checking convergence (such as the change          in parameters). We expect you to evaluate training set error only once or twice during optimization.     In summary, you can change whatever you wish to change. You are only constrained to achieve a pre-determined     objective as fast as possible.8) Bug report and more information: transflearn.optim.wnips2011@gmail.com9) Acknowledgements:We thank Yoshua Bengio, Yann LeCun, James Martens and Ilya Sutskever for comments and suggestions. 10) References: [1] Y. LeCun, L. Bottou, G. Orr and K. Muller. Efficient BackProp. In Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998. [2] Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow and A.Y. Ng. On optimization methods for deep learning. ICML, 2011.[3] J. Martens. Deep learning via Hessian free optimization. ICML, 2010.[4] A. Torralba, R. Fergus and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 2008.[5] G. E. Hinton and R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.  Science, 2006.[6] M. Schmidt. minFunc,  http://www.di.ens.fr/~mschmidt/Software/minFunc.html, 2009.[7] X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. AISTATS 2010.