**UPDATE:**The proceedings will be published in a special issue of The Journal Of Machine Learning Research (vol.44, Dec 2015) prior to the workshop date.

The event will consist of three sessions, each dedicated to a specific open problem in the area of feature extraction. There will be a panel discussion at the end of the workshop, where the audience will have an opportunity to engage in a debate with workshop organizers and invited speakers.

**Workshop chairs:**

Dmitry Storcheus (**Google Research**)

Afshin Rostamizadeh (**Google Research**)

Sanjiv Kumar (**Google Research****)**

Invited speakers:

#### Klaus-Robert Müller (**TU Berlin**)

Fei Sha (**University of Southern California**)

Le Song (**Georgia Institute of Technology**)

Kilian Weinberger (**Cornell University**)

**Description:**

The problem of extracting features from given data is of critical importance for the successful application of machine learning. Feature extraction, as usually understood, seeks an optimal transformation from raw data into features that can be used as an input for a learning algorithm. In recent times this problem has been attacked using a growing number of diverse techniques that originated in separate research communities: from PCA and LDA to manifold and metric learning. It is the goal of this workshop to provide a platform to exchange ideas and compare results across these techniques.

The workshop will consist of three sessions, each dedicated to a specific open problem in the area of feature extraction. The sessions will start with invited talks and conclude with panel discussions, where the audience will engage into debates with speakers and organizers. For example, we welcome submissions from sub-areas such as:

General embedding techniques: extract features by fitting an embedding that best describes data

Unsupervised manifold learning: classical examples include LLE, Isomap, Laplacian Eigenmap

Supervised manifold learning: classical examples include LDA, CCA, a more modern example includes "coupled" optimizations (e.g. multiple kernel learning), which learn a representation and discriminative model simultaneously.

Dimensionality reduction: PCA, Kernel PCA, ICA, Multidimensional Scaling, other projection based methods.

Metric learning: connected in both directions: features can be extracted based on learning an optimal similarity function as well as metric on input data can be learned through distance on extracted features. Examples include MCMC, NCA, MCML, LMNN, RCA as well as online approaches and multi task learning.

Scalable nonlinear features: random (or learned) Fourier feature approximation of kernel feature maps, kernel matrix approximation.

Deep Neural Networks, which can be used to learn a feature representation that is meant to generalize across similar domains.

Supervised vs. Unsupervised: Classic manifold learning methods such as LLE, Isomap, Laplacian Eigenmap have proven to be able to efficiently extract patterns in unlabeled data. Later on various supervised extensions of these methods emerged that make use of labels for adjusting the distance between distinct classes. Moreover, multiple kernel learning has been also suggested for feature extraction in a supervised manner. That often improves classification accuracy at the cost flexibility and scalability. Can we shed more light on the tradeoff between supervised and unsupervised methods? Can we understand, which methods are most useful for particular settings and why?

Scalability. recent advances in approximating kernel functions via random Fourier features have enabled kernel machines to match the DNNs. That inspired a question: how well can we approximate kernel functions? Many efficient methods have been suggested, for instance Monte Carlo methods improved the results of Fourier features as well as approximating polynomial kernels via explicit feature maps showed remarkable performance. What does it all means for the prospects of convex scalable methods? Will they become the new state of the art feature extraction technique? Do the recent results shed more light on the comparison between kernel methods and deep nets? These questions along with many others are encouraged on the workshop.

Convex and non-convex feature extraction. While deep nets suffer from non-convexity and the lack of theoretical guarantees, kernel machines are convex and well studied mathematically. Thus, it is extremely tempting for us to resort to kernels in understanding neural nets. A significant progress has been made to that end. Particularly, sequences of deep kernels have been used to study the layer-wise transformation of neural nets input. A recent work analyzed deep network with the idea of "relevance decomposition" - that is, determining which inputs/pixels are important for an image to be classified as what type of objects. These ideas are very promising, however the question of connection between deep nets and kernels is still unexplored - can we shed more light from empirical and theoretical point of view?

Metric learning for feature extraction. It has been shown on image data that distance learning is an effective way to extract features. In fact, this connection goes both ways since the metric on input data can be learned through the metric on extracted features. However, the question of how to learn distance in order to most benefit the subsequent classification is still open. A number of efficient algorithms have emerged, such as support vector metric learning and large margin nearest neighbours. In addition, semi-supervised metric learning attracted some attention recently. Can we better understand how different metric learning methods compare in feature extraction applications? Can we shed more light on metric learning from the theoretical point of view, for example generalization guarantees?

Balance between extraction and classification stages. We often see in real world applications (e.g. spam detection, audio filtering) that feature extraction is CPU-heavy compared to classification. That raises a question: how to balance computer resources between extraction and classification stage? The classic solution was to sparsify the choice of features with L-1 regularization. A promising alternative is to use trees of classifiers. However, this problem is NP hard, so a number of relaxations has been suggested. Can we contrast and compare them? Will the tree-based approaches to extraction/classification tradeoff become the state of the art

- Theory vs. Practice: Certain methods are supported by significant theoretical guarantees, but how do these guarantees translate into performance in practice? On the other hand, certain methods that perform excellently in practice are not understood theoretically; can novel theoretical understanding help improve these methods further?