text classification in scikits.learn proposal(GSoC 2008)

 
 

Introduction

There are many papers about necessity of standard machine learning framework[3]. I believe that python, scipy and numpy are very suitable language for such kind of tasks. There are machine learning package in scikits[4], but now it is in raw state. I'm going to make it appropriate to text classification and categorization tasks. These tasks required classifiers. There are at least 2 state-of-art classifiers: svm and Bayesian regression.  Also it is need support for sparse data, because text classification  problems  often have very high dimensions, but length of single document is less than dimension of feature space. There should be a number of feature selection algorithms and other pre-processing and post-processing operations such as normalization,different types of tf-idf, etc. 

The main idea is to make writing  new classifiers(and integrating them to scipy) and comparing its with existed  as simple as  possible.

And finally, I am going to write tutorial on text classification with scipy.

Goals

The first task to make current dataset class[5] appropriate for working with sparse data. In interface level it should something like masked array(user can mask unused elements of array). In implementation level it could be for example simple list. If we just need to multiply vectors there are obvious algorithms to calculate scalar multiplication of such sparse arrays. I am going to look at libsvm sparse format to understand how they use sparseness of data. Another idea to use scpy.sparse.  This question require short analysis.

In near future developers of scikits.learn hopes to inmplement in scikits interfaces for other machine learning packages such as PyMVPA[6] and Weka[7]. And such architecture requires simple interfaces between different packages. That is why scikits.learn dataset (and sparse dataset) should be implemented by simple data structures such as numpy.array and allow simple access to these data structures. So sparse part also should satisfy these requrements.

I think package could work with at least one format of input data to preserve user to make up it. It not very hard task, we can use one of existed format and parser for it.

All supplementary algorithms(normalization, tf-idf weightning) would be implemented as simple operation with vectors and matrices.

 Feature selection algorithms are very useful for all machine learning application. So It might be a good idea to implement all feature selection algorithm with common interface. This will be better than using feature selection in separate frameworks(which will be integrated into scikits.learn).

The main part of my work is implementation of classifiers. Svm is now integrated in scikits.learn as python wrapper over libsvm.  Some regression algorithms will be integrated with PyMVPA[6], but I want to implement Bayesian Binary Regression[8] . It shows good results in standart text classification tests[9]. I am can't precisely estimate complexity of rewriting all algorithm in python so I'm not sure implement only wrapper or whole application in python. But in both cases I am going to make it as generic as possible. For instance now train function consists of many steps, such as choosing best variance, feature selection, etc. I'll try to make all these functionality available for user of scikits.learn.

Good documentation is an integral part of good software. I always dream about documentation which has references on using algorithms. It could be used as educational material. I am going to write simple tutorial and documents which describes(or gives links on description) approaches which stay behind the simple interface.

Other toolboxes

There are a lot of other machine learning toolboxes. Each of them has pluses and minuses.  I try to describe some of them.

Orange [1] is a vary good data mining project but has poor sparse format support. PyML [2] has all needed for text classification tasks features, but there are problems with installation on different platforms and code design is not perfect. 

I know nothing about Weka[7]. But it is written in Java. But there are plans to  integrate it to scikits.learn. 

There is "Multivariate Pattern Analysis in Python" [6]. It looks pretty good. There are also plans to integrate it to scikits.learn. It has some almost all things which I mentioned. And I try not to repeat its functionality, but use it and supplement it with new features. It is needed understanding all plans of integrating this tool into scikits.learn.  I hope that in near future I can receive  detailed description of how it will look in scipy, it allows me to write clear plan of communication my part of code with this tool as part of scikits.learn.


Timeline

I see these main milestones.

Milestone 1(1 july)

- Sparse datasets

- Differentiate features which are to implement with PyMPVPA

- Understand how precisely implement BBR

- Write tests for whole process of classification

On this moment it should be possible to load data, make cross-validation  with svm in few lines of code.

Milestone 2(20 august)

-Implement of BBR

(the most difficult part of all work) 

-Write tutorial and other documentation


Links

1. Orange http://magix.fri.uni-lj.si/orange/
2. PyML http://pyml.sourceforge.net/
3. The Need for Open Source Software in Machine Learning http://www.jmlr.org/papers/volume8/sonnenburg07a/sonnenburg07a.pdf

4. https://projects.scipy.org/scipy/scikits/wiki/MachineLearningOriginalProposal

5. DatasetProposal http://projects.scipy.org/scipy/scikits/browser/trunk/learn/scikits/learn/datasets/DATASET_PROPOSAL.txt

6. PyMVPA http://pkg-exppsy.alioth.debian.org/pymvpa

7 Weka http://www.cs.waikato.ac.nz/ml/weka/

8 BBR http://www.stat.rutgers.edu/~madigan/BBR/

 9 BBR article http://stat.rutgers.edu/~madigan/PAPERS/techno-06-09-18.pdf