Pattern recognition toolbox
TOOLDIAG is a collection of methods for statistical pattern recognition. The main area of application is classification. The application area is limited to multidimensional continuous features, without any missing values. No symbolic features (attributes) are allowed. The program in implemented in the 'C' programming language and was tested in several computing environments. The user interface is simple, command-line oriented, but the methods behind it are efficient and fast. You can customize your own methods on the application programming level with relatively little effort. If you wish a presentation of the theory behind the program at your university, feel free to contact me.
In the following a more detailed description about the possibilities of TOOLDIAG is given.
Envisaged future extensions of the program:
- CLASSIFIER PARADIGM
Different classifier types are provided:
- K-Nearest Neighbor
- Linear Machines, using the following learning rules
- Delta-Rule (a.k.a Widrow-Hoff rule, LMS rule)
- Deterministic Least-Mean-Square rule (a.k.a Pseudoinverse)
- Perceptron learning rule
- Quadratic Gaussian Classifier
- Radial Basis Function Network with training algorithms
- Error-Correction Learning
- Parzen window with kernel types: Hypercubic, Hypertriangle, Hyperspheric, Gaussian, Exponential, Lorenz
- Q* algorithm
- Multilayer Perceptron (1 hidden layer)
- Learning rules: Stochastic & Batch (with/without momentum)
- Activation function: Sigmoid & Hyperbolic tangent
- Support Vector Machine, using LIBSVM
- Probabilistic Neural Network
- "Your own classifier" (Framework to implement your own classification method)
- FEATURE SELECTION
A strong part of the program. Several search strategies are provided:
The search strategies can be combined with several selection criteria. The main groups of the selection criteria are:
- Best Features
- Sequential Forward Selection
- Sequential Backward Selection
- Plus L - Take away R
- Sequential Floating Forward Selection
- Sequential Floating Backward Selection
- Branch and Bound
- Exhaustive Search
- Estimated minimal error probability
A arbitrary classifier model can be combined with a arbitrary cross-validation technique to estimate the error (Wrapper method).
- Inter-class distance: Minkowski, City block, Euclidean, Chebychev, Nonlinear (Parzen & hyperspheric kernel)
- Probabilistic distance, assuming multivariate Gaussian distribution: Chernoff, Bhattacharyya distance, Matusita distance, Divergence, Mahalanobis, Patrick-Fisher
- Confusion-matrix based
- FEATURE EXTRACTION
All available features are combined to new features with a lower dimension. The methods:
- Linear discriminant analysis
- Principal Component Analysis alias Karhunen-Loève Expansion
- Sammon mapping (a nonlinear method)
- Higher-order combinations of existing features (polynomials)
- Times Series: Fourier transform
- Times Series: Regression by Orthogonal Polynomials
- Times Series: Regression by 'Usual' Polynomials
- PERFORMANCE ESTIMATION
Several performance estimation methods can be combined with all available classifier paradigms, thus allowing easy comparison of results.
The Cross-Validation methods:
- Rotation alias K-fold cross validation
The performance estimation methods:
- Sensitivity aka Recall
- Geometric mean of Sensitivity and Precision
- F-measure of Sensitivity and Precision
- ROC analysis: Area under ROC curve
- SAMMON PLOT
A graphical interface to the GNUPLOT program is provided which allows to plot the data points in 2-D or 3-D. Higher-dimensional data can be mapped by a structure conserving algorithm, the Sammon mapping.
The analized data can be passed to other programs or can be split into several training and test data sets. Two different feature families which describe the same samples can be merged together. Interfaces exist to:
The data samples can be normalized
- Linear to [0,1]
- Zero mean, unit variance
Statistical parameters of the data can be generated, globally and for each particular class:
- Extrema, mean and standard deviation
- Covariance matrix
- Correlation matrix
- Dispersion and overlapping
- Besides a set of functionalities of minor importance are available, like loading and saving possibilities, noise adding or a demonstration run.
- Estimation of unknown feature values
- Symbolic features allowed
The following WWW resources contain databases which are processable by TOOLDIAG. Only databases with continuous features (attributes) and no missing values are allowed as input to TOOLDIAG.
Windows: There is a pre-compiled executable included and a project file for the DevC++ development environment ( http://www.bloodshed.net/dev/devcpp.html )
with which you can recompile the source code.
Sorry no fancy interface and intallers available, just pure methods.