datamining

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

http://jmvidal.cse.sc.edu/talks/conceptlearning/

http://see.stanford.edu/SEE/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

http://see.stanford.edu/SEE/courseinfo.aspx?coll=17005383-19c6-49ed-9497-2ba8bfcfe5f6

Contests:

http://challenge.spock.com/pages/learn_more

http://www.data-mining-cup.com/

Table 1 Inside the CRISP DM Process.

One popular data mining approach relies on Cross Industry Process for Data Mining (CRISP DM). This approach enables you to plan, execute, and control the lifecycle of a data mining project. CRISP DM relies on several phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment and maintenance.

Unsupervised learning (learning without teacher)

The learning algorithms may base only on inputs. Unsupervised learning is used for example in clustering, self-organization, auto-association and some visualization algorithms.

If learning algorithms use pairs <xi, yi> where yi is the desired output value for xi, then learning is called supervised (with teacher).

The problem of finding the best model from some space H can be analyzed from the Bayesian point of view. The Bayes theorem

P(f|D) = P(D|f)*P(f) /P(D)

defines the relationship between the posterior probability P(f|D) and prior probability P(f) of given hypothesis f. This formula is fundamental in Bayesian learning, which aims in finding the hypothesis which maximizes the a posteriori probability (MAP):

fMAP = argmax P(f|D) = argmax P(D|f) * P(f)

Assuming equal a priori probabilities for all the hypotheses f in H leads to the definition of the maximum likelihood (ML) hypothesis:

fML = argmax P(D|f)

Instance based learning

The instance based or similarity based methods are a large branch of the machine learning algorithms that were developed by the pattern recognition community. A primary example of such models is the k nearest neighbors (kNN) algorithm, which classifies data vectors on the basis of their similarity to some memorized instances. More precisely, for a given data vector it assigns the class label that appears most frequently among its k nearest neighbors. This may be seen as an approximation of the Bayes Optimal Classifier, where the probabilities are estimated on the basis of the analysis of the vicinity of the classified example. Methods restricting their analysis to a neighborhood of given object are called local. In the case of single neighbor classifier (1NN), the decision borders are the Voronoi tesselation of the feature space.

There are many different possibilities of kNN algorithm implementation and application. For example, different distance measures can be used, and the choice is not simple. Also the choice of neighbors is not an unambiguous task. One can select k nearest neighbors or neighbors within a given radius. Neighbors influence on the decision may be weighted according to the distance (1/(1 + dist) or

max[1−dist, 0]), flexible k depending on the region may be used, best k may be estimated by means of cross-validation, etc.

A group of algorithms is dedicated to prototype selection (instances among which search for neighbors is done). Algorithms that work well for prototype selection include DROP, Explore, LVQ, DEL, ICF and ENN.

Decision trees

Decision trees are hierarchical models, very popular in classification tasks.

The tree nodes are described by logical conditions using single features. This is the most common technique in decision trees implementations and results in decision borders perpendicular to the axes of the feature space. Another shapes of borders are possible when, for instance, linear combinations of features or distances from some reference points are used in logical expressions assigned to the nodes, however perpendicular borders are often preferred, because of their comprehensibility. In some fields experts’ knowledge can be directly used to specify decision rules, but more often all that is given, is a set of classified data vectors, so the best way to find decision rules is to use computational intelligence tools.

Tree growing algorithms start with a given training data set as the root node and recursively split the nodes into several disjoint parts to separate vectors assigned to different classes. Each split is described by a general logical rule, which in fact divides the whole feature space (not just the training set).

Such comprehensible models are advantageous in many fields (for instance in medical diagnosis) and provide information about particular feature relevance. The recursiveness of the processes makes feature selection a local task – useful features can be picked up even if they are valuable only in a subregion of the input space, while from the point of view of the whole space they seem to contain less information than other features.

Building optimal trees is a very complex problem, especially because the optimization of a particular split is not the same as maximization of classification accuracy for the whole model. A number of different decision tree construction methods has already been published. Some of them are: Classification and Regression Trees (CART) [Breiman et al.,], ID3 [Quinlan, 1986], C4.5 [Quinlan, 1993], Separability

of Split Value (SSV) Trees, Fast Algorithm for Classification Trees (FACT). The methods are based on different ideas and use different model structures. Some of them use dichotomic splits (CART, SSV), others allow for more complex branching.

Decision trees can usually be presented in a form of logical rules. Such comprehensible models are advantageous in many fields (for instance in medical diagnosis) and provide information about particular

feature relevance. The recursiveness of the processes makes feature selection a local task – useful features can be picked up even if they are valuable only in a subregion of the input space, while from the point of view of the whole space they seem to contain less information than other features.

The methods are based on different ideas and use different model structures. Some of them use dichotomic splits (CART, SSV), others allow for more complex branching. Some algorithms assume a particular data distribution and use parametric statistical tests (FACT, QUEST, Cal5), others make

no such assumptions.

Nevertheless, all trees can be described with the same terminology including nodes, subnodes, subtrees, leaves, branches, branch length, tree depth, class labels assigned to nodes, data vectors falling into a node, and others. There are three main aspects, where decision tree algorithms differ from each other: the way nodes are split into subnodes, the way of tree construction, which in most cases is just a search process7, and the way of taking care for generalization (stopping criteria or pruning techniques).

Splitting criteria

Splitting nodes according to continuous and discrete inputs must be performed differently. This is why some methods (like ID3) can deal only with discrete features. Continuous attributes need to be converted to discrete before such algorithms can be applied. Good classification results can be obtained only in

the company of good (external) discretization methods. On the other side: methods like FACT and QUEST are designed to deal only with continuous input – here discrete features are converted to continuous

by translating them into binary vectors of dimension equal to the number of possible discrete values and then, projected to a single dimension. The original Cal5 algorithm was also designed for continuous data, however it can be quite easily adapted to deal with discrete features. The candidate splits of continuous features are usually binary (of the form {(−1, a], (a,1)}), however there exist some methods which split such input into several intervals (e.g. FACT and Cal5).

Most decision tree algorithms split the nodes with respect to the values of a single feature. Selecting the feature most eligible for the split is often closely bound up with the selection of the splitting points (CART, C4.5, SSV). An alternative strategy is applied in FACT, where the split feature is defined as

the one that maximizes the value of F statistic known from the analysis of variance (ANOVA) method. The QUEST algorithm also calculates F statistic for continuous features, but for discrete ones it uses .2 statistic (to prevent from favoring the discrete attributes over continuous ones). Also Cal5 selects

the split feature with a separate algorithms: either the amount of information about the classes is measured with entropy based formula (in this case the discretization must be performed first) or a coefficient is calculated for each feature to estimate their class separation abilities on the basis of mean squared distances within classes and between class centroids.

Simple search techniques like hill climbing are most frequently used. The experience shows that increasing the computational efforts of the search method (e.g. using beam search) does not improve the generalization abilities of constructed trees (Quinlan and Cameron-Jones, 1995) – no reliable explanation of this fact is known, but it looks like more thorough search leads to solutions more specific to the training set.

Generalization

Building decision trees which maximally fit the training data usually ends up in overfitted, large trees with leaves classifying only a few cases, and thus does not provide general knowledge about the problem. To keep the trees simpler and more general the methods may use stopping criteria. A simple idea is to keep the numbers of vectors in the nodes above a specified threshold. Another way is to extend the separability criterion with a penalty term conforming to the Minimum Description Length principle. In both cases it is difficult to define criteria which yield results close to optimal. A more reasonable way is to use statistical tests to verify the significance of the improvement introduced

by a split (such technique is a part of C4.5). Stopping tree growth can save much time, but makes optimum solutions less probable. Usually, better generalization is obtained when building overfitted

trees and pruning them. There are many different pruning methods. They may concern tree depth, number of vectors in a node etc. Their parameters can be chosen on the basis of a cross-validation test performed within the training data (like in CART, QUEST or SSV Tree).

Missing values

The training data set can be incomplete (some feature values for some data vectors may be inaccessible). Putting arbitrary values in the place of the missing ones should be taken into consideration only if no other methods can be used. Much caution about it is advised. CART introduced the idea of surrogate splits, which is to collect some spare splits, maximally similar to the main one but using different features. The surrogate splits are used when the classified data vector does not provide the value necessary to check the main split condition. When C4.5 is trained on incomplete data, the gains are scaled with a factor equal to the frequency of observing the feature values among the vectors falling

into the node – each training sample has a weight associated with it. The SSV criterion simply does not include the vectors with missing values in calculations. Such vectors are regarded to fall into both subnodes. When an incomplete vector is to be classified by SSV Tree, all the branches of nonzero

probability are checked and their leaves are treated as a single leaf to determine the dominating class.

Other decision tree ideas

There are many other decision tree algorithms, not mentioned here. Some of them are just slight modifications of the basic ones (e.g. NewID is an improved ID3), and some are quite original.

TDDT (Top-Down Decision Trees) is the algorithm available in the MLC++ library (Kohavi et al., 1996). It is similar to C4.5 and introduces some changes to protect against generating small nodes which are created by the information gain based strategies when discrete features have multiple values.

There are also some decision thee approaches, where node membership is decided by more than one feature. Dipol criteria (similar to SSV) have been used (Bobrowski and Krˆetowski, 2000) to construct decision trees where splitting conditions use linear combinations of features. Linear Machine Decision

Trees (LMDT) (Utgoff and Brodley, 1991) use linear machines at tree nodes. They try some variable elimination, but in general such methods are not eligible for feature selection – instead they construct new features as combinations of the original ones.

Oblique Classifier (OC1) (Murthy et al., 1994) combines heuristic and nondeterministic methods to determine interesting linear combination of features. It searches for trees by hill climbing.

Option decision trees (Buntine, 1993) allow several alternative splits of the same node. The final classification is determined with relevant calculations and an analysis of probabilities.

On the basis of the SSV criterion heterogeneous decision trees have been proposed (Duch and Gr , abczewski, 2002). Their split conditions may concern distances from some reference vectors.

Simulated annealing

Simulated annealing (SA) is a search method that is based on how physical

matter cools down and freezes, ending up in a crystal structure that minimizes

the energy of the body.

In SA, the search starts with an initial, potentially random, variable subset

in a high “temperature”. At each step, a small random change is introduced

to the subset. If the change results in a better subset, it is accepted. If the

result is worse than the previous solution, the change is accepted with a probability

that depends on the temperature: in a high temperature, an adverse

change is more likely to be accepted than in a low temperature. As steps are

completed, the temperature is declined every now and then — more often

when no improvements can be found. Thus, the search will not get stuck in a

local optimum in the beginning of the search process when the temperature is

high. Still, it is able to find the exact local optimum in the end of the search,

because of the low temperature that does not allow deteriorating changes to

be made anymore.

Genetic algorithms

Genetic algorithms (GAs) constitute another family of stochastic optimization

algorithms.

While SA is inspired by physics, the motivation for GAs comes from biological

evolution, where the best individuals have a higher probability of survival.

An algorithmic difference between SA and GAs is that while SA only keeps

one variable subset in memory, GAs maintain a set of them. In the GA terminology,

the solution vectors are usually called chromosomes and a set of

chromosomes is called a population. The biological vocabulary is further exploited

by defining genetic operations like mutation or crossover. In mutation,

a random bit or several bits of the chromosome are flipped to yield a new chromosome.

In crossover, an offspring of two chromosomes is obtained by cutting

both at some random position and swapping the tails. A new population is

typically formed by retaining some of the chromosomes in the old population

and composing new chromosomes by applying genetic operations on the old

chromosomes. The better a chromosome is, the higher is its probability of

being selected to the new population, or as a parent in a genetic operation.

Page updated

Google Sites

Report abuse