Data Mining
http://www.reddit.com/r/programming/comments/bznwz/proggit_can_you_recommend_a_gentle_introduction/
http://www.google.com/Top/Computers/Software/Databases/Data_Mining/Public_Domain_Software/
http://www.google.com/Top/Computers/Artificial_Intelligence/Machine_Learning/Software/
http://www.google.com/Top/Computers/Artificial_Intelligence/
Orange (Python) http://magix.fri.uni-lj.si/orange/
Modular toolkit for Data Processing (Python) http://mdp-toolkit.sourceforge.net/
Weka (Java) http://www.cs.waikato.ac.nz/~ml/weka/
AIDA (C++/Java/Python) http://en.wikipedia.org/wiki/AIDA_%28computing%29 http://aida.freehep.org/tools.thtml
YALE (Java) http://rapid-i.com/
IBM http://www.alphaworks.ibm.com/keywords/data%20mining
Open Bayes (Python) http://www.openbayes.org/
SVM (Python) http://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://svmlight.joachims.org/
Clustering:
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm
http://glaros.dtc.umn.edu/gkhome/views/cluto/gcluto/index.html
Rete rule engine (Python) http://www.mindswap.org/~katz/pychinko/
Text Mining http://krondix.blogspot.com/2007/05/text-mining-libraries.html ClustBoost GATE, Lucene,ANNIE, MinorThird
http://en.wikipedia.org/wiki/Classic_data_sets
Video: http://obousquet.googlepages.com/mlvideos
http://en.wikipedia.org/wiki/Machine_learning
http://en.wikipedia.org/wiki/Support_vector_machine SVM
http://www.learningtheory.org/
http://homepages.inf.ed.ac.uk/s0450736/maxent.html Maximum entropy
http://www-faculty.cs.uiuc.edu/~hanj/
http://jmlr.csail.mit.edu/ - Journal of Machine Learning Research
http://www.astro.cornell.edu/staff/loredo/bayes/
http://maven.smith.edu/~jfrankli/studentResearch/tle/cluster.html
http://www.olapreport.com/Analyses.htm
http://www.intel.com/technology/techresearch/people/bios/bradski_g.htm
http://www.pnas.org/cgi/content/abstract/104/15/6424?etoc
http://www.siam.org/meetings/proceedings/2007/datamining/
http://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29 Bootstrapping
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
http://jmvidal.cse.sc.edu/talks/conceptlearning/
http://see.stanford.edu/SEE/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
http://see.stanford.edu/SEE/courseinfo.aspx?coll=17005383-19c6-49ed-9497-2ba8bfcfe5f6
http://challenge.spock.com/pages/learn_more
http://www.data-mining-cup.com/
Table 1 Inside the CRISP DM Process.
One popular data mining approach relies on Cross Industry Process for Data Mining (CRISP DM). This approach enables you to plan, execute, and control the lifecycle of a data mining project. CRISP DM relies on several phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment and maintenance.
The learning algorithms may base only on inputs. Unsupervised learning is used for example in clustering, self-organization, auto-association and some visualization algorithms.
If learning algorithms use pairs <xi, yi> where yi is the desired output value for xi, then learning is called supervised (with teacher).
The problem of finding the best model from some space H can be analyzed from the Bayesian point of view. The Bayes theorem
P(f|D) = P(D|f)*P(f) /P(D)
defines the relationship between the posterior probability P(f|D) and prior probability P(f) of given hypothesis f. This formula is fundamental in Bayesian learning, which aims in finding the hypothesis which maximizes the a posteriori probability (MAP):
fMAP = argmax P(f|D) = argmax P(D|f) * P(f)
Assuming equal a priori probabilities for all the hypotheses f in H leads to the definition of the maximum likelihood (ML) hypothesis:
fML = argmax P(D|f)
The instance based or similarity based methods are a large branch of the machine learning algorithms that were developed by the pattern recognition community. A primary example of such models is the k nearest neighbors (kNN) algorithm, which classifies data vectors on the basis of their similarity to some memorized instances. More precisely, for a given data vector it assigns the class label that appears most frequently among its k nearest neighbors. This may be seen as an approximation of the Bayes Optimal Classifier, where the probabilities are estimated on the basis of the analysis of the vicinity of the classified example. Methods restricting their analysis to a neighborhood of given object are called local. In the case of single neighbor classifier (1NN), the decision borders are the Voronoi tesselation of the feature space.
There are many different possibilities of kNN algorithm implementation and application. For example, different distance measures can be used, and the choice is not simple. Also the choice of neighbors is not an unambiguous task. One can select k nearest neighbors or neighbors within a given radius. Neighbors influence on the decision may be weighted according to the distance (1/(1 + dist) or
max[1−dist, 0]), flexible k depending on the region may be used, best k may be estimated by means of cross-validation, etc.
A group of algorithms is dedicated to prototype selection (instances among which search for neighbors is done). Algorithms that work well for prototype selection include DROP, Explore, LVQ, DEL, ICF and ENN.
Decision trees are hierarchical models, very popular in classification tasks.
The tree nodes are described by logical conditions using single features. This is the most common technique in decision trees implementations and results in decision borders perpendicular to the axes of the feature space. Another shapes of borders are possible when, for instance, linear combinations of features or distances from some reference points are used in logical expressions assigned to the nodes, however perpendicular borders are often preferred, because of their comprehensibility. In some fields experts’ knowledge can be directly used to specify decision rules, but more often all that is given, is a set of classified data vectors, so the best way to find decision rules is to use computational intelligence tools.
Tree growing algorithms start with a given training data set as the root node and recursively split the nodes into several disjoint parts to separate vectors assigned to different classes. Each split is described by a general logical rule, which in fact divides the whole feature space (not just the training set).
Such comprehensible models are advantageous in many fields (for instance in medical diagnosis) and provide information about particular feature relevance. The recursiveness of the processes makes feature selection a local task – useful features can be picked up even if they are valuable only in a subregion of the input space, while from the point of view of the whole space they seem to contain less information than other features.
Building optimal trees is a very complex problem, especially because the optimization of a particular split is not the same as maximization of classification accuracy for the whole model. A number of different decision tree construction methods has already been published. Some of them are: Classification and Regression Trees (CART) [Breiman et al.,], ID3 [Quinlan, 1986], C4.5 [Quinlan, 1993], Separability
of Split Value (SSV) Trees, Fast Algorithm for Classification Trees (FACT). The methods are based on different ideas and use different model structures. Some of them use dichotomic splits (CART, SSV), others allow for more complex branching.
Decision trees can usually be presented in a form of logical rules. Such comprehensible models are advantageous in many fields (for instance in medical diagnosis) and provide information about particular
feature relevance. The recursiveness of the processes makes feature selection a local task – useful features can be picked up even if they are valuable only in a subregion of the input space, while from the point of view of the whole space they seem to contain less information than other features.
The methods are based on different ideas and use different model structures. Some of them use dichotomic splits (CART, SSV), others allow for more complex branching. Some algorithms assume a particular data distribution and use parametric statistical tests (FACT, QUEST, Cal5), others make
no such assumptions.
Nevertheless, all trees can be described with the same terminology including nodes, subnodes, subtrees, leaves, branches, branch length, tree depth, class labels assigned to nodes, data vectors falling into a node, and others. There are three main aspects, where decision tree algorithms differ from each other: the way nodes are split into subnodes, the way of tree construction, which in most cases is just a search process7, and the way of taking care for generalization (stopping criteria or pruning techniques).
Splitting nodes according to continuous and discrete inputs must be performed differently. This is why some methods (like ID3) can deal only with discrete features. Continuous attributes need to be converted to discrete before such algorithms can be applied. Good classification results can be obtained only in
the company of good (external) discretization methods. On the other side: methods like FACT and QUEST are designed to deal only with continuous input – here discrete features are converted to continuous
by translating them into binary vectors of dimension equal to the number of possible discrete values and then, projected to a single dimension. The original Cal5 algorithm was also designed for continuous data, however it can be quite easily adapted to deal with discrete features. The candidate splits of continuous features are usually binary (of the form {(−1, a], (a,1)}), however there exist some methods which split such input into several intervals (e.g. FACT and Cal5).
Most decision tree algorithms split the nodes with respect to the values of a single feature. Selecting the feature most eligible for the split is often closely bound up with the selection of the splitting points (CART, C4.5, SSV). An alternative strategy is applied in FACT, where the split feature is defined as
the one that maximizes the value of F statistic known from the analysis of variance (ANOVA) method. The QUEST algorithm also calculates F statistic for continuous features, but for discrete ones it uses .2 statistic (to prevent from favoring the discrete attributes over continuous ones). Also Cal5 selects
the split feature with a separate algorithms: either the amount of information about the classes is measured with entropy based formula (in this case the discretization must be performed first) or a coefficient is calculated for each feature to estimate their class separation abilities on the basis of mean squared distances within classes and between class centroids.
Simple search techniques like hill climbing are most frequently used. The experience shows that increasing the computational efforts of the search method (e.g. using beam search) does not improve the generalization abilities of constructed trees (Quinlan and Cameron-Jones, 1995) – no reliable explanation of this fact is known, but it looks like more thorough search leads to solutions more specific to the training set.
Building decision trees which maximally fit the training data usually ends up in overfitted, large trees with leaves classifying only a few cases, and thus does not provide general knowledge about the problem. To keep the trees simpler and more general the methods may use stopping criteria. A simple idea is to keep the numbers of vectors in the nodes above a specified threshold. Another way is to extend the separability criterion with a penalty term conforming to the Minimum Description Length principle. In both cases it is difficult to define criteria which yield results close to optimal. A more reasonable way is to use statistical tests to verify the significance of the improvement introduced
by a split (such technique is a part of C4.5). Stopping tree growth can save much time, but makes optimum solutions less probable. Usually, better generalization is obtained when building overfitted
trees and pruning them. There are many different pruning methods. They may concern tree depth, number of vectors in a node etc. Their parameters can be chosen on the basis of a cross-validation test performed within the training data (like in CART, QUEST or SSV Tree).
The training data set can be incomplete (some feature values for some data vectors may be inaccessible). Putting arbitrary values in the place of the missing ones should be taken into consideration only if no other methods can be used. Much caution about it is advised. CART introduced the idea of surrogate splits, which is to collect some spare splits, maximally similar to the main one but using different features. The surrogate splits are used when the classified data vector does not provide the value necessary to check the main split condition. When C4.5 is trained on incomplete data, the gains are scaled with a factor equal to the frequency of observing the feature values among the vectors falling
into the node – each training sample has a weight associated with it. The SSV criterion simply does not include the vectors with missing values in calculations. Such vectors are regarded to fall into both subnodes. When an incomplete vector is to be classified by SSV Tree, all the branches of nonzero
probability are checked and their leaves are treated as a single leaf to determine the dominating class.
There are many other decision tree algorithms, not mentioned here. Some of them are just slight modifications of the basic ones (e.g. NewID is an improved ID3), and some are quite original.
TDDT (Top-Down Decision Trees) is the algorithm available in the MLC++ library (Kohavi et al., 1996). It is similar to C4.5 and introduces some changes to protect against generating small nodes which are created by the information gain based strategies when discrete features have multiple values.
There are also some decision thee approaches, where node membership is decided by more than one feature. Dipol criteria (similar to SSV) have been used (Bobrowski and Krˆetowski, 2000) to construct decision trees where splitting conditions use linear combinations of features. Linear Machine Decision
Trees (LMDT) (Utgoff and Brodley, 1991) use linear machines at tree nodes. They try some variable elimination, but in general such methods are not eligible for feature selection – instead they construct new features as combinations of the original ones.
Oblique Classifier (OC1) (Murthy et al., 1994) combines heuristic and nondeterministic methods to determine interesting linear combination of features. It searches for trees by hill climbing.
Option decision trees (Buntine, 1993) allow several alternative splits of the same node. The final classification is determined with relevant calculations and an analysis of probabilities.
On the basis of the SSV criterion heterogeneous decision trees have been proposed (Duch and Gr , abczewski, 2002). Their split conditions may concern distances from some reference vectors.
Simulated annealing (SA) is a search method that is based on how physical
matter cools down and freezes, ending up in a crystal structure that minimizes
the energy of the body.
In SA, the search starts with an initial, potentially random, variable subset
in a high “temperature”. At each step, a small random change is introduced
to the subset. If the change results in a better subset, it is accepted. If the
result is worse than the previous solution, the change is accepted with a probability
that depends on the temperature: in a high temperature, an adverse
change is more likely to be accepted than in a low temperature. As steps are
completed, the temperature is declined every now and then — more often
when no improvements can be found. Thus, the search will not get stuck in a
local optimum in the beginning of the search process when the temperature is
high. Still, it is able to find the exact local optimum in the end of the search,
because of the low temperature that does not allow deteriorating changes to
be made anymore.
Genetic algorithms (GAs) constitute another family of stochastic optimization
algorithms.
While SA is inspired by physics, the motivation for GAs comes from biological
evolution, where the best individuals have a higher probability of survival.
An algorithmic difference between SA and GAs is that while SA only keeps
one variable subset in memory, GAs maintain a set of them. In the GA terminology,
the solution vectors are usually called chromosomes and a set of
chromosomes is called a population. The biological vocabulary is further exploited
by defining genetic operations like mutation or crossover. In mutation,
a random bit or several bits of the chromosome are flipped to yield a new chromosome.
In crossover, an offspring of two chromosomes is obtained by cutting
both at some random position and swapping the tails. A new population is
typically formed by retaining some of the chromosomes in the old population
and composing new chromosomes by applying genetic operations on the old
chromosomes. The better a chromosome is, the higher is its probability of
being selected to the new population, or as a parent in a genetic operation.