data mining/R-Stats

some links:

TYPE 1 AND TYPE 2 ERROR

posted Feb 5, 2015, 12:40 PM by shuo zhang   [ updated Feb 5, 2015, 12:45 PM ]


WHICH ONE IS IT?

Suppose we do know the true mean of the sampling distribution, it turns out that our estimate with a sample of 30 is correct. (H0)


set theory

posted Mar 11, 2014, 6:46 PM by shuo zhang

second order derivative

posted Jan 15, 2014, 12:13 PM by shuo zhang   [ updated Feb 13, 2014, 6:56 PM ]


must update R to 3.0.2 to support MAC MAVERICK

posted Jan 3, 2014, 9:42 AM by shuo zhang

http://stat.ethz.ch/R-manual/R-devel/library/graphics/html/curve.html

HERE to download. otherwise R will crash on every error handling.

classification evaluation measures

posted Dec 15, 2013, 9:16 AM by shuo zhang

   predictedpredicted 
   -
actual TP FN 
actualFPTN 

TPR=recall=TP/(TP+FN)
precision=TP/(TP+FP)
TNR=TN/(TN+FP)=specificity
FPR=FP/(FP+TN)
FNR=FN/(FN+TP)
F=2*recall*precision/(precision+recall)=2*TP/(2*TP+FP+FN)

F-statistic is a harmonic mean of precision and recall, i.e., 
F=2/(1/r+1/p)

k-medoids algorithm and demo

posted Dec 13, 2013, 12:24 PM by shuo zhang

The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows:[2]

  1. Initialize: randomly select k of the n data points as the medoids
  2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distanceManhattan distance or Minkowski distance)
  3. For each medoid m
    1. For each non-medoid data point o
      1. Swap m and o and compute the total cost of the configuration
  4. Select the configuration with the lowest cost.
  5. repeat steps 2 to 4 until there is no change in the medoid

demo:

Cute illustration of k-means

posted Dec 13, 2013, 12:13 PM by shuo zhang   [ updated Feb 13, 2014, 6:56 PM ]


using plot3d to visualize 3d data in R

posted Dec 12, 2013, 12:57 PM by shuo zhang

library(foreign)
library(rgl)
#THC DATA
thc=read.arff("~/Desktop/thc1.csv.arff")
plot3d(thc[2:4],col=rainbow(3)[thc$Cluster])

#PLANET DATA
HW=read.csv("~/Downloads/Cluster3D.csv")
plot3d(HW)

DHW=apply(as.matrix(dist(scale(HW))),1,sort)
summary(DHW[3,])


h2=read.arff(file.choose())
head(h2)

plot3d(h2[2:4],col=rainbow(4)[h2$Cluster])
h3<-h2[h2$y<0.75,]
h4<-h2[h2$y<0.4,]

plot3d(h3[2:4],col=rainbow(4)[h3$Cluster])
plot3d(h4[2:4],col=rainbow(4)[h4$Cluster])

COMPLETE LINK VS. SINGLE LINK CLUSTERING FOR RUSPINI DATA

posted Dec 11, 2013, 10:36 AM by shuo zhang

library(cluster)
Rusp_HC=hclust(dist(ruspini),"complete")
plot(ruspini,pch=21,bg=rainbow(4)[cutree(Rusp_HC,4)],main="CompleteLink")

dev.new()
Rusp_HS=hclust(dist(ruspini),"single")
plot(ruspini,pch=21,bg=rainbow(4)[cutree(Rusp_HS,4)],main="SingleLink")

1-10 of 52