Hierarchical clustering

After obtaining model-based expression values, we can perform high-level analysis such as hierarchical clustering (Eisen et al. 1998). Unsupervised sample clustering using genes obtained by Analysis/Filter genes can be used to identify novel sample clusters and their associated “signature genes”, to check the data quality to see if replicate samples or samples under similar conditions are clustered together (if not what might be possible reasons), and to identify unexpected clustering (e.g. samples generated in same date or lab cluster together). Select the menu “Analysis/Hierarchical clustering”:

A “gene list file” is a tab-delimited text file with probe set name in the first column of each line. It can be generated by “Analysis/Filter genes”, “Analysis/Compare samples” or “Tools/Gene list file”. It may also be a “Tree file” saved by the “Clustering/Save tree” function so that an existing tree structure saved before can be used. dChip will use genes in the file for clustering.


One may check “Tools/Options/Analysis/Mask redundant probe sets” to exclude the redundant probe sets (having the same LocusLink ID) from a gene list and only keep the first occurring probe set, since multiple probe sets for the same gene tend to bias the result of sample clustering and functionally significant gene clusters. However, if the replicate probe sets are both selected by some filtering or comparison criteria, and cluster closely in the clustering, this is a good indication of meaningfulness of the selected gene list. On the other hand, if a selected gene list seems to have genes not related to each other (e.g. not many replicate probe sets), we may doubt its validity and often a FDR by permutation can result in similar number of genes and thus supports this suspicion. The same conclusion can be extended to probe sets for the genes in the same gene families or same pathways.


The samples used for clustering are either all the arrays, or the samples in the “Array list file” if it is specified. When a “Filter genes” gene list is used for clustering, it is often desired to use the same “Array list file” used in filtering genes to do gene clustering and sample clustering. This is an unsupervised sample clustering since the genes are selected by large variation across samples and the sample group information is not used. When one specifies a “Compare samples” gene list generated by using only a subset of samples, it is often desired to only specify and order the relevant samples in “Array list file” and view them without sample clustering. In this case the main interest lies in viewing the genes obtained by comparison, and one can often get good sample clustering since the genes are selected by using the sample group information. It is also interesting to cluster both samples used for selecting genes and samples not used for selecting genes (e.g. samples from an independent study) together, one can predict the group membership of the latter samples.


Clustering algorithm


The default clustering algorithm of genes is as follows: the distance between two genes is defined as 1 - r where r is the Pearson correlation coefficient between the standardized expression values (make mean 0 and standard deviation 1) of the two genes across the samples used. Two genes with the closest distance are first merged into a super-gene and connected by branches with length representing their distance, and are then excluded for subsequent merging events. The expression values of the newly formed super-gene is the average of standardized expression values of the two genes (centroid-linkage) across samples. Then the next pair of genes (super-genes) with the smallest distance is chosen to merge and the process is repeated n – 1  times to merge all the n genes. A similar procedure is used to cluster samples. These standardization and clustering methods follow Golub et al. 1999 and Eisen et al. 1998. Centroid linkage can produce branch inversion when the distance between two clusters is smaller than the height of either cluster, dChip truncates the distance to be the larger of the two heights. This prevents the branch inversion in visualization, but the further distance computation is still based on the averaged profile.

One may choose alternative “Distance metric” as 1 - |r| (r is the correlation coefficient) as the distance measure. This is useful if we want to cluster negatively correlated genes clustered together. The “Average linkage” method can be specified, where the distance between two gene clusters (super-gene) is the average of all pair-wise distances between two genes not belonging to the same gene cluster. Tao Shi has observed that dChip produces the same clustering result as the R function hclust (using 1 – correlation matrix of row-wise standardized expression values) when the average linkage is used, but not when the centroid linkage is used.

Other clustering options


Click “Options” (or “Tools/Options/Clustering”) to specify additional clustering options:



We can choose to cluster samples as well as genes. Uncheck the “Cluster genes” button to cluster samples without clustering genes, and this is useful if genes need to be put in a particular order when clustering samples. The option “Only draw lines for standard separator” (moved to the “Tools/Array list file” dialog for V1.2+) is discussed in the section “Array list file”.


Before clustering, the expression values for a gene across all samples are standardized (linearly scaled) to have mean 0 and standard deviation 1, and these standardized values are used to calculated correlations between genes and samples and serve as the basis for merging nodes. If the scale of the data is already adjusted, one may choose not to standardize a gene’s expression value across samples by unchecking the “Standardize rows” option. By default the samples are clustered using row-wise standardized or un-standardized values. One can check “Standardize columns” to standardize the raw expression data column-wise for sample clustering. Since the raw expression values are comparable row-wise but not column-wise, the column-wise standardization may not be meaningful when different genes have different magnitude of expression values. A user is advised to try to cluster samples with or without “Standardize columns” checked to judge which option yields more reasonable sample clustering.

If “Tools/Analysis/Treat outlier expression as missing values” is checked, the expression value called as “array-outlier” will be ignored when computing correlations and their data points are displayed as black (Blue/Red coloring) or white (Green/Red coloring) boxes in the clustering picture.

If the number of genes is large (e.g. 10,000), dChip may report “out of memory” or perform slowly, since storing all the pair-wise distances requires too much memory and may cause virtual-memory swapping. The solution is to uncheck the “Tools/Options/Clustering/Pre-calculate distances” button to calculate the pair-wise distances between genes on the fly.


Clustering output


Click “OK” to start clustering, and select “Analysis/Stop Analysis” or press “ESC” to stop the ongoing analysis. Following the analysis output as follows, the clustering picture will be displayed immediately. Click the “Analysis” icon on the left to view the analysis output:


{Hierarchical clustering

  Treat 24 arrays as 24 experiments

  Read in genes listed in file D:\array\out\iglehart filtered gene.xls...

    Found 191 genes


  Begin clustering...

    Calcuate distance 190

    Merge event 189

    Calcuate distance 20

    Merge event 16


  Finding significant functional clusters...

    Found 6 chaperone genes in a 49-cluster (all: 61/5009, PValue: 9.84e-013)

    Found 10 structural protein genes in a 47-cluster (all: 361/5009, PValue: 9.58e-005)

    Found 8 extracellular genes in a 29-cluster (all: 400/5009, PValue: 4.93e-005)

Finished in 00 hours 01 minutes}

Here 191 genes are selected for clustering. dChip also automatically searches for functionally significant clusters in the resulting clustering tree.