HappieClust - Fast approximate hierarchical clustering using similarity heuristics

Publication
Description
Built-in similarity measures
License
Download
Examples

Publication

M. Kull, J. Vilo. Fast approximate hierarchical clustering using similarity heuristics. BioData Mining 2008, 1:9 (22 September 2008)

Description

HappieClust is an approximate version of agglomerative hierarchical clustering. When performing the standard full agglomerative hierarchical clustering, each pair of objects must be inspected to evaluate similarity. This is very time-consuming for large numbers of objects and/or complicated similarity measures. HappieClust performs agglomerative hierarchical clustering with partial information, not requiring all pairwise similarities to be known. HappieClust is further able to use similarity heuristics to carefully choose a subset of pairs for which the similarities are evaluated.

HappieClust can be applied on any kind of data. It has a large set of built-in similarity measures for numeric data vectors and two measures for character strings. For these cases the similarity heuristics can be applied. For unsupported data types and similarity measures it is possible to precalculate any subset of similarities with any other tool and then feed the results into HappieClust for clustering.

HappieClust is originally developed for clustering gene expression microarray data (see the publication listed above), but can be applied for any other kind of biological or non-biological data.

Built-in similarity measures

HappieClust interprets value 0 as perfect similarity and the increase in value as weakening similarity. In this sense, HappieClust works with distance measures rather than similarity measures. Negative values are not accepted. The built-in distance measures are as follows:

NB! A custom similarity measure can be used by precalculating a subset of similarities with any other tool and then feeding the results into HappieClust for clustering)

For numeric vectors

Distance measureDescription
Minkowski distance (with parameter P)P-th root of the sum of P-th powers of vector element-wise differences
Manhattan distanceMinkowski distance with p=1
Euclidean distanceMinkowski distance with p=2
Pearson correlation distanceOne minus Pearson correlation of the two vectors (centered correlation)
Absolute Pearson correlation distanceOne minus the absolute value of Pearson correlation of the two vectors
Uncentered correlation distanceOne minus the uncentered correlation of the two vectors
Absolute uncentered correlation distanceOne minus the absolute value of the uncentered correlation of the vectors
Spearman's rank correlation distanceOne minus Spearman's rank correlation of the two vectors
Absolute Spearman's rank correlation distanceOne minus the absolute value of the Spearman's rank correlation of the two vectors
Chord distanceEuclidean distance measured after the normalization of the two vectors
Centered chord distanceEuclidean distance measured after centering and normalizing the vectors
Average difference distanceAverage of vector element-wise differences
Maximum difference distanceMaximum of vector element-wise differences
Sum of signs distanceThe number of positions where the vectors have opposite-signed values

For character strings

Distance measureDescription
Hamming distanceThe number of positions where the strings have different characters
Levenshtein (edit) distanceThe number of single character insertions, deletions, replacements it takes to get one string from the other

Licence

HappieClust is free for non-commercial use, please cite the publication listed above. For commercial use contact the authors.

Download happieclust version 1.6.1 binary executables

Please contact the authors in order to get binaries for other platforms.

Version history

The package contains five programs:

Run each executable without parameters to see help on command line options.

The datasets used in the article are available from here in NetCDF format:

Examples

Example 1 (Simple)
  • Input text file capitals.txt (monthly average temperatures of 17 capital cities in tab-separated format):
    Monthly average temperatures of 17 capital cities
  • Cluster the rows and columns of the temperature matrix, i.e. cluster the cities and months with respect to temperature:
    happieclust_matrix.sh capitals.txt capitals -d euclidean (Linux and MacOSX)
    happieclust_matrix.bat capitals.txt capitals -d euclidean (Windows)
  • Resulting files in Stanford clustered data format (can be viewed with e.g. Java Treeview):
  • Visualization of results with Java Treeview:
    Hierarchical clustering of capital cities and months by average temperature
Example 2 (Advanced)
  • Input text file:
    capitals.txt
  • Convert the text file into NetCDF format:
    tab2nc '{string empty,col_names[c]} r*{string row_names[]; double matrix[][c]}' capitals.txt capitals2.nc
  • Resulting NetCDF file:
    capitals2.nc
  • View the header of the NetCDF file (optional, requires installation of NetCDF):
    ncdump -h capitals2.nc
  • Perform full hierarchical clustering using Euclidean distance (no need to use approximation as the dataset is so small):
    happieclust -i capitals2.nc -o capitals2_rows.nc --output-data -D matrix --cluster rows -d euclidean --all-pairwise
    happieclust -i capitals2_rows.nc -o capitals2_rowscols.nc --output-data -D matrix --cluster cols -d euclidean --all-pairwise
  • Convert results into Stanford clustered data format:
    happieclust2cdt capitals2_rowscols.nc capitals2
  • Resulting files in Stanford clustered data format (can be viewed with e.g. Java Treeview): The results are exactly the same as in Example 1.
  • Instead of using the Stanford clustered data format, it can be useful to use the resulting NetCDF file directly. The file contains 7 one-dimensional arrays of length 2*N-1 where N is the number of clustered objects. For example, the NetCDF file resulting from the command
    happieclust --input capitals2.nc -D matrix --output capitals2_cols.nc --cluster cols --distance euclidean --all-pairwise --prefix ''
    contains the following data:
    Tabular representation of the HappieClust result file
    The numbers in grey are independent of the clustered data. The meaning of the fields is exemplified in the following figure:
    Dendrogram
  • A very useful feature: when you cluster a NetCDF variable along some dimension and use the flag --output-data, HappieClust reorders all variables which include this dimension. For example, you might have some metadata for each object you cluster, such as the population of a city or the number of days in a month. This feature was used above to reorder the temperature matrix and names of cities and months as the Stanford clustered data format requires this.
Example 3 (Approximate clustering given time and memory constraints)
  • Input NetCDF file:
    shyamsundar05.nc
  • Perform average linkage approximate hierarchical clustering of the genes using Pearson correlation distance:
    happieclust -i shyamsundar05.nc --data expr -o ex3tmp.nc -d pear -l ave -C rows -O --memory 1 --time 20
  • As a result, HappieClust clusters the data as well as it can while trying to fit into 1 GB of memory and 20 seconds of running time. This is achieved by estimating the largest number of object pairs which enables distance calculation for each pair and clustering within the time and memory constraints. By default, half of these pairs are chosen using similarity heuristics and the other half randomly.
  • Perform the same on conditions with Euclidean distance. As there are only 123 conditions, HappieClust is able to perform full hierarchical clustering within the time and memory constraints.
    happieclust -i ex3tmp.nc -D expr -o ex3.nc -d eucl -l ave -C cols -O --memory 1 --time 20
  • Convert into Stanford clustered data format:
    happieclust2cdt ex3.nc ex3 -M expr -R genes -C samples
  • A small part of the visualization with Java Treeview:
    Dendrogram
Example 4 (Approximate clustering given the number of distances)
  • Input NetCDF file:
    shyamsundar05.nc
  • Perform approximate hierarchical clustering of the genes using Pearson correlation distance by calculating 3 million distances:
    happieclust -i shyamsundar05.nc --data expr -o ex4.nc -d pear -C rows --num-distances 3000000
  • As a result, HappieClust clusters the data as well as it can with 3 million as the limit on the number of distance calculations.
  • Output NetCDF file:
    ex4.nc

Page last updated: Feb 19, 2010