# HappieClust - Fast approximate hierarchical clustering using similarity heuristics

**Authors: Meelis Kull, Jaak Vilo**

Quretec Ltd., Tartu, Estonia

BIIT-group, Institute of Computer Science, University of Tartu

Contact: meelis d kull a ut d ee, jaak d vilo a ut d ee (a=@ d=.)

## Publication

M. Kull, J. Vilo. Fast approximate hierarchical clustering using similarity heuristics.*BioData Mining*2008,

**1**:9 (22 September 2008)

## Description

HappieClust is an approximate version of agglomerative hierarchical clustering. When performing the standard full agglomerative hierarchical clustering, each pair of objects must be inspected to evaluate similarity. This is very time-consuming for large numbers of objects and/or complicated similarity measures. HappieClust performs agglomerative hierarchical clustering with partial information, not requiring all pairwise similarities to be known. HappieClust is further able to use similarity heuristics to carefully choose a subset of pairs for which the similarities are evaluated.

HappieClust can be applied on any kind of data. It has a large set of built-in similarity measures for numeric data vectors and two measures for character strings. For these cases the similarity heuristics can be applied. For unsupported data types and similarity measures it is possible to precalculate any subset of similarities with any other tool and then feed the results into HappieClust for clustering.

HappieClust is originally developed for clustering gene expression microarray data (see the publication listed above), but can be applied for any other kind of biological or non-biological data.

## Built-in similarity measures

HappieClust interprets value 0 as perfect similarity and the increase in value as weakening similarity. In this sense, HappieClust works with distance measures rather than similarity measures. Negative values are not accepted. The built-in distance measures are as follows:

**NB!** A custom similarity measure can be used by precalculating a subset of similarities with any other tool and then feeding the results into HappieClust for clustering)

### For numeric vectors

Distance measure | Description |

Minkowski distance (with parameter P) | P-th root of the sum of P-th powers of vector element-wise differences |

Manhattan distance | Minkowski distance with p=1 |

Euclidean distance | Minkowski distance with p=2 |

Pearson correlation distance | One minus Pearson correlation of the two vectors (centered correlation) |

Absolute Pearson correlation distance | One minus the absolute value of Pearson correlation of the two vectors |

Uncentered correlation distance | One minus the uncentered correlation of the two vectors |

Absolute uncentered correlation distance | One minus the absolute value of the uncentered correlation of the vectors |

Spearman's rank correlation distance | One minus Spearman's rank correlation of the two vectors |

Absolute Spearman's rank correlation distance | One minus the absolute value of the Spearman's rank correlation of the two vectors |

Chord distance | Euclidean distance measured after the normalization of the two vectors |

Centered chord distance | Euclidean distance measured after centering and normalizing the vectors |

Average difference distance | Average of vector element-wise differences |

Maximum difference distance | Maximum of vector element-wise differences |

Sum of signs distance | The number of positions where the vectors have opposite-signed values |

### For character strings

Distance measure | Description |

Hamming distance | The number of positions where the strings have different characters |

Levenshtein (edit) distance | The number of single character insertions, deletions, replacements it takes to get one string from the other |

## Licence

HappieClust is free for non-commercial use, please cite the publication listed above. For commercial use contact the authors.

## Download `happieclust` version 1.6.1 binary executables

Please contact the authors in order to get binaries for other platforms.

### Version history

- Version 1.6.0 32-bit Linux (x86)
- Version 1.6.0 64-bit Linux (x86)
- Version 1.6.0 Intel Mac OS X (x86)
- Version 1.6.0 Windows

The package contains five programs:

`happieclust`

- the main binary, which inputs data in NetCDF format, performs hierarchical clustering and outputs results in NetCDF format`tab2nc`

- a tool for converting tab-separated text files into NetCDF format, part of the TabCDF toolkit authored by Tambet Arak and Meelis Kull`nc2tab`

- a tool for converting NetCDF files into tab-separated text files, also part of the TabCDF toolkit`happieclust2cdt`

- a tool for converting NetCDF files resulting from HappieClust clustering into Stanford clustered data format, which can be visualized with e.g. Java Treeview`happieclust_matrix`

- a convenient script which runs other tools to achieve clustering of a tabular text file with output in Stanford clustered data format

The datasets used in the article are available from here in NetCDF format:

- shyamsundar05.nc (20 MB)
- lukk08.nc (1 GB)

## Examples

- Input text file capitals.txt (monthly average temperatures of 17 capital cities in tab-separated format):

- Cluster the rows and columns of the temperature matrix, i.e. cluster the cities and months with respect to temperature:

`happieclust_matrix.sh capitals.txt capitals -d euclidean`

(Linux and MacOSX)

`happieclust_matrix.bat capitals.txt capitals -d euclidean`

(Windows) - Resulting files in Stanford clustered data format (can be viewed with e.g. Java Treeview):
- Visualization of results with Java Treeview:

- Input text file:

capitals.txt - Convert the text file into NetCDF format:

`tab2nc '{string empty,col_names[c]} r*{string row_names[]; double matrix[][c]}' capitals.txt capitals2.nc`

- Resulting NetCDF file:

capitals2.nc - View the header of the NetCDF file (optional, requires installation of NetCDF):

`ncdump -h capitals2.nc`

- Perform full hierarchical clustering using Euclidean distance (no need to use approximation as the dataset is so small):

`happieclust -i capitals2.nc -o capitals2_rows.nc --output-data -D matrix --cluster rows -d euclidean --all-pairwise`

`happieclust -i capitals2_rows.nc -o capitals2_rowscols.nc --output-data -D matrix --cluster cols -d euclidean --all-pairwise`

- Convert results into Stanford clustered data format:

`happieclust2cdt capitals2_rowscols.nc capitals2`

- Resulting files in Stanford clustered data format (can be viewed with e.g. Java Treeview): The results are exactly the same as in Example 1.
- Instead of using the Stanford clustered data format, it can be useful to use the resulting NetCDF file directly.
The file contains 7 one-dimensional arrays of length 2*N-1 where N is the number of clustered objects.
For example, the NetCDF file resulting from the command

`happieclust --input capitals2.nc -D matrix --output capitals2_cols.nc --cluster cols --distance euclidean --all-pairwise --prefix ''`

contains the following data:

The numbers in grey are independent of the clustered data. The meaning of the fields is exemplified in the following figure:

**A very useful feature:**when you cluster a NetCDF variable along some dimension and use the flag --output-data, HappieClust reorders all variables which include this dimension. For example, you might have some metadata for each object you cluster, such as the population of a city or the number of days in a month. This feature was used above to reorder the temperature matrix and names of cities and months as the Stanford clustered data format requires this.

- Input NetCDF file:

shyamsundar05.nc - Perform average linkage approximate hierarchical clustering of the genes using Pearson correlation distance:

`happieclust -i shyamsundar05.nc --data expr -o ex3tmp.nc -d pear -l ave -C rows -O --memory 1 --time 20`

- As a result, HappieClust clusters the data as well as it can while trying to fit into 1 GB of memory and 20 seconds of running time. This is achieved by estimating the largest number of object pairs which enables distance calculation for each pair and clustering within the time and memory constraints. By default, half of these pairs are chosen using similarity heuristics and the other half randomly.
- Perform the same on conditions with Euclidean distance. As there are only 123 conditions, HappieClust is able to perform full hierarchical clustering within the time and memory constraints.

`happieclust -i ex3tmp.nc -D expr -o ex3.nc -d eucl -l ave -C cols -O --memory 1 --time 20`

- Convert into Stanford clustered data format:

`happieclust2cdt ex3.nc ex3 -M expr -R genes -C samples`

- A small part of the visualization with Java Treeview:

- Input NetCDF file:

shyamsundar05.nc - Perform approximate hierarchical clustering of the genes using Pearson correlation distance by calculating 3 million distances:

`happieclust -i shyamsundar05.nc --data expr -o ex4.nc -d pear -C rows --num-distances 3000000`

- As a result, HappieClust clusters the data as well as it can with 3 million as the limit on the number of distance calculations.
- Output NetCDF file:

ex4.nc

Page last updated: Feb 19, 2010