Connections 17(2):78-80
Copyright 1994 INSNA
Stephen P. Borgatti
University of South Carolina
Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this:
Step 3 can be done in different ways, which is what distinguishes single-link from complete-link and average-link clustering. In single-link clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. In complete-link clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. In average-link clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. A variation on average-link clustering is the UCLUS method of D'Andrade (1978) which uses the median distance.
Example. The following pages trace a hierarchical clustering of distances in miles between U.S. cities. The method of clustering is single-link.
Input distance matrix:
BOS | NY | DC | MIA | CHI | SEA | SF | LA | DEN | |
BOS | 0 | 206 | 429 | 1504 | 963 | 2976 | 3095 | 2979 | 1949 |
NY | 206 | 0 | 233 | 1308 | 802 | 2815 | 2934 | 2786 | 1771 |
DC | 429 | 233 | 0 | 1075 | 671 | 2684 | 2799 | 2631 | 1616 |
MIA | 1504 | 1308 | 1075 | 0 | 1329 | 3273 | 3053 | 2687 | 2037 |
CHI | 963 | 802 | 671 | 1329 | 0 | 2013 | 2142 | 2054 | 996 |
SEA | 2976 | 2815 | 2684 | 3273 | 2013 | 0 | 808 | 1131 | 1307 |
SF | 3095 | 2934 | 2799 | 3053 | 2142 | 808 | 0 | 379 | 1235 |
LA | 2979 | 2786 | 2631 | 2687 | 2054 | 1131 | 379 | 0 | 1059 |
DEN | 1949 | 1771 | 1616 | 2037 | 996 | 1307 | 1235 | 1059 | 0 |
The nearest pair of cities is BOS and NY, at distance 206. These are merged into a single cluster called "BOS/NY".
Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "BOS/NY" to DC is chosen to be 233, which is the distance from NY to DC. Similarly, the distance from "BOS/NY" to DEN is chosen to be 1771.
After merging BOS with NY:
BOS/NY | DC | MIA | CHI | SEA | SF | LA | DEN | |
BOS/NY | 0 | 223 | 1308 | 802 | 2815 | 2934 | 2786 | 1771 |
DC | 223 | 0 | 1075 | 671 | 2684 | 2799 | 2631 | 1616 |
MIA | 1308 | 1075 | 0 | 1329 | 3273 | 3053 | 2687 | 2037 |
CHI | 802 | 671 | 1329 | 0 | 2013 | 2142 | 2054 | 996 |
SEA | 2815 | 2684 | 3273 | 2013 | 0 | 808 | 1131 | 1307 |
SF | 2934 | 2799 | 3053 | 2142 | 808 | 0 | 379 | 1235 |
LA | 2786 | 2631 | 2687 | 2054 | 1131 | 379 | 0 | 1059 |
DEN | 1771 | 1616 | 2037 | 996 | 1307 | 1235 | 1059 | 0 |
The nearest pair of objects is BOS/NY and DC, at distance 223. These are merged into a single cluster called "BOS/NY/DC". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:
After merging DC with BOS-NY:
BOS/NY/DC | MIA | CHI | SEA | SF | LA | DEN | |
BOS/NY/DC | 0 | 1075 | 671 | 2684 | 2799 | 2631 | 1616 |
MIA | 1075 | 0 | 1329 | 3273 | 3053 | 2687 | 2037 |
CHI | 671 | 1329 | 0 | 2013 | 2142 | 2054 | 996 |
SEA | 2684 | 3273 | 2013 | 0 | 808 | 1131 | 1307 |
SF | 2799 | 3053 | 2142 | 808 | 0 | 379 | 1235 |
LA | 2631 | 2687 | 2054 | 1131 | 379 | 0 | 1059 |
DEN | 1616 | 2037 | 996 | 1307 | 1235 | 1059 | 0 |
Now, the nearest pair of objects is SF and LA, at distance 379. These are merged into a single cluster called "SF/LA". Then we compute the distance from this new cluster to all other objects, to get a new distance matrix:
After merging SF with LA:
BOS/ NY/DC |
MIA | CHI | SEA | SF/LA | DEN | |
BOS/NY/DC | 0 | 1075 | 671 | 2684 | 2631 | 1616 |
MIA | 1075 | 0 | 1329 | 3273 | 2687 | 2037 |
CHI | 671 | 1329 | 0 | 2013 | 2054 | 996 |
SEA | 2684 | 3273 | 2013 | 0 | 808 | 1307 |
SF/LA | 2631 | 2687 | 2054 | 808 | 0 | 1059 |
DEN | 1616 | 2037 | 996 | 1307 | 1059 | 0 |
Now, the nearest pair of objects is CHI and BOS/NY/DC, at distance 671. These are merged into a single cluster called "BOS/NY/DC/CHI". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:
After merging CHI with BOS/NY/DC:
BOS/NY/DC/ CHI |
MIA | SEA | SF/LA | DEN | |
BOS/NY/DC/CHI | 0 | 1075 | 2013 | 2054 | 996 |
MIA | 1075 | 0 | 3273 | 2687 | 2037 |
SEA | 2013 | 3273 | 0 | 808 | 1307 |
SF/LA | 2054 | 2687 | 808 | 0 | 1059 |
DEN | 996 | 2037 | 1307 | 1059 | 0 |
Now, the nearest pair of objects is SEA and SF/LA, at distance 808. These are merged into a single cluster called "SF/LA/SEA". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:
After merging SEA with SF/LA:
BOS/NY/DC/CHI | MIA | SF/LA/SEA | DEN | |
BOS/NY/DC/CHI | 0 | 1075 | 2013 | 996 |
MIA | 1075 | 0 | 2687 | 2037 |
SF/LA/SEA | 2054 | 2687 | 0 | 1059 |
DEN | 996 | 2037 | 1059 | 0 |
Now, the nearest pair of objects is DEN and BOS/NY/DC/CHI, at distance 996. These are merged into a single cluster called "BOS/NY/DC/CHI/DEN". Then we compute the distance from this new cluster to all other clusters, to get a new distance matrix:
After merging DEN with BOS/NY/DC/CHI:
BOS/NY/DC/CHI/DEN | MIA | SF/LA/SEA | |
BOS/NY/DC/CHI/DEN | 0 | 1075 | 1059 |
MIA | 1075 | 0 | 2687 |
SF/LA/SEA | 1059 | 2687 | 0 |
Now, the nearest pair of objects is BOS/NY/DC/CHI/DEN and SF/LA/SEA, at distance 1059. These are merged into a single cluster called "BOS/NY/DC/CHI/DEN/SF/LA/SEA". Then we compute the distance from this new compound object to all other objects, to get a new distance matrix:
After merging SF/LA/SEA with BOS/NY/DC/CHI/DEN:
BOS/NY/DC/CHI/DEN/SF/LA/SEA | MIA | |
BOS/NY/DC/CHI/DEN/SF/LA/SEA | 0 | 1075 |
MIA | 1075 | 0 |
Finally, we merge the last two clusters at level 1075. This process is summarized by the clustering diagram printed by many software packages:
In the diagram, the columns are associated with the items and the rows are associated with levels (stages) of clustering. An 'X' is placed between two columns in a given row if the corresponding items are merged at that stage in the clustering.
D'andrade,R. 1978, "U-Statistic Hierarchical Clustering" Psychometrika, 4:58-67.
Johnson,S.C. 1967, "Hierarchical Clustering Schemes" Psychometrika, 2:241-254.
[geneva97/eop.htm]