Measures of Similarity

Measures of Similarity
Steve Borgatti, Boston College

Assume that we are measuring the similarity between vector X and vector Y. We use X* and Y* to refer to the canonical normalizations (or uniformed versions) of the X and Y.

Generic Measure of Similarity

If X* indicates the uniformed version of X, then Zegers & ten Berge family of association measures can all be described by the same equation:

(s +1)/2

Absolute Scale Data

Identity coefficient. Scale differences not normalized away

Not mentioned by Z & ten B is the Euclidean distance coefficient. This measure is not normed -- varies from 0 to ??

Ratio Scale Data

Tucker's congruence = coefficient of proportionality. Differences in amplitude normalized away

Additive Scale Data

Coefficient of additivity = Winer's I

Interval Scale Data

Pearson correlation = coefficient of linearity

Ordinal data

Spearman's rho = r(X*,Y*)
Goodman and Kruskal Gamma = (P - Q)/(P + Q), P is concordant pair and Q is discordant
example:

	X	Y
1	1	1
2	1	2
3	2	1
4	2	1
5	3	1
6	3	1
7	3	2

	1	2	3	4	5	6	7
1		n	n	n	n	n	p
2			q	q	q	q	n
3				n	n	n	p
4					n	n	p
5						n	n
6							n
7

P = 3, Q = 4, gamma = -1/7

Or do it via contingency table:

	1	2
1	1	1
2	2	0
3	2	1

P = 1*(0+1) + 2*(1) = 3

Q = 1*(2+2) +0*(2) = 4

Gamma = -1/7

Another example:

City Size/Arenas	Small	Medium	Large
Weak Mayor	a = 10	b = 5	c = 2
Strong Mayor	d = 10	e = 15	f = 20

P = a(e+f) + bf = 10(15+20) + 5*20 = 450
Q = c(d+e) + bd = 2(10+15) + 5*10 = 100
gamma = (P - Q)/(P + Q) = (450-100)/(450 + 100) = .636

Presence/Absence Data

Simple matches
Jaccard
Gamma / Yule's Q
- (ad-bc)/(ad+bc)
- (OR-1)/(OR+1)

Nominal Data

chi-square
cramer's v

(equals phi when table is 2 by 2