HANDOUT
Handling
Missing Data
It is often said that network
analysis is less forgiving of missing data than other forms of research.
This is probably not nearly as true as people think (see Borgatti,
Carley and Krackhardt, 2006), but there is something to it.
We should distinguish between
two forms of missing data: node level, and tie level. Node level
is where a respondent does not answer the network portion of a survey at
all, as if they were not part of the study. Tie level missing data is
where they choose not to given an evaluation of a particular actor (such
as their boss), but do answer for other actors.
Tie level missing data -- if
not excessive -- can be handled by standard imputation approaches, such
as the Ward, Hoff and Lofdahl (2003) approach, and we don't discuss it
further.
Node level missing data is
more problematic. Two main strategies are used to deal with it. First,
you can ignore the node entirely, as if it never existed. From a matrix
perspective, if the original data matrix had 50 rows and columns, the
new matrix will not have 49 rows and columns.
Another approach is to impute
the missing data, which means to guess what the person would have
answered had they had a chance. There are several ways to do this,
including modeling the dataset by fitting an ERG model, then filling in
the missing data with maximum likelihood estimates based on the
parameters of the ERGM. But the simple way is as follows.
Undirected (logically
symmetric data)
First, let's consider the
case of an undirected relation (i.e., a social relation that is
logically symmetric). In that case, the simple strategy is to assume
that if the respondent had answered, he would have responded the same
way that others in fact did about him. In short, the person's column in
the data matrix (what people say about them) is used to fill in the
values of the person's row, which is missing.
The easy way to do this in
UCINET is via an undocumented matrix algebra command called
replacena. Given input matrices A and B, the replacena
routine changes any missing values found in A to the corresponding value
found in B, and saves the result in a new matrix C. For, example, typing
--> C = replacena(A B)
ask the program to create a
new dataset C such that cij = bij if aij is missing, and cij = aij
otherwise.
So how do you use this to
replaced missing values with what the other person said? The information
about what the other person said about the missing person is given in
the missing person's column. So you cij = aij when person i answered the
survey, but aji when they didn't. In other words, you want to use the
transpose of the matrix. So if A is the raw data matrix, you want to
create a new version of it called, say, A-cleaned as follows:
--> A-cleaned = replacena(A
transpose(B))
Directed (logically
nonSymmetric data)
If the network is directed,
such as who gives advice to whom, we can't use the trick above, because
there is no reason that if person I gives advice to person J, person j
gives advice to person I. But if we can do something very similar *if*
we have had the foresight to ask the directed relation in two separate
directions. By separate directions I mean that whenever you ask "who do
you go to for advice" (GET) you also ask "who comes to you for advice"
(GIVE). Each of these creates its own matrix, and you can used to fill
in for each other, because if i seeks advice from j, then we expect that
j would report i coming to them for advice. Now if i doesn't fill out
the survey, then one approach to handling the missing data is to assume
that i's row in the "goes to for advice" matrix would resemble i's
column in the "comes to you for advice matrix", which means we could use
replacena to fill in the missing values, as follows:
--> CLEANEDGET = replacena(GET
transpose(GIVE))
--> CLEANEDGIVE =
replacena(GIVE, transpose(GET))
|