Page 1 of 1

How To Handle Necessarily Missing Data

Posted: Mon Oct 13, 2014 9:00 am
by Qbert123
Many discussions of missing data deal with various methods of imputation, like mean values or EM. But in some cases the data will be missing as a necessary consequence of the data generation process.

For instance, let's say I'm trying to predict students' grades, and one of the inputs I want to analyze is the average grades of the student's siblings. If a particular student is an only child, then that value will be missing, not because we failed to collect the data, but because logically there is no data to collect. This is distinct from cases where the student has siblings, but we can't find their grades.

Other examples abound: say we're in college admissions and we want to include students' AP exam results, but not all students took AP exams. Or we're looking at social network data, but not all subjects have Facebook and/or Twitter accounts.

These data are missing, but they're certainly not missing at random. How have people dealt with this in the past? This must be a solved problem in econometrics, but I can't find a good reference.

Re: How To Handle Necessarily Missing Data

Posted: Mon Oct 13, 2014 9:26 am
by startz
One method, which may not be perfect, is to include a dummy D for the missing X and then in place of X in the equation use the interaction D*X. In other words, instead of

Code: Select all

ls y c x
use

Code: Select all

ls y c D D*X

Re: How To Handle Necessarily Missing Data

Posted: Tue Oct 14, 2014 6:13 am
by Carlo Lazzaro
Qbert123 raises an interesting issue. However, as far as her/his example are concerned, the risk of non-ignorable missing values can be avoided (or at least reduced) by fine-tuning the inclusion criteria in the study or improving the questionnaire items to be administered to participants.

Kind regards,
Carlo

Re: How To Handle Necessarily Missing Data

Posted: Tue Oct 14, 2014 6:39 am
by Qbert123
Thanks to Startz; I think the interactive term added *without* including the original variable itself is the trick I was looking for. EM algorithm approaches would be the other way to do it.

Carlo, I don't understand your comment. In the cases I'm talking about, no survey is going to pick up the AP scores of students who don't take AP tests. These data are *necessarily* missing, so we have to find a way to deal with it.

Re: How To Handle Necessarily Missing Data

Posted: Fri Apr 24, 2015 12:41 am
by Carlo Lazzaro
Qbert123: my previous reply referred to an instance when the survey was not started out yet.
Now I see that your query had a different flavour.
An interesting textbook covering this (and related issues about dealing with missing vlues) is: Van Buuren, S. (2012), Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL. ISBN 9781439868249.

Re: How To Handle Necessarily Missing Data

Posted: Sun Apr 26, 2015 9:23 pm
by Qbert123
Carlo -- thanks very much for the tip. I'll look up the Van Buuren book.