How to Handle Missing Data: Implications of Deletion Instead of Deductive Imputation

Coophathaway
4 min readJan 20, 2021

In this post, I would like to explore the possible consequences of deleting or removing data as a result of the missing values present, and how this could slant the meaning and validity of the findings and relationships that come from the “cleaned” data set. I will also delve into the alternatives of deleting or removing missing values, including deductive imputation, mean/median/mode imputation, stochastic imputation, regression imputation, and multiply-stochastic regression imputation.

The fundamental argument or point that I would like to make here, is that it is all too easy to find columns with missing values, and to remove the rows that contain the missing values. I think it is important to consider why there might be missing values in the first place. No data set is perfect, that is a reality that we all must accept. However, when there are significant amounts of missing values in a data set, it is important to think about why, and who is accountable for these missing values.

It is quite possible that the person collecting the data simply forgot, or was lazy and didn’t want to ask the last couple questions on the survey.

It is also possible that the surveyor was tired at the end of the day, and so the theoretical group being surveyed at the end of the day wasn’t able to provide answers to questions they weren’t asked. All data collected from this end of day group could be removed and wouldn’t be considered, as a result of a couple missing values at the end of the survey.

Even further, it is quite possible that certain questions on surveys might be more attractive to answer for some groups of people than for others. Missing values as a result of discretionary survey questions could create numerous instances of missing values in a data set.

You may have one series of data that is completely intact and has all the values you need to reach findings and assumptions, however another series might have missing values.

Simply removing all instances of missing values would also remove data from other series where the data is fully intact, thus potentially convoluting the data set, and the learnings and assumptions that come out it.

Doing what was mentioned above would remove a lot of valid data, and would provide no insights or considerations into why that data is missing. Disqualifying one entire entry by a person, as a result of one missing value at the last column of the data set only makes sense in certain instances. But a data scientist must ask themselves, is that theoretical column of data even relevant to the relationship and analysis one is trying to extract?

It is important to extract from a data-frame the series with meaningful information, and to leave behind the series that are inconsequential.

A theoretical feature name could be “Is_Interested” and let’s say the entry types can only be a yes or no. Say this column gets One-Hot-Encoded, where yes = 1 and no = 0. It is entirely possible that the relationships and findings that a data scientist is after, wouldn’t consider the series “Is_Interested” as relevant or important, so why should the missing values be considered. Missing values in this column could result in the removal of the entire entry or row, thus compromising the entire entry. One should leave behind this theoretical series of data.

Once the meaningful and relevant series in a data-frame has been assigned to a new variable, we can move on to addressing the missing values in each of the series.

Keeping the integrity of the data set we are working with is quite important. This is why I would like to make the argument that whenever possible, and when the circumstances allows for it, to use imputation to resolve missing values, instead of simple deleting or removing them.

As mentioned above, removing instances of missing values would also remove valuable data from other series where the data is fully intact. This could convolute the data set.

Imputation allows for substituting these missing values with neutral placeholders. This allows the other intact data to remain, while substituting the missing value that would cause computational problems, with a value that will allow for series and data set computations. Properly conducted, I believe this is a much better way to resolve missing values.

Types of Imputation:

The first type is called deductive imputation.

The second type is known as mean/median/mode imputation.

The third type is stochastic imputation.

The fourth is regression imputation.

The fifth type is multiply-stochastic regression imputation.

I will go on to further explore these types of imputation in my next blog post, stay tuned!

--

--