How To Handle Missing Values In Dataset?

Missed values in a dataset is very much common in real time. Taking action on such missed values is inevitable because missed data causes problems.

Lets take a sample dataset.

ID	F1	F2	F3	F4	class label
1	12	6	1	89	0
2	9	8	0	76	1
3	14	11	None	78	1
4	10	None	2	90	0
5	11	9	3	66	1
6	8	13	5	90	0
7	12	10	1	72	1

The sample data set contains 7 data points & 4 features and 1 class label.
A cell is represented as (row_id, column_id). Features F1 ,F2, F3, F4 corresponds to column id of 1,2,3,4 respectively
It is assumed that data is pre-processed, and all the missed values are represented by None value
As we can see the values for the cells (3,3) & (4,2) are not given. (None means empty value)

Here I have discussed some techniques to handle the missed values.

1. Remove rows (data points) with missed values

This is the simplest strategy to handle missed values

2. Imputation (Replace the missed values)

Replacing the missed values by mean (average) or median or mode (most frequently occurred value)

Example:

If we choose to replace the missed value by mean, then the value of the cell (3,3) becomes 2. (Average of 1,0,2,3,5,1 is 2)
Impute based on class labels.

Instead of considering all the data points for the calculation of mean or median or mode, consider only data points whose class label matches with the class label of missed value data point.

Example:

(3,3) is a missed value. The class label of 3rd row is 1. The row id’s which are having class labels as 1 are 2, 5, 7

If we choose to replace the missed value by mean, then the value of the cell (3,3) becomes 1. (Average of 1,3,1 is 1)

3. Source of information

Missed values can also become source of information.

Example: Lets say we have collected age, sex, weight, height, hair color of all peoples in a village. Here hair color is an interesting attribute. Some people might have grey color hair, black color hair, white color hair etc. But what about the people who do not have hairs at all? These people might leave the field hair color empty.

So if hair color field is empty, then it says particular person is not having hairs & it could be an important information to note.

So we store these information in separate fields as shown below.

Lets take a subset of the sample data

ID	F1	F2	F3	F4	class label
2	9	8	0	76	1
3	14	11	None	78	1
4	10	None	2	90	0

Imputed data

ID	F1	F2	F3	F4	class label
2	9	8	0	76	1
3	14	11	1	78	1
4	10	10	2	90	0

We use Missing value features (Binary) to indicate the missing of data values. We use 1 if the value is missing else 0

ID	F2	F3
2	0	0
3	0	1
4	1	0

4. Model based Imputation

In this technique, we assume the feature, for which the values are missing for some data points, as class label & we predict the missed value by using the algorithms like K-NN etc.

Example: Lets assume, we got the value for the cell (3, 3) as 1 (by using Impute based on class labels technique). Now sample data looks like below.

ID	F1	F2	F3	F4	class label
1	12	6	1	89	0
2	9	8	0	76	1
3	14	11	1 (Imputed)	78	1
4	10	None	2	90	0
5	11	9	3	66	1
6	8	13	5	90	0
7	12	10	1	72	1

As we can see, F2 is the feature for which the value is missed at the cell (4,2). So F2 becomes the class label. Now we need to predict the value for the cell (4,2). Modified data set looks as shown below.

ID	F1	F (earlier *class label*)	F3	F4	class label (earlier F2)
1	12	0	1	89	6
2	9	1	0	76	8
3	14	1	1	78	11
5	11	1	3	66	9
6	8	0	5	90	13
7	12	1	1	72	10
4	10	0	2	90	None

Now using an algorithm like KNN, we can predict the missed values.

5. Removal of feature:

If a feature contains too much of missed values, then it is better to remove the feature because the feature is not making any *sense* & it is not adding any value

Please contact me regarding any queries or suggestions