In real-life data sets, we may find a considerable amount of missing value, sometimes these values can lead our data analysis and data modelling processes in the wrong direction. In general, we can define missing values as no record or datapoint stored for the variable in an observation or data gathering process. The below picture can be a representation of missing values:

In the above image, NaN written on places is missing values. With the help of this Quick guide we learn the following things about missing values:

Types of missing values
Imputation techniques
Packages for imputations

Let’s start by understanding the types of missing data.

Types of missing values

In general, there are three basic types of missing data.

Missing completely at random(MCAR)

As the name suggests, missed values are generated randomly in our data irrespective of relationships between any values, feature(observed or unobserved) and the missingness of the data.

Examples:

Surveys where participants missed to fill any values.
Missed readings filled in any laboratory during any experiment.

Missing at random(MAR)

Values missed in a feature are not related to the feature itself but to other features in the data.

Examples:

In any survey, women are less likely to give information about their age. It means the missing values are not related to the age feature but to the gender feature.
In any production line’s records, missing values generated for a component due to failure of different component comes under this category.

However, MAR can become MACR when any condition or control is applied to the observed feature.

Missing not at random(MNAR)

if Values that are missed in a feature are related to the feature itself can be called MNAR. this type of missing value is most challenging to handle.

Examples:

Participants are more likely to refuse to complete the survey about their unemployment.
In a survey, participants refused to talk about their pay scale.

Imputation techniques

There are four main methods of handling missing data in datasets:

Delete missing values instance from data.
Imputation of missing data
Applying models like XGBoost that support the existence of missing data.
Use models for imputation of missing data like maximum likelihood estimation.

Some standard replacement and mathematical methods to impute missing data.

Mean/Median imputation: Using this method, we can replace the missing values when the feature consists of continuous values.
K-nearest neighbour: this is a clustering process that tells about the nearest values, which can be used to fill missing values instances.
Multivariate Imputation by Chained Equations(MICE): in this method, we use machine learning models and different data features to predict the values for replacing missing values in a particular feature.
Iterative Regression Imputation: train a model for each feature with missing values.

The below picture will let us know about these methods.

Packages for imputations

There are various python packages to deal with missing values. Let’s take a look at these packages.

Mean/Median/Mode imputation

sklearn.impute.SimpleImputer(missing_values=np.nan, strategy=mean/meadian/mode’).fit_transform(incomplete_feature)

Note: by just changing the strategy parameter, we can choose any method to impute missing values.

k-nearest neighbours (kNN) imputation

fancyimpute.KNN(k=n).fit_transform(incomplete_feature)

Note: the k parameter is for a cluster number that needs to be defined as a numerical value.

Matrix factorization (MF) imputation

fancyimpute.MatrixFactorization().fit_transform(XY_incomplete)

Multivariate imputation(Regularized Linear Regression)

sklearn.impute.IterativeImputer().fit_transform(incomplete_feature)

Multivariate imputation(Random Forest Regression)

sklearn.impute.IterativeImputer(estimator= RandomForestRegressor().fit_transform(incomplete_feature)

Final words

In this quick guide, we have discussed small definitions of basic types of missing values that can harm our data modelling in real life. These missing values require different treatments that we discussed in the second section, and some python packages help in the treatment of missing values given in the third section.

A Quick Guide to Deal with Missing Data

Types of missing values

Missing completely at random(MCAR)

Missing at random(MAR)

Missing not at random(MNAR)

Imputation techniques

Packages for imputations

Mean/Median/Mode imputation

k-nearest neighbours (kNN) imputation

Matrix factorization (MF) imputation

Multivariate imputation(Regularized Linear Regression)

Multivariate imputation(Random Forest Regression)

Final words

Connect

USA | Ireland | India