Skip to content Skip to footer

A Quick Guide to Deal with Missing Data

  • Imputation techniques
  • Packages for imputations

Types of missing values

In general, there are three basic types of missing data.

Missing completely at random(MCAR)

As the name suggests, missed values are generated randomly in our data irrespective of relationships between any values, feature(observed or unobserved) and the missingness of the data.

  • Missed readings filled in any laboratory during any experiment.

Missing at random(MAR)

Values missed in a feature are not related to the feature itself but to other features in the data.

  • In any production line’s records, missing values generated for a component due to failure of different component comes under this category.

Missing not at random(MNAR)

if Values that are missed in a feature are related to the feature itself can be called MNAR. this type of missing value is most challenging to handle.

  • In a survey, participants refused to talk about their pay scale.

Imputation techniques

There are four main methods of handling missing data in datasets:

  • Imputation of missing data
  • Applying models like XGBoost that support the existence of missing data.
  • Use models for imputation of missing data like maximum likelihood estimation.
  • K-nearest neighbour: this is a clustering process that tells about the nearest values, which can be used to fill missing values instances.
  • Multivariate Imputation by Chained Equations(MICE): in this method, we use machine learning models and different data features to predict the values for replacing missing values in a particular feature.
  • Iterative Regression Imputation: train a model for each feature with missing values.

Packages for imputations

There are various python packages to deal with missing values. Let’s take a look at these packages.

Mean/Median/Mode imputation

sklearn.impute.SimpleImputer(missing_values=np.nan, strategy=mean/meadian/mode’).fit_transform(incomplete_feature)

k-nearest neighbours (kNN) imputation

fancyimpute.KNN(k=n).fit_transform(incomplete_feature)

Matrix factorization (MF) imputation

fancyimpute.MatrixFactorization().fit_transform(XY_incomplete)

Multivariate imputation(Regularized Linear Regression)

sklearn.impute.IterativeImputer().fit_transform(incomplete_feature)

Multivariate imputation(Random Forest Regression)

sklearn.impute.IterativeImputer(estimator= RandomForestRegressor().fit_transform(incomplete_feature)

Final words

In this quick guide, we have discussed small definitions of basic types of missing values that can harm our data modelling in real life. These missing values require different treatments that we discussed in the second section, and some python packages help in the treatment of missing values given in the third section.