How To Handle Missing Data In Data Preprocessing | Machine Learning Tutorials (2021)

How To Handle Missing Data In Data Preprocessing | Machine Learning Tutorials (2021)

Introduction :

In this tutorial we will learn how to handle missing data in data preprocessing, So let's get started.

Data PreProcessing Tools:

  •      Importing the libraries.
  •      Importing the dataset.
  •      Taking care of missing data.
  •      Encoding categorical data.
  •      Splitting the dataset into the training set and test set.
  •      Feature scaling.


If we have to work in Machine Learning, libraries plays an import role.

We will use the following libraries in almost every Machine Learning model.

  •       NumPy
  •       Pandas
  •       MatplotLib

Download the dataset Data.csv

To handle missing data we have to import SimpleImputer class from impute module of Sklearn.

from sklearn.impute import SimpleImputer

Python code:

 1. Importing the libraries

         import numpy as np

         import  matplotlib.pyplot as plt

         import pandas as pd

  2. Importing the dataset

dataset = pd.read_csv('Data.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].values




  3. Taking care of missing data

from sklearn.impute import SimpleImputer

         imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
[:, 1:3])

         X[:, 1:3] = imputer.transform(X[:, 1:3])



STEP 1: Importing the libraries, as in the above code we import NumPy, Pandas and MatplotLib.

STEP 2: Importing the dataset, as we import our dataset Data_CSV and we split our dataset into X and y.

X represents matrix of independent variable and y represents vector of dependent variable.

STEP 3: Taking care of missing data, Using impute module from Sklearn package.

         Impute module:

class sklearn.impute.SimpleImputer(*missing_values=nanstrategy='mean'fill_value=Noneverbose=0copy=Trueadd_indicator=False)

Imputation transformer for completing missing values.


Parameters : -> 

missing_values  : integer or “NaN”-> 

strategy        : What to impute - mean, median or most_frequent along axis 

 Feel free to ask doubts in comment section and give suggestions.

Post a Comment