 How To Handle Missing Data In Data Preprocessing | Machine Learning Tutorials (2021)

# How To Handle Missing Data In Data Preprocessing | Machine Learning Tutorials (2021)

Introduction :

In this tutorial we will learn how to handle missing data in data preprocessing, So let's get started.

Data PreProcessing Tools:

•      Importing the libraries.
•      Importing the dataset.
•      Taking care of missing data.
•      Encoding categorical data.
•      Splitting the dataset into the training set and test set.
•      Feature scaling.

If we have to work in Machine Learning, libraries plays an import role.

We will use the following libraries in almost every Machine Learning model.

•       NumPy
•       Pandas
•       MatplotLib

To handle missing data we have to import SimpleImputer class from impute module of Sklearn.

`from sklearn.impute import SimpleImputer`

Python code:

1. Importing the libraries

`         import numpy as np         import  matplotlib.pyplot as plt         import pandas as pd`

2. Importing the dataset

`dataset = pd.read_csv('Data.csv')X = dataset.iloc[:, :-1].valuesy = dataset.iloc[:, -1].values`

3. Taking care of missing data

`from sklearn.impute import SimpleImputer         imputer = SimpleImputer(missing_values=np.nan, strategy='mean')         imputer.fit(X[:, 1:3])         X[:, 1:3] = imputer.transform(X[:, 1:3])`

CODE EXPLANATION :

STEP 1: Importing the libraries, as in the above code we import NumPy, Pandas and MatplotLib.

STEP 2: Importing the dataset, as we import our dataset Data_CSV and we split our dataset into X and y.

X represents matrix of independent variable and y represents vector of dependent variable.

STEP 3: Taking care of missing data, Using impute module from Sklearn package.

Impute module:

`class sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)`

Imputation transformer for completing missing values.

Parameters:

`Parameters : -> missing_values  : integer or “NaN”-> strategy        : What to impute - mean, median or most_frequent along axis `

Feel free to ask doubts in comment section and give suggestions.

1. 