Data imputation is a critical step in data preprocessing and analysis that involves filling in missing or incomplete data values using various methods. Missing data can occur due to a variety of reasons such as data entry errors, equipment malfunction, or participant non-response, which can have a significant impact on data analysis and modeling if not addressed appropriately.
There are various methods available for data imputation, ranging from simple approaches such as central tendency imputation to more sophisticated methods such as machine learning-based imputation. The choice of method depends on the nature of the data, the amount of missingness, and the research question. In this article, we will explore different data imputation techniques, their advantages and limitations, and when to use each method. By the end of this article, readers will have a comprehensive understanding of practical data imputation and be able to apply it to their own datasets.
Materials and tools needed:
For the theoretical aspect of this topic you can check out here. Tools required for imputation techniques include:
Any python IDE like jupyter notebook, google colab, pycharm, spyder etc.
Python packages which need to be installed are pandas.
Use this dataset and follow the below steps for successful data imputation.
Our data set is a finance related dataset and we want to clean the dataset here using appropriate techniques. This credit default dataset contains 7 columns and 529 unique values :
Loan_ID: Unique ID to track loan applications.
Gender: Whether the applicant is a male or female.
Married: Whether the applicant is married or not.
ApplicantIncome: Salary of the applicant per year .
CoapplicantIncome: Salary of the Coapplicant per year.
LoanAmount: The amount of loan an applicant wants to take.
Loan_Status: The status of the loan applicant that is whether it is accepted or not by the bank.
Step by step process of data imputation using central tendency measures:
Central tendency refers to the central value in a set of data. The three measures of central tendency are mean, median, and mode. The mean is the sum of all values divided by the total number of values, the median is the middle value when the data is arranged in order, and the mode is the most frequent value in the data.
Now, let's do a step-wise implementation of the imputation method using one of these measures on the given dataset using Python.
Step 1: Install the required libraries for effective data manipulation.
pip install pandas
Step 2: After installing the required libraries, import them to our IDE.
import pandas as pd
Step 3: Lets import the dataset now directly from the github repository itself.
Step 4: Just to make sure the data has been imported correctly we may use the head function.
Step 5: Let’s check if our data set contains any null values or not.
Step 6: Let's replace the null values with appropriate techniques.
For the categorical columns, we replace by the most frequent value or mode.
Let’s decode the above code now:
df[‘Gender’] basically extracts the Gender column
fillna() function is used as a filler to fill the nan values
mode() function is used to calculate the individual mode of the gender column
For the numerical columns, we replace by the mean of that column.
Step 7: As done in Step 5, let’s re-check to see if any missing values are still left.
Since all values are resulting in 0 above, null values have been filled successfully and are ready for the later EDA and modeling part.
You can go to our GitHub Repository to download the full code.
In this blog we explained how to implement data imputation using central tendency in a dataset with appropriate codes and techniques. In conclusion, data imputation is a valuable technique in data analysis that allows for meaningful and accurate analysis of incomplete datasets. Its proper implementation can lead to more accurate conclusions and better-informed decisions.
Missing data is a common problem in real-world datasets, and ignoring it can lead to biased results or reduced statistical power. But the central tendency imputation techniques are not always suitable and logical.
Some prominent limitations may include:
Reducing Variance: Central tendency imputation can reduce the variance of the data, making it difficult to detect patterns and relationships in the data.
Biased Results: Central tendency imputation assumes that the missing data is similar to the available data, which may not always be true. If the missing data is systematically different from the available data, central tendency imputation can lead to biased results.
Loss of Information: Finally, central tendency imputation can result in the loss of information, especially if a large proportion of the data is missing. This can reduce the power of statistical analyses and limit the ability to draw valid conclusions from the data.
So we need to randomize the imputation techniques using machine learning (like KNN imputer) which is a forthcoming blog in the future.