Updated: Apr 6
Find and replace missing values through imputation in Data Science
In many real-world scenarios, datasets can contain missing data, which can lead to biased or inaccurate analyses if not addressed appropriately.
Data imputation helps to fill in missing values with estimated values based on current data.
Imputation plays an important role in exploratory data analysis (EDA) because missing data can significantly impact the validity and reliability of the results. EDA is the process of analyzing and visualizing data to gain insights into the underlying patterns, relationships, and distributions of the data. However, if the dataset contains missing values, it can limit the scope and accuracy of the analysis.
Imputation helps to address missing data by estimating the missing values based on the available information in the dataset. This can allow for a more comprehensive and accurate analysis of the data, providing more meaningful insights and better-informed decisions. Furthermore, imputation can help to reduce bias in the analysis by ensuring that the missing data does not disproportionately affect certain observations or variables. Besides this if our data contains missing values then our standard EDA codes will generate errors. By imputing missing values, the analysis can be conducted on a more complete and representative dataset, improving the overall quality and reliability of the results.
What is Data imputation?:
Data imputation is a technique used to estimate missing or incomplete data in a dataset. In many real-world scenarios, datasets can contain missing data, which can lead to biased or inaccurate analyses if not addressed appropriately. Data imputation helps to fill in missing values with estimated values based on the available information in the dataset.
The process of data imputation involves selecting an appropriate imputation-method, which can range from simple techniques such as mean or mode imputation to more complex techniques such as regression imputation, multiple imputation, or deep learning-based imputation. The choice of imputation method depends on the nature and complexity of the data, the amount of missing data, and the purpose of the analysis.
Improve the quality of data: Missing data can lead to biased analyses and inaccurate results. By imputing missing values, we can improve the quality of the data and reduce the potential for errors.
Maximize the use of available data: Incomplete data can limit the scope of analysis, as missing values may result in the exclusion of certain observations or variables. By imputing missing values, we can maximize the use of available data and increase the statistical power of our analyses.
Maintain representativeness: If missing values are not handled properly, they can introduce bias and affect the representativeness of the sample. Imputing missing values can help to maintain the representativeness of the sample and ensure that the conclusions drawn from the analysis are accurate.
Facilitate the use of advanced analytical techniques: Many advanced analytical techniques, such as machine learning algorithms, require complete data. Imputing missing values can facilitate the use of these techniques and help to unlock valuable insights.
Some popular techniques for imputation:
Central tendency imputation: This involves replacing missing values with the central tendency values like mean ,median value of the available data for that variable. This method is simple to implement, but it may not be appropriate for all scenarios
We use the following central tendencies to fill missing values in situations like:
Mean imputation: When the data contains numerical continuous values with no outliers then this method can be used. We fill the missing or null values with the mean of that respective column.
Median imputation: If the data column has numerical continuous values but has outliers in it we use the median of the column to fill the missing or null values since median is robust to outliers.
Mode imputation: If the data has categorical columns then this type of imputation is best suited. The concept lies in filling the missing categorical data with the most frequent value or the mode.
Some advanced techniques include:-
K-nearest neighbor (KNN) imputation: This is a method for filling in missing values in a dataset by estimating them from the values of their K-nearest neighbors in the dataset. The idea behind this method is that similar objects tend to have similar values for their attributes.
The KNN imputation algorithm works as follows:
Identify the K nearest neighbors: For each missing value, the KNN algorithm identifies the K nearest neighbors based on the available data in the dataset. The distance between two data points can be measured using various distance metrics, such as Euclidean distance, Manhattan distance, or cosine similarity.
Compute the imputed value: Once the K nearest neighbors have been identified, the missing value is estimated as the weighted average of the values of the K nearest neighbors. The weights are determined by the distances between the missing value and its K nearest neighbors. The closer the neighbors are, the higher the weight assigned to their values.
Repeat the process: The KNN imputation process is repeated for each missing value in the dataset.
One of the advantages of the KNN imputation method is that it can handle non-linear relationships between variables and can be used for both continuous and categorical variables
The choice of the value of K: The value of K should be carefully selected, as it can significantly affect the quality of the imputed data. A small value of K may lead to overfitting, while a large value of K may result in underfitting.
The presence of outliers: KNN imputation is sensitive to outliers, as they can significantly affect the distance between data points.
Computationally intensive: KNN imputation can be computationally intensive for large datasets, especially when the value of K is large.
SVM (Support Vector Machines) imputation: This is a technique used to fill in missing data in a dataset using the SVM algorithm. The SVM algorithm is typically used for classification problems, but it can also be adapted for regression problems and imputation tasks.
In SVM imputation, the SVM algorithm is trained on the non-missing data in the dataset, and then used to predict the missing values. The idea behind SVM imputation is that the SVM algorithm can learn patterns in the data and use them to accurately predict the missing values.
The algorithm of SVM imputation typically involves the following steps:
Identify the missing values in the dataset.
Split the dataset into two subsets: one with complete data (i.e., no missing values) and one with the missing values.
Train an SVM model on the complete data subset, using the available features as inputs and the target variable (i.e., the variable with the missing values) as the output.
Use the trained SVM model to predict the missing values in the missing data subset.
Combine the complete data subset and the imputed missing data subset to create a complete dataset.
The above figure shows how the SVM algorithm classifies between different classes (here for example the figure has two classes namely class 1 and class 2) by drawing a margin or hyper plane. After doing this on the complete data it predicts the missing value on the incomplete data accordingly.
SVM is a robust algorithm that can handle outliers and noise in the data unlike KNN imputers. This can be useful in situations where the dataset contains a lot of noise or missing values are clustered in certain areas.
Can overfit the data: SVM algorithm can be prone to overfitting, particularly when the dataset is small or noisy. This can lead to a model that performs well on the training data but poorly on new, unseen data.
Limited interpretability: SVM is often criticized for its lack of interpretability, as it can be difficult to understand how the algorithm is making its predictions. This can be a limitation in situations where interpretability is important, such as in certain scientific or medical applications.
In this blog we got to know what data imputation is all about and different techniques used for such imputation from a theoretical point of view. In the end we may conclude by saying that while choosing an imputation method, it's important to carefully evaluate the available options and select the one that is most appropriate for the specific problem and dataset at hand.