top of page

Measuring the Unmeasurable: The Power of Distance Metrics in Data Science


Introduction:





Distance metrics are a fundamental concept in data science and machine learning. They are used to measure the similarity or dissimilarity between two or more data points in a dataset. A distance metric is a mathematical function that takes two data points as input and calculates the distance between them.


Distance metrics play a critical role in many machine learning tasks, such as clustering, classification, and anomaly detection. They help us understand the relationships between data points and can be used to identify patterns and trends in large datasets. There are many different types of distance metrics, each with its own strengths and weaknesses. Euclidean distance, Manhattan distance, cosine similarity etc. are some of the most commonly used distance metrics in data science.


Overall, distance metrics are an essential tool for data scientists, allowing them to compare and analyze data in a meaningful way. By using distance metrics, data scientists can gain insights into complex datasets, identify patterns and relationships, and ultimately make better decisions based on data-driven insights.



Types of commonly used distance metrics:


There are many types of distance metrics used in data science, each with its own strengths and weaknesses. Some of the most commonly used distance metrics include:


  • Euclidean distance: This is the straight-line distance between two points in a multi-dimensional space. It is the most commonly used distance metric and is suitable for continuous variables.


  • Manhattan distance: Also known as taxicab distance, it measures the distance between two points by adding the absolute differences of their coordinates. It is useful for discrete variables.


  • Minkowski distance: This is a generalized distance metric that includes Euclidean and Manhattan distance as special cases. It is defined as the pth root of the sum of the pth power of the absolute differences between coordinates.


  • Cosine similarity: This metric measures the cosine of the angle between two vectors. It is used for text data and other sparse data where the magnitude of the vector is less important than the direction.


  • Chebyshev distance: Chebyshev distance, also known as chessboard distance or L∞ distance, is a metric used to measure the distance between two points in a grid system where movement is restricted to horizontal, vertical, and diagonal directions.


  • Hamming distance: This is a metric used for comparing two strings of equal length. It measures the number of positions at which the corresponding symbols are different.


Some commonly used distance metrics with their Implementation:


Euclidean distance:


The Euclidean distance is a measure of the distance between two points in a multi-dimensional space. It is named after the Greek mathematician Euclid, who was one of the first to describe this concept.

We can represent Euclidean Distance in 2-dimensional space as:




The Euclidean distance between two points in a two-dimensional space (x1, y1) and (x2, y2) can be calculated using the following formula:



The formula can be extended to higher dimensions by adding more terms for each additional dimension.


The Euclidean distance is commonly used in various fields, such as statistics, machine learning, and computer science, to measure the similarity or dissimilarity between objects. It is a fundamental concept in geometry and has many practical applications in real-world problems, such as clustering, classification, and data visualization. Some specific ML models where it is used are: KNN, DBSCAN, K-means, PCA etc.

Let’s implement it in python:



Code:


import math 
def euclidean_distance(point1, point2):

#calculate the distance between two points
  distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(point1, point2)])) 
  return distance

# example usage 
point1 = [1, 2, 3] 
point2 = [4, 5, 6]
distance = euclidean_distance(point1, point2) 
print(distance)





Manhattan distance:


The Manhattan distance is a measure of the distance between two points in a multi-dimensional space. It is named after the Manhattan grid layout, in which streets run at right angles to each other.


We can represent Manhattan Distance in 2-dimensional space as:




The Manhattan distance between two points in a two-dimensional space (x1, y1) and (x2, y2) is the sum of the absolute differences of their coordinates along the x-axis and y-axis, respectively:



The formula can be extended to higher dimensions by adding more terms for each additional dimension.


The Manhattan distance is often used in situations where movement can only occur along orthogonal (right-angle) paths, such as in a city grid, a game board, or a maze. It is also used in computer science, particularly in algorithms that involve finding the shortest path between two points. The Manhattan distance is sometimes referred to as the taxicab distance or the L1 norm. It is widely used in models like KNN, linear regression, clustering algorithms etc.


Let’s implement it in python:



Code:


def manhattan_distance(point1, point2):
 # calculate the distance between two points
  distance = sum([abs(a - b) for a, b in zip(point1, point2)])
  return distance 

# example usage 
point1 = [1, 2, 3]
point2 = [4, 5, 6] 
distance = manhattan_distance(point1, point2)
print(distance)





Minkowski distance:


The Minkowski distance is a measure of the distance between two points in a multi-dimensional space. It is a generalized form of the Euclidean distance and the Manhattan distance, which are special cases of the Minkowski distance with a parameter of 2 and 1, respectively.


We can represent Minkowski Distance in 2 dimensional space as:




The Minkowski distance between two points in a 2-dimensional space (x1, y1) and (x2, y2) is defined as:



where p is a parameter that determines the degree of the distance.


When p=1, the Minkowski distance reduces to the Manhattan distance, and when p=2, it reduces to the Euclidean distance. It can be extended to n-dimensional space.The Minkowski distance is commonly used in machine learning, data mining, and image processing to measure the similarity or dissimilarity between two objects. It is also used in physics to describe the distance between two points in a space-time continuum. The Minkowski distance has many practical applications in real-world problems, such as clustering, classification, and feature selection.


Let’s implement it in python:



Code:


import math
 def minkowski_distance(point1, point2, p):
 # calculate the distance between two points
  distance = math.pow(sum([math.pow(abs(a - b), p) for a, b in zip(point1,    point2)]), 1/p)
  return distance 

# example usage
 point1 = [1, 2, 3]
 point2 = [4, 5, 6] p = 3
 distance = minkowski_distance(point1, point2, p)
 print(distance)





Cosine similarity:



Cosine similarity is a measure of similarity between two vectors in a multi-dimensional space. It measures the cosine of the angle between the two vectors and ranges from -1 to 1. A value of 1 means that the two vectors are identical, a value of 0 means that the two vectors are orthogonal (i.e., not similar), and a value of -1 means that the two vectors are diametrically opposed.


Cosine similarity is widely used in natural language processing, information retrieval, and machine learning, especially in text classification and clustering. In text processing, documents are often represented as vectors of term frequencies or TF-IDF (term frequency-inverse document frequency) weights, and the cosine similarity is used to compare the similarity between them.


We can represent cosine similarity in 2-dimensional space as:





The formula for computing cosine similarity between two vectors a and b is:



where a.b is the dot product of the two vectors and ||a|| and ||b|| are the magnitudes of the two vectors.


It can be extended to n-dimensional space. Cosine similarity has many advantages over other similarity measures, such as the Euclidean distance as it is insensitive to the scale of the data and captures the semantic meaning of the data more effectively.


Let’s implement it in python:



Code:



import math
 def cosine_similarity(vector1, vector2):

  dot_product = sum([a*b for a,b in zip(vector1, vector2)])
  magnitude1 = math.sqrt(sum([a**2 for a in vector1]))
  magnitude2 = math.sqrt(sum([a**2 for a in vector2]))
  return dot_product / (magnitude1 * magnitude2)

 # example usage
 vector1 = [1, 2, 3]
 vector2 = [4, 5, 6]
 similarity = cosine_similarity(vector1, vector2)
 print(similarity)





Chebyshev distance:


The Chebyshev distance, also known as the maximum metric or L∞ metric, is a measure of the distance between two points in a multi-dimensional space. It is defined as the maximum absolute difference between the coordinates of the two points along any dimension.


We can represent Chebyshev Distance in 2 dimensional space as:





In other words, given two points (x1, y1) and (x2, y2) in an 2-dimensional space, the Chebyshev distance between them is calculated as:




It can be extended to n-dimensional space. The Chebyshev distance is useful when you want to measure the difference between two points in a way that only considers the dimension with the largest difference. It is commonly used in areas such as computer vision and pattern recognition, as well as in game theory and economics.


Here are some characteristics of the Chebyshev distance:


  • It satisfies all the axioms of a metric space.

  • It is not affected by differences in the scale of different dimensions.

  • It is symmetric, i.eThe distance from A to B is the same as the distance from B to A.

  • It can be used to define a bounded neighborhood around a point in a space, which is useful for clustering and classification algorithms.


Let’s implement it in python:



Code:


import numpy as np
 def chebyshev_distance(point1, point2):

   return np.max(np.abs(point1 - point2))

 # Example usage
 point1 = np.array([1, 2, 3])
 point2 = np.array([4, 5, 6])
 distance = chebyshev_distance(point1, point2)
 print(distance)





Hamming distance:


Hamming distance is a measure of the difference between two strings of equal length. It is defined as the number of positions at which the corresponding symbols are different in the two strings. In other words, given two strings s and t of the same length, the Hamming distance between them is the number of positions i such that s[i] ≠ t[i].


For example,




In the above example the hamming distance is 3.

The Hamming distance is often used in coding theory and digital communications, where it is used to detect and correct errors in transmitted data. It is also used in bioinformatics to measure the similarity between DNA sequences.


Here are some characteristics of the Hamming distance:


  • It is always a non-negative integer.

  • It is zero if and only if the two strings are identical.

  • It satisfies the triangle inequality, which means that the distance from A to C is no greater than the distance from A to B plus the distance from B to C.


Let’s implement it in python:



Code:


def hamming_distance(s, t):
 if len(s) != len(t):
   raise ValueError("Strings must be of equal length")
 return sum(1 for a, b in zip(s, t) if a != b)

# Example usage 
s = "hello" 
t = "hallo" 
distance = hamming_distance(s, t) 
print(distance)




Conclusion:


In this blog we tried explaining different commonly used distance metrics used in data science. So we may say that when selecting a distance metric for a specific application, it is important to consider the properties of the data and the requirements of the application. Additionally, it is common to standardize or normalize the data before calculating the distances to ensure that each feature contributes equally to the distance measure.


In conclusion, distance metrics are an important tool in data science for measuring the similarity or dissimilarity between data points. By choosing the appropriate metric for a specific application we can obtain more accurate and meaningful results from our analyses.


71 views0 comments
bottom of page