[Machine Learning] Common Vector Space Metrics Compilation

Last Updated: 2023/4/5 Fixed the display issue of mathematical symbols in Markdown
This article was first published on 若绾. Please indicate the source if reproduced.

Introduction#

When it comes to machine learning and data science, vector space distance is a very important concept. In many machine learning applications, data points are often represented in vector form. Therefore, it is crucial to understand how to calculate and compare distances between vectors. Vector space distance can be used to solve many problems, such as clustering, classification, dimensionality reduction, and more. In federated learning projects, vector space distance is particularly important as it helps us compare vectors from different devices or data sources to determine if they are similar enough for joint training.

This article will introduce the basic concepts of vector space distance, including Euclidean distance, Manhattan distance, Chebyshev distance, and more. We will discuss how these distance metrics are calculated, their advantages and disadvantages, and when to use each distance metric. We will also introduce some more advanced distance metrics, such as Mahalanobis distance and cosine similarity, and explore their applicability in different scenarios. Hopefully, this article can help you better understand vector space distance and how to apply it to your federated learning projects.

Manhattan Distance#

Manhattan distance (L1 norm) is a method for measuring the distance between two vectors X and Y. Its formula is as follows:

$D_{M}(X,Y) = \sum_{i=1}^{n}|x_{i} - y_{i}|$

where:

$D_{M}(X,Y)$ represents the Manhattan distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the i-th component of vectors X and Y, respectively.
n is the number of components in the vectors.

Manhattan distance measures the distance between two vectors by summing the absolute differences between their components. It is named after the grid-like layout of streets in Manhattan, where the distance between two points is the sum of their horizontal and vertical distances. Compared to Euclidean distance, Manhattan distance is more suitable for applications in high-dimensional spaces because in high-dimensional spaces, the distance between two points is easier to calculate using horizontal and vertical distances. Manhattan distance is also widely used in machine learning and data science, such as in clustering and classification problems, as well as in feature extraction in image and speech recognition.

Canberra Distance#

Canberra distance is a distance metric used to measure the similarity between two vectors. It is commonly used in data analysis and information retrieval. Its formula is as follows:

$D_{c}(X,Y) = \sum_{i=1}^{n}\frac{|x_{i} - y_{i}|}{|x_{i}| + |y_{i}|}$

where:

$D_{c}(X,Y)$ represents the Canberra distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the i-th component of vectors X and Y, respectively.
n is the number of components in the vectors.

Canberra distance is a distance metric that takes into account the magnitude of the vector components and is suitable for situations where the magnitude of the components is important, such as analyzing gene expression data. Unlike other distance metrics, Canberra distance calculates the distance by dividing the absolute difference between components by the sum of their absolute values. When there are many zero-valued components in the vector, the denominator may become very small, which can lead to instability in the distance calculation. Canberra distance is also widely used in machine learning and data science, such as in classification, clustering, recommendation systems, and information retrieval.

Euclidean Distance#

Euclidean distance is a common method for measuring the distance between two vectors. It is widely used in various tasks in the field of machine learning and data science, such as clustering, classification, and regression.

Euclidean distance can be used to calculate the distance between two vectors. Let's assume we have two vectors X and Y with equal lengths, meaning each vector contains n components. The Euclidean distance between them can be calculated using the following formula:

$D_{E}(X,Y) = \sqrt{\sum_{i=1}^{n}(x_{i} - y_{i})^2}$

where:

$D_{E}(X,Y)$ represents the Euclidean distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the i-th component of vectors X and Y, respectively.
n is the number of components in the vectors.

Euclidean distance is calculated by summing the squared differences between the components of the two vectors and taking the square root of the sum. It is named after the ancient Greek mathematician Euclid and is widely used in plane geometry as well. In machine learning and data science, Euclidean distance is often used to calculate the similarity or distance between two samples, as it helps us identify samples that are close to each other in feature space.

Standardized Euclidean Distance#

Standardized Euclidean distance is a metric for measuring the distance between two vectors. It takes into account the variability of the vector components and is typically used in cases where the components have different measurement units and scales, and the variability of the components is important.

The formula for calculating standardized Euclidean distance is as follows:

$D_{SE}(X,Y) = \sqrt{\sum_{i=1}^{n}\frac{(x_{i} - y_{i})^2}{s_{i}^2}}$

where:

$D_{SE}(X,Y)$ represents the standardized Euclidean distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the i-th component of vectors X and Y, respectively.
n is the number of components in the vectors.
$s_{i}$ is the standard deviation of the i-th component in the vectors.

Standardized Euclidean distance takes into account the variability of the vector components and is typically used in cases where the components have different measurement units and scales, and the variability of the components is important. It is similar to Euclidean distance in terms of calculation, but when calculating the difference between components, it is normalized by dividing it by the standard deviation of the component.

Squared Euclidean Distance#

Squared Euclidean distance is a metric for calculating the distance between two vectors. Its formula is as follows:

$D_{SE}^2(X,Y) = \sum_{i=1}^{n}(x_{i} - y_{i})^2$

where:

$D_{SE}^2(X,Y)$ represents the squared Euclidean distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the i-th component of vectors X and Y, respectively.
n is the number of components in the vectors.

Squared Euclidean distance measures the distance between two vectors by summing the squared differences between their components. Compared to Euclidean distance, squared Euclidean distance avoids the square root operation on the sum, making the calculation more efficient. Squared Euclidean distance is also widely used in machine learning and data science, such as in clustering, classification, regression, and other tasks.

It is important to note that squared Euclidean distance ignores the scale and units of the components, so it may not be suitable for measuring the distance between vectors in some cases. In such cases, other distance metrics such as standardized Euclidean distance, Manhattan distance, and Chebyshev distance can be used.

Cosine Similarity#

Cosine similarity is a metric for measuring the similarity between two vectors. Its formula is as follows:

$cos(\theta) = \frac{X \cdot Y}{\left| X\right| \left| Y\right|} = \frac{\sum\limits_{i=1}^{n} x_i y_i}{\sqrt{\sum\limits_{i=1}^{n} x_i^2} \sqrt{\sum\limits_{i=1}^{n} y_i^2}}$

where:

$cos(\theta)$ represents the cosine similarity between vectors X and Y.
$X \cdot Y$ represents the dot product of vectors X and Y.
$\left| X\right|$ and $\left| Y\right|$ represent the magnitudes of vectors X and Y, respectively.
$x_i$ and $y_i$ represent the i-th component of vectors X and Y, respectively.
n is the number of components in the vectors.

Cosine similarity is a similarity metric that takes into account the angle between vectors. In fields such as natural language processing and information retrieval, cosine similarity is often used to measure the similarity between texts, as texts can be represented as vectors and the magnitudes of the components are not important. Compared to other distance metrics, cosine similarity calculation is simpler and has good performance. For example, when two vectors are exactly the same, the cosine similarity is 1; when the angle between two vectors is 90 degrees, the cosine similarity is 0; when two vectors have opposite directions, the cosine similarity is -1.

It is important to note that cosine similarity does not consider the differences in magnitude between vector components, so it may not be suitable for some datasets.

Chebyshev Distance#

Chebyshev distance is a metric used to measure the distance between two vectors. It calculates the maximum absolute difference between the components of the two vectors. Let's assume we have two vectors X and Y with equal lengths, meaning each vector contains n components. The Chebyshev distance between them can be calculated using the following formula:

$D_{C}(X,Y) = max_{i=1}^{n}|x_{i} - y_{i}|$

where:

$D_{C}(X,Y)$ represents the Chebyshev distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the i-th component of vectors X and Y, respectively.
n is the number of components in the vectors.

Chebyshev distance is often used to measure the distance between vectors. Its calculation is similar to Manhattan distance, but instead of summing the absolute differences between the components, it takes the maximum absolute difference between the components. Therefore, it can better capture the differences between vectors compared to Manhattan distance. Chebyshev distance is also widely used in machine learning and data science, such as in image processing, signal processing, time series analysis, and more.

It is important to note that Chebyshev distance may be affected by outliers, as it is based on the maximum absolute difference between components. Therefore, in the presence of outliers, Chebyshev distance may give inaccurate distance measurements.

Mahalanobis Distance#

Mahalanobis distance is a metric used to measure the distance between two vectors. It takes into account the correlations between the components. Let's assume we have two vectors X and Y with equal lengths, meaning each vector contains n components. The Mahalanobis distance between them can be calculated using the following formula:

$D_{M}(X,Y) = \sqrt{(X-Y)^T S^{-1} (X-Y)}$

where:

$D_{M}(X,Y)$ represents the Mahalanobis distance between vectors X and Y.
X and Y are two vectors with a length of n.
S is the covariance matrix of size n x n.

The calculation formula of Mahalanobis distance is similar to Euclidean distance, but it takes into account the correlations between the components. If the covariance matrix is an identity matrix, then the Mahalanobis distance is equivalent to Euclidean distance. Compared to Euclidean distance, Mahalanobis distance can better capture the correlations between the components, so it has been widely used in fields that require considering the correlations between components, such as financial risk management, speech recognition, image recognition, and more.

It is important to note that Mahalanobis distance requires the components to follow a multivariate normal distribution, and the covariance matrix needs to be positive definite. If the components do not satisfy these conditions, Mahalanobis distance may give inaccurate distance measurements. In addition, Mahalanobis distance is also affected by the estimation error of the covariance matrix. In practical applications, we need to estimate the covariance matrix using sample data, so the size and quality of the samples will affect the accuracy of Mahalanobis distance.