Let’s say you want to compute the pairwise distance between two sets of points, a
and b
, in Python. The technique works for an arbitrary number of points, but for simplicity make them 2D. Set a
has m
points giving it a shape of (m, 2)
and b
has n
points giving it a shape of (n, 2)
. How do you generate a (m, n)
distance matrix with pairwise distances?
The simplest thing you can do is call the distance_matrix
function in the SciPy spatial package:
import numpy as np
from scipy.spatial import distance_matrix
a = np.zeros((3, 2))
b = np.ones((4, 2))
distance_matrix(a, b)
This produces the following distance matrix:
array([[1.41421356, 1.41421356, 1.41421356, 1.41421356],
[1.41421356, 1.41421356, 1.41421356, 1.41421356],
[1.41421356, 1.41421356, 1.41421356, 1.41421356]])
Easy enough! But actually you can do the same thing without SciPy by leveraging NumPy’s broadcasting rules:
>>> np.linalg.norm(a[:, None, :] - b[None, :, :], axis=-1)
array([[1.41421356, 1.41421356, 1.41421356, 1.41421356],
[1.41421356, 1.41421356, 1.41421356, 1.41421356],
[1.41421356, 1.41421356, 1.41421356, 1.41421356]])
Why does this work? Because NumPy applies element-wise calculations when axes have the same dimension or when one of the axes can be expanded to match. It checks for matching dimensions by moving right to left through the axes.
None adds a new axis to a NumPy array. So a[:, None, :]
gives a (3, 1, 2)
view of a
and b[None, :, :]
gives a (1, 4, 2)
view of b
. The subtraction operation moves right to left. Any 2D point can be subtracted from another 2D point. The 4 dimensions from b
get expanded over the new axis in a
and then the 3 dimensions in a
get expanded over the first axis in b
. The result is a (3, 4, 2)
array with element-wise subtractions.
Then, we apply the L2 norm along the -1th axis (which is shorthand for the last axis). This gives us the Euclidean distance between each pair of points.
There are a few benefits to using the NumPy approach over the SciPy approach.
- You don’t need to install SciPy (which is kinda heavy).
- It works for other tensor packages that use NumPy broadcasting rules like PyTorch and TensorFlow.
- It works with any operation that can do reductions.
As an example of point 3, you can do pairwise Manhattan distance with the following:
>>> np.sum(np.abs(a[:, None, :] - b[None, :, :]), axis=-1)
array([[2., 2., 2., 2.],
[2., 2., 2., 2.],
[2., 2., 2., 2.]])
Becoming comfortable with this type of vectorized operation is an important way to get better at scientific computing!