Skip to main content

The Similarity Measures

The Similarity Measures

When dealing with a matrix X(n×p)X_{(n\times p)}, it is often important to determine whether its columns or rows exhibit any relationships. This can be assessed using various association coefficients, such as the matrix product XTXX^{T} X. Other measures include Pearson correlation, cosine similarity, and covariance. Let's explore these measures in detail for a pair of columns Xj=(x1j,,xij,,xnj)TX_j = \left( x_{1j}, \dots, x_{ij}, \dots, x_{nj} \right)^T and XjX_{j'}.

Pearson Correlation

The Pearson correlation between columns jj and jj' is defined as:

Corr(j,j)=rjj=i=1n(xijXjˉ)(xijXjˉ)XjXjˉXjXjˉCorr(j, j') = r_{jj'} = \frac{\sum\limits_{i=1}^{n} (x_{ij} - \bar{X_j})(x_{ij'} - \bar{X_{j'}})} {||X_{j}-\bar{X_j}|| ||X_{j'}-\bar{X_{j'}}|| }

Here, j,jj, j' represent a pair (e.g., a pair of industries), Xjˉ\bar{X_j} denotes the mean of column jj, and the square norm is defined as XjXjˉ=i=1n(xijXjˉ)2||X_j - \bar{X_j}||=\sqrt{\sum_{i=1}^{n}(x_{ij}-\bar{X_j})^2 }.

Cosine Similarity

Cosine similarity is given by:

CosSim(j,j)=rjj=i=1nxij xijXjXjCosSim(j, j') = r_{jj'} = \frac{\sum\limits_{i=1}^{n} x_{ij} \ x_{ij'}} {||X_j|| ||X_{j'}||}

It can be noted that Corr(Xj,Xj)=CosSim(XjXjˉ,XjXjˉ)Corr(X_{j}, X_{j'}) = CosSim(X_{j}-\bar{X_j}, X_{j'}-\bar{X_j'}).

Sample Covariance

The sample covariance is calculated as:

Cov(j,j)=1ni=1n(xijXjˉ)(xijXjˉ)Cov(j, j') = \frac{1}{n} \sum\limits_{i=1}^{n} (x_{ij} - \bar{X_j})(x_{ij'} - \bar{X_{j'}})

where nn is the number of observations.

These measures are interrelated, as evident from their formulas. In special cases, the matrix product XTXX^{T} X, covariance matrix, cosine similarity, or Pearson correlation may become identical to one another.

If the column variables are centered (mean is zero), the covariance matrix is Cov(Y)=YTY/(n1)Cov(Y) = Y^{T} Y / (n - 1), with Y=XXˉY = X - \bar X. When columns are z-standardized (demeaned and divided by the standard deviation), Pearson correlation matches the covariance: Corr(Z)=ZTZ/(n1)Corr(Z) = Z^{T} Z / (n - 1) with Z=(XXˉ)/std(X)Z = (X - \bar X)/ std(X). If columns are unit-scaled (sum of squares is 1), the cosine similarity matrix is Cossim(V)=VTVCossim(V) = V^{T} V with V=X/XV = X / ||X||. Centering before unit scaling, i.e., W=(XXˉ)/XXˉW = (X - \bar X) / ||X - \bar X||, results in the Pearson correlation matrix equaling the cosine similarity matrix.

In practice, empirical data (XX) often do not meet these special conditions, leading to differences among these measures. For instance, when counting populations or total nominal values of output or trade, the XX matrix is typically not centered. While normalization or standardization could be applied, it is crucial to evaluate whether such transformations are justified in the specific empirical context.

By understanding these similarity measures, we can better analyze the relationships within data matrices, providing valuable insights into the structure of the data.