Skip to main content

Similarity Measures in US Counties

Exploring Key Similarity Measures

Among the various similarity measures explored, two will be used further in this paper as references for the entire family of US county-based similarity measures. These are:

  • Pearson correlation of log(X)\log(X)
  • Cosine similarity of XX

Here, XX represents the observed employment levels or, alternatively, the number of establishments by US county and 4-digit NAICS activity.

Pearson Correlation of log(X)\log(X)

The first measure, Pearson correlation, is justified because the distribution of values in the rows and columns of XX tends to become nearly Gaussian or well-defined when transformed by log(X)\log(X). Computing Pearson correlation is sensible once matrix values exhibit a distribution closer to normal.

Consider two industries, XX and YY, such that their employment levels satisfy Corr(logEx,logEy)1Corr(\log E_x, \log E_y) \approx 1. This implies logEya log(Ex)+b\log E_y \sim a \ \log(E_x) + b, where aa and bb are real coefficients of a line. Consequently, Eyeb ExaE_y \sim e^{b} \ E^a_x. In cases of high correlation (all pairs with correlation higher than 0.85), we can fit these linear regressions and find that a1a \approx 1 in all cases, and b0b \approx 0 with a standard deviation of 0.35. Overall, this indicates that a high correlation of log variables suggests the employment levels of the pair of industries are roughly proportional to each other.

Cosine Similarity of XX

The focus on the second measure, cosine similarity, arises from a first principles approach to the problem of coexistence of industry facilities. We will demonstrate in Section: Continuous Measures how cosine similarity can be used as a measure of actual coexistence (within a typical distance) of the locations belonging to a pair of industries.

Linking Similarity Measures to Models

Now, we have two indicators of similarity that can be linked to models involving employment levels or spatial micro-foundations. Although we do not directly explore a link between the Pearson coefficient of the log\log variables and cosine similarity, these measures generally agree on identifying which pairs of industries show high similarity. They align with a larger family of measures capturing the same underlying characteristic: similarity by spatial distribution.

In the rest of the analysis, we will use both these measures, computed for the variables 'employment level' and 'number of facilities'. The four outcomes are not exactly equivalent, but they provide a coherent account of spatial patterns by which economic activities are distributed across the US. Results change more significantly when altering the similarity measure than when changing the observed variable.