Skip to main content

Estimation of pLQ

Estimation of pLQ

In this section, we delve into the estimation of probabilistic location quotients (pLQ) using a k-nearest neighbor (knn) algorithm. This method is employed to estimate Prob(LQcp,t+1>1)Prob(LQ_{cp, t+1} > 1) given log(s)log(s) and log(ScSp/SW)log(S_c S_p / S_W), simplifying the cpcp index for clarity. The task is akin to computing a density function, determining what fraction of points in a specific region of the (log(s)log(s), log(ScSp/SW)log(S_c S_p / S_W)) plane satisfy the condition LQcp,t+1>1LQ_{cp, t+1} > 1.

Methodology

Several approaches exist for this estimation, including partitioning the space into bins and calculating the fraction of LQt+1>1LQ_{t+1} > 1 within them. However, the knn method is chosen for its efficiency. In this context, the kk nearest points in the feature space (log(s)log(s), log(ScSp/SW)log(S_c S_p / S_W)) are used to compute the fraction of LQt+1>1LQ_{t+1} > 1, which is then assigned to the probed point. A significant advantage of using knn is the fixed size of the neighbor set (set to k=200k = 200 in both datasets), ensuring statistical robustness in sparse regions while allowing finer-grained modeling in denser areas.

Binning and Uncertainty

To extract pLQ from observations, data is partitioned into log(LQ) quantiles (typically 75) and further divided along T quantiles (also typically 75). This results in a 2D binning with equal observations, creating 752=562575^2 = 5625 bins with almost 240 points each. Variations in this binning provide a measure of uncertainty for the computed pLQ values. Below is a sample of the computed pLQ values and their uncertainty, derived from applying 9 different 2D binnings:

ccodepcodeyearμ(pLQ)\mu(pLQ)σ(pLQ)\sigma(pLQ)
mus250520120.2490.026
cyp280720090.0260.008
mys870520100.0030.002
ita590720130.4330.031
mar380820070.0060.005

Visualization

The outcomes are illustrated in Figure 1, where the estimated pLQ values are plotted as a function of the size factor (y2y_2) and log(LQ)log(LQ) (y1y_1) on the left, and against observed values (x1=log(s)x_1 = log(s)) and expectation x2x_2 on the right. The top plots represent trade data, while the bottom plots depict patent data. Both an ideal continuous probability function and its empirical data estimations are considered as probabilistic location quotients (pLQ).

Additional Resources

For those interested in replicating this estimation, a minimal Python code snippet is available in the Appendix. This code allows estimation of pLQ given a dataset of observations, providing essential details on the knn regressor's function. The binet Python package includes the computepLQ() function, with source code available for verification. A minimal working example is accessible in this notebook, requiring only basic Python and iPython notebook familiarity.

Footnotes:

  1. For a comprehensive review of the knn method in the context of other regressor algorithms, see Wu (2008). An early discussion of the knn concept can be found in Loftsgaarden (1965).
  2. A minimal snippet of Python code for estimating pLQ is included in the Appendix.