Skip to main content

Empirical Distributions

Empirical Distributions

The distribution of values x1=log(scp)x_1 = \log(s_{cp}) significantly influences the distribution of row and column sums, and consequently affects x2=log(ScSp/Sw)x_2 = \log(S_c S_p / S_w). More specifically, a bell-shaped distribution of the log values x1=log(scp)x_1 = \log(s_{cp}) results in a bell-shaped distribution of log(Sc)\log(S_c) and log(Sp)\log(S_p) values, and thus x2x_2 values. This leads to a bell-shaped distribution of log(LQ)=x1x2\log(LQ) = x_1 - x_2 values.

We confirm this intuition with empirical data. The distributions of x1x_1, x2x_2, and log(LQ)=x1x2\log(LQ) = x_1 - x_2 are illustrated in Figure 1.

Distribution of x_1 = \log(s_{cp}), x_2 = \log(S_c S_p / S_w), and \log(LQ) = x_1 - x_2.

Figure 1: Distribution of x1=log(scp)x_1 = \log(s_{cp}), x2=log(ScSp/Sw)x_2 = \log(S_c S_p / S_w), and log(LQ)=x1x2\log(LQ) = x_1 - x_2. The cut at log(s)=3\log(s) = 3 reveals the lower bound in scp=1000s_{cp} = 1000 US$ in our exports data. These distributions of log variables are much 'better behaved' than those of linear LQ, scps_{cp}, or ScSp/SwS_c S_p / S_w. The distribution of the number of patents and business counts is largely constrained to natural numbers, frequently showing small values. This feature explains the fragmented appearance of its histogram and scatter plots in Figure 2.

Distributions of observed points in log scale.

Figure 2: Distributions of observed points in log scale. The dots represent actual data points in our dataset. Plots on the top are for trade data, middle for patent counts data, bottom for business counts data. Left: points in ((x1+x2)/2,log(LQ)=x1x2(x_1 + x_2) / 2, \log(LQ) = x_1 - x_2) coordinates. Right: points in (x2,x1x_2, x_1) coordinates, with x1=log(scp)x_1 = \log(s_{cp}) and x2=log(ScSp/Sw)x_2 = \log(S_c S_p/S_w). Left plots can be transformed to right plots by applying a linear transformation. Dark lines indicate axes log(LQ)\log(LQ), x1x_1, x2x_2. A log transformation lets all observations form a bounded cloud.

Discussion on Empirical Settings

In the context of patent data, even with fractional counting, small natural numbers are quite frequent. A similar pattern is observed with business counts data, as shown in the middle and right plots of Figure 1. Despite the fragmented nature of the distribution of raw values, the rows, column sums, and the x2x_2 factor are well-distributed in logs, mitigating potential issues.

For exports data, there are additional features to consider. A lower bound of 10310^3 US$ is evident in our data (sharp edge on the distribution of values, left plot in Figure 1). This primarily affects the lower end of log-distributed values, and is not expected to critically impact row and column sums.

Handling Null Values

When applying a logarithmic transformation, null values can pose challenges depending on the task. In export data, null values might represent small values like US$ 500, or indicate absent activities in a region. For decisions regarding the LQ>1LQ > 1 condition, null values should be considered below the LQ=1LQ = 1 threshold.

In regression analyses using LQ values, addressing null values requires careful consideration due to potential distortions. Options include replacing zero values with a "very small" value, omitting null observations, or using an interaction with a dummy variable for null values. The choice depends on the specific application context.

Conceptual Interpretation of Null Entries

Null entries in cross sections often indicate either small volumes (as in truncated export data) or absences. Absences can be mild, such as intermittent exports, or deep, resulting from sparse records in detailed cross sections. Researchers must decide how to handle these null entries based on their specific tasks. However, the presence of null entries should not necessarily deter the application of log transformations on raw data.

This paper focuses on the LQ=1LQ = 1 threshold and observations fluctuating around it, ensuring that the undefined nature of the log of null values does not hinder our analyses.


Footnote: Fractional counting allows values to exist between natural numbers and below one.