In regression analyses, we often wonder about “how much covariates matter?” for explaining the relationship between a target variable \(D\) and an outcome variable \(Y\).
For example, we might start analysing the gender wage gap with a simple regression model as log(wage) on gender. But arguably, men and women differ in many socio-economic characteristics: they might have different (average) levels of education or career experience, and they might work in different industries and select into different higher- or lower-paying industries. So which fraction of the gender wage gap can be explained by these observable characteristics?
In this notebook, we will compute and decompose the gender wage gab based on a subset of the PSID data set using a method commonly known as the “Gelbach Decomposition” (Gelbach, JoLE 2016).
We start with loading a subset of the PSID data provided by the AER R package.
import reimport pandas as pdimport pyfixest as pfpsid = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/refs/heads/master/csv/AER/PSID7682.csv")psid["experience"] = pd.Categorical(psid["experience"])psid["year"] = pd.Categorical(psid["year"])psid.head()
rownames
experience
weeks
occupation
industry
south
smsa
married
gender
union
education
ethnicity
wage
year
id
0
1
3
32
white
no
yes
no
yes
male
no
9
other
260
1976
1
1
2
4
43
white
no
yes
no
yes
male
no
9
other
305
1977
1
2
3
5
40
white
no
yes
no
yes
male
no
9
other
402
1978
1
3
4
6
39
white
no
yes
no
yes
male
no
9
other
402
1979
1
4
5
7
42
white
yes
yes
no
yes
male
no
9
other
429
1980
1
Computing a first correlation between gender and wage, we find that males earn on average 0.474 log points more than women.
To examine the impact of observable on the relationship between wage and gender, a common strategy in applied research is to incrementally add a set of covariates to the baseline regression model of log(wage) on gender. Here, we will incrementally add the following covariates:
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
We see that the coefficient on gender is roughly the same in all models. Tentatively, we might already conclude that the observable characteristics in the data do not explain a large part of the gender wage gap.
But how much do differences in education matter? We have computed 6 additional models that contain education as a covariate. The obtained point estimates vary between \(0.059\) and \(0.075\). Which of these numbers should we report?
Additionally, note that while we have only computed 6 additional models with covariates, the number of possible models is much larger. If I did the math correctly, simply by additively and incrementally adding covariates, we could have computed \(57\) different models (not all of which would have included education as a control).
As it turns out, different models will lead to different point estimates. The order of incrementally adding covariates might impact our conclusion. To illustrate this, we keep the same ordering as before, but start with ethnicity as our first variable:
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
We obtain 5 new coefficients on education that vary between 0.074 and 0.059.
So, which share of the “raw” gender wage gap can be attributed to differences in education between men and women? Should we report a statisics based on the 0.075 estimate? Or on the 0.059 estimate? Which value should we pick?
To help us with this problem, Gelbach (2016, JoLE) develops a decomposition procedure building on the omitted variable bias formula that produces a single value for the contribution of a given covariate, say education, to the gender wage gap.
Notation and Gelbach’s Algorithm
Before we dive into a code example, let us first introduce the notation and Gelbach’s algorithm. We are interested in “decomposing” the effect of a variable \(X_{1} \in \mathbb{R}\) on an outcome \(Y \in \mathbb{R}\) into a part explained by covariates \(X_{2} \in \mathbb{R}^{k_{2}}\) and an unexplained part.
Thus we can specify two regression models:
The short model \[
Y = X_{1} \beta_{1} + u_{1}
\]
the long (or full) model
\[
Y = X_{1} \beta_{1} + X_{2} \beta_{2} + e
\]
By fitting the short regression, we obtain an estimate \(\hat{\beta}_{1}\), which we will denote as the direct effect, and by estimating the long regression, we obtain an estimate of the regression coefficients \(\hat{\beta}_{2} \in \mathbb{R}^{k_2}\). We will denote the estimate on \(X_1\) in the long regression as the full effect.
We can then compute the contribution of an individual covariate \(\hat{\delta}_{k}\) via the following algorithm:
Step 1: we compute coefficients from \(k_{2}\) auxiliary regression models \(\hat{\Gamma}\) as \[
\hat{\Gamma} = (X_{1}'X_{1})^{-1} X_{1}'X_{2}
\]
In words, we regress the target variable \(X_{1}\) on each covariate in \(X_{2}\). In practice, we can easily do this in one line of code via scipy.linalg.lstsq().
Step 2: We can compute the total effect explained by the covariates, which we denote by \(\delta\), as
where \(\hat{\Gamma}_{k}\) are the coefficients from an auxiliary regression \(X_1\) on covariate \(X_{2,k}\) and \(\hat{\beta}_{2,k}\) is the associated estimate on \(X_{2,k}\) from the full model.
The individual contribution of covariate \(k\) is then defined as
After having obtained \(\delta_{k}\) for each auxiliary variable \(k\), we can easily aggregate multiple variables into a single groups of interest. For example, if \(X_{2}\) contains a set of dummies from industry fixed effects, we could compute the explained part of “industry” by summing over all the dummies:
After fitting the full model, we can run the decomposition procedure by calling the decompose() method. The only required argument is to specify the target parameter, which in this case is “gender”. Inference is conducted via a non-parametric bootstrap and can optionally be turned off.
As before, this produces a pretty big output table that reports - the direct effect of the regression of log(wage) ~ gender - the full effect of gender on log wage using the full regression with all control variables - the explained effect as the difference between the full and direct effect - a single scalar value for the individual contributions of a covariate to overall explained effect
For our example at hand, the additional covariates only explain a tiny fraction of the differences in log wages between men and women - 0.064 points. Of these, around one third can be attributed to ethnicity, 0.00064 to years of eduaction, etc.
pf.make_table(gb)
direct_effect
full_effect
explained_effect
gender[T.male]
0.47447
0.41034
0.06413
[0.45957, 0.50398]
[0.39909, 0.43440]
[0.04599, 0.10157]
ethnicity[T.other]
0.02275
[0.01928, 0.02913]
education
0.00064
[-0.00669, 0.00551]
experience[T.2]
-0.00030
[-0.00038, 0.00020]
experience[T.3]
-0.00234
[-0.00545, 0.00068]
experience[T.4]
-0.00351
[-0.00664, -0.00137]
experience[T.5]
-0.00307
[-0.00418, -0.00210]
experience[T.6]
-0.00103
[-0.00388, 0.00065]
experience[T.7]
0.00000
[0.00023, 0.00663]
experience[T.8]
0.00184
[0.00048, 0.00772]
experience[T.9]
-0.00338
[-0.00394, 0.00290]
experience[T.10]
-0.00438
[-0.00822, -0.00434]
experience[T.11]
-0.00185
[-0.01007, 0.00476]
experience[T.12]
-0.00146
[-0.00367, 0.00006]
experience[T.13]
-0.00588
[-0.00753, 0.00256]
experience[T.14]
-0.00961
[-0.01220, -0.00514]
experience[T.15]
-0.00878
[-0.01349, 0.00014]
experience[T.16]
-0.00551
[-0.01104, 0.00113]
experience[T.17]
-0.00020
[-0.00239, 0.00118]
experience[T.18]
-0.00220
[-0.00425, -0.00007]
experience[T.19]
-0.00551
[-0.00811, -0.00134]
experience[T.20]
-0.00789
[-0.01951, -0.00011]
experience[T.21]
-0.00358
[-0.01253, -0.00033]
experience[T.22]
-0.00611
[-0.01149, -0.00335]
experience[T.23]
-0.00225
[-0.00732, 0.00394]
experience[T.24]
-0.00347
[-0.00518, 0.00302]
experience[T.25]
-0.00132
[-0.00329, 0.00276]
experience[T.26]
0.00190
[0.00097, 0.01047]
experience[T.27]
0.00985
[0.00718, 0.01245]
experience[T.28]
0.00936
[0.00539, 0.00755]
experience[T.29]
0.01263
[0.00961, 0.01783]
experience[T.30]
0.01217
[0.00692, 0.01191]
experience[T.31]
0.01377
[0.00599, 0.01641]
experience[T.32]
0.01221
[0.00656, 0.01189]
experience[T.33]
0.01177
[0.01226, 0.01842]
experience[T.34]
0.01102
[0.00619, 0.01278]
experience[T.35]
0.00898
[0.00555, 0.01469]
experience[T.36]
0.00672
[0.00553, 0.00988]
experience[T.37]
0.00596
[0.00142, 0.01117]
experience[T.38]
0.00627
[0.00415, 0.00908]
experience[T.39]
0.00349
[-0.00051, 0.00313]
experience[T.40]
0.00163
[-0.00053, 0.00187]
experience[T.41]
-0.00006
[-0.00139, 0.00112]
experience[T.42]
-0.00020
[-0.00091, 0.00032]
experience[T.43]
0.00015
[-0.00114, 0.00126]
experience[T.44]
-0.00158
[-0.00234, -0.00051]
experience[T.45]
-0.00172
[-0.00460, -0.00008]
experience[T.46]
-0.00088
[-0.00109, -0.00000]
experience[T.47]
-0.00031
[-0.00005, 0.00129]
experience[T.48]
-0.00057
[0.00004, 0.00093]
experience[T.49]
-0.00098
[-0.00261, 0.00011]
experience[T.50]
-0.00071
[-0.00152, 0.00024]
experience[T.51]
-0.00003
[-0.00026, 0.00000]
occupation[T.white]
-0.01649
[-0.01603, -0.01136]
industry[T.yes]
0.01818
[0.01259, 0.02040]
year[T.1977]
-0.00000
[-0.00100, 0.00019]
year[T.1978]
0.00000
[-0.00120, 0.00534]
year[T.1979]
0.00000
[-0.00835, -0.00024]
year[T.1980]
0.00000
[-0.00264, 0.00760]
year[T.1981]
0.00000
[-0.00339, 0.00949]
year[T.1982]
0.00000
[-0.00999, 0.01238]
Because experience is a categorical variable, the table gets pretty unhandy: we produce one estimate for “each” level. Luckily, Gelbach’s decomposition allows us to group individual contributions into a single number. In the decompose() method, we can combine variables via the combine_covariates argument: