A fixed effect model is a statistical model that includes fixed effects, which are parameters that are estimated to be constant across different groups.
Example [Panel Data]: In the context of panel data, fixed effects are parameters that are constant across different individuals or time. The typical model example is given by the following equation:
where \(Y_{it}\) is the dependent variable for individual \(i\) at time \(t\), \(X_{it}\) is the independent variable, \(\beta\) is the coefficient of the independent variable, \(\alpha_i\) is the individual fixed effect, \(\psi_t\) is the time fixed effect, and \(\varepsilon_{it}\) is the error term. The individual fixed effect \(\alpha_i\) is a parameter that is constant across time for each individual, while the time fixed effect \(\psi_t\) is a parameter that is constant across individuals for each time period.
Note however that, despite the fact that fixed effects are commonly used in panel setting, one does not need a panel data set to work with fixed effects. For example, cluster randomized trials with cluster fixed effects, or wage regressions with worker and firm fixed effects.
In this “quick start” guide, we will show you how to estimate a fixed effect model using the PyFixest package. We do not go into the details of the theory behind fixed effect models, but we focus on how to estimate them using PyFixest.
Read Sample Data
In a first step, we load the module and some synthetic example data:
import matplotlib.pyplot as pltimport pandas as pdfrom lets_plot import LetsPlotimport pyfixest as pffrom pyfixest.did.estimation import did2sfrom pyfixest.did.event_study import event_studyLetsPlot.setup_html()plt.style.use("seaborn-v0_8")%load_ext autoreload%autoreload 2%load_ext watermark%config InlineBackend.figure_format ="retina"%watermark --iversions
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
The watermark extension is already loaded. To reload it, use:
%reload_ext watermark
pyfixest : 0.20.2
pandas : 2.2.2
matplotlib: 3.8.4
We see that some of our columns have missing data.
OLS Estimation
We are interested in the relation between the dependent variable Y and the independent variables X1 using a fixed effect model for group_id. Let’s see how the data looks like:
We can estimate a fixed effects regression via the feols() function. feols() has three arguments: a two-sided model formula, the data, and optionally, the type of inference.
fit = pf.feols(fml="Y ~ X1 | group_id", data=data, vcov="HC1")type(fit)
pyfixest.estimation.feols_.Feols
The first part of the formula contains the dependent variable and “regular” covariates, while the second part contains fixed effects.
feols() returns an instance of the Fixest class.
To inspect the results, we can use a summary function or method:
Alternatively, the .summarize module contains a summary function, which can be applied on instances of regression model objects or lists of regression model objects.
You can access individual elements of the summary via dedicated methods: .tidy() returns a “tidy” pd.DataFrame, .coef() returns estimated parameters, and se() estimated standard errors. Other methods include pvalue(), confint() and tstat().
Let’s have a quick d-tour on the intuition behind fixed effects models using the example above. To do so, let us begin by comparing it with a simple OLS model.
We see that the X1 coefficient is -1.0, which is less than the value from the OLS model above (which was 0.949). Where is the difference coming from? Well, in the fixed effect model we are interested in controlling for the feature group_id. One possibility to do this is by adding a simple dummy variable for each level of group_id.
This is does not scale well! Imagine you have 1000 different levels of group_id. You would need to add 1000 dummy variables to your model. This is where fixed effect models come in handy. They allow you to control for these fixed effects without adding all these dummy variables. The way to do it is by a demeaning procedure. The idea is to subtract the average value of each level of group_id from the respective observations. This way, we control for the fixed effects without adding all these dummy variables. Let’s try to do this manually:
We get the same results as the fixed effect model Y1 ~ X | group_id above. The PyFixest package uses a more efficient algorithm to estimate the fixed effect model, but the intuition is the same.
Standard Errors and Inference
Supported covariance types are “iid”, “HC1-3”, CRV1 and CRV3 (up to two-way clustering).
Why do we have so many different types of standard errors?
The standard errors of the coefficients are crucial for inference. They tell us how certain we can be about the estimated coefficients. In the presence of heteroskedasticity (a situation which typically arises with cross-sectional data), the standard OLS standard errors are biased. The pyfixest package provides several types of standard errors that are robust to heteroskedasticity.
iid: assumes that the error variance is spherical, i.e. errors are homoskedastic and not correlated (independent and identically distributed errors have a spherical error variance).
CRV1 and CRV3: cluster robust standard errors according to Cameron, Gelbach, and Miller (2011). See A Practitioner’s Guide to Cluster-Robust Inference. For CRV1 and CRV3 one should pass a dictionaty of the form {"CRV1": "clustervar"}.
Inference can be adjusted “on-the-fly” via the .vcov() method:
Confidence Intervals for a single model can be adjusted using a multiplier boostrap:
fit_ci = pf.feols("Y ~ X1+ C(f1)", data = data)fit_ci.confint(joint =True).head()
0.025%
0.975%
Intercept
-0.422921
1.400832
X1
-1.160034
-0.738848
C(f1)[T.1.0]
1.388184
3.777114
C(f1)[T.2.0]
-2.834723
-0.329145
C(f1)[T.3.0]
-1.604060
0.979393
Panel Data Example: Causal Inference for the Brave and True
In this example we replicate the results of the great (freely available reference!) Causal Inference for the Brave and True - Chapter 14. Please refer to the original text for a detailed explanation of the data.
The objective is to estimate the effect of the variable married on the variable lwage using a fixed effect model on the entity variable nr and the time variable year.
panel_fit = pf.feols( fml="lwage ~ expersq + union + married + hours | nr + year", data=data_df, vcov={"CRV1": "nr + year"},)panel_fit.summary()
IV estimation with multiple endogenous variables and multiple estimation syntax is currently not supported. The syntax is depvar ~ exog.vars | fixef effects | endog.vars ~ instruments.
Poisson Regression
It is possible to estimate Poisson Regressions (for example, to model count data). We can showcase this feature with another synthetic data set.
PyFixest supports a range of multiple estimation functionality: sw, sw0, csw, csw0, and multiple dependent variables. The meaning of these options is explained in the Multiple Estimations vignette of the fixest package:
sw: this function is replaced sequentially by each of its arguments. For example, y ~ x1 + sw(x2, x3) leads to two estimations: y ~ x1 + x2 and y ~ x1 + x3.
sw0: identical to sw but first adds the empty element. E.g. y ~ x1 + sw0(x2, x3) leads to three estimations: y ~ x1, y ~ x1 + x2 and y ~ x1 + x3.
csw: it stands for cumulative stepwise. It adds to the formula each of its arguments sequentially. E.g. y ~ x1 + csw(x2, x3) will become y ~ x1 + x2 and y ~ x1 + x2 + x3.
csw0: identical to csw but first adds the empty element. E.g. y ~ x1 + csw0(x2, x3) leads to three estimations: y ~ x1, y ~ x1 + x2 and y ~ x1 + x2 + x3.
If multiple regression syntax is used, feols() and fepois returns an instance of a FixestMulti object, which essentially consists of a dicionary of Fepois or Feols instances.
Alternatively, you can look at the estimation results via the etable() method:
multi_fit.etable()
est1 est2 est3
------------ ----------------- ----------------- -----------------
depvar Y Y Y
---------------------------------------------------------------------
Intercept 0.919*** (0.112)
X1 -1.000*** (0.082) -0.949*** (0.066) -0.919*** (0.058)
---------------------------------------------------------------------
f1 - x x
f2 - - x
---------------------------------------------------------------------
R2 0.123 0.437 0.609
S.E. type hetero hetero hetero
Observations 998 997 997
---------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)
You can access an individual model by its name - i.e. a formula - via the all_fitted_models attribute.
multi_fit.all_fitted_models["Y~X1"].tidy()
Estimate
Std. Error
t value
Pr(>|t|)
2.5%
97.5%
Coefficient
Intercept
0.918518
0.111707
8.222580
6.661338e-16
0.699310
1.137725
X1
-1.000086
0.082420
-12.134086
0.000000e+00
-1.161822
-0.838350
or equivalently via the fetch_model method:
multi_fit.fetch_model(0).tidy()
Model: Y~X1
Estimate
Std. Error
t value
Pr(>|t|)
2.5%
97.5%
Coefficient
Intercept
0.918518
0.111707
8.222580
6.661338e-16
0.699310
1.137725
X1
-1.000086
0.082420
-12.134086
0.000000e+00
-1.161822
-0.838350
Here, 0 simply fetches the first model stored in the all_fitted_models dictionary, 1 the second etc.
Objects of type Fixest come with a range of additional methods: tidy(), coef(), vcov() etc, which essentially loop over the equivalent methods of all fitted models. E.g. Fixest.vcov() updates inference for all models stored in Fixest.
est1 est2
------------ ---------------- ----------------
depvar dep_var dep_var_hat
------------------------------------------------
ATT 2.135*** (0.044) 2.152*** (0.048)
------------------------------------------------
state x -
year x -
------------------------------------------------
R2 0.758 0.338
S.E. type by: state CRV1
Observations 46500 46500
------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)