Getting Started with PyFixest

OLS with Fixed Effects

What is a fixed effect model?

A fixed effect model is a statistical model that includes fixed effects, which are parameters that are estimated to be constant across different groups.

Example [Panel Data]: In the context of panel data, fixed effects are parameters that are constant across different individuals or time. The typical model example is given by the following equation:

\[ Y_{it} = \beta X_{it} + \alpha_i + \psi_t + \varepsilon_{it} \]

where \(Y_{it}\) is the dependent variable for individual \(i\) at time \(t\), \(X_{it}\) is the independent variable, \(\beta\) is the coefficient of the independent variable, \(\alpha_i\) is the individual fixed effect, \(\psi_t\) is the time fixed effect, and \(\varepsilon_{it}\) is the error term. The individual fixed effect \(\alpha_i\) is a parameter that is constant across time for each individual, while the time fixed effect \(\psi_t\) is a parameter that is constant across individuals for each time period.

Note however that, despite the fact that fixed effects are commonly used in panel setting, one does not need a panel data set to work with fixed effects. For example, cluster randomized trials with cluster fixed effects, or wage regressions with worker and firm fixed effects.

In this “quick start” guide, we will show you how to estimate a fixed effect model using the PyFixest package. We do not go into the details of the theory behind fixed effect models, but we focus on how to estimate them using PyFixest.

Read Sample Data

In a first step, we load the module and some synthetic example data:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
try:
    from lets_plot import LetsPlot
    _HAS_LETS_PLOT = True
except ImportError:
    _HAS_LETS_PLOT = False

from marginaleffects import slopes, avg_slopes

import pyfixest as pf

if _HAS_LETS_PLOT:
    LetsPlot.setup_html()

plt.style.use("seaborn-v0_8")

%load_ext watermark
%config InlineBackend.figure_format = "retina"
%watermark --iversions

data = pf.get_data()

data.head()

pyfixest  : 0.30.2
pandas    : 2.2.3
matplotlib: 3.10.3
numpy     : 2.2.6

	Y	Y2	X1	X2	f1	f2	f3	group_id	Z1	Z2	weights
0	NaN	2.357103	0.0	0.457858	15.0	0.0	7.0	9.0	-0.330607	1.054826	0.661478
1	-1.458643	5.163147	NaN	-4.998406	6.0	21.0	4.0	8.0	NaN	-4.113690	0.772732
2	0.169132	0.751140	2.0	1.558480	NaN	1.0	7.0	16.0	1.207778	0.465282	0.990929
3	3.319513	-2.656368	1.0	1.560402	1.0	10.0	11.0	3.0	2.869997	0.467570	0.021123
4	0.134420	-1.866416	2.0	-3.472232	19.0	20.0	6.0	14.0	0.835819	-3.115669	0.790815

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Y         999 non-null    float64
 1   Y2        1000 non-null   float64
 2   X1        999 non-null    float64
 3   X2        1000 non-null   float64
 4   f1        999 non-null    float64
 5   f2        1000 non-null   float64
 6   f3        1000 non-null   float64
 7   group_id  1000 non-null   float64
 8   Z1        999 non-null    float64
 9   Z2        1000 non-null   float64
 10  weights   1000 non-null   float64
dtypes: float64(11)
memory usage: 86.1 KB

We see that some of our columns have missing data.

OLS Estimation

We are interested in the relation between the dependent variable Y and the independent variables X1 using a fixed effect model for group_id. Let’s see how the data looks like:

ax = data.plot(kind="scatter", x="X1", y="Y", c="group_id", colormap="viridis")

We can estimate a fixed effects regression via the feols() function. feols() has three arguments: a two-sided model formula, the data, and optionally, the type of inference.

fit = pf.feols(fml="Y ~ X1 | group_id", data=data, vcov="HC1")
type(fit)

pyfixest.estimation.feols_.Feols

The first part of the formula contains the dependent variable and “regular” covariates, while the second part contains fixed effects.

feols() returns an instance of the Fixest class.

Inspecting Model Results

To inspect the results, we can use a summary function or method:

fit.summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: group_id
Inference:  HC1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.019 |        0.083 |   -12.234 |      0.000 | -1.182 |  -0.856 |
---
RMSE: 2.141 R2: 0.137 R2 Within: 0.126

Or display a formatted regression table:

pf.etable(fit)

	Y
	(1)
coef
X1	-1.019*** (0.083)
fe
group_id	x
stats
Observations	998
S.E. type	hetero
R²	0.137
R² Within	0.126
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

Alternatively, the .summarize module contains a summary function, which can be applied on instances of regression model objects or lists of regression model objects. For details on how to customize etable(), please take a look at the dedicated vignette.

pf.summary(fit)

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: group_id
Inference:  HC1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.019 |        0.083 |   -12.234 |      0.000 | -1.182 |  -0.856 |
---
RMSE: 2.141 R2: 0.137 R2 Within: 0.126

You can access individual elements of the summary via dedicated methods: .tidy() returns a “tidy” pd.DataFrame, .coef() returns estimated parameters, and se() estimated standard errors. Other methods include pvalue(), confint() and tstat().

fit.tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
X1	-1.019009	0.083296	-12.233634	0.0	-1.182467	-0.85555

fit.coef()

Coefficient
X1   -1.019009
Name: Estimate, dtype: float64

fit.se()

Coefficient
X1    0.083296
Name: Std. Error, dtype: float64

fit.tstat()

Coefficient
X1   -12.233634
Name: t value, dtype: float64

fit.confint()

	2.5%	97.5%
X1	-1.182467	-0.85555

Last, model results can be visualized via dedicated methods for plotting:

fit.coefplot()
# or pf.coefplot([fit])

How to interpret the results?

Let’s have a quick d-tour on the intuition behind fixed effects models using the example above. To do so, let us begin by comparing it with a simple OLS model.

fit_simple = pf.feols("Y ~ X1", data=data, vcov="HC1")

fit_simple.summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: 0
Inference:  HC1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.919 |        0.112 |     8.223 |      0.000 |  0.699 |   1.138 |
| X1            |     -1.000 |        0.082 |   -12.134 |      0.000 | -1.162 |  -0.838 |
---
RMSE: 2.158 R2: 0.123

We can compare both models side by side in a regression table:

pf.etable([fit, fit_simple])

	Y
	(1)	(2)
coef
X1	-1.019*** (0.083)	-1.000*** (0.082)
Intercept		0.919*** (0.112)
fe
group_id	x	-
stats
Observations	998	998
S.E. type	hetero	hetero
R²	0.137	0.123
R² Within	0.126	-
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

We see that the X1 coefficient is -1.019, which is less than the value from the OLS model in column (2). Where is the difference coming from? Well, in the fixed effect model we are interested in controlling for the feature group_id. One possibility to do this is by adding a simple dummy variable for each level of group_id.

fit_dummy = pf.feols("Y ~ X1 + C(group_id) ", data=data, vcov="HC1")

fit_dummy.summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: 0
Inference:  HC1
Observations:  998

| Coefficient         |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept           |      0.760 |        0.288 |     2.640 |      0.008 |  0.195 |   1.326 |
| X1                  |     -1.019 |        0.083 |   -12.234 |      0.000 | -1.182 |  -0.856 |
| C(group_id)[T.1.0]  |      0.380 |        0.451 |     0.844 |      0.399 | -0.504 |   1.264 |
| C(group_id)[T.2.0]  |      0.084 |        0.389 |     0.216 |      0.829 | -0.680 |   0.848 |
| C(group_id)[T.3.0]  |      0.790 |        0.415 |     1.904 |      0.057 | -0.024 |   1.604 |
| C(group_id)[T.4.0]  |     -0.189 |        0.388 |    -0.487 |      0.626 | -0.950 |   0.572 |
| C(group_id)[T.5.0]  |      0.537 |        0.388 |     1.385 |      0.166 | -0.224 |   1.297 |
| C(group_id)[T.6.0]  |      0.307 |        0.398 |     0.771 |      0.441 | -0.474 |   1.087 |
| C(group_id)[T.7.0]  |      0.015 |        0.422 |     0.035 |      0.972 | -0.814 |   0.844 |
| C(group_id)[T.8.0]  |      0.382 |        0.406 |     0.941 |      0.347 | -0.415 |   1.179 |
| C(group_id)[T.9.0]  |      0.219 |        0.417 |     0.526 |      0.599 | -0.599 |   1.037 |
| C(group_id)[T.10.0] |     -0.363 |        0.422 |    -0.861 |      0.390 | -1.191 |   0.465 |
| C(group_id)[T.11.0] |      0.201 |        0.387 |     0.520 |      0.603 | -0.559 |   0.961 |
| C(group_id)[T.12.0] |     -0.110 |        0.410 |    -0.268 |      0.788 | -0.915 |   0.694 |
| C(group_id)[T.13.0] |      0.126 |        0.440 |     0.287 |      0.774 | -0.736 |   0.989 |
| C(group_id)[T.14.0] |      0.353 |        0.416 |     0.848 |      0.397 | -0.464 |   1.170 |
| C(group_id)[T.15.0] |      0.469 |        0.398 |     1.179 |      0.239 | -0.312 |   1.249 |
| C(group_id)[T.16.0] |     -0.135 |        0.396 |    -0.340 |      0.734 | -0.913 |   0.643 |
| C(group_id)[T.17.0] |     -0.005 |        0.401 |    -0.013 |      0.989 | -0.792 |   0.781 |
| C(group_id)[T.18.0] |      0.283 |        0.403 |     0.702 |      0.483 | -0.508 |   1.074 |
---
RMSE: 2.141 R2: 0.137

This is does not scale well! Imagine you have 1000 different levels of group_id. You would need to add 1000 dummy variables to your model. This is where fixed effect models come in handy. They allow you to control for these fixed effects without adding all these dummy variables. The way to do it is by a demeaning procedure. The idea is to subtract the average value of each level of group_id from the respective observations. This way, we control for the fixed effects without adding all these dummy variables. Let’s try to do this manually:

def _demean_column(df: pd.DataFrame, column: str, by: str) -> pd.Series:
    return df[column] - df.groupby(by)[column].transform("mean")


fit_demeaned = pf.feols(
    fml="Y_demeaned ~ X1_demeaned",
    data=data.assign(
        Y_demeaned=lambda df: _demean_column(df, "Y", "group_id"),
        X1_demeaned=lambda df: _demean_column(df, "X1", "group_id"),
    ),
    vcov="HC1",
)

fit_demeaned.summary()

###

Estimation:  OLS
Dep. var.: Y_demeaned, Fixed effects: 0
Inference:  HC1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.003 |        0.068 |     0.041 |      0.968 | -0.130 |   0.136 |
| X1_demeaned   |     -1.019 |        0.083 |   -12.345 |      0.000 | -1.181 |  -0.857 |
---
RMSE: 2.141 R2: 0.126

We get the same results as the fixed effect model Y1 ~ X | group_id above. The PyFixest package uses a more efficient algorithm to estimate the fixed effect model, but the intuition is the same.

Updating Regression Coefficients

You can update the coefficients of a model object via the update() method, which may be useful in an online learning setting where data arrives sequentially.

To see this in action, let us first fit a model on a subset of the data:

data_subsample = data.sample(frac=0.5)
m = pf.feols("Y ~ X1 + X2", data=data_subsample)
# current coefficient vector
m._beta_hat

array([ 1.02063052, -1.06251249, -0.182769  ])

Then sample 5 new observations and update the model with the new data. The update rule is

\[ \hat{\beta}_{n+1} = \hat{\beta}_n + (X_{n+1}' X_{n+1})^{-1} x_{n+1} + (y_{n+1} - x_{n+1} \hat{\beta}_n) \]

for a new observation \((x_{n+1}, y_{n+1})\).

new_points_id = np.random.choice(list(set(data.index) - set(data_subsample.index)), 5)
X_new, y_new = (
    np.c_[np.ones(len(new_points_id)), data.loc[new_points_id][["X1", "X2"]].values],
    data.loc[new_points_id]["Y"].values,
)
m.update(X_new, y_new)

array([ 1.00549238, -1.04932674, -0.17897649])

We verify that we get the same results if we had estimated the model on the appended data.

pf.feols(
    "Y ~ X1 + X2", data=data.loc[data_subsample.index.append(pd.Index(new_points_id))]
).coef().values

array([ 1.00549238, -1.04932674, -0.17897649])

Standard Errors and Inference

Supported covariance types are “iid”, “HC1-3”, CRV1 and CRV3 (up to two-way clustering).

Why do we have so many different types of standard errors?

The standard errors of the coefficients are crucial for inference. They tell us how certain we can be about the estimated coefficients. In the presence of heteroskedasticity (a situation which typically arises with cross-sectional data), the standard OLS standard errors are biased. The pyfixest package provides several types of standard errors that are robust to heteroskedasticity.

iid: assumes that the error variance is spherical, i.e. errors are homoskedastic and not correlated (independent and identically distributed errors have a spherical error variance).
HC1-3: heteroskedasticity-robust standard errors according to White (1980) and MacKinnon and White (1985). See Econometric Computing with HC and HAC Covariance Matrix Estimators from the sandwich package for more details.
CRV1 and CRV3: cluster robust standard errors according to Cameron, Gelbach, and Miller (2011). See A Practitioner’s Guide to Cluster-Robust Inference. For CRV1 and CRV3 one should pass a dictionaty of the form {"CRV1": "clustervar"}.

Inference can be adjusted “on-the-fly” via the .vcov() method:

fit.vcov({"CRV1": "group_id + f2"}).summary()

fit.vcov({"CRV3": "group_id"}).summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: group_id
Inference:  CRV1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.019 |        0.121 |    -8.445 |      0.000 | -1.272 |  -0.766 |
---
RMSE: 2.141 R2: 0.137 R2 Within: 0.126 
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: group_id
Inference:  CRV3
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.019 |        0.125 |    -8.174 |      0.000 | -1.281 |  -0.757 |
---
RMSE: 2.141 R2: 0.137 R2 Within: 0.126

The estimated covariance matrix is available as an attribute of the Feols object called ._vcov.

Inference via the Wild Bootstrap

It is also possible to run a wild (cluster) bootstrap after estimation (via the wildboottest module, see MacKinnon, J. G., Nielsen, M. Ø., & Webb, M. D. (2023). Fast and reliable jackknife and bootstrap methods for cluster-robust inference. Journal of Applied Econometrics, 38(5), 671–694.):

fit2 = pf.feols(fml="Y ~ X1", data=data, vcov={"CRV1": "group_id"})
fit2.wildboottest(param="X1", reps=999)

param                             X1
t value           -8.567586579080423
Pr(>|t|)                         0.0
bootstrap_type                    11
inference              CRV(group_id)
impose_null                     True
ssc                         1.056615
dtype: object

The Causal Cluster Variance Estimator

Additionally, PyFixest supports the causal cluster variance estimator following Abadie et al. (2023). Let’s look into it with another data set:

df = pd.read_stata("http://www.damianclarke.net/stata/census2000_5pc.dta")

df.head()

	ln_earnings	educ	hours	state	college
0	11.91839	18.0	50.0	44.0	1.0
1	11.48247	11.0	42.0	44.0	0.0
2	10.46310	12.0	42.0	44.0	0.0
3	10.22194	13.0	40.0	44.0	1.0
4	9.21034	13.0	8.0	44.0	1.0

axes = df.plot.hist(column=["ln_earnings"], by=["college"])

Now we can estimate the model ln_earnings ~ college where we cluster the standard errors at the state level:

fit3 = pf.feols("ln_earnings ~ college", vcov={"CRV1": "state"}, data=df)
fit3.ccv(treatment="college", pk=0.05, n_splits=2, seed=929)

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
CCV	0.4656425903701481	0.00348	133.820079	0.0	0.458657	0.472628
CRV1	0.465643	0.027142	17.155606	0.0	0.411152	0.520133

Randomization Inference

You can also conduct inference via randomization inference (see Heß, Stata Journal 2017). PyFixest supports random and cluster random sampling.

fit2.ritest(resampvar="X1=0", reps=1000, cluster="group_id")

H0                                     X1=0
ri-type                     randomization-c
Estimate                 -1.000085840074156
Pr(>|t|)                                0.0
Std. Error (Pr(>|t|))                   0.0
2.5% (Pr(>|t|))                         0.0
97.5% (Pr(>|t|))                        0.0
Cluster                            group_id
dtype: object

Multiple Testing Corrections: Bonferroni and Romano-Wolf

To correct for multiple testing, p-values can be adjusted via either the Bonferroni, the method by Romano and Wolf (2005), see for example The Romano-Wolf Multiple Hypothesis Correction in Stata, and the method by Westfall & Young (see here).

pf.bonferroni([fit, fit2], param="X1").round(3)

	est0	est1
Estimate	-1.019	-1.000
Std. Error	0.125	0.117
t value	-8.174	-8.568
Pr(>\|t\|)	0.000	0.000
2.5%	-1.281	-1.245
97.5%	-0.757	-0.755
Bonferroni Pr(>\|t\|)	0.000	0.000

pf.rwolf([fit, fit2], param="X1", reps=9999, seed=1234).round(3)

	est0	est1
Estimate	-1.019	-1.000
Std. Error	0.125	0.117
t value	-8.174	-8.568
Pr(>\|t\|)	0.000	0.000
2.5%	-1.281	-1.245
97.5%	-0.757	-0.755
RW Pr(>\|t\|)	0.000	0.000

pf.wyoung([fit, fit2], param="X1", reps=9999, seed=1234).round(3)

	est0	est1
Estimate	-1.019	-1.000
Std. Error	0.125	0.117
t value	-8.174	-8.568
Pr(>\|t\|)	0.000	0.000
2.5%	-1.281	-1.245
97.5%	-0.757	-0.755
WY Pr(>\|t\|)	0.000	0.000

Joint Confidence Intervals

Simultaneous confidence bands for a vector of parameters can be computed via the joint_confint() method. See Simultaneous confidence bands: Theory, implementation, and an application to SVARs for background.

fit_ci = pf.feols("Y ~ X1+ C(f1)", data=data)
fit_ci.confint(joint=True).head()

	2.5%	97.5%
Intercept	-0.422426	1.400337
X1	-1.159919	-0.738963
C(f1)[T.1.0]	1.388832	3.776466
C(f1)[T.2.0]	-2.834043	-0.329825
C(f1)[T.3.0]	-1.603359	0.978692

Panel Data Example: Causal Inference for the Brave and True

In this example we replicate the results of the great (freely available reference!) Causal Inference for the Brave and True - Chapter 14. Please refer to the original text for a detailed explanation of the data.

data_path = "https://raw.githubusercontent.com/bashtage/linearmodels/main/linearmodels/datasets/wage_panel/wage_panel.csv.bz2"
data_df = pd.read_csv(data_path)

data_df.head()

	nr	year	exper	hours	educ	union	lwage	expersq	occupation
0	13	1980	1	2672	14	0	1.197540	1	9
1	13	1981	2	2320	14	1	1.853060	4	9
2	13	1982	3	2940	14	0	1.344462	9	9
3	13	1983	4	2960	14	0	1.433213	16	9
4	13	1984	5	3071	14	0	1.568125	25	5

The objective is to estimate the effect of the variable married on the variable lwage using a fixed effect model on the entity variable nr and the time variable year.

panel_fit = pf.feols(
    fml="lwage ~ expersq + union + married + hours | nr + year",
    data=data_df,
    vcov={"CRV1": "nr + year"},
)

pf.etable(panel_fit)

	lwage
	(1)
coef
expersq	-0.006*** (0.001)
union	0.073* (0.023)
married	0.048* (0.018)
hours	-0.000** (0.000)
fe
nr	x
year	x
stats
Observations	4360
S.E. type	by: nr+year
R²	0.631
R² Within	0.047
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

We obtain the same results as in the book!

Instrumental Variables (IV) Estimation

It is also possible to estimate instrumental variable models with one endogenous variable and (potentially multiple) instruments.

In general, the syntax for IV is depvar ~ exog.vars | fixef effects | endog.vars ~ instruments.

iv_fit = pf.feols(fml="Y2 ~ 1 | f1 + f2 | X1 ~ Z1 + Z2", data=data)
iv_fit.summary()

###

Estimation:  IV
Dep. var.: Y2, Fixed effects: f1+f2
Inference:  CRV1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.600 |        0.338 |    -4.728 |      0.000 | -2.293 |  -0.908 |
---

If the model does not contain any fixed effects, just drop the second part of the formula above:

pf.feols(fml="Y ~ 1 | X1 ~ Z1 + Z2", data=data).summary()

###

Estimation:  IV
Dep. var.: Y, Fixed effects: 0
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.911 |        0.156 |     5.843 |      0.000 |  0.605 |   1.217 |
| X1            |     -0.993 |        0.134 |    -7.398 |      0.000 | -1.256 |  -0.730 |
---

You can access the first stage regression object via the ._model_1st_stage attribute:

pf.etable([iv_fit._model_1st_stage, iv_fit])

	X1	Y2
	(1)	(2)
coef
Z1	0.395*** (0.012)
Z2	-0.005 (0.006)
X1		-1.600*** (0.338)
fe
f2	x	x
f1	x	x
stats
Observations	998	998
S.E. type	by: f1	by: f1
R²	0.431	-
R² Within	0.399	-
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

You can access the F-Statistic of the first stage via the _f_stat_1st_stage attribute:

iv_fit._f_stat_1st_stage

np.float64(557.4557976365319)

Via the IV_Diag method, you can compute additional IV Diagnostics, as the effective F-statistic following Olea & Pflueger (2013):

iv_fit.IV_Diag()
iv_fit._eff_F

np.float64(400.13353398688037)

IV estimation with multiple endogenous variables and multiple estimation syntax is currently not supported.

Poisson Regression

It is possible to estimate Poisson Regressions (for example, to model count data). We can showcase this feature with another synthetic data set.

pois_data = pf.get_data(model="Fepois")

ax = pois_data.plot(
    kind="scatter",
    x="X1",
    y="Y",
    c="group_id",
    colormap="viridis",
    s="f2",
)

pois_fit = pf.fepois(fml="Y ~ X1 | group_id", data=pois_data, vcov={"CRV1": "group_id"})
pois_fit.summary()

###

Estimation:  Poisson
Dep. var.: Y, Fixed effects: group_id
Inference:  CRV1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |      0.004 |        0.032 |     0.119 |      0.905 | -0.060 |   0.067 |
---
Deviance: 1126.202

Quantile Regression

You can fit a quantile regression via the quantreg function:

fit_qr = pf.quantreg("Y ~ X1 + X2", data = data, quantile = [0.1, 0.5, 0.9])
pf.qplot(fit_qr)

/home/runner/work/pyfixest/pyfixest/pyfixest/estimation/quantreg/quantreg_.py:77: FutureWarning: 
           The Quantile Regression implementation is experimental and may change in future releases.
           But mostly, we expect the API to remain unchanged.

(<Figure size 960x576 with 3 Axes>,
 array([<Axes: title={'center': 'Intercept'}, xlabel='Quantile', ylabel='Coefficient (95 % CI)'>,
        <Axes: title={'center': 'X1'}, xlabel='Quantile', ylabel='Coefficient (95 % CI)'>,
        <Axes: title={'center': 'X2'}, xlabel='Quantile', ylabel='Coefficient (95 % CI)'>],
       dtype=object))

For details, take a look at the dedicated quantreg vignette.

Tests of Multiple Hypothesis / Wald Tests

You can test multiple hypotheses simultaneously via the wald_test method.

fit = pf.feols("Y ~ X1 + X2 | f1", data=data)

For example, to test the joint null hypothesis of \(X_{1} = 0\) and \(X_{2} = 0\) vs the alternative that \(X_{1} \neq 0\) or \(X_{2} \neq 0\), we would run

fit.wald_test(R=np.eye(2))

statistic    1.463308e+02
pvalue       6.661338e-16
dtype: float64

Alternatively, suppose we wanted to test a more complicated joint null hypothesis: \(X_{1} + 2X_{2} = 2.0\) and \(X_{2} = 1.0\). To do so, we would define \(R\) and \(q\) as

R1 = np.array([[1, 2], [0, 1]])
q1 = np.array([2.0, 1.0])
fit.wald_test(R=R1, q=q1)

/home/runner/work/pyfixest/pyfixest/pyfixest/estimation/feols_.py:1180: UserWarning: Distribution changed to chi2, as R is not an identity matrix and q is not a zero vector.

statistic    4269.167164
pvalue          0.000000
dtype: float64

Other GLMs (without fixed effects)

PyFixest experimentally supports a range of other GLMs without fixed effects (adding fixed effect support is WIP) via the pf.feglm() function. Full support with all bells and whistles (in particular, fixed effects demeaning) is planned for PyFixest 0.29.

data_glm = pf.get_data(N=100, seed = 170)
data_glm["Y"] = np.where(data_glm["Y"] > 0, 1, 0)

fit_gaussian = pf.feglm(fml = "Y~X1", data = data_glm, family = "gaussian")
fit_logit = pf.feglm(fml = "Y~X1", data = data_glm, family = "logit")
fit_probit = pf.feglm(fml = "Y~X1", data = data_glm, family = "probit")

pf.etable([
    fit_gaussian,
    fit_logit,
    fit_probit,
])

	Y
	(1)	(2)	(3)
coef
X1	-0.059 (0.060)	-0.275 (0.283)	-0.171 (0.171)
Intercept	0.377*** (0.080)	-0.497 (0.362)	-0.307 (0.223)
stats
Observations	99	99	99
S.E. type	iid	iid	iid
R²	-	-	-
Adj. R²	-	-	-
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

You can make predictions on the response and link scale via the predict() method:

fit_logit.predict(type = "response")[0:5]
fit_logit.predict(type = "link")[0:5]

array([-0.49677152, -1.046612  , -1.046612  , -0.77169176, -1.046612  ])

You can compute the average marginal effect via the marginaleffects package:

avg_slopes(fit_logit, variables = "X1")

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"X1"	"dY/dX"	-0.27492	0.282645	-0.97267	0.330717	1.59633	-0.828894	0.279054

Please take a look at the marginaleffects book to learn about other transformations that the marginaleffects package supports.

Multiple Estimation

PyFixest supports a range of multiple estimation functionality: sw, sw0, csw, csw0, and multiple dependent variables. The meaning of these options is explained in the Multiple Estimations vignette of the fixest package:

sw: this function is replaced sequentially by each of its arguments. For example, y ~ x1 + sw(x2, x3) leads to two estimations: y ~ x1 + x2 and y ~ x1 + x3.

sw0: identical to sw but first adds the empty element. E.g. y ~ x1 + sw0(x2, x3) leads to three estimations: y ~ x1, y ~ x1 + x2 and y ~ x1 + x3.

csw: it stands for cumulative stepwise. It adds to the formula each of its arguments sequentially. E.g. y ~ x1 + csw(x2, x3) will become y ~ x1 + x2 and y ~ x1 + x2 + x3.

csw0: identical to csw but first adds the empty element. E.g. y ~ x1 + csw0(x2, x3) leads to three estimations: y ~ x1, y ~ x1 + x2 and y ~ x1 + x2 + x3.

Additionally, we support split and fsplit function arguments. > - split allows to split a sample by a given variable. If specified, pf.feols() and pf.fepois() will loop through all resulting sample splits. > - fsplit works just as split, but fits the model on the full sample as well.

If multiple regression syntax is used, feols() and fepois returns an instance of a FixestMulti object, which essentially consists of a dicionary of Fepois or Feols instances.

multi_fit = pf.feols(fml="Y ~ X1 | csw0(f1, f2)", data=data, vcov="HC1")
multi_fit

<pyfixest.estimation.FixestMulti_.FixestMulti at 0x7f2859b39070>

multi_fit.etable()

	Y
	(1)	(2)	(3)
coef
X1	-1.000*** (0.082)	-0.949*** (0.067)	-0.919*** (0.060)
Intercept	0.919*** (0.112)
fe
f2	-	-	x
f1	-	x	x
stats
Observations	998	997	997
S.E. type	hetero	hetero	hetero
R²	0.123	0.437	0.609
R² Within	-	0.161	0.200
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

You can access an individual model by its name - i.e. a formula - via the all_fitted_models attribute.

multi_fit.all_fitted_models["Y~X1"].tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	0.918518	0.111707	8.222580	6.661338e-16	0.699310	1.137725
X1	-1.000086	0.082420	-12.134086	0.000000e+00	-1.161822	-0.838350

or equivalently via the fetch_model method:

multi_fit.fetch_model(0).tidy()

Model:  Y~X1

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	0.918518	0.111707	8.222580	6.661338e-16	0.699310	1.137725
X1	-1.000086	0.082420	-12.134086	0.000000e+00	-1.161822	-0.838350

Here, 0 simply fetches the first model stored in the all_fitted_models dictionary, 1 the second etc.

Objects of type Fixest come with a range of additional methods: tidy(), coef(), vcov() etc, which essentially loop over the equivalent methods of all fitted models. E.g. Fixest.vcov() updates inference for all models stored in Fixest.

multi_fit.vcov("iid").summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: 0
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.919 |        0.112 |     8.214 |      0.000 |  0.699 |   1.138 |
| X1            |     -1.000 |        0.085 |   -11.802 |      0.000 | -1.166 |  -0.834 |
---
RMSE: 2.158 R2: 0.123 
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1
Inference:  iid
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.949 |        0.070 |   -13.636 |      0.000 | -1.086 |  -0.813 |
---
RMSE: 1.73 R2: 0.437 R2 Within: 0.161 
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference:  iid
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.919 |        0.060 |   -15.322 |      0.000 | -1.037 |  -0.802 |
---
RMSE: 1.441 R2: 0.609 R2 Within: 0.2

You can summarize multiple models at once via etable(). etable() has many options to customize the output to obtain publication-ready tables.

pf.etable(
    [fit, fit2],
    labels={"Y": "Wage", "X1": "Age", "X2": "Years of Schooling"},
    felabels={"f1": "Industry Fixed Effects"},
    caption="Regression Results",
)

	Wage
Regression Results
	(1)	(2)
coef
Age	-0.950*** (0.067)	-1.000*** (0.117)
Years of Schooling	-0.174*** (0.018)
Intercept		0.919*** (0.121)
fe
Industry Fixed Effects	x	-
stats
Observations	997	998
S.E. type	by: f1	by: group_id
R²	0.489	0.123
R² Within	0.239	-
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

You can also visualize multiple estimation results via iplot() and coefplot():

multi_fit.coefplot().show()

Difference-in-Differences / Event Study Designs

PyFixest supports eventy study designs via two-way fixed effects, Gardner’s 2-stage estimator, and the linear projections estimator.

url = "https://raw.githubusercontent.com/py-econometrics/pyfixest/master/pyfixest/did/data/df_het.csv"
df_het = pd.read_csv(url)

df_het.head()

	unit	state	group	unit_fe	g	year	year_fe	treat	rel_year	rel_year_binned	error	dep_var
0	1	33	Group 2	7.043016	2010	1990	0.066159	False	-20.0	-6	-0.086466	7.022709
1	1	33	Group 2	7.043016	2010	1991	-0.030980	False	-19.0	-6	0.766593	7.778628
2	1	33	Group 2	7.043016	2010	1992	-0.119607	False	-18.0	-6	1.512968	8.436377
3	1	33	Group 2	7.043016	2010	1993	0.126321	False	-17.0	-6	0.021870	7.191207
4	1	33	Group 2	7.043016	2010	1994	-0.106921	False	-16.0	-6	-0.017603	6.918492

fit_did2s = pf.did2s(
    df_het,
    yname="dep_var",
    first_stage="~ 0 | state + year",
    second_stage="~i(rel_year,ref= -1.0)",
    treatment="treat",
    cluster="state",
)


fit_twfe = pf.feols(
    "dep_var ~ i(rel_year,ref = -1.0) | state + year",
    df_het,
    vcov={"CRV1": "state"},
)

from pyfixest.report.utils import rename_categoricals
pf.iplot(
    [fit_did2s, fit_twfe], coord_flip=False, figsize=(900, 400), title="TWFE vs DID2S", rotate_xticks=90,
    labels= rename_categoricals(fit_did2s._coefnames, template="{value_int}")
)

The event_study() function provides a common API for several event study estimators.

fit_twfe = pf.event_study(
    data=df_het,
    yname="dep_var",
    idname="state",
    tname="year",
    gname="g",
    estimator="twfe",
)

fit_did2s = pf.event_study(
    data=df_het,
    yname="dep_var",
    idname="state",
    tname="year",
    gname="g",
    estimator="did2s",
)

pf.etable([fit_twfe, fit_did2s])

	dep_var	dep_var_hat
	(1)	(2)
coef
is_treated	2.135*** (0.044)	2.152*** (0.048)
fe
year	x	-
state	x	-
stats
Observations	46500	46500
S.E. type	by: state	CRV1
R²	0.758	0.338
R² Within	0.251	-
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

For more details see the vignette on Difference-in-Differences Estimation.