Formula Syntax

Core Estimation

In this tutorial, we showcase all of PyFixest formulas syntax; including syntax for fitting models with fixed effects, interactions, and multiple-estimation operators.

Setup

import numpy as np
import pyfixest as pf
data = pf.get_data()
data.head()

	Y	Y2	X1	X2	f1	f2	f3	group_id	Z1	Z2	weights
0	NaN	2.357103	0.0	0.457858	15.0	0.0	7.0	9.0	-0.330607	1.054826	0.661478
1	-1.458643	5.163147	NaN	-4.998406	6.0	21.0	4.0	8.0	NaN	-4.113690	0.772732
2	0.169132	0.751140	2.0	1.558480	NaN	1.0	7.0	16.0	1.207778	0.465282	0.990929
3	3.319513	-2.656368	1.0	1.560402	1.0	10.0	11.0	3.0	2.869997	0.467570	0.021123
4	0.134420	-1.866416	2.0	-3.472232	19.0	20.0	6.0	14.0	0.835819	-3.115669	0.790815

PyFixest specifies different regression models by Wilkinson Formulas via the formulaic package. Wilkinson formulas should be familiar to you if you have used R’s lm() or statsmodels formula API. Many additional ideas implemented in PyFixest have been developed in the fixest package (most notably multiple estimation syntax, the i-operator, sample splitting). By default, all formula options presented here are supported by all models available via the pf.feols(), pf.feglm(), and pf.fepois() APIs.

Basic Syntax

In the simplest case, we regress covariates X1 and X2 on Y.

fit1 = pf.feols("Y ~ X1 + X2", data=data)
fit1.summary()

###

Estimation:  OLS
Dep. var.: Y
sample: None = all
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.889 |        0.108 |     8.197 |      0.000 |  0.676 |   1.102 |
| X1            |     -0.993 |        0.082 |   -12.092 |      0.000 | -1.154 |  -0.832 |
| X2            |     -0.176 |        0.022 |    -8.102 |      0.000 | -0.219 |  -0.134 |
---
RMSE: 2.09 R2: 0.177

All transformations that are supported via formulaic are also supported via PyFixest. To name just a few important ones, you can create categorical variables via the C() operator:

fit2 = pf.feols("Y ~ X1 + X2 + C(f1)", data=data)

You can interact variables via the * and : operators:

fit3 = pf.feols("Y ~ X1:X2", data=data)
fit4 = pf.feols("Y ~ X1*X2", data=data)
pf.etable([fit3, fit4])

	Y
	(1)	(2)
coef
X1 × X2	-0.099 (0.018)	0.02 (0.027)
X1		-0.992 (0.082)
X2		-0.197 (0.036)
Intercept	-0.136 (0.072)	0.888 (0.108)
stats
Observations	998	998
R²	0.031	0.177
Format of coefficient cell: Coefficient (Std. Error)

To create logarithms of a function, just use

fit5 = pf.feols("Y ~ log(X1)", data=data)

or use any numpy transforms, e.g.

fit5 = pf.feols("Y ~ X1  + np.power(X1,2)", data=data)

Note - for the logarithm, we suggest to not rely on np.log but use the internal log operator.

Fixed Effects Syntax

We can add fixed effects behind the | operator: here we add two fixed effects f1 and f2.

fit6 = pf.feols("Y ~ X1 + X2 | f1 + f2", data=data)

We can interact two fixed effects via the ^ operator.

fit7 = pf.feols("Y ~ X1 + X2 | f1^f2", data=data)

For details on fixed effects regression, take a look at the OLS with Fixed Effects vignette.

Instrumental Variables (IV) Syntax

For IV estimation, PyFixest uses a three-part formula syntax:

"Y ~ exogenous_controls | fixed_effects | endogenous ~ instruments"

Here is a minimal example with fixed effects:

fit_iv = pf.feols("Y ~ X2 | f1 + f2 | X1 ~ Z1", data=data)
fit_iv.summary()

###

Estimation:  IV
Dep. var.: Y, Fixed effects: f1 + f2
sample: None = all
Inference:  iid
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X2            |     -0.174 |        0.015 |   -11.701 |      0.000 | -0.204 |  -0.145 |
| X1            |     -1.050 |        0.089 |   -11.793 |      0.000 | -1.225 |  -0.875 |
---

For details on IV estimation, take a look at the Instrumental Variables vignette.

The `i()` operator for interacting fixed effects

For interacting fixed effects, we include a specialised operator i()

If you simply wrap a variable into i(), it will be treated just as the C() operator (see above).

fit_i = pf.feols("Y ~ i(f1)", data=data)
fit_c = pf.feols("Y ~ C(f1)", data=data)

But overall, i() is more powerful than C(). Most importantly, you can easily set the reference level of the categorical variable:

# set 1 as reference level
fit_i1 = pf.feols("Y ~ i(f1, ref = 1)", data=data)

You can also easily interact variables:

# set 1 as reference level
fit_i2 = pf.feols("Y ~ i(f1, f2)", data=data)

and set reference levels for both via the ref and ref2 levels.

# set 1 as reference level
fit_i3 = pf.feols("Y ~ i(f1, f2, ref = 1, ref2 = 2)", data=data)

This is in particular useful for difference-in-differences models.

Last, you can bin levels of a variable via the bin argument. This groups multiple levels into a single category.

fit_bin = pf.feols(
    "Y ~ i(f1, bin={'low': list(range(0, 10)), 'mid': list(range(10, 20)), 'high': list(range(20, 30))}, ref='low')",
    data=data,
)
fit_bin.summary()

###

Estimation:  OLS
Dep. var.: Y
sample: None = all
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |     -0.473 |        0.122 |    -3.887 |      0.000 | -0.712 |  -0.234 |
| f1::high      |      0.110 |        0.174 |     0.630 |      0.529 | -0.232 |   0.451 |
| f1::mid       |      0.968 |        0.176 |     5.503 |      0.000 |  0.623 |   1.313 |
---
RMSE: 2.264 R2: 0.035

Multiple Estimation Syntax

Last, PyFixest provides syntactic sugar to fit multiple estimations in one go. This is not only economizes on lines-of-code, but allows for performance optimizations via caching - if you fit many regression models on a fixed set of fixed effects and many overlapping covariates or dependent variables, and performance is poor, we highly recommend you to try out multiple estimations.

For multiple estimations, we provide 5 custom operators: sw, csw, sw0, csw0 and mvsw. In addition, it is possible to specify multiple dependent variables.

Multiple dependent variables

Multiple depvars are expanded to multiple estimations: "Y1 + Y2 ~ X1" behaves like "sw(Y1, Y2) ~ X1".

fit_multi_dep = pf.feols("Y + Y2 ~ X1 + X2", data=data)
pf.etable(fit_multi_dep)

	Y	Y2
	(1)	(2)
coef
X1	-0.993 (0.082)	-1.316 (0.214)
X2	-0.176 (0.022)	-0.133 (0.057)
Intercept	0.889 (0.108)	1.042 (0.283)
stats
Observations	998	999
R²	0.177	0.042
Format of coefficient cell: Coefficient (Std. Error)

`sw()`: stepwise alternatives

y ~ x1 + sw(x2, x3) expands to y ~ x1 + x2 and y ~ x1 + x3.

fit_sw = pf.feols("Y ~ X1 + sw(X2, Z1)", data=data)
pf.etable(fit_sw)

	Y
	(1)	(2)
coef
X1	-0.993 (0.082)	-0.991 (0.109)
X2	-0.176 (0.022)
Z1		-0.009 (0.068)
Intercept	0.889 (0.108)	0.918 (0.112)
stats
Observations	998	998
R²	0.177	0.123
Format of coefficient cell: Coefficient (Std. Error)

`sw0()`: stepwise with zero step

y ~ x1 + sw0(x2, x3) expands to y ~ x1, y ~ x1 + x2, and y ~ x1 + x3.

fit_sw0 = pf.feols("Y ~ X1 + sw0(X2, Z1)", data=data)
pf.etable(fit_sw0)

	Y
	(1)	(2)	(3)
coef
X1	-1.000 (0.085)	-0.993 (0.082)	-0.991 (0.109)
X2		-0.176 (0.022)
Z1			-0.009 (0.068)
Intercept	0.919 (0.112)	0.889 (0.108)	0.918 (0.112)
stats
Observations	998	998	998
R²	0.123	0.177	0.123
Format of coefficient cell: Coefficient (Std. Error)

`csw()`: cumulative stepwise

y ~ x1 + csw(x2, x3) expands to y ~ x1 + x2 and y ~ x1 + x2 + x3.

fit_csw = pf.feols("Y ~ X1 + csw(X2, Z1)", data=data)
pf.etable(fit_csw)

	Y
	(1)	(2)
coef
X1	-0.993 (0.082)	-1.010 (0.106)
X2	-0.176 (0.022)	-0.177 (0.022)
Z1		0.017 (0.066)
Intercept	0.889 (0.108)	0.889 (0.108)
stats
Observations	998	998
R²	0.177	0.177
Format of coefficient cell: Coefficient (Std. Error)

`csw0()`: cumulative stepwise with zero step

y ~ x1 + csw0(x2, x3) expands to y ~ x1, y ~ x1 + x2, and y ~ x1 + x2 + x3.

fit_csw0 = pf.feols("Y ~ X1 + csw0(X2, Z1)", data=data)
pf.etable(fit_csw0)

	Y
	(1)	(2)	(3)
coef
X1	-1.000 (0.085)	-0.993 (0.082)	-1.010 (0.106)
X2		-0.176 (0.022)	-0.177 (0.022)
Z1			0.017 (0.066)
Intercept	0.919 (0.112)	0.889 (0.108)	0.889 (0.108)
stats
Observations	998	998	998
R²	0.123	0.177	0.177
Format of coefficient cell: Coefficient (Std. Error)

`mvsw()`: multiverse stepwise

y ~ mvsw(x1, x2, x3) expands to all non-empty combinations plus the zero step: y ~ 1, y ~ x1, y ~ x2, y ~ x3, y ~ x1 + x2, y ~ x1 + x3, y ~ x2 + x3, y ~ x1 + x2 + x3.

fit_mvsw = pf.feols("Y ~ mvsw(X1, X2, Z1)", data=data)
pf.etable(fit_mvsw)

	Y
	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)
coef
X1		-1.000 (0.085)			-0.993 (0.082)	-0.991 (0.109)		-1.010 (0.106)
X2			-0.178 (0.023)		-0.176 (0.022)		-0.172 (0.023)	-0.177 (0.022)
Z1				-0.396 (0.054)		-0.009 (0.068)	-0.378 (0.053)	0.017 (0.066)
Intercept	-0.127 (0.073)	0.919 (0.112)	-0.15 (0.071)	0.286 (0.091)	0.889 (0.108)	0.918 (0.112)	0.246 (0.089)	0.889 (0.108)
stats
Observations	999	998	999	998	998	998	998	998
R²	0	0.123	0.055	0.05	0.177	0.123	0.102	0.177
Format of coefficient cell: Coefficient (Std. Error)

Combining operators

Multiple estimation operators can be combined. For example, y ~ csw(x1, x2) + sw(z1, z2) expands to y ~ x1 + z1, y ~ x1 + z2, y ~ x1 + x2 + z1, y ~ x1 + x2 + z2.

fit_combo = pf.feols("Y ~ csw(X1, X2) + sw(Z1, X1:Z1)", data=data)
pf.etable(fit_combo)

	Y
	(1)	(2)	(3)	(4)
coef
X1	-0.991 (0.109)	-1.014 (0.13)	-1.010 (0.106)	-1.041 (0.126)
Z1	-0.009 (0.068)		0.017 (0.066)
X1 × Z1		0.007 (0.049)		0.024 (0.047)
X2			-0.177 (0.022)	-0.177 (0.022)
Intercept	0.918 (0.112)	0.921 (0.113)	0.889 (0.108)	0.897 (0.11)
stats
Observations	998	998	998	998
R²	0.123	0.123	0.177	0.177
Format of coefficient cell: Coefficient (Std. Error)

Regressions on Multiple Samples

Via the split and fsplit argument, you can easily separate identical models on different samples.

split estimates separate models by subgroup.
fsplit does the same but also keeps the full-sample fit.

fit_split = pf.feols("Y ~ X1 + X2 | f1", data=data, split="f2")
pf.etable(fit_split)

	Y
	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)	(10)	(11)	(12)	(13)	(14)	(15)	(16)	(17)	(18)	(19)	(20)	(21)	(22)	(23)	(24)	(25)	(26)	(27)	(28)	(29)	(30)
coef
X1	-2.177 (0.638)	-0.801 (0.466)	0.495 (0.344)	-2.044 (0.482)	-0.519 (0.785)	-0.974 (0.312)	0.056 (0.505)	-0.222 (1.151)	-0.69 (0.46)	0.351 (0.42)	-0.986 (0.458)	-0.466 (0.472)	-0.879 (0.509)	-1.851 (1.072)	-2.697 (1.051)	-1.532 (0.387)	-1.274 (0.329)	-1.120 (0.838)	-0.937 (0.529)	-1.012 (0.531)	-1.315 (0.37)	-1.137 (0.527)	-1.033 (0.447)	-1.700 (0.5)	-0.43 (0.303)	-1.065 (0.539)	-0.065 (0.674)	-0.575 (0.412)	-0.659 (0.48)	-0.845 (0.411)
X2	-0.145 (0.065)	-0.106 (0.134)	-0.022 (0.093)	-0.177 (0.144)	-0.096 (0.212)	-0.213 (0.086)	-0.23 (0.128)	-0.206 (0.248)	-0.135 (0.123)	-0.061 (0.099)	-0.242 (0.102)	-0.224 (0.149)	-0.139 (0.115)	-0.218 (0.194)	-0.24 (0.147)	-0.182 (0.132)	-0.093 (0.076)	-0.254 (0.221)	-0.171 (0.104)	-0.189 (0.14)	-0.102 (0.097)	-0.323 (0.188)	-0.033 (0.132)	-0.235 (0.146)	-0.059 (0.09)	-0.326 (0.093)	0.099 (0.185)	0.03 (0.148)	-0.256 (0.173)	-0.026 (0.112)
fe
f1	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x
stats
Observations	24	32	10	27	19	28	29	14	18	24	36	14	35	9	20	30	24	19	23	27	23	23	25	18	24	34	16	26	22	35
R²	0.924	0.788	0.975	0.858	0.749	0.781	0.715	0.6	0.754	0.782	0.696	0.673	0.658	0.954	0.798	0.834	0.8	0.51	0.841	0.735	0.598	0.771	0.597	0.877	0.829	0.668	0.514	0.761	0.788	0.654
Format of coefficient cell: Coefficient (Std. Error)

fit_fsplit = pf.feols("Y ~ X1 + X2 | f1", data=data, fsplit="f2")
pf.etable(fit_fsplit)

	Y
	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)	(10)	(11)	(12)	(13)	(14)	(15)	(16)	(17)	(18)	(19)	(20)	(21)	(22)	(23)	(24)	(25)	(26)	(27)	(28)	(29)	(30)	(31)
coef
X1	-0.95 (0.066)	-2.177 (0.638)	-0.801 (0.466)	0.495 (0.344)	-2.044 (0.482)	-0.519 (0.785)	-0.974 (0.312)	0.056 (0.505)	-0.222 (1.151)	-0.69 (0.46)	0.351 (0.42)	-0.986 (0.458)	-0.466 (0.472)	-0.879 (0.509)	-1.851 (1.072)	-2.697 (1.051)	-1.532 (0.387)	-1.274 (0.329)	-1.120 (0.838)	-0.937 (0.529)	-1.012 (0.531)	-1.315 (0.37)	-1.137 (0.527)	-1.033 (0.447)	-1.700 (0.5)	-0.43 (0.303)	-1.065 (0.539)	-0.065 (0.674)	-0.575 (0.412)	-0.659 (0.48)	-0.845 (0.411)
X2	-0.174 (0.018)	-0.145 (0.065)	-0.106 (0.134)	-0.022 (0.093)	-0.177 (0.144)	-0.096 (0.212)	-0.213 (0.086)	-0.23 (0.128)	-0.206 (0.248)	-0.135 (0.123)	-0.061 (0.099)	-0.242 (0.102)	-0.224 (0.149)	-0.139 (0.115)	-0.218 (0.194)	-0.24 (0.147)	-0.182 (0.132)	-0.093 (0.076)	-0.254 (0.221)	-0.171 (0.104)	-0.189 (0.14)	-0.102 (0.097)	-0.323 (0.188)	-0.033 (0.132)	-0.235 (0.146)	-0.059 (0.09)	-0.326 (0.093)	0.099 (0.185)	0.03 (0.148)	-0.256 (0.173)	-0.026 (0.112)
fe
f1	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x
stats
Observations	997	24	32	10	27	19	28	29	14	18	24	36	14	35	9	20	30	24	19	23	27	23	23	25	18	24	34	16	26	22	35
R²	0.489	0.924	0.788	0.975	0.858	0.749	0.781	0.715	0.6	0.754	0.782	0.696	0.673	0.658	0.954	0.798	0.834	0.8	0.51	0.841	0.735	0.598	0.771	0.597	0.877	0.829	0.668	0.514	0.761	0.788	0.654
Format of coefficient cell: Coefficient (Std. Error)

Where to Go Next

OLS with Fixed Effects: practical FE estimation patterns.
Difference-in-Differences: event-study applications with i().
Regression Tables: organize and export many-model workflows.
How-To: Translating Stata to PyFixest: syntax mapping and defaults.