Formula Syntax

Core Estimation
In this tutorial, we showcase all of PyFixest formulas syntax; including syntax for fitting models with fixed effects, interactions, and multiple-estimation operators.

Setup

import numpy as np
import pyfixest as pf
data = pf.get_data()
data.head()
Y Y2 X1 X2 f1 f2 f3 group_id Z1 Z2 weights
0 NaN 2.357103 0.0 0.457858 15.0 0.0 7.0 9.0 -0.330607 1.054826 0.661478
1 -1.458643 5.163147 NaN -4.998406 6.0 21.0 4.0 8.0 NaN -4.113690 0.772732
2 0.169132 0.751140 2.0 1.558480 NaN 1.0 7.0 16.0 1.207778 0.465282 0.990929
3 3.319513 -2.656368 1.0 1.560402 1.0 10.0 11.0 3.0 2.869997 0.467570 0.021123
4 0.134420 -1.866416 2.0 -3.472232 19.0 20.0 6.0 14.0 0.835819 -3.115669 0.790815

PyFixest specifies different regression models by Wilkinson Formulas via the formulaic package. Wilkinson formulas should be familiar to you if you have used R’s lm() or statsmodels formula API. Many additional ideas implemented in PyFixest have been developed in the fixest package (most notably multiple estimation syntax, the i-operator, sample splitting). By default, all formula options presented here are supported by all models available via the pf.feols(), pf.feglm(), and pf.fepois() APIs.

Basic Syntax

In the simplest case, we regress covariates X1 and X2 on Y.

fit1 = pf.feols("Y ~ X1 + X2", data=data)
fit1.summary()
###

Estimation:  OLS
Dep. var.: Y
sample: None = all
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.889 |        0.108 |     8.197 |      0.000 |  0.676 |   1.102 |
| X1            |     -0.993 |        0.082 |   -12.092 |      0.000 | -1.154 |  -0.832 |
| X2            |     -0.176 |        0.022 |    -8.102 |      0.000 | -0.219 |  -0.134 |
---
RMSE: 2.09 R2: 0.177 

All transformations that are supported via formulaic are also supported via PyFixest. To name just a few important ones, you can create categorical variables via the C() operator:

fit2 = pf.feols("Y ~ X1 + X2 + C(f1)", data=data)

You can interact variables via the * and : operators:

fit3 = pf.feols("Y ~ X1:X2", data=data)
fit4 = pf.feols("Y ~ X1*X2", data=data)
pf.etable([fit3, fit4])
Y
(1) (2)
coef
X1 × X2 -0.099
(0.018)
0.02
(0.027)
X1 -0.992
(0.082)
X2 -0.197
(0.036)
Intercept -0.136
(0.072)
0.888
(0.108)
stats
Observations 998 998
R2 0.031 0.177
Format of coefficient cell: Coefficient (Std. Error)

To create logarithms of a function, just use

fit5 = pf.feols("Y ~ log(X1)", data=data)

or use any numpy transforms, e.g.

fit5 = pf.feols("Y ~ X1  + np.power(X1,2)", data=data)

Note - for the logarithm, we suggest to not rely on np.log but use the internal log operator.

Fixed Effects Syntax

We can add fixed effects behind the | operator: here we add two fixed effects f1 and f2.

fit6 = pf.feols("Y ~ X1 + X2 | f1 + f2", data=data)

We can interact two fixed effects via the ^ operator.

fit7 = pf.feols("Y ~ X1 + X2 | f1^f2", data=data)

For details on fixed effects regression, take a look at the OLS with Fixed Effects vignette.

Instrumental Variables (IV) Syntax

For IV estimation, PyFixest uses a three-part formula syntax:

"Y ~ exogenous_controls | fixed_effects | endogenous ~ instruments"

Here is a minimal example with fixed effects:

fit_iv = pf.feols("Y ~ X2 | f1 + f2 | X1 ~ Z1", data=data)
fit_iv.summary()
###

Estimation:  IV
Dep. var.: Y, Fixed effects: f1 + f2
sample: None = all
Inference:  iid
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X2            |     -0.174 |        0.015 |   -11.701 |      0.000 | -0.204 |  -0.145 |
| X1            |     -1.050 |        0.089 |   -11.793 |      0.000 | -1.225 |  -0.875 |
---

For details on IV estimation, take a look at the Instrumental Variables vignette.

The i() operator for interacting fixed effects

For interacting fixed effects, we include a specialised operator i()

If you simply wrap a variable into i(), it will be treated just as the C() operator (see above).

fit_i = pf.feols("Y ~ i(f1)", data=data)
fit_c = pf.feols("Y ~ C(f1)", data=data)

But overall, i() is more powerful than C(). Most importantly, you can easily set the reference level of the categorical variable:

# set 1 as reference level
fit_i1 = pf.feols("Y ~ i(f1, ref = 1)", data=data)

You can also easily interact variables:

# set 1 as reference level
fit_i2 = pf.feols("Y ~ i(f1, f2)", data=data)

and set reference levels for both via the ref and ref2 levels.

# set 1 as reference level
fit_i3 = pf.feols("Y ~ i(f1, f2, ref = 1, ref2 = 2)", data=data)

This is in particular useful for difference-in-differences models.

Last, you can bin levels of a variable via the bin argument. This groups multiple levels into a single category.

fit_bin = pf.feols(
    "Y ~ i(f1, bin={'low': list(range(0, 10)), 'mid': list(range(10, 20)), 'high': list(range(20, 30))}, ref='low')",
    data=data,
)
fit_bin.summary()
###

Estimation:  OLS
Dep. var.: Y
sample: None = all
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |     -0.473 |        0.122 |    -3.887 |      0.000 | -0.712 |  -0.234 |
| f1::high      |      0.110 |        0.174 |     0.630 |      0.529 | -0.232 |   0.451 |
| f1::mid       |      0.968 |        0.176 |     5.503 |      0.000 |  0.623 |   1.313 |
---
RMSE: 2.264 R2: 0.035 

Multiple Estimation Syntax

Last, PyFixest provides syntactic sugar to fit multiple estimations in one go. This is not only economizes on lines-of-code, but allows for performance optimizations via caching - if you fit many regression models on a fixed set of fixed effects and many overlapping covariates or dependent variables, and performance is poor, we highly recommend you to try out multiple estimations.

For multiple estimations, we provide 5 custom operators: sw, csw, sw0, csw0 and mvsw. In addition, it is possible to specify multiple dependent variables.

Multiple dependent variables

Multiple depvars are expanded to multiple estimations: "Y1 + Y2 ~ X1" behaves like "sw(Y1, Y2) ~ X1".

fit_multi_dep = pf.feols("Y + Y2 ~ X1 + X2", data=data)
pf.etable(fit_multi_dep)
Y Y2
(1) (2)
coef
X1 -0.993
(0.082)
-1.316
(0.214)
X2 -0.176
(0.022)
-0.133
(0.057)
Intercept 0.889
(0.108)
1.042
(0.283)
stats
Observations 998 999
R2 0.177 0.042
Format of coefficient cell: Coefficient (Std. Error)

sw(): stepwise alternatives

y ~ x1 + sw(x2, x3) expands to y ~ x1 + x2 and y ~ x1 + x3.

fit_sw = pf.feols("Y ~ X1 + sw(X2, Z1)", data=data)
pf.etable(fit_sw)
Y
(1) (2)
coef
X1 -0.993
(0.082)
-0.991
(0.109)
X2 -0.176
(0.022)
Z1 -0.009
(0.068)
Intercept 0.889
(0.108)
0.918
(0.112)
stats
Observations 998 998
R2 0.177 0.123
Format of coefficient cell: Coefficient (Std. Error)

sw0(): stepwise with zero step

y ~ x1 + sw0(x2, x3) expands to y ~ x1, y ~ x1 + x2, and y ~ x1 + x3.

fit_sw0 = pf.feols("Y ~ X1 + sw0(X2, Z1)", data=data)
pf.etable(fit_sw0)
Y
(1) (2) (3)
coef
X1 -1.000
(0.085)
-0.993
(0.082)
-0.991
(0.109)
X2 -0.176
(0.022)
Z1 -0.009
(0.068)
Intercept 0.919
(0.112)
0.889
(0.108)
0.918
(0.112)
stats
Observations 998 998 998
R2 0.123 0.177 0.123
Format of coefficient cell: Coefficient (Std. Error)

csw(): cumulative stepwise

y ~ x1 + csw(x2, x3) expands to y ~ x1 + x2 and y ~ x1 + x2 + x3.

fit_csw = pf.feols("Y ~ X1 + csw(X2, Z1)", data=data)
pf.etable(fit_csw)
Y
(1) (2)
coef
X1 -0.993
(0.082)
-1.010
(0.106)
X2 -0.176
(0.022)
-0.177
(0.022)
Z1 0.017
(0.066)
Intercept 0.889
(0.108)
0.889
(0.108)
stats
Observations 998 998
R2 0.177 0.177
Format of coefficient cell: Coefficient (Std. Error)

csw0(): cumulative stepwise with zero step

y ~ x1 + csw0(x2, x3) expands to y ~ x1, y ~ x1 + x2, and y ~ x1 + x2 + x3.

fit_csw0 = pf.feols("Y ~ X1 + csw0(X2, Z1)", data=data)
pf.etable(fit_csw0)
Y
(1) (2) (3)
coef
X1 -1.000
(0.085)
-0.993
(0.082)
-1.010
(0.106)
X2 -0.176
(0.022)
-0.177
(0.022)
Z1 0.017
(0.066)
Intercept 0.919
(0.112)
0.889
(0.108)
0.889
(0.108)
stats
Observations 998 998 998
R2 0.123 0.177 0.177
Format of coefficient cell: Coefficient (Std. Error)

mvsw(): multiverse stepwise

y ~ mvsw(x1, x2, x3) expands to all non-empty combinations plus the zero step: y ~ 1, y ~ x1, y ~ x2, y ~ x3, y ~ x1 + x2, y ~ x1 + x3, y ~ x2 + x3, y ~ x1 + x2 + x3.

fit_mvsw = pf.feols("Y ~ mvsw(X1, X2, Z1)", data=data)
pf.etable(fit_mvsw)
Y
(1) (2) (3) (4) (5) (6) (7) (8)
coef
X1 -1.000
(0.085)
-0.993
(0.082)
-0.991
(0.109)
-1.010
(0.106)
X2 -0.178
(0.023)
-0.176
(0.022)
-0.172
(0.023)
-0.177
(0.022)
Z1 -0.396
(0.054)
-0.009
(0.068)
-0.378
(0.053)
0.017
(0.066)
Intercept -0.127
(0.073)
0.919
(0.112)
-0.15
(0.071)
0.286
(0.091)
0.889
(0.108)
0.918
(0.112)
0.246
(0.089)
0.889
(0.108)
stats
Observations 999 998 999 998 998 998 998 998
R2 0 0.123 0.055 0.05 0.177 0.123 0.102 0.177
Format of coefficient cell: Coefficient (Std. Error)

Combining operators

Multiple estimation operators can be combined. For example, y ~ csw(x1, x2) + sw(z1, z2) expands to y ~ x1 + z1, y ~ x1 + z2, y ~ x1 + x2 + z1, y ~ x1 + x2 + z2.

fit_combo = pf.feols("Y ~ csw(X1, X2) + sw(Z1, X1:Z1)", data=data)
pf.etable(fit_combo)
Y
(1) (2) (3) (4)
coef
X1 -0.991
(0.109)
-1.014
(0.13)
-1.010
(0.106)
-1.041
(0.126)
Z1 -0.009
(0.068)
0.017
(0.066)
X1 × Z1 0.007
(0.049)
0.024
(0.047)
X2 -0.177
(0.022)
-0.177
(0.022)
Intercept 0.918
(0.112)
0.921
(0.113)
0.889
(0.108)
0.897
(0.11)
stats
Observations 998 998 998 998
R2 0.123 0.123 0.177 0.177
Format of coefficient cell: Coefficient (Std. Error)

Regressions on Multiple Samples

Via the split and fsplit argument, you can easily separate identical models on different samples.

  • split estimates separate models by subgroup.
  • fsplit does the same but also keeps the full-sample fit.
fit_split = pf.feols("Y ~ X1 + X2 | f1", data=data, split="f2")
pf.etable(fit_split)
Y
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30)
coef
X1 -2.177
(0.638)
-0.801
(0.466)
0.495
(0.344)
-2.044
(0.482)
-0.519
(0.785)
-0.974
(0.312)
0.056
(0.505)
-0.222
(1.151)
-0.69
(0.46)
0.351
(0.42)
-0.986
(0.458)
-0.466
(0.472)
-0.879
(0.509)
-1.851
(1.072)
-2.697
(1.051)
-1.532
(0.387)
-1.274
(0.329)
-1.120
(0.838)
-0.937
(0.529)
-1.012
(0.531)
-1.315
(0.37)
-1.137
(0.527)
-1.033
(0.447)
-1.700
(0.5)
-0.43
(0.303)
-1.065
(0.539)
-0.065
(0.674)
-0.575
(0.412)
-0.659
(0.48)
-0.845
(0.411)
X2 -0.145
(0.065)
-0.106
(0.134)
-0.022
(0.093)
-0.177
(0.144)
-0.096
(0.212)
-0.213
(0.086)
-0.23
(0.128)
-0.206
(0.248)
-0.135
(0.123)
-0.061
(0.099)
-0.242
(0.102)
-0.224
(0.149)
-0.139
(0.115)
-0.218
(0.194)
-0.24
(0.147)
-0.182
(0.132)
-0.093
(0.076)
-0.254
(0.221)
-0.171
(0.104)
-0.189
(0.14)
-0.102
(0.097)
-0.323
(0.188)
-0.033
(0.132)
-0.235
(0.146)
-0.059
(0.09)
-0.326
(0.093)
0.099
(0.185)
0.03
(0.148)
-0.256
(0.173)
-0.026
(0.112)
fe
f1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
stats
Observations 24 32 10 27 19 28 29 14 18 24 36 14 35 9 20 30 24 19 23 27 23 23 25 18 24 34 16 26 22 35
R2 0.924 0.788 0.975 0.858 0.749 0.781 0.715 0.6 0.754 0.782 0.696 0.673 0.658 0.954 0.798 0.834 0.8 0.51 0.841 0.735 0.598 0.771 0.597 0.877 0.829 0.668 0.514 0.761 0.788 0.654
Format of coefficient cell: Coefficient (Std. Error)
fit_fsplit = pf.feols("Y ~ X1 + X2 | f1", data=data, fsplit="f2")
pf.etable(fit_fsplit)
Y
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31)
coef
X1 -0.95
(0.066)
-2.177
(0.638)
-0.801
(0.466)
0.495
(0.344)
-2.044
(0.482)
-0.519
(0.785)
-0.974
(0.312)
0.056
(0.505)
-0.222
(1.151)
-0.69
(0.46)
0.351
(0.42)
-0.986
(0.458)
-0.466
(0.472)
-0.879
(0.509)
-1.851
(1.072)
-2.697
(1.051)
-1.532
(0.387)
-1.274
(0.329)
-1.120
(0.838)
-0.937
(0.529)
-1.012
(0.531)
-1.315
(0.37)
-1.137
(0.527)
-1.033
(0.447)
-1.700
(0.5)
-0.43
(0.303)
-1.065
(0.539)
-0.065
(0.674)
-0.575
(0.412)
-0.659
(0.48)
-0.845
(0.411)
X2 -0.174
(0.018)
-0.145
(0.065)
-0.106
(0.134)
-0.022
(0.093)
-0.177
(0.144)
-0.096
(0.212)
-0.213
(0.086)
-0.23
(0.128)
-0.206
(0.248)
-0.135
(0.123)
-0.061
(0.099)
-0.242
(0.102)
-0.224
(0.149)
-0.139
(0.115)
-0.218
(0.194)
-0.24
(0.147)
-0.182
(0.132)
-0.093
(0.076)
-0.254
(0.221)
-0.171
(0.104)
-0.189
(0.14)
-0.102
(0.097)
-0.323
(0.188)
-0.033
(0.132)
-0.235
(0.146)
-0.059
(0.09)
-0.326
(0.093)
0.099
(0.185)
0.03
(0.148)
-0.256
(0.173)
-0.026
(0.112)
fe
f1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
stats
Observations 997 24 32 10 27 19 28 29 14 18 24 36 14 35 9 20 30 24 19 23 27 23 23 25 18 24 34 16 26 22 35
R2 0.489 0.924 0.788 0.975 0.858 0.749 0.781 0.715 0.6 0.754 0.782 0.696 0.673 0.658 0.954 0.798 0.834 0.8 0.51 0.841 0.735 0.598 0.771 0.597 0.877 0.829 0.668 0.514 0.761 0.788 0.654
Format of coefficient cell: Coefficient (Std. Error)

Where to Go Next