import pyfixest as pf
= pf.get_data() df
Translating Stata to PyFixest
How to Get Started
This guide will focus on how to replicate the regression results you would get in Stata with the Python package pyfixest
and assumes you know how to do things like install Python packages and load data into Pandas. For a broader introduction to doing econmetrics in Python you might check out Arthur Turrell’s Coding for Economist, which includes a section on Coming from Stata, or Tidy Finance with Python by Christopher Scheuch, Stefan Voigt, Patrick Weiss, and Christoph Frey. You can also check out the fixest section of stata2r as there is a lot of overlap between fixest and PyFixest syntax.
Data
pyfixest
includes a function to generate a dataset for testing.
If you want to use the same dataset in Stata, you can save the data to your home directory as a .dta file with
import os
"~"), "pyfixest-data.dta")) df.to_stata(os.path.join(os.path.expanduser(
and then load the data in Stata with
cd ~use pyfixest-data.dta
Basic OLS
To do a basic linear regression in pyfixest
you would simply use
= pf.feols("Y ~ X1", data = df)
fit1 = pf.feols("Y ~ X1 + X2", data = df) fit2
which is equivalent to
reg Y X1
of obs = 998
* Source | SS df MS Number F(1, 996) = 139.30
* -------------+---------------------------------- F = 0.0000
* Model | 650.171727 1 650.171727 Prob >
* Residual | 4648.80985 996 4.66747977 R-squared = 0.1227
* -------------+---------------------------------- Adj R-squared = 0.1218
* Total | 5298.98157 997 5.31492635 Root MSE = 2.1604
*
* ------------------------------------------------------------------------------
* Y | Coefficient Std. err. t P>|t| [95% conf. interval]
* -------------+----------------------------------------------------------------
* X1 | -1.000086 .0847353 -11.80 0.000 -1.166366 -.8338056_cons | .9185178 .1118212 8.21 0.000 .6990856 1.13795
*
* ------------------------------------------------------------------------------
reg y X1 X2
of obs = 998
* Source | SS df MS Number F(2, 995) = 106.99
* -------------+---------------------------------- F = 0.0000
* Model | 937.866146 2 468.933073 Prob >
* Residual | 4361.11543 995 4.38303058 R-squared = 0.1770
* -------------+---------------------------------- Adj R-squared = 0.1753
* Total | 5298.98157 997 5.31492635 Root MSE = 2.0936
*
* ------------------------------------------------------------------------------
* Y | Coefficient Std. err. t P>|t| [95% conf. interval]
* -------------+----------------------------------------------------------------
* X1 | -.9929358 .0821175 -12.09 0.000 -1.154079 -.8317925
* X2 | -.1763424 .021766 -8.10 0.000 -.2190549 -.1336299_cons | .8887791 .1084224 8.20 0.000 .6760163 1.101542
* * ------------------------------------------------------------------------------
in Stata. However, you should note that this will only run the regressions and store the results in fit1
and fit2
. To show the results you can use some of the following methods and functions.
# Basic summary statisticsmodels fit1.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: iid
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.919 | 0.112 | 8.214 | 0.000 | 0.699 | 1.138 |
| X1 | -1.000 | 0.085 | -11.802 | 0.000 | -1.166 | -0.834 |
---
RMSE: 2.158 R2: 0.123
# Estimates, std errors, t-values, etc. in a "tidy" tablemodels fit1.tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | 0.918518 | 0.111821 | 8.214166 | 6.661338e-16 | 0.699086 | 1.137950 |
X1 | -1.000086 | 0.084735 | -11.802468 | 0.000000e+00 | -1.166366 | -0.833806 |
# Customizable table that can include results for multiple models pf.report.etable([fit1, fit2])
Y | ||
---|---|---|
(1) | (2) | |
coef | ||
X1 | -1.000*** (0.085) |
-0.993*** (0.082) |
X2 | -0.176*** (0.022) |
|
Intercept | 0.919*** (0.112) |
0.889*** (0.108) |
stats | ||
Observations | 998 | 998 |
S.E. type | iid | iid |
R2 | 0.123 | 0.177 |
Adj. R2 | 0.122 | 0.175 |
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error) |
You can also access individual parts of the results with a variety of methods like
# Get the coefficients fit1.coef()
Coefficient
Intercept 0.918518
X1 -1.000086
Name: Estimate, dtype: float64
# Get the standard errors fit1.se()
Coefficient
Intercept 0.111821
X1 0.084735
Name: Std. Error, dtype: float64
# Get the p-values fit1.pvalue()
Coefficient
Intercept 6.661338e-16
X1 0.000000e+00
Name: Pr(>|t|), dtype: float64
Robust Standard Errors
To get heteroskedasticity robust standard errors you can use
= pf.feols("Y ~ X1 + X2", data=df, vcov="HC1")
fit3 fit3.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: HC1
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.889 | 0.108 | 8.249 | 0.000 | 0.677 | 1.100 |
| X1 | -0.993 | 0.080 | -12.439 | 0.000 | -1.150 | -0.836 |
| X2 | -0.176 | 0.022 | -8.129 | 0.000 | -0.219 | -0.134 |
---
RMSE: 2.09 R2: 0.177
which is equivalent to
reg Y X1 X2, robust
of obs = 998
* Linear regression Number F(2, 995) = 107.91
* F = 0.0000
* Prob >
* R-squared = 0.1770
* Root MSE = 2.0936
*
* ------------------------------------------------------------------------------
* | Robuststd. err. t P>|t| [95% conf. interval]
* Y | Coefficient
* -------------+----------------------------------------------------------------
* X1 | -.9929358 .0798259 -12.44 0.000 -1.149582 -.8362893
* X2 | -.1763424 .0216936 -8.13 0.000 -.2189129 -.1337719_cons | .8887791 .1077457 8.25 0.000 .6773442 1.100214
* * ------------------------------------------------------------------------------
or
reg Y X1 X2, vce(robust)
* Identical output to above
or you can choose a different type of robust standard errors like “HC3” using
= pf.feols("Y ~ X1 + X2", data=df, vcov="HC3")
fit4 fit4.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: HC3
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.889 | 0.108 | 8.222 | 0.000 | 0.677 | 1.101 |
| X1 | -0.993 | 0.080 | -12.396 | 0.000 | -1.150 | -0.836 |
| X2 | -0.176 | 0.022 | -8.087 | 0.000 | -0.219 | -0.134 |
---
RMSE: 2.09 R2: 0.177
Note: This will not exactly match the output of the equivalent Stata command, which is
reg Y X1 X2, vce(hc3)
of obs = 998
* Linear regression Number F(2, 995) = 107.38
* F = 0.0000
* Prob >
* R-squared = 0.1770
* Root MSE = 2.0936
*
* ------------------------------------------------------------------------------
* | Robust HC3std. err. t P>|t| [95% conf. interval]
* Y | Coefficient
* -------------+----------------------------------------------------------------
* X1 | -.9929358 .0799832 -12.41 0.000 -1.149891 -.8359807
* X2 | -.1763424 .0217734 -8.10 0.000 -.2190693 -.1336154_cons | .8887791 .1079372 8.23 0.000 .6769684 1.10059
* * ------------------------------------------------------------------------------
this is because by default, pyfixest
uses two small sample size corrections for HC3 robust standard errors, while Stata only uses one of them. You can turn off the correction that Stata doesn’t use with the ssc
argument.
= pf.feols("Y ~ X1 + X2", data=df, vcov="HC3", ssc=pf.ssc(adj = False))
fit5 fit5.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: HC3
Observations: 998
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.889 | 0.108 | 8.230 | 0.000 | 0.677 | 1.101 |
| X1 | -0.993 | 0.080 | -12.408 | 0.000 | -1.150 | -0.836 |
| X2 | -0.176 | 0.022 | -8.095 | 0.000 | -0.219 | -0.134 |
---
RMSE: 2.09 R2: 0.177
which matches Stata exactly. You can read all about the small sample size corrections implememnted by pyfixest
at On Small Sample Corrections.
Clustered Standard Errors
To cluster the standard errors by group you can use
= pf.feols("Y ~ X1 + X2", data=df.dropna(subset=['f1']), vcov={"CRV1": "f1"})
fit6 fit6.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: CRV1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.890 | 0.262 | 3.393 | 0.002 | 0.353 | 1.426 |
| X1 | -0.995 | 0.076 | -13.142 | 0.000 | -1.150 | -0.840 |
| X2 | -0.177 | 0.020 | -8.920 | 0.000 | -0.217 | -0.136 |
---
RMSE: 2.091 R2: 0.177
which is equivalent to
reg Y X1 X2, vce(cluster f1)
of obs = 997
* Linear regression Number F(2, 29) = 102.51
* F = 0.0000
* Prob >
* R-squared = 0.1774
* Root MSE = 2.094
*for 30 clusters in f1)
* (Std. err. adjusted
* ------------------------------------------------------------------------------
* | Robuststd. err. t P>|t| [95% conf. interval]
* Y | Coefficient
* -------------+----------------------------------------------------------------
* X1 | -.9951969 .0757246 -13.14 0.000 -1.150071 -.8403227
* X2 | -.1766173 .019799 -8.92 0.000 -.2171109 -.1361237_cons | .8895569 .2622066 3.39 0.002 .3532841 1.42583
* * ------------------------------------------------------------------------------
Note: clustered standard errors are not supported with missing values in the cluster variable, which is why we drop the rows with missing values for f1
.
For two way clustering you would need to use
= pf.feols(
fit7 "Y ~ X1 + X2",
=df.dropna(subset=['f1', 'f2']),
data={"CRV1": "f1 + f2"}
vcov
) fit7.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: CRV1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.890 | 0.301 | 2.958 | 0.006 | 0.274 | 1.505 |
| X1 | -0.995 | 0.073 | -13.570 | 0.000 | -1.145 | -0.845 |
| X2 | -0.177 | 0.023 | -7.535 | 0.000 | -0.225 | -0.129 |
---
RMSE: 2.091 R2: 0.177
Note: This will not exactly match the output of the equivalent Stata command, which is
reg Y X1 X2, vce(cluster f1 f2)
of obs = 997
* Linear regression Number comb.: Cluster comb. = 3
* Clusters per min = 30 F(2, 29) = 88.58
* F = 0.0000
* avg = 211 Prob > max = 572 R-squared = 0.1774
*
* Adj R-squared = 0.1758
* Root MSE = 2.0940
*for multiway clustering)
* (Std. err. adjusted
* ------------------------------------------------------------------------------
* | Robuststd. err. t P>|t| [95% conf. interval]
* Y | Coefficient
* -------------+----------------------------------------------------------------
* X1 | -.9951969 .074771 -13.31 0.000 -1.148121 -.8422731
* X2 | -.1766173 .02376 -7.43 0.000 -.225212 -.1280226_cons | .8895569 .3016464 2.95 0.006 .2726208 1.506493
*
* ------------------------------------------------------------------------------by f1 and f2. * Cluster combinations formed
this is because by default, pyfixest
uses a small sample size correction that adjusts each clustering dimentsion by whichever dimension has the smallest number of clusters, while in Stata the default is to adjust each dimension based on the number of clusters in that dimension. You can use the same correction as Stata through the ssc
argument.
= pf.feols(
fit8 "Y ~ X1 + X2",
=['f1', 'f2']),
df.dropna(subset={"CRV1": "f1 + f2"},
vcov=pf.ssc(cluster_df="conventional")
ssc
) fit8.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: 0
Inference: CRV1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 0.890 | 0.302 | 2.949 | 0.006 | 0.273 | 1.506 |
| X1 | -0.995 | 0.075 | -13.310 | 0.000 | -1.148 | -0.842 |
| X2 | -0.177 | 0.024 | -7.433 | 0.000 | -0.225 | -0.128 |
---
RMSE: 2.091 R2: 0.177
As a reminder, for an excellent breakdown on small sample correction in the pyfixest
package, you can check out On Small Sample Correction.
Fixed Effect Regressions
To do a fixed effect regression with one fixed effect you could use
= pf.feols("Y ~ X1 + X2 | f1", data=df, vcov="iid")
fit9 fit9.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: f1
Inference: iid
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -0.950 | 0.066 | -14.306 | 0.000 | -1.080 | -0.819 |
| X2 | -0.174 | 0.018 | -9.902 | 0.000 | -0.209 | -0.140 |
---
RMSE: 1.648 R2: 0.489 R2 Within: 0.239
which is equivalent to
xtset f1xtreg Y X1 X2, fe
effects (within) regression Number of obs = 997
* Fixed-variable: f1 Number of groups = 30
* Group
*group:
* R-squared: Obs per min = 23
* Within = 0.2388
* Between = 0.0770 avg = 33.2max = 48
* Overall = 0.1774
*F(2, 965) = 151.33
* corr(u_i, Xb) = 0.0268 Prob > F = 0.0000
*
*
* ------------------------------------------------------------------------------
* Y | Coefficient Std. err. t P>|t| [95% conf. interval]
* -------------+----------------------------------------------------------------
* X1 | -.9495256 .0663728 -14.31 0.000 -1.079777 -.8192739
* X2 | -.1742253 .0175957 -9.90 0.000 -.2087555 -.1396951_cons | .842222 .0872525 9.65 0.000 .6709955 1.013448
*
* -------------+----------------------------------------------------------------
* sigma_u | 1.2570454
* sigma_e | 1.6751049of variance due to u_i)
* rho | .36026283 (fraction
* ------------------------------------------------------------------------------F test that all u_i=0: F(29, 965) = 20.29 Prob > F = 0.0000 *
or
reghdfe Y X1 X2, absorb(f1)
of obs = 997
* HDFE Linear regression Number group F( 2, 965) = 151.33
* Absorbing 1 HDFE F = 0.0000
* Prob >
* R-squared = 0.4890
* Adj R-squared = 0.4726
* Within R-sq. = 0.2388
* Root MSE = 1.6751
*
* ------------------------------------------------------------------------------
* Y | Coefficient Std. err. t P>|t| [95% conf. interval]
* -------------+----------------------------------------------------------------
* X1 | -.9495256 .0663728 -14.31 0.000 -1.079777 -.8192739
* X2 | -.1742253 .0175957 -9.90 0.000 -.2087555 -.1396951_cons | .842222 .0872525 9.65 0.000 .6709955 1.013448
*
* ------------------------------------------------------------------------------
*of freedom:
* Absorbed degrees
* -----------------------------------------------------+
* Absorbed FE | Categories - Redundant = Num. Coefs |
* -------------+---------------------------------------|
* f1 | 30 0 30 | * -----------------------------------------------------+
Note: You need to specify vcov="iid
. This is because the default for fixest
is to cluster standard errors based on the first fixed effect. So
= pf.feols("Y ~ X1 + X2 | f1", data=df)
fit10 fit10.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: f1
Inference: CRV1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -0.950 | 0.067 | -14.266 | 0.000 | -1.086 | -0.813 |
| X2 | -0.174 | 0.018 | -9.464 | 0.000 | -0.212 | -0.137 |
---
RMSE: 1.648 R2: 0.489 R2 Within: 0.239
is equivalent to
xtset f1xtreg Y X1 X2, fe vce(cluster f1)
effects (within) regression Number of obs = 997
* Fixed-variable: f1 Number of groups = 30
* Group
*group:
* R-squared: Obs per min = 23
* Within = 0.2388
* Between = 0.0770 avg = 33.2max = 48
* Overall = 0.1774
*F(2, 29) = 146.33
* corr(u_i, Xb) = 0.0268 Prob > F = 0.0000
*
*for 30 clusters in f1)
* (Std. err. adjusted
* ------------------------------------------------------------------------------
* | Robuststd. err. t P>|t| [95% conf. interval]
* Y | Coefficient
* -------------+----------------------------------------------------------------
* X1 | -.9495256 .0665572 -14.27 0.000 -1.08565 -.8134009
* X2 | -.1742253 .018409 -9.46 0.000 -.2118759 -.1365746_cons | .842222 .0694639 12.12 0.000 .7001523 .9842916
*
* -------------+----------------------------------------------------------------
* sigma_u | 1.2570454
* sigma_e | 1.6751049of variance due to u_i)
* rho | .36026283 (fraction * ------------------------------------------------------------------------------
or
cluster(f1)
reghdfe Y X1 X2, absorb(f1)
of obs = 997
* HDFE Linear regression Number group F( 2, 29) = 146.33
* Absorbing 1 HDFE robust to heteroskedasticity Prob > F = 0.0000
* Statistics
* R-squared = 0.4890
* Adj R-squared = 0.4726
* Within R-sq. = 0.2388of clusters (f1) = 30 Root MSE = 1.6751
* Number
*for 30 clusters in f1)
* (Std. err. adjusted
* ------------------------------------------------------------------------------
* | Robuststd. err. t P>|t| [95% conf. interval]
* Y | Coefficient
* -------------+----------------------------------------------------------------
* X1 | -.9495256 .0665572 -14.27 0.000 -1.08565 -.8134009
* X2 | -.1742253 .018409 -9.46 0.000 -.2118759 -.1365746_cons | .842222 .0694639 12.12 0.000 .7001523 .9842916
*
* ------------------------------------------------------------------------------
*of freedom:
* Absorbed degrees
* -----------------------------------------------------+
* Absorbed FE | Categories - Redundant = Num. Coefs |
* -------------+---------------------------------------|
* f1 | 30 30 0 *|
* -----------------------------------------------------+cluster; treated as redundant for DoF computation * * = FE nested within
For multiple fixed effects you could do
= pf.feols("Y ~ X1 + X2 | f1 + f2", data=df)
fit11 fit11.summary()
###
Estimation: OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference: CRV1
Observations: 997
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1 | -0.924 | 0.062 | -14.934 | 0.000 | -1.051 | -0.797 |
| X2 | -0.174 | 0.015 | -11.737 | 0.000 | -0.204 | -0.144 |
---
RMSE: 1.346 R2: 0.659 R2 Within: 0.303
which is equivalent to
cluster(f1)
reghdfe Y X1 X2, absorb(f1 f2)
of obs = 997
* HDFE Linear regression Number F( 2, 29) = 182.76
* Absorbing 2 HDFE groups robust to heteroskedasticity Prob > F = 0.0000
* Statistics
* R-squared = 0.6590
* Adj R-squared = 0.6372
* Within R-sq. = 0.3026of clusters (f1) = 30 Root MSE = 1.3893
* Number
*for 30 clusters in f1)
* (Std. err. adjusted
* ------------------------------------------------------------------------------
* | Robuststd. err. t P>|t| [95% conf. interval]
* Y | Coefficient
* -------------+----------------------------------------------------------------
* X1 | -.9240462 .0618743 -14.93 0.000 -1.050593 -.7974991
* X2 | -.1741073 .0148338 -11.74 0.000 -.2044458 -.1437689_cons | .8156588 .064596 12.63 0.000 .6835452 .9477723
*
* ------------------------------------------------------------------------------
*of freedom:
* Absorbed degrees
* -----------------------------------------------------+
* Absorbed FE | Categories - Redundant = Num. Coefs |
* -------------+---------------------------------------|
* f1 | 30 30 0 *|
* f2 | 30 1 29 |
* -----------------------------------------------------+cluster; treated as redundant for DoF computation * * = FE nested within
Note: By default pyfixest
will still cluster standard errors just by the first group specified. You can change this using the vcov
argument. See the fit8
example above for how to specify two way clustering.