PyFixest is a Python package for fast high-dimensional fixed effects regression. In this tutorial, we’ll show you how to fit your first regression with PyFixest.
A First Regression: The Causal Returns to Education via Twin Studies
We want to estimate the causal returns of education on earnings via a twin study. In this notebook, we focus on PyFixest estimation functionality and syntax. For details on the question at hand, please take a look at the OLS with Fixed Effects vignette.
In a first step, we load a synthetic twin-study style data set:
import pyfixest as pftwins = pf.get_twin_data(N_pairs=500, seed=42)twins.head()
twin_pair_id
twin_id
ability
educ
age
experience
log_wage
0
1
1
0.304717
14.880083
38.0
17.119917
3.241823
1
1
2
0.304717
13.942729
49.0
29.057271
3.379130
2
2
1
-1.039984
10.041047
33.0
16.958953
2.303006
3
2
2
-1.039984
8.475001
32.0
17.524999
2.057258
4
3
1
0.750451
8.000000
35.0
21.000000
3.449381
pf.get_twin_data() returns a simulated twin-pair dataset where each row is one individual twin and twin_pair_id identifies a pair of twins. We have these other relevant variables:
educ: years of education completed
earnings (or log_wage in transformed specs): labor-market outcome used as the dependent variable
experience: a proxy for labor-market experience
twin_pair_id: twin-pair identifier
ability: an unobserved confounder that leads to both higher earnings and more years of schooling
PyFixest’s core estimation function is called feols(). As a bare minimum, you need to pass a pandas or polars data frame and a Wilkinson formula.
We first estimate a naive OLS regression in which we regress earnings on years of education and experience. The resulting coefficient on educ will not reflect a causal effect of education on earnings, but will be biased because higher-ability students likely select into more schooling and would also have higher earnings later in life even in the absence of additional education.
We will now estimate a twin fixed-effects model, which aims to control for ability by comparing outcomes within twin pairs. As twins share the same genetic endowment, twin studies hypothesize that controlling for twin fixed effects can remove much of the ability-related confounding that biases cross-sectional OLS estimates. In both models, we cluster standard errors at the twin-pair level.
We compare both specifications side by side via etable().
pf.etable( [fit_naive, fit_fe], labels={"log_wage": "Log Hourly Wage","educ": "Years of Education","experience": "Experience" }, felabels={"twin_pair_id": "Twin Pair FE"}, caption="Returns to Education: Naive OLS vs Twin Fixed Effects",)
Returns to Education: Naive OLS vs Twin Fixed Effects
Log Hourly Wage
(1)
(2)
coef
Years of Education
0.114 (0.007)
0.088 (0.007)
Experience
0.019 (0.002)
0.02 (0.002)
Intercept
1.113 (0.098)
fe
Twin Pair FE
-
x
stats
Observations
1,000
1,000
R2
0.283
0.801
Format of coefficient cell: Coefficient (Std. Error)
We see that the estimated return to education is smaller once we include twin fixed effects, consistent with an upward bias in naive OLS of education on earnings from unobserved ability differences across individuals.
Now, what if we had the unobserved confounder at hand? In this case, we could simply control for it in our regression model. In real life, we likely wouldn’t be so lucky to have it, but alas, here we have access to it as we are working with synthetic data:
fit_latent = pf.feols("log_wage ~ educ + experience + ability", data = twins, vcov={"CRV1": "twin_pair_id"})pf.coefplot( [fit_naive, fit_fe, fit_latent], keep = ["educ"], coord_flip =False, title ="Three Estimates for the Returns of Education on Wages")
We see that the fixed effect design gives us estimates that are much closer to the model in which we correctly control for the unoberserved confounder. By controlling for twin fixed effects, we have managed to control for an unobserved confounder, ability.
Where to Go Next
Now that we’ve fit our first regression, we can jump right into one of the next tutorials that showcases core PyFixest workflows for estimation, inference, and reporting of regression models with (and without) fixed effects.
We provide more examples of fixed effects designs, including twin studies, worker-firm panels, and difference-in-differences models. We also provide some intuition on how the demeaning behind PyFixest works via the Frisch-Waugh-Lovell Theorem.
Interior-point quantile regression: model the full conditional distribution, not just the mean, with an example from software observability (p99 latency).
You can browse all tutorials in the Tutorial Gallery, or see the How-To Guides for task-oriented recipes. The Function Reference has all details around functions and their arguments, classes, methods, and attributes.