A first Regression with PyFixest

Installation

PyFixest is a Python package for fast high-dimensional fixed effects regression. In this tutorial, we’ll show you how to fit your first regression with PyFixest.

You can install pyfixest from PyPi via

pip install -U pyfixest

A First Regression: The Causal Returns to Education via Twin Studies

We want to estimate the causal returns of education on earnings via a twin study. In this notebook, we focus on PyFixest estimation functionality and syntax. For details on the question at hand, please take a look at the OLS with Fixed Effects vignette.

In a first step, we load a synthetic twin-study style data set:

import pyfixest as pf

twins = pf.get_twin_data(N_pairs=500, seed=42)
twins.head()
twin_pair_id twin_id ability educ age experience log_wage
0 1 1 0.304717 14.880083 38.0 17.119917 3.241823
1 1 2 0.304717 13.942729 49.0 29.057271 3.379130
2 2 1 -1.039984 10.041047 33.0 16.958953 2.303006
3 2 2 -1.039984 8.475001 32.0 17.524999 2.057258
4 3 1 0.750451 8.000000 35.0 21.000000 3.449381

pf.get_twin_data() returns a simulated twin-pair dataset where each row is one individual twin and twin_pair_id identifies a pair of twins. We have these other relevant variables:

  • educ: years of education completed
  • earnings (or log_wage in transformed specs): labor-market outcome used as the dependent variable
  • experience: a proxy for labor-market experience
  • twin_pair_id: twin-pair identifier
  • ability: an unobserved confounder that leads to both higher earnings and more years of schooling

PyFixest’s core estimation function is called feols(). As a bare minimum, you need to pass a pandas or polars data frame and a Wilkinson formula.

We first estimate a naive OLS regression in which we regress earnings on years of education and experience. The resulting coefficient on educ will not reflect a causal effect of education on earnings, but will be biased because higher-ability students likely select into more schooling and would also have higher earnings later in life even in the absence of additional education.

We will now estimate a twin fixed-effects model, which aims to control for ability by comparing outcomes within twin pairs. As twins share the same genetic endowment, twin studies hypothesize that controlling for twin fixed effects can remove much of the ability-related confounding that biases cross-sectional OLS estimates. In both models, we cluster standard errors at the twin-pair level.

fit_naive = pf.feols(
  "log_wage ~ educ + experience",
  data=twins,
  vcov={"CRV1": "twin_pair_id"}
)
fit_fe = pf.feols(
  "log_wage ~ educ + experience | twin_pair_id",
   data=twins,
   vcov={"CRV1": "twin_pair_id"}
)

We compare both specifications side by side via etable().

pf.etable(
    [fit_naive, fit_fe],
    labels={
      "log_wage": "Log Hourly Wage",
      "educ": "Years of Education",
      "experience": "Experience"
    },
    felabels={"twin_pair_id": "Twin Pair FE"},
    caption="Returns to Education: Naive OLS vs Twin Fixed Effects",
)
Returns to Education: Naive OLS vs Twin Fixed Effects
Log Hourly Wage
(1) (2)
coef
Years of Education 0.114
(0.007)
0.088
(0.007)
Experience 0.019
(0.002)
0.02
(0.002)
Intercept 1.113
(0.098)
fe
Twin Pair FE - x
stats
Observations 1,000 1,000
R2 0.283 0.801
Format of coefficient cell: Coefficient (Std. Error)

We see that the estimated return to education is smaller once we include twin fixed effects, consistent with an upward bias in naive OLS of education on earnings from unobserved ability differences across individuals.

Now, what if we had the unobserved confounder at hand? In this case, we could simply control for it in our regression model. In real life, we likely wouldn’t be so lucky to have it, but alas, here we have access to it as we are working with synthetic data:

fit_latent = pf.feols(
  "log_wage ~ educ + experience + ability",
  data = twins,
  vcov={"CRV1": "twin_pair_id"}
)

pf.coefplot(
  [fit_naive, fit_fe, fit_latent],
  keep = ["educ"],
  coord_flip = False,
  title = "Three Estimates for the Returns of Education on Wages"
)

We see that the fixed effect design gives us estimates that are much closer to the model in which we correctly control for the unoberserved confounder. By controlling for twin fixed effects, we have managed to control for an unobserved confounder, ability.

Where to Go Next

Now that we’ve fit our first regression, we can jump right into one of the next tutorials that showcases core PyFixest workflows for estimation, inference, and reporting of regression models with (and without) fixed effects.

Tutorial Description
OLS with Fixed Effects We provide more examples of fixed effects designs, including twin studies, worker-firm panels, and difference-in-differences models. We also provide some intuition on how the demeaning behind PyFixest works via the Frisch-Waugh-Lovell Theorem.
Formula Syntax We explain PyFixest’s formula interface in all of its detail, including special operators as i() for interactions and multiple estimation syntax.
Standard Errors & Inference Here we showcase different options to conduct inference with PyFixest, via iid, heteroskedastic, cluster robust errors, and more.
Regression Tables We show how to produce publication-ready tables via the pf.etable() function and maketables.
Difference-in-Differences TWFE, Gardner’s two-stage DID2S, local projections, and event study designs with heterogeneous treatment effects.
Quantile Regression Interior-point quantile regression: model the full conditional distribution, not just the mean, with an example from software observability (p99 latency).

You can browse all tutorials in the Tutorial Gallery, or see the How-To Guides for task-oriented recipes. The Function Reference has all details around functions and their arguments, classes, methods, and attributes.