When conducting online A/B tests or large-scale experiments, we often analyze multiple dependent variables simultaneously. Analyzing multiple KPIs introduces a significant statistical challenge: the multiple testing problem.
In classical hypothesis testing, the significance level \(\alpha\) controls the probability of a false positive (Type I error), typically \(\alpha=0.05\). While this ensures a 5% chance of a false positive for a single test, the likelihood of detecting at least one false positive grows rapidly when conducting multiple tests. For example, testing 20 KPIs independently at \(\alpha = 0.05\) results in a 64% chance of at least one false positive—calculated as \(1 - (1 - 0.05)^{20} \approx 0.64\). This issue, known as the multiple testing problem, can lead to false claims of significant effects when none exist.
To address this, the concept of controlling the Familywise Error Rate (FWER) has been widely adopted. FWER controls the probability of at least one Type I error across a family of hypotheses. Several correction methods exist to mitigate the multiple testing problem, including:
Bonferroni Correction: A simple and conservative method that adjusts the significance level for each test by dividing \(\alpha\) by the number of tests.
Romano-Wolf & Westfall-Young Stepwise Procedures: Two more powerful methods that use resampling techniques to control the FWER.
This vignette demonstrates how these methods effectively control the FWER in a variety of scenarios. We will compare their performance and highlight the trade-offs between simplicity and statistical power. Specifically, we show that while Bonferroni provides strong error control, it is conservative in many practical applications. In contrast, Romano-Wolf and Westfall-Young methods are more powerful, offering greater sensitivity to detect true effects while maintaining robust control of the FWER.
What is a Family-Wise Error Rate (FWER)?
Suppose that we are running an experiment and want to test if our treatment impacts 20 different dependent variables (KPIs). For any given independent test, the chance of a false positive is given by the (significance) level of the individual test, which is most commonly set to \(\alpha = 0.05\). Formally, we can define the false positive rate for a single hypothesis \(H\) as:
\[
P(\text{reject } H \mid H \text{ is true}) = \alpha
\]
For a family of tests\(S = {s \in \{1, \dots, P\} }\) hypotheses \(\{H_{s}\}_{s \in S}\), we can analogously define the family-wise error rate (FWER) as:
\[
P(\text{reject at least one } H_{s} \text{ with } s \in I) = \alpha
\]
where \(I\) is the set of true hypotheses. In other words, the FWER is the probability of making at least one false positive across all tests in the family.
Setup
In a first step, we create a data set with multiple (potentially correlated) dependent variables that share a common set of covariates. All simulations in this notebook are greatly inspired by Clarke, Romano & Wolf (Stata Journal, 2020).
%load_ext autoreload%autoreload 2import matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom great_tables import loc, stylefrom joblib import Parallel, delayedfrom tqdm import tqdmimport pyfixest as pf
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
N =100n_covariates =5n_depvars =20
def get_data(N, n_covariates, n_depvars, rho, seed):"Simulate data with true nulls." rng = np.random.default_rng(seed) Omega = np.eye(n_depvars) X = rng.standard_normal((N, n_covariates)) u_joint = np.random.multivariate_normal(np.zeros(n_depvars), Omega, N) beta = np.zeros((n_covariates, n_depvars)) y = X @ beta + u_joint data = pd.DataFrame(X, columns=[f"X{i}"for i inrange(n_covariates)]) data = data.assign(**{f"y{i}": y[:, i] for i inrange(n_depvars)})return data
data = get_data( N=N, n_covariates=n_covariates, n_depvars=n_depvars, rho=0.5, seed=12345)data.head()
X0
X1
X2
X3
X4
y0
y1
y2
y3
y4
...
y10
y11
y12
y13
y14
y15
y16
y17
y18
y19
0
-1.423825
1.263728
-0.870662
-0.259173
-0.075343
0.285444
-0.613527
1.064762
0.941035
-0.188382
...
0.947734
0.505579
-2.838629
-0.563997
1.757725
-0.187876
2.554182
-2.486495
0.216818
0.298878
1
-0.740885
-1.367793
0.648893
0.361058
-1.952863
1.365574
0.404626
-1.869440
-0.008121
2.129185
...
0.439416
-1.992477
-0.002794
1.603059
-0.244048
0.880147
0.700935
-0.375832
1.113939
0.279370
2
2.347410
0.968497
-0.759387
0.902198
-0.466953
0.215444
-0.282206
0.720051
0.366021
1.051319
...
0.272646
-0.507247
-0.784782
-1.007066
0.840372
-1.150062
2.082881
0.583346
1.077486
0.027789
3
-0.060690
0.788844
-1.256668
0.575858
1.398979
-1.065016
1.185823
0.970361
-0.765870
0.479013
...
-1.007019
0.924188
1.949084
-1.560250
1.167244
-0.837087
0.360323
0.640585
0.527408
-0.033194
4
1.322298
-0.299699
0.902919
-1.621583
-0.158189
0.664518
1.456532
1.858938
-0.285083
-0.379121
...
-0.960765
-0.238118
-1.055520
-0.486306
0.583903
0.718977
-0.122756
0.545447
0.174667
0.926524
5 rows × 25 columns
Now that we have our data set at hand, we can fit 20 regression models - one for each dependent variable. To do so, we will use pyfixest’s multiple estimation syntax.
dependent_vars =" + ".join([f"y{i}"for i inrange(20)])independent_vars =" + ".join([f"X{i}"for i inrange(5)])fml =f"{dependent_vars} ~ {independent_vars}"
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
We see that our estimation produces multiple false positives for the effect of X1 on multiple dependent variables. Recall that we had simulated X1 to have no effect on any of the dependent variables. Still, the estimation procedure produces a significant effect for X1 on multiple dependent variables.
PyFixest provides three functions to adjust inference for multiple testing: pf.bonferroni(), pf.rwolf(), and pf.wyoung(). All three share a common API.
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
We quickly see that the corrected p-values do not flag any false positives.
Controlling for the Familiy-Wise Error Rate (FWER)
We now show by means of simulation that the three methods control the family-wise error rate (FWER). To do so, we simulate 1000 data sets imposing true nulls for the effect of X1 on all of 20 created dependent variables. For each simulation, we then count if the methods flag more than one false positive and report our results.
We see that all three correction methods get close to the desired 5% level. In contrast, the uncorrected method produces the expected much higher family-wise error rate.
Power
Now that we’ve seen that all three methods effectively handle false positives, let’s see how well they avoid false negatives. In other words, we will study how powerful all three methods are in detecting true effects.
To do so, we slightly have to adjust the data generating process. Instead of simulating the impact of X1 on all dependent variables to be zero (a true null effect), we will now simulate the impact of X1 to be \(0.5\) for all dependent variables. Hence we simulate true positives and count how often we correctly detect the true effect, or, equivalently stated, how often we correctly reject the null of no treatment effect.
def get_data_true_effect(N, n_covariates, n_depvars, rho=0.5, seed=12345, effect=0.1):"Generate data with true positives." rng = np.random.default_rng(seed) Omega = np.eye(n_depvars) Omega[Omega !=1] = rho X = rng.standard_normal((N, n_covariates)) u_joint = np.random.multivariate_normal(np.zeros(n_depvars), Omega, N) beta = np.zeros((n_covariates, n_depvars)) beta[1, :] = effect y = X @ beta + u_joint data = pd.DataFrame(X, columns=[f"X{i}"for i inrange(n_covariates)]) data = data.assign(**{f"y{i}": y[:, i] for i inrange(n_depvars)})return data
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)
We will now study power more systematically via a simulation. More concretely, we will compute how often we detect the true effect of X1 on Y1, Y2, …, etc given a fixed sample size \(N\) using “uncorrected” p-values, the Bonferroni, Romano-Wolf and Westfall-Young methods.
def compute_power(seed, rho, effect):"Simulate data, estimate models, and compute power." data = get_data_true_effect( N, n_covariates, n_depvars, rho, seed=seed, effect=effect ) fit = pf.feols( fml=fml, data=data, vcov="hetero" ) # model '1' regresses on Y1 - we're only interested in the power of this specific test df = fit.tidy().reset_index().set_index("Coefficient").xs("X1") df["Pr(>|t|) detect"] = df["Pr(>|t|)"] <0.05 df["Bonferroni detect"] = ( pf.bonferroni(fit, param="X1").xs("Bonferroni Pr(>|t|)").values <0.05 ) df["rwolf detect"] = ( pf.rwolf(fit, param="X1", reps=200, seed=seed *11).xs("RW Pr(>|t|)").values<0.05 ) df["wyoung detect"] = ( pf.wyoung(fit, param="X1", reps=200, seed=seed *11).xs("WY Pr(>|t|)").values<0.05 )# Compute family rejection means detect_effect = {"Pr(>|t|) detect effect": df["Pr(>|t|) detect"].mean(),"Bonferroni detect effect": df["Bonferroni detect"].mean(),"rwolf detect effect": df["rwolf detect"].mean(),"wyoung detect effect": df["wyoung detect"].mean(), } detect_effect_df = pd.DataFrame(detect_effect, index=[effect])return detect_effect_df
def run_power_simulation(n_iter, rho, effect, nthreads=-1):"Run simulation for power." seeds =list(range(n_iter)) results = Parallel(n_jobs=nthreads)( delayed(compute_power)(seed, rho=rho, effect=effect) for seed in tqdm(seeds) )return pd.concat(results).mean()
0%| | 0/1000 [00:00<?, ?it/s]c:\Users\alexa\Documents\pyfixest\.pixi\envs\dev\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
warnings.warn(
100%|██████████| 1000/1000 [12:00<00:00, 1.39it/s]
We see that the “unadjusted” method detects the “true effects” at the highest frequency with on average 97% correctly detected effects. Does this mean that we should use uncorrected tests then? Well, maybe, but these do not control the family-wise error rate. While we have a better chance to detect a true effect, we also have a higher risk of flagging false positives.
Additionally, it looks as if the rwolf and wyoung methods detect the true positives at a slightly higher rate than the Bonferroni method.
Do these findings generalize to other effect sizes? We can check this by simply imposing different effects on the data generating process and repeating the previous exercise multiple times.
We see that for any simulated effect size, the Romano-Wolf and Westfall-Young methods detect a higher share of true positives than the Bonferroni method: they have higher power.
Literature
Clarke, Damian, Joseph P. Romano, and Michael Wolf. “The Romano–Wolf multiple-hypothesis correction in Stata.” The Stata Journal 20.4 (2020): 812-843.
Romano, Joseph P., and Michael Wolf. “Stepwise multiple testing as formalized data snooping.” Econometrica 73.4 (2005): 1237-1282.
Westfall, Peter H., and S. Stanley Young. Resampling-based multiple testing: Examples and methods for p-value adjustment. Vol. 279. John Wiley & Sons, 1993.