Multiple Hypothesis Testing Corrections

When conducting online A/B tests or large-scale experiments, we often analyze multiple dependent variables simultaneously. Analyzing multiple KPIs introduces a significant statistical challenge: the multiple testing problem.

In classical hypothesis testing, the significance level \(\alpha\) controls the probability of a false positive (Type I error), typically \(\alpha=0.05\). While this ensures a 5% chance of a false positive for a single test, the likelihood of detecting at least one false positive grows rapidly when conducting multiple tests. For example, testing 20 KPIs independently at \(\alpha = 0.05\) results in a 64% chance of at least one false positive—calculated as \(1 - (1 - 0.05)^{20} \approx 0.64\). This issue, known as the multiple testing problem, can lead to false claims of significant effects when none exist.

To address this, the concept of controlling the Familywise Error Rate (FWER) has been widely adopted. FWER controls the probability of at least one Type I error across a family of hypotheses. Several correction methods exist to mitigate the multiple testing problem, including:

This vignette demonstrates how these methods effectively control the FWER in a variety of scenarios. We will compare their performance and highlight the trade-offs between simplicity and statistical power. Specifically, we show that while Bonferroni provides strong error control, it is conservative in many practical applications. In contrast, Romano-Wolf and Westfall-Young methods are more powerful, offering greater sensitivity to detect true effects while maintaining robust control of the FWER.

What is a Family-Wise Error Rate (FWER)?

Suppose that we are running an experiment and want to test if our treatment impacts 20 different dependent variables (KPIs). For any given independent test, the chance of a false positive is given by the (significance) level of the individual test, which is most commonly set to \(\alpha = 0.05\). Formally, we can define the false positive rate for a single hypothesis \(H\) as:

\[ P(\text{reject } H \mid H \text{ is true}) = \alpha \]

For a family of tests \(S = {s \in \{1, \dots, P\} }\) hypotheses \(\{H_{s}\}_{s \in S}\), we can analogously define the family-wise error rate (FWER) as:

\[ P(\text{reject at least one } H_{s} \text{ with } s \in I) = \alpha \]

where \(I\) is the set of true hypotheses. In other words, the FWER is the probability of making at least one false positive across all tests in the family.

Setup

In a first step, we create a data set with multiple (potentially correlated) dependent variables that share a common set of covariates. All simulations in this notebook are greatly inspired by Clarke, Romano & Wolf (Stata Journal, 2020).

%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from great_tables import loc, style
from joblib import Parallel, delayed
from tqdm import tqdm

import pyfixest as pf
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
N = 100
n_covariates = 5
n_depvars = 20
def get_data(N, n_covariates, n_depvars, rho, seed):
    "Simulate data with true nulls."
    rng = np.random.default_rng(seed)
    Omega = np.eye(n_depvars)
    X = rng.standard_normal((N, n_covariates))
    u_joint = np.random.multivariate_normal(np.zeros(n_depvars), Omega, N)
    beta = np.zeros((n_covariates, n_depvars))
    y = X @ beta + u_joint

    data = pd.DataFrame(X, columns=[f"X{i}" for i in range(n_covariates)])
    data = data.assign(**{f"y{i}": y[:, i] for i in range(n_depvars)})

    return data
data = get_data(
    N=N, n_covariates=n_covariates, n_depvars=n_depvars, rho=0.5, seed=12345
)
data.head()
X0 X1 X2 X3 X4 y0 y1 y2 y3 y4 ... y10 y11 y12 y13 y14 y15 y16 y17 y18 y19
0 -1.423825 1.263728 -0.870662 -0.259173 -0.075343 0.285444 -0.613527 1.064762 0.941035 -0.188382 ... 0.947734 0.505579 -2.838629 -0.563997 1.757725 -0.187876 2.554182 -2.486495 0.216818 0.298878
1 -0.740885 -1.367793 0.648893 0.361058 -1.952863 1.365574 0.404626 -1.869440 -0.008121 2.129185 ... 0.439416 -1.992477 -0.002794 1.603059 -0.244048 0.880147 0.700935 -0.375832 1.113939 0.279370
2 2.347410 0.968497 -0.759387 0.902198 -0.466953 0.215444 -0.282206 0.720051 0.366021 1.051319 ... 0.272646 -0.507247 -0.784782 -1.007066 0.840372 -1.150062 2.082881 0.583346 1.077486 0.027789
3 -0.060690 0.788844 -1.256668 0.575858 1.398979 -1.065016 1.185823 0.970361 -0.765870 0.479013 ... -1.007019 0.924188 1.949084 -1.560250 1.167244 -0.837087 0.360323 0.640585 0.527408 -0.033194
4 1.322298 -0.299699 0.902919 -1.621583 -0.158189 0.664518 1.456532 1.858938 -0.285083 -0.379121 ... -0.960765 -0.238118 -1.055520 -0.486306 0.583903 0.718977 -0.122756 0.545447 0.174667 0.926524

5 rows × 25 columns

Now that we have our data set at hand, we can fit 20 regression models - one for each dependent variable. To do so, we will use pyfixest’s multiple estimation syntax.

dependent_vars = " + ".join([f"y{i}" for i in range(20)])
independent_vars = " + ".join([f"X{i}" for i in range(5)])
fml = f"{dependent_vars} ~ {independent_vars}"
fit = pf.feols(fml, data=data, vcov="hetero")
(pf.etable(fit).tab_style(style=style.fill(color="yellow"), locations=loc.body(rows=1)))
y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 y17 y18 y19
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)
coef
X0 0.079
(0.119)
0.111
(0.108)
0.088
(0.105)
0.158
(0.094)
0.033
(0.112)
-0.011
(0.100)
-0.223*
(0.100)
0.020
(0.117)
-0.172*
(0.083)
0.068
(0.082)
0.009
(0.114)
0.023
(0.106)
0.002
(0.099)
0.100
(0.092)
-0.068
(0.098)
-0.022
(0.100)
0.153
(0.123)
0.231*
(0.095)
0.113
(0.120)
-0.051
(0.079)
X1 -0.064
(0.093)
-0.065
(0.108)
-0.055
(0.107)
-0.055
(0.097)
-0.013
(0.101)
0.036
(0.092)
-0.038
(0.100)
-0.052
(0.112)
0.080
(0.084)
0.072
(0.088)
0.002
(0.092)
0.049
(0.124)
-0.017
(0.102)
0.061
(0.099)
-0.023
(0.086)
-0.017
(0.090)
0.159
(0.113)
0.032
(0.122)
-0.144
(0.100)
0.064
(0.088)
X2 -0.165
(0.116)
-0.181
(0.125)
0.107
(0.090)
-0.027
(0.094)
-0.080
(0.111)
0.009
(0.079)
0.185
(0.098)
-0.023
(0.108)
0.025
(0.075)
-0.155
(0.086)
-0.008
(0.095)
0.175
(0.116)
0.153
(0.091)
0.127
(0.090)
0.042
(0.105)
0.142
(0.094)
-0.081
(0.116)
-0.195
(0.105)
-0.200
(0.107)
0.015
(0.095)
X3 -0.110
(0.089)
0.133
(0.108)
0.018
(0.110)
0.006
(0.090)
0.036
(0.100)
-0.083
(0.105)
0.202*
(0.094)
0.105
(0.118)
0.005
(0.081)
-0.003
(0.113)
0.015
(0.115)
0.095
(0.093)
0.030
(0.103)
0.076
(0.112)
-0.080
(0.085)
0.067
(0.106)
-0.057
(0.127)
-0.016
(0.106)
0.051
(0.102)
-0.003
(0.088)
X4 0.017
(0.114)
0.124
(0.128)
0.126
(0.107)
-0.014
(0.085)
-0.120
(0.100)
0.067
(0.094)
0.118
(0.105)
-0.035
(0.107)
-0.127
(0.075)
0.126
(0.094)
0.004
(0.101)
0.035
(0.120)
0.200*
(0.098)
0.011
(0.119)
-0.222**
(0.083)
-0.131
(0.095)
-0.029
(0.113)
0.146
(0.106)
-0.104
(0.109)
0.018
(0.084)
Intercept 0.121
(0.101)
0.020
(0.115)
0.191
(0.108)
0.103
(0.086)
-0.069
(0.103)
-0.090
(0.096)
-0.010
(0.099)
0.021
(0.103)
-0.024
(0.082)
-0.026
(0.102)
0.103
(0.111)
-0.130
(0.112)
0.132
(0.098)
0.004
(0.104)
-0.043
(0.088)
-0.046
(0.100)
0.115
(0.111)
-0.084
(0.114)
0.066
(0.109)
0.014
(0.090)
stats
Observations 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
S.E. type hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero
R2 0.039 0.068 0.037 0.033 0.022 0.015 0.144 0.015 0.063 0.051 0.000 0.035 0.068 0.037 0.079 0.040 0.057 0.078 0.069 0.009
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

We see that our estimation produces multiple false positives for the effect of X1 on multiple dependent variables. Recall that we had simulated X1 to have no effect on any of the dependent variables. Still, the estimation procedure produces a significant effect for X1 on multiple dependent variables.

PyFixest provides three functions to adjust inference for multiple testing: pf.bonferroni(), pf.rwolf(), and pf.wyoung(). All three share a common API.

pval_bonferroni = (
    pf.bonferroni(fit.to_list(), param="X1").xs("Bonferroni Pr(>|t|)").values
)

pval_rwolf = (
    pf.rwolf(fit.to_list(), param="X1", reps=1000, seed=22).xs("RW Pr(>|t|)").values
)

pval_wyoung = (
    pf.wyoung(fit.to_list(), param="X1", reps=1000, seed=22).xs("WY Pr(>|t|)").values
)
(
    pf.etable(
        fit,
        custom_model_stats={
            "Bonferroni: pval(X1)": pval_bonferroni.round(4).tolist(),
            "RW: pval(X1)": pval_rwolf.round(4).tolist(),
            "WY: pval(X1)": pval_wyoung.round(4).tolist(),
        },
    ).tab_style(style=style.fill(color="yellow"), locations=loc.body(rows=[6, 7, 8]))
)
y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 y17 y18 y19
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)
coef
X0 0.192
(0.105)
-0.125
(0.122)
-0.184
(0.095)
0.067
(0.110)
-0.279*
(0.118)
-0.064
(0.123)
-0.041
(0.093)
0.079
(0.089)
0.050
(0.107)
0.075
(0.113)
0.059
(0.096)
-0.163
(0.094)
0.172
(0.110)
0.090
(0.122)
0.032
(0.104)
0.025
(0.106)
0.075
(0.092)
0.055
(0.126)
-0.142
(0.101)
-0.027
(0.092)
X1 0.053
(0.104)
0.030
(0.124)
0.162
(0.084)
0.024
(0.111)
-0.037
(0.104)
0.087
(0.088)
-0.011
(0.089)
-0.075
(0.091)
0.039
(0.102)
0.083
(0.112)
-0.060
(0.101)
0.139
(0.089)
-0.183
(0.109)
-0.038
(0.117)
-0.038
(0.112)
-0.191
(0.108)
-0.016
(0.095)
0.128
(0.099)
0.074
(0.099)
0.120
(0.088)
X2 -0.164
(0.095)
0.140
(0.112)
0.058
(0.103)
-0.065
(0.095)
0.003
(0.113)
0.282**
(0.098)
0.077
(0.110)
0.081
(0.110)
0.125
(0.105)
-0.011
(0.105)
0.116
(0.109)
0.003
(0.107)
-0.138
(0.109)
-0.138
(0.091)
0.053
(0.093)
0.011
(0.109)
0.046
(0.103)
0.073
(0.121)
0.079
(0.134)
0.187
(0.100)
X3 0.022
(0.108)
-0.041
(0.117)
-0.015
(0.098)
0.042
(0.088)
-0.121
(0.096)
-0.037
(0.101)
0.008
(0.090)
-0.108
(0.090)
0.039
(0.096)
0.115
(0.118)
-0.064
(0.109)
-0.065
(0.098)
0.053
(0.100)
0.062
(0.121)
-0.099
(0.107)
0.173
(0.090)
-0.036
(0.095)
0.147
(0.097)
-0.039
(0.114)
-0.027
(0.093)
X4 -0.109
(0.099)
-0.071
(0.120)
0.026
(0.093)
0.064
(0.074)
0.062
(0.127)
0.021
(0.088)
-0.035
(0.104)
-0.082
(0.085)
0.059
(0.093)
0.027
(0.092)
0.084
(0.105)
-0.123
(0.114)
-0.142
(0.107)
0.105
(0.103)
0.004
(0.113)
0.002
(0.097)
0.040
(0.091)
-0.063
(0.099)
0.006
(0.105)
-0.010
(0.092)
Intercept 0.169
(0.105)
0.094
(0.117)
-0.021
(0.097)
0.188*
(0.090)
-0.034
(0.112)
-0.047
(0.101)
0.048
(0.098)
-0.161
(0.092)
-0.109
(0.101)
-0.047
(0.113)
-0.227*
(0.095)
0.006
(0.101)
0.136
(0.102)
0.212*
(0.097)
-0.042
(0.103)
0.128
(0.104)
0.029
(0.103)
-0.139
(0.110)
-0.052
(0.106)
0.049
(0.097)
stats
Bonferroni: pval(X1) 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
RW: pval(X1) 1.0 1.0 0.7273 1.0 1.0 0.996 1.0 0.998 1.0 0.999 1.0 0.8921 0.8521 1.0 1.0 0.8162 1.0 0.956 0.999 0.954
WY: pval(X1) 1.0 1.0 0.727 1.0 1.0 0.996 1.0 0.998 1.0 0.999 1.0 0.892 0.852 1.0 1.0 0.816 1.0 0.956 0.999 0.954
Observations 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
S.E. type hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero hetero
R2 0.078 0.030 0.063 0.017 0.079 0.085 0.009 0.048 0.026 0.023 0.033 0.061 0.097 0.040 0.014 0.076 0.011 0.035 0.024 0.051
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

We quickly see that the corrected p-values do not flag any false positives.

Controlling for the Familiy-Wise Error Rate (FWER)

We now show by means of simulation that the three methods control the family-wise error rate (FWER). To do so, we simulate 1000 data sets imposing true nulls for the effect of X1 on all of 20 created dependent variables. For each simulation, we then count if the methods flag more than one false positive and report our results.

def compute_family_rejection(seed, rho):
    "Simulate data, estimate models, and compute family rejection rates."
    data = get_data(N, n_covariates, n_depvars, rho, seed=seed)
    fit = pf.feols(fml=fml, data=data, vcov="hetero")
    df = fit.tidy().reset_index().set_index("Coefficient").xs("X1")

    df["Pr(>|t|) reject"] = df["Pr(>|t|)"] < 0.05
    df["Bonferroni reject"] = (
        pf.bonferroni(fit, param="X1").xs("Bonferroni Pr(>|t|)").values < 0.05
    )
    df["rwolf reject"] = (
        pf.rwolf(fit, param="X1", reps=1000, seed=seed * 11).xs("RW Pr(>|t|)").values
        < 0.05
    )
    df["wyoung reject"] = (
        pf.wyoung(fit, param="X1", reps=1000, seed=seed * 11).xs("WY Pr(>|t|)").values
        < 0.05
    )

    # Compute family rejection means
    family_rejection = {
        "Pr(>|t|) reject family": df["Pr(>|t|) reject"].sum() > 0,
        "Bonferroni reject family": df["Bonferroni reject"].sum() > 0,
        "rwolf reject family": df["rwolf reject"].sum() > 0,
        "wyoung reject family": df["wyoung reject"].sum() > 0,
    }

    return pd.Series(family_rejection)
def run_fwer_simulation(n_iter, rho):
    "Run simulation for family-wise error rate."
    results = Parallel(n_jobs=-1)(
        delayed(compute_family_rejection)(seed, rho=rho) for seed in tqdm(range(n_iter))
    )
    return pd.concat(results).reset_index().groupby("index").mean()
run_fwer_simulation(n_iter=1000, rho=0.5)
100%|██████████| 1000/1000 [19:03<00:00,  1.14s/it]
0
index
Bonferroni reject family 0.069
Pr(>|t|) reject family 0.694
rwolf reject family 0.043
wyoung reject family 0.043

We see that all three correction methods get close to the desired 5% level. In contrast, the uncorrected method produces the expected much higher family-wise error rate.

Power

Now that we’ve seen that all three methods effectively handle false positives, let’s see how well they avoid false negatives. In other words, we will study how powerful all three methods are in detecting true effects.

To do so, we slightly have to adjust the data generating process. Instead of simulating the impact of X1 on all dependent variables to be zero (a true null effect), we will now simulate the impact of X1 to be \(0.5\) for all dependent variables. Hence we simulate true positives and count how often we correctly detect the true effect, or, equivalently stated, how often we correctly reject the null of no treatment effect.

def get_data_true_effect(N, n_covariates, n_depvars, rho=0.5, seed=12345, effect=0.1):
    "Generate data with true positives."
    rng = np.random.default_rng(seed)
    Omega = np.eye(n_depvars)
    Omega[Omega != 1] = rho
    X = rng.standard_normal((N, n_covariates))
    u_joint = np.random.multivariate_normal(np.zeros(n_depvars), Omega, N)
    beta = np.zeros((n_covariates, n_depvars))
    beta[1, :] = effect
    y = X @ beta + u_joint

    data = pd.DataFrame(X, columns=[f"X{i}" for i in range(n_covariates)])
    data = data.assign(**{f"y{i}": y[:, i] for i in range(n_depvars)})

    return data
data_true = get_data_true_effect(
    N=N, n_covariates=n_covariates, n_depvars=n_depvars, rho=0.5, seed=12345, effect=0.5
)
fit = pf.feols(fml, data=data_true)
(
    pf.etable(fit).tab_style(
        style=style.fill(color="yellow"), locations=loc.body(rows=[1])
    )
)
y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 y17 y18 y19
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)
coef
X0 -0.108
(0.103)
-0.152
(0.116)
-0.243*
(0.113)
-0.026
(0.112)
-0.158
(0.109)
-0.132
(0.108)
-0.155
(0.104)
-0.180
(0.114)
-0.254*
(0.111)
-0.157
(0.106)
-0.066
(0.109)
-0.005
(0.115)
-0.214*
(0.101)
-0.183
(0.115)
-0.007
(0.099)
-0.138
(0.105)
-0.104
(0.107)
-0.138
(0.111)
-0.102
(0.110)
-0.154
(0.098)
X1 0.538***
(0.099)
0.493***
(0.112)
0.589***
(0.109)
0.503***
(0.108)
0.340**
(0.106)
0.493***
(0.104)
0.568***
(0.100)
0.590***
(0.110)
0.502***
(0.107)
0.505***
(0.102)
0.375***
(0.105)
0.411***
(0.111)
0.452***
(0.098)
0.586***
(0.111)
0.438***
(0.096)
0.558***
(0.101)
0.408***
(0.103)
0.442***
(0.107)
0.498***
(0.106)
0.572***
(0.095)
X2 0.081
(0.101)
0.026
(0.114)
0.094
(0.111)
0.094
(0.110)
0.008
(0.108)
-0.036
(0.106)
-0.054
(0.102)
-0.014
(0.112)
-0.043
(0.109)
-0.042
(0.104)
0.128
(0.107)
-0.043
(0.113)
0.039
(0.099)
0.053
(0.113)
-0.048
(0.097)
0.027
(0.103)
-0.021
(0.105)
0.019
(0.109)
0.015
(0.108)
-0.106
(0.097)
X3 -0.068
(0.104)
-0.110
(0.117)
-0.033
(0.114)
-0.098
(0.113)
-0.127
(0.110)
-0.086
(0.109)
-0.056
(0.104)
-0.094
(0.115)
-0.092
(0.112)
-0.064
(0.107)
-0.099
(0.110)
-0.003
(0.116)
0.032
(0.102)
-0.157
(0.116)
-0.085
(0.100)
-0.017
(0.106)
-0.147
(0.108)
-0.027
(0.111)
-0.067
(0.110)
-0.088
(0.099)
X4 -0.063
(0.101)
0.013
(0.114)
-0.142
(0.111)
0.002
(0.110)
0.086
(0.107)
-0.092
(0.106)
-0.059
(0.102)
0.030
(0.112)
-0.013
(0.109)
-0.067
(0.104)
0.068
(0.107)
-0.134
(0.113)
-0.012
(0.099)
-0.021
(0.112)
0.046
(0.097)
0.076
(0.103)
-0.033
(0.105)
-0.019
(0.108)
-0.073
(0.107)
0.061
(0.096)
Intercept -0.007
(0.100)
-0.061
(0.113)
-0.010
(0.110)
0.060
(0.109)
-0.035
(0.107)
0.003
(0.105)
-0.013
(0.101)
-0.014
(0.111)
-0.029
(0.108)
0.063
(0.103)
-0.003
(0.106)
0.071
(0.112)
0.117
(0.098)
0.022
(0.112)
-0.013
(0.097)
-0.034
(0.102)
0.054
(0.104)
-0.013
(0.108)
0.027
(0.107)
0.053
(0.096)
stats
Observations 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
S.E. type iid iid iid iid iid iid iid iid iid iid iid iid iid iid iid iid iid iid iid iid
R2 0.259 0.202 0.264 0.214 0.146 0.226 0.288 0.269 0.239 0.238 0.151 0.149 0.207 0.273 0.218 0.267 0.186 0.170 0.213 0.333
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

We will now study power more systematically via a simulation. More concretely, we will compute how often we detect the true effect of X1 on Y1, Y2, …, etc given a fixed sample size \(N\) using “uncorrected” p-values, the Bonferroni, Romano-Wolf and Westfall-Young methods.

def compute_power(seed, rho, effect):
    "Simulate data, estimate models, and compute power."
    data = get_data_true_effect(
        N, n_covariates, n_depvars, rho, seed=seed, effect=effect
    )
    fit = pf.feols(
        fml=fml, data=data, vcov="hetero"
    )  # model '1' regresses on Y1 - we're only interested in the power of this specific test
    df = fit.tidy().reset_index().set_index("Coefficient").xs("X1")

    df["Pr(>|t|) detect"] = df["Pr(>|t|)"] < 0.05
    df["Bonferroni detect"] = (
        pf.bonferroni(fit, param="X1").xs("Bonferroni Pr(>|t|)").values < 0.05
    )
    df["rwolf detect"] = (
        pf.rwolf(fit, param="X1", reps=200, seed=seed * 11).xs("RW Pr(>|t|)").values
        < 0.05
    )
    df["wyoung detect"] = (
        pf.wyoung(fit, param="X1", reps=200, seed=seed * 11).xs("WY Pr(>|t|)").values
        < 0.05
    )

    # Compute family rejection means
    detect_effect = {
        "Pr(>|t|) detect effect": df["Pr(>|t|) detect"].mean(),
        "Bonferroni detect effect": df["Bonferroni detect"].mean(),
        "rwolf detect effect": df["rwolf detect"].mean(),
        "wyoung detect effect": df["wyoung detect"].mean(),
    }

    detect_effect_df = pd.DataFrame(detect_effect, index=[effect])
    return detect_effect_df
def run_power_simulation(n_iter, rho, effect, nthreads=-1):
    "Run simulation for power."
    seeds = list(range(n_iter))
    results = Parallel(n_jobs=nthreads)(
        delayed(compute_power)(seed, rho=rho, effect=effect) for seed in tqdm(seeds)
    )

    return pd.concat(results).mean()
run_power_simulation(n_iter=1000, rho=0.5, effect=0.4)
  0%|          | 0/1000 [00:00<?, ?it/s]c:\Users\alexa\Documents\pyfixest\.pixi\envs\dev\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(
100%|██████████| 1000/1000 [12:00<00:00,  1.39it/s]
Pr(>|t|) detect effect      0.97045
Bonferroni detect effect    0.78485
rwolf detect effect         0.86985
wyoung detect effect        0.86985
dtype: float64

We see that the “unadjusted” method detects the “true effects” at the highest frequency with on average 97% correctly detected effects. Does this mean that we should use uncorrected tests then? Well, maybe, but these do not control the family-wise error rate. While we have a better chance to detect a true effect, we also have a higher risk of flagging false positives.

Additionally, it looks as if the rwolf and wyoung methods detect the true positives at a slightly higher rate than the Bonferroni method.

Do these findings generalize to other effect sizes? We can check this by simply imposing different effects on the data generating process and repeating the previous exercise multiple times.

def run_power_simulation_vary_effect():
    "Run power simulations with varying effect sizes."
    n_points = 10
    max_val = 0.7

    effects = (
        np.sign(np.linspace(-1, 1, n_points))
        * max_val
        * (np.linspace(-1, 1, n_points) ** 2)
    )

    res_list = []
    for effect_size in tqdm(effects):
        res = run_power_simulation(n_iter=1000, rho=0.5, effect=effect_size)
        res["effect"] = effect_size
        res_list.append(res)
    return pd.concat(res_list, axis=1).T.set_index("effect")


res = run_power_simulation_vary_effect()
100%|██████████| 1000/1000 [10:54<00:00,  1.53it/s]
100%|██████████| 1000/1000 [10:17<00:00,  1.62it/s]
100%|██████████| 1000/1000 [10:05<00:00,  1.65it/s]
100%|██████████| 1000/1000 [10:09<00:00,  1.64it/s]
100%|██████████| 1000/1000 [10:25<00:00,  1.60it/s]
100%|██████████| 1000/1000 [09:54<00:00,  1.68it/s]
100%|██████████| 1000/1000 [09:44<00:00,  1.71it/s]
100%|██████████| 1000/1000 [10:53<00:00,  1.53it/s]
100%|██████████| 1000/1000 [14:32<00:00,  1.15it/s]
100%|██████████| 1000/1000 [14:38<00:00,  1.14it/s]
100%|██████████| 10/10 [1:53:04<00:00, 678.48s/it]
column_to_label_dict = {
    "Pr(>|t|) detect effect": "Unadjusted",
    "Bonferroni detect effect": "Bonferroni",
    "rwolf detect effect": "RW",
    "wyoung detect effect": "WY",
}

plt.figure(figsize=(10, 6))
line_styles = ["-", "--", "-.", ":"]
for column, line_style in zip(res.columns, line_styles):
    plt.plot(
        res.index, res[column], linestyle=line_style, label=column_to_label_dict[column]
    )

plt.title("Power of Multiple Testing Correction Procedures")
plt.xlabel("Effect")
plt.ylabel("Proportion of correctly detected effects")
plt.legend()
plt.grid()
plt.show()

We see that for any simulated effect size, the Romano-Wolf and Westfall-Young methods detect a higher share of true positives than the Bonferroni method: they have higher power.

Literature

  • Clarke, Damian, Joseph P. Romano, and Michael Wolf. “The Romano–Wolf multiple-hypothesis correction in Stata.” The Stata Journal 20.4 (2020): 812-843.
  • Romano, Joseph P., and Michael Wolf. “Stepwise multiple testing as formalized data snooping.” Econometrica 73.4 (2005): 1237-1282.
  • Westfall, Peter H., and S. Stanley Young. Resampling-based multiple testing: Examples and methods for p-value adjustment. Vol. 279. John Wiley & Sons, 1993.