Compressed Double Machine Learning

DBDML implements a compressed leave-one-out residualization estimator for a partially linear model with discrete controls:

\[ Y_i = W_i'\beta + g(X_i) + \varepsilon_i. \]

The controls \(X_i\) define groups \(g \in \mathcal{G}\). The estimator residualizes both \(Y_i\) and \(W_i\) against leave-one-out group means, then runs OLS on the residuals.

Constructor

DBDML(
    db_name: str | None,
    table_name: str,
    outcome_var: str,
    treatment_var: str | list[str],
    discrete_covars: list[str],
    seed: int,
    n_bootstraps: int = 200,
    connection=None,
)

Leave-One-Out Residualization

For a variable \(V_i\), define

\[ \hat{m}_{V,-i}(X_i) = \frac{1}{N_g - 1} \sum_{\substack{j \in g\\j \ne i}} V_j. \]

The residual is

\[ \tilde{V}_i = V_i - \hat{m}_{V,-i}(X_i) = \frac{N_g V_i - S_V^{(g)}}{N_g - 1}, \]

where \(S_V^{(g)}=\sum_{j\in g}V_j\).

The target coefficient solves

\[ \hat{\beta} = \left(\sum_i \tilde{W}_i\tilde{W}_i'\right)^{-1} \left(\sum_i \tilde{W}_i\tilde{Y}_i\right). \]

Compressed Cross-Products

For group \(g\), define

\[ S_W^{(g)} = \sum_{i \in g} W_i, \qquad S_Y^{(g)} = \sum_{i \in g} Y_i, \]

\[ S_{WW}^{(g)} = \sum_{i \in g} W_iW_i', \qquad S_{WY}^{(g)} = \sum_{i \in g} W_iY_i. \]

Then

\[ \sum_{i \in g}\tilde{W}_i\tilde{W}_i' = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WW}^{(g)} - S_W^{(g)}S_W^{(g)'} \right], \]

and

\[ \sum_{i \in g}\tilde{W}_i\tilde{Y}_i = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WY}^{(g)} - S_W^{(g)}S_Y^{(g)} \right]. \]

So DBDML only needs grouped counts, sums, and cross-products. Singleton groups are dropped because leave-one-out residualization is undefined when \(N_g=1\).

Usage

from duckreg import DBDML

model = DBDML(
    db_name="dml.db",
    table_name="data",
    outcome_var="Y",
    treatment_var=["D1", "D2"],
    discrete_covars=["market", "period"],
    seed=42,
    n_bootstraps=200,
)
model.fit()
model.summary()

For a remote backend, pass connection=con and db_name=None.

Relation To The Notebook

The original notebooks/duckdml.ipynb demonstrates the same residualization idea on synthetic grouped controls. The website examples use smaller data, but the algebra is identical: compress to per-group sufficient statistics and solve one treatment cross-product system.