DBDML(
db_name: str | None,
table_name: str,
outcome_var: str,
treatment_var: str | list[str],
discrete_covars: list[str],
seed: int,
n_bootstraps: int = 200,
connection=None,
)Compressed Double Machine Learning
DBDML implements a compressed leave-one-out residualization estimator for a partially linear model with discrete controls:
\[ Y_i = W_i'\beta + g(X_i) + \varepsilon_i. \]
The controls \(X_i\) define groups \(g \in \mathcal{G}\). The estimator residualizes both \(Y_i\) and \(W_i\) against leave-one-out group means, then runs OLS on the residuals.
Constructor
Leave-One-Out Residualization
For a variable \(V_i\), define
\[ \hat{m}_{V,-i}(X_i) = \frac{1}{N_g - 1} \sum_{\substack{j \in g\\j \ne i}} V_j. \]
The residual is
\[ \tilde{V}_i = V_i - \hat{m}_{V,-i}(X_i) = \frac{N_g V_i - S_V^{(g)}}{N_g - 1}, \]
where \(S_V^{(g)}=\sum_{j\in g}V_j\).
The target coefficient solves
\[ \hat{\beta} = \left(\sum_i \tilde{W}_i\tilde{W}_i'\right)^{-1} \left(\sum_i \tilde{W}_i\tilde{Y}_i\right). \]
Compressed Cross-Products
For group \(g\), define
\[ S_W^{(g)} = \sum_{i \in g} W_i, \qquad S_Y^{(g)} = \sum_{i \in g} Y_i, \]
\[ S_{WW}^{(g)} = \sum_{i \in g} W_iW_i', \qquad S_{WY}^{(g)} = \sum_{i \in g} W_iY_i. \]
Then
\[ \sum_{i \in g}\tilde{W}_i\tilde{W}_i' = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WW}^{(g)} - S_W^{(g)}S_W^{(g)'} \right], \]
and
\[ \sum_{i \in g}\tilde{W}_i\tilde{Y}_i = \frac{N_g}{(N_g - 1)^2} \left[ N_g S_{WY}^{(g)} - S_W^{(g)}S_Y^{(g)} \right]. \]
So DBDML only needs grouped counts, sums, and cross-products. Singleton groups are dropped because leave-one-out residualization is undefined when \(N_g=1\).
Usage
from duckreg import DBDML
model = DBDML(
db_name="dml.db",
table_name="data",
outcome_var="Y",
treatment_var=["D1", "D2"],
discrete_covars=["market", "period"],
seed=42,
n_bootstraps=200,
)
model.fit()
model.summary()For a remote backend, pass connection=con and db_name=None.
Relation To The Notebook
The original notebooks/duckdml.ipynb demonstrates the same residualization idea on synthetic grouped controls. The website examples use smaller data, but the algebra is identical: compress to per-group sufficient statistics and solve one treatment cross-product system.