duckreg
  • Home
  • Ibis
  • Compression
  • Linear
  • Panel
  • DML
  • GLMs
  • Fisher Scoring
  • Ridge
  • Inference
  • Examples
  • Performance
  1. Linear Regression API
  • duckreg
  • Ibis Backends
  • Compression and Estimator Lifecycle
  • Linear Regression API
  • Panel Estimators
  • Compressed Double Machine Learning
  • Generalized Linear Models
  • Fisher Scoring and Multinomial GLMs
  • Compressed Ridge Regression
  • Inference and Variance Estimation
  • Executed Examples
  • Performance Comparisons

On this page

  • Constructor
  • Basic Fit
  • Analytic HC1 Covariance
  • Bootstrap
  • Backwards Compatibility

Linear Regression API

DBRegression is the preferred linear-regression interface. It compresses ordinary least squares by grouping on the right-hand-side variables, then solves weighted least squares on the compressed cells.

Constructor

DBRegression(
    db_name: str | None,
    table_name: str,
    formula: str,
    cluster_col: str | None,
    seed: int,
    n_bootstraps: int = 100,
    rowid_col: str = "rowid",
    fitter: str = "numpy",
    connection=None,
)

The formula is a standard additive linear formula:

formula = "Y ~ D + f1 + f2"

Multiple outcomes are supported:

formula = "Y + Y2 ~ D + f1 + f2"

Fixed-effect separators such as Y ~ D | unit + time are intentionally rejected. Use DBMundlak, DBDoubleDemeaning, or DBMundlakEventStudy for those designs.

Basic Fit

from duckreg import DBRegression

model = DBRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + f1 + f2",
    cluster_col=None,
    seed=42,
    n_bootstraps=0,
)
model.fit()
model.fit_vcov()
model.summary()

The returned point estimate is ordered as intercept followed by the RHS variables:

["Intercept", "D", "f1", "f2"]

Analytic HC1 Covariance

For a single outcome, fit_vcov() computes an HC1-style sandwich covariance from compressed sufficient statistics:

\[ \hat{V} = \frac{N}{N-k} (X'WX)^{-1} \left(\sum_g RSS_g x_gx_g'\right) (X'WX)^{-1}. \]

The grouped residual sum of squares is

\[ RSS_g = n_g \hat{y}_g^2 -2\hat{y}_g \sum_{i \in g} y_i + \sum_{i \in g} y_i^2. \]

This is why compression stores both sum_Y and sum_Y_sq.

Bootstrap

When n_bootstraps > 0, fit() calls bootstrap().

Setting Path
cluster_col=None Resample compressed rows and recompute weighted least squares.
cluster_col="cluster" Group by covariate cell and cluster, resample clusters, then collapse back to covariate cells.

In the DBRegression implementation, cluster bootstrap multiplicities are handled in pandas after collecting a compressed cluster-by-cell table. This avoids DuckDB-only unnest(?) idioms in the backend-neutral path.

Backwards Compatibility

DuckRegression has the same constructor shape and remains exported:

from duckreg import DuckRegression

For new code, prefer DBRegression unless you specifically need to preserve an older DuckRegression workflow.