The original notebooks benchmarked duckreg against full in-memory regression workflows. This page keeps the same spirit but uses a compact reproducible benchmark that is fast enough to render with the documentation.
Absolute timings depend on hardware, backend, and data layout. The stable quantity is the compression ratio: when the design has many repeated covariate rows, the database can reduce a large raw table to a much smaller sufficient-statistic table before Python solves anything.
Setup
Code
import importlib.utilimport osimport tempfileimport timeimport ibisimport numpy as npimport pandas as pdfrom duckreg import DBRegressionrng = np.random.default_rng(2027)tmpdir = tempfile.mkdtemp(prefix="duckreg_perf_")def elapsed(label, fn): start = time.perf_counter() out = fn()return label, time.perf_counter() - start, out
Data
The benchmark uses 250,000 rows, but only 2,000 unique right-hand-side cells.
This baseline materializes the whole design matrix in Python and runs ordinary least squares directly. It is not a full replacement for a statistical package, but it shows the cost of solving on all rows instead of compressed cells.
def run_numpy_ols(): X = np.c_[np.ones(len(df)), df[["D", "f1", "f2"]].to_numpy()] y = df["Y"].to_numpy()return np.linalg.lstsq(X, y, rcond=None)[0]numpy_label, numpy_time, numpy_estimate = elapsed("NumPy full OLS", run_numpy_ols)
Optional pyfixest Baseline
If pyfixest is installed, the render also runs a package-level OLS baseline.
if importlib.util.find_spec("pyfixest") isnotNone:import pyfixest as pfdef run_pyfixest():return pf.feols("Y ~ D + f1 + f2", data=df, vcov="hetero") pyfixest_label, pyfixest_time, pyfixest_fit = elapsed("pyfixest", run_pyfixest) pyfixest_estimate = pyfixest_fit.coef().reindex(["Intercept", "D", "f1", "f2"]).to_numpy()else: pyfixest_label, pyfixest_time, pyfixest_estimate ="pyfixest", np.nan, None
The compressed and full-data OLS point estimates match because the grouped sufficient statistics preserve the linear-model normal equations exactly. The performance gain comes from moving the grouping operation to the database and collecting only the compressed table.
What To Expect On Larger Data
The notebook benchmarks used larger local tables and found large speedups when the design compressed aggressively. The same pattern should hold on remote backends when:
The backend can group the data efficiently.
The RHS variables are discrete or saturated enough to create repeated cells.
The collected sufficient-statistic table is much smaller than the raw table.
If a design is nearly continuous and every row is unique, compression cannot do much. In that case the database still handles storage and query execution, but the in-memory solve approaches the raw-data problem size.