duckreg
  • Home
  • Ibis
  • Compression
  • Linear
  • Panel
  • DML
  • GLMs
  • Fisher Scoring
  • Ridge
  • Inference
  • Examples
  • Performance
  1. Performance Comparisons
  • duckreg
  • Ibis Backends
  • Compression and Estimator Lifecycle
  • Linear Regression API
  • Panel Estimators
  • Compressed Double Machine Learning
  • Generalized Linear Models
  • Fisher Scoring and Multinomial GLMs
  • Compressed Ridge Regression
  • Inference and Variance Estimation
  • Executed Examples
  • Performance Comparisons

On this page

  • Setup
  • Data
  • Compressed Regression
  • Full NumPy Baseline
  • Optional pyfixest Baseline
  • Timing Summary
  • Coefficient Check
  • What To Expect On Larger Data

Performance Comparisons

The original notebooks benchmarked duckreg against full in-memory regression workflows. This page keeps the same spirit but uses a compact reproducible benchmark that is fast enough to render with the documentation.

Absolute timings depend on hardware, backend, and data layout. The stable quantity is the compression ratio: when the design has many repeated covariate rows, the database can reduce a large raw table to a much smaller sufficient-statistic table before Python solves anything.

Setup

Code
import importlib.util
import os
import tempfile
import time

import ibis
import numpy as np
import pandas as pd

from duckreg import DBRegression

rng = np.random.default_rng(2027)
tmpdir = tempfile.mkdtemp(prefix="duckreg_perf_")


def elapsed(label, fn):
    start = time.perf_counter()
    out = fn()
    return label, time.perf_counter() - start, out

Data

The benchmark uses 250,000 rows, but only 2,000 unique right-hand-side cells.

Code
n = 250_000
df = pd.DataFrame(
    {
        "rowid": np.arange(n),
        "D": rng.integers(0, 2, n).astype(float),
        "f1": rng.integers(0, 50, n).astype(float),
        "f2": rng.integers(0, 20, n).astype(float),
    }
)
df["Y"] = 1.0 + 1.5 * df["D"] + 0.04 * df["f1"] - 0.03 * df["f2"] + rng.normal(size=n)

db_path = os.path.join(tmpdir, "perf.db")
con = ibis.duckdb.connect(db_path)
con.create_table("data", df, overwrite=True)

raw_cells = len(df)
unique_cells = df[["D", "f1", "f2"]].drop_duplicates().shape[0]
pd.DataFrame(
    {
        "raw_rows": [raw_cells],
        "unique_design_cells": [unique_cells],
        "compression_ratio": [raw_cells / unique_cells],
    }
)
raw_rows unique_design_cells compression_ratio
0 250000 2000 125.0

Compressed Regression

def run_duckreg():
    model = DBRegression(
        db_name=None,
        connection=con,
        table_name="data",
        formula="Y ~ D + f1 + f2",
        cluster_col=None,
        seed=42,
        n_bootstraps=0,
    )
    model.fit()
    model.fit_vcov()
    return model

duckreg_label, duckreg_time, duck_model = elapsed("DBRegression", run_duckreg)
duck_estimate = duck_model.summary()["point_estimate"]
duck_se = duck_model.summary()["standard_error"]

pd.DataFrame(
    {
        "term": ["Intercept", "D", "f1", "f2"],
        "estimate": duck_estimate,
        "std_error": duck_se,
    }
)
term estimate std_error
0 Intercept 0.997041 0.005510
1 D 1.500149 0.003996
2 f1 0.040063 0.000139
3 f2 -0.029919 0.000347

Full NumPy Baseline

This baseline materializes the whole design matrix in Python and runs ordinary least squares directly. It is not a full replacement for a statistical package, but it shows the cost of solving on all rows instead of compressed cells.

def run_numpy_ols():
    X = np.c_[np.ones(len(df)), df[["D", "f1", "f2"]].to_numpy()]
    y = df["Y"].to_numpy()
    return np.linalg.lstsq(X, y, rcond=None)[0]

numpy_label, numpy_time, numpy_estimate = elapsed("NumPy full OLS", run_numpy_ols)

Optional pyfixest Baseline

If pyfixest is installed, the render also runs a package-level OLS baseline.

if importlib.util.find_spec("pyfixest") is not None:
    import pyfixest as pf

    def run_pyfixest():
        return pf.feols("Y ~ D + f1 + f2", data=df, vcov="hetero")

    pyfixest_label, pyfixest_time, pyfixest_fit = elapsed("pyfixest", run_pyfixest)
    pyfixest_estimate = pyfixest_fit.coef().reindex(["Intercept", "D", "f1", "f2"]).to_numpy()
else:
    pyfixest_label, pyfixest_time, pyfixest_estimate = "pyfixest", np.nan, None

Timing Summary

summary = pd.DataFrame(
    {
        "method": [duckreg_label, numpy_label, pyfixest_label],
        "seconds": [duckreg_time, numpy_time, pyfixest_time],
        "rows_seen_by_python": [len(duck_model.df_compressed), len(df), len(df)],
    }
)
summary["relative_to_duckreg"] = summary["seconds"] / duckreg_time
summary
method seconds rows_seen_by_python relative_to_duckreg
0 DBRegression 0.039078 2000 1.000000
1 NumPy full OLS 0.014897 250000 0.381206
2 pyfixest 1.359962 250000 34.801247

Coefficient Check

coef_table = pd.DataFrame(
    {
        "term": ["Intercept", "D", "f1", "f2"],
        "duckreg": duck_estimate,
        "numpy_full_ols": numpy_estimate,
    }
)
if pyfixest_estimate is not None:
    coef_table["pyfixest"] = pyfixest_estimate
coef_table
term duckreg numpy_full_ols pyfixest
0 Intercept 0.997041 0.997041 0.997041
1 D 1.500149 1.500149 1.500149
2 f1 0.040063 0.040063 0.040063
3 f2 -0.029919 -0.029919 -0.029919

The compressed and full-data OLS point estimates match because the grouped sufficient statistics preserve the linear-model normal equations exactly. The performance gain comes from moving the grouping operation to the database and collecting only the compressed table.

What To Expect On Larger Data

The notebook benchmarks used larger local tables and found large speedups when the design compressed aggressively. The same pattern should hold on remote backends when:

  1. The backend can group the data efficiently.
  2. The RHS variables are discrete or saturated enough to create repeated cells.
  3. The collected sufficient-statistic table is much smaller than the raw table.

If a design is nearly continuous and every row is unique, compression cannot do much. In that case the database still handles storage and query execution, but the in-memory solve approaches the raw-data problem size.