duckreg
  • Home
  • Ibis
  • Compression
  • Linear
  • Panel
  • DML
  • GLMs
  • Fisher Scoring
  • Ridge
  • Inference
  • Examples
  • Performance
  1. Ibis Backends
  • duckreg
  • Ibis Backends
  • Compression and Estimator Lifecycle
  • Linear Regression API
  • Panel Estimators
  • Compressed Double Machine Learning
  • Generalized Linear Models
  • Fisher Scoring and Multinomial GLMs
  • Compressed Ridge Regression
  • Inference and Variance Estimation
  • Executed Examples
  • Performance Comparisons

On this page

  • Preferred API
  • Databricks
  • Connection Forms
  • Backend Requirements
  • Migration From Duck* To DB*
  • Current Limits

Ibis Backends

The 0.4 API separates estimator logic from a specific SQL engine. duckreg uses Ibis as the database expression layer, then collects only compressed sufficient statistics into Python.

Preferred API

Use the DB* estimators for backend-neutral work:

import ibis
from duckreg import DBRegression

con = ibis.duckdb.connect("large_dataset.db")

model = DBRegression(
    db_name=None,
    connection=con,
    table_name="data",
    formula="Y ~ D + X",
    cluster_col="cluster_id",
    seed=42,
    n_bootstraps=0,
)
model.fit()
model.fit_vcov()
model.summary()

For local DuckDB use, db_name is still enough:

from duckreg import DBRegression

model = DBRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + X",
    cluster_col=None,
    seed=42,
    n_bootstraps=0,
)

Databricks

Install the Databricks extra and pass a live Ibis backend:

uv pip install "duckreg[databricks]"
import ibis
from duckreg import DBRegression

con = ibis.databricks.connect(
    server_hostname="dbc-...",
    http_path="/sql/1.0/warehouses/...",
    access_token="...",
    catalog="main",
    schema="analytics",
)

model = DBRegression(
    db_name=None,
    connection=con,
    table_name="experiment_events",
    formula="Y ~ treatment + segment + period",
    cluster_col="user_id",
    seed=42,
    n_bootstraps=0,
)
model.fit()
model.fit_vcov()

The same pattern applies to any backend created by ibis.<backend>.connect(...).

Connection Forms

duckreg accepts three connection styles:

Argument pattern Meaning
db_name="file.db" Open a DuckDB database through Ibis.
db_name="duckdb:///file.db" Use ibis.connect(...) on an Ibis URL.
db_name=None, connection=con Use an existing Ibis backend object.

The third form is the safest for remote engines because credentials, catalog selection, and session settings stay under caller control.

Backend Requirements

The estimators rely on ordinary relational operations:

Estimator family Ibis operations used
DBRegression group_by, aggregate, sums, counts, joins for bootstrap multiplicities.
DBDML grouped sums and cross-products, plus HAVING n_g > 1 expressed as an Ibis filter.
DBMundlak group means by unit/time and joins back to the base table.
DBDoubleDemeaning unit means, time means, an overall mean, joins, and a cross join.
DBMundlakEventStudy cohort construction, generated indicator columns, compression, cluster bootstrap.
DBLogisticRegression and DBPoissonRegression grouped counts and outcome sums.
DBMultinomialLogisticRegression grouped class-count indicators.
DBPoissonMultinomialRegression grouped label-by-covariate count sums.

The implementation avoids backend-specific constructs such as DuckDB unnest(?) in the DB* path. The compatibility Duck* estimators still contain older DuckDB-oriented SQL in some methods.

Migration From Duck* To DB*

The constructor shapes are intentionally close:

from duckreg import DuckRegression, DBRegression

old = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + X",
    cluster_col="cluster_id",
    seed=42,
    n_bootstraps=0,
)

new = DBRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + X",
    cluster_col="cluster_id",
    seed=42,
    n_bootstraps=0,
)

For remote data, keep the estimator call the same but pass connection=con and db_name=None.

Current Limits

DBRegression does not parse fixed effects in formulas like Y ~ D | unit + time. Use panel estimators instead:

from duckreg import DBMundlak, DBDoubleDemeaning

GLM bootstraps are not implemented. Use fit_vcov() with n_bootstraps=0 for DBLogisticRegression, DBPoissonRegression, and DBMultinomialLogisticRegression.