Descriptive Statistics & Balance Tables

DTable() allows to display descriptive statistics for a set of variables in the same layout. DTable() inherits from the MTable base class, which provides all the core output functionality. This means that ETable can generate tables in multiple formats (HTML/GT, docx, LaTeX). BTable() inherits from DTable() to display simple Balance Tables adding statistical tests for treatment comparisons.

Basic Usage of DTable()

Specify the variables you want to display the descriptive statistics for. Here we also directly define variable labels and set these as default labels (see Setting defaults in the ETable documentation).

# Import necessary libraries
import numpy as np
import pandas as pd
import maketables as mt

# Load sample dataset
df = pd.read_csv("../data/salaries.csv")

# Define variable labels
labels = {
    "logwage": "ln(Wage)",
    "wage": "Wage",
    "age": "Age",
    "female": "Female",
    "tenure": "Years of Tenure",
    "occupation": "Occupation",
    "worker_type": "Worker Type",
    "education": "Education Level"
}

# Set default labels 
mt.MTable.DEFAULT_LABELS = labels
mt.DTable(
    df,
    vars=["wage", "logwage", "age", "tenure"],
    caption="Descriptive statistics",
)
Descriptive statistics
N Mean Std. Dev.
Wage 1,800 62,742 28,312
ln(Wage) 1,800 10.94 0.48
Age 1,800 40.77 11.10
Years of Tenure 1,800 17.62 11.18

Choose the set of statistics to be displayed with stats. You can use any pandas aggregation functions.

mt.DTable(
    df,
    vars=["wage", "logwage", "age", "tenure"],
    stats=["count", "mean", "std", "min", "max"],
    caption="Descriptive statistics",
    
)
Descriptive statistics
N Mean Std. Dev. Min Max
Wage 1,800 62,742 28,312 25,000 166,589
ln(Wage) 1,800 10.94 0.48 10.13 12.02
Age 1,800 40.77 11.10 22.00 65.00
Years of Tenure 1,800 17.62 11.18 0.00 43.00

Summarize by characteristics in columns and rows

You can summarize by characteristics using the bycol argument when groups are to be displayed in columns. When the number of observations is the same for all variables in a group, you can also opt to display the number of observations only once for each group byin a separate line at the bottom of the table with counts_row_below==True.

# Generate a categorical variable for gender from the dummy variable
df["gender"] = df["female"].map({0: "Male", 1: "Female"})

mt.DTable(
    df,
    vars=["wage", "logwage", "age", "tenure"],
    bycol=["worker_type","gender"],
    stats=["count", "mean", "std"],
    caption="Descriptive statistics by worker type and gender",
    stats_labels={"count": "Number of observations"},
    counts_row_below=True,
    digits=2)
Descriptive statistics by worker type and gender
  Blue Collar White Collar
Female Male Female Male
Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.
stats
Wage 53,900 24,679 54,360 26,129 65,615 27,898 71,399 29,204
ln(Wage) 10.79 0.47 10.79 0.49 11.00 0.45 11.08 0.46
Age 41.10 10.96 39.83 11.14 41.79 11.02 40.20 11.17
Years of Tenure 17.86 11.19 16.73 11.15 18.59 11.08 17.10 11.23
nobs
Number of observations 357.00 368.00 530.00 545.00

You can also use custom aggregation functions to compute further statistics or affect how statistics are presented. Pyfixest provides two such functions mean_std and mean_newline_std which compute the mean and standard deviation and display both the same cell (either with line break between them or not). This allows to have more compact tables when you want to show statistics for many characteristcs in the columns.

You can also hide the display of the statistics labels in the header with hide_stats_labels=True. In that case a table note will be added naming the statistics displayed using its label (if you have not provided a custom note).

mt.DTable(
    df,
    vars=["wage", "logwage", "age", "tenure"],
    bycol=["worker_type", "gender"],
    stats=["mean_newline_std", "count"],
    caption="Descriptive statistics by worker type and gender",
    stats_labels={"count": "Number of observations"},
    counts_row_below=True,
    hide_stats=True,
)
Descriptive statistics by worker type and gender
Blue Collar White Collar
Female Male Female Male
stats
Wage 53,900
(24,679)
54,360
(26,129)
65,615
(27,898)
71,399
(29,204)
ln(Wage) 10.79
(0.47)
10.79
(0.49)
11.00
(0.45)
11.08
(0.46)
Age 41.10
(10.96)
39.83
(11.14)
41.79
(11.02)
40.20
(11.17)
Years of Tenure 17.86
(11.19)
16.73
(11.15)
18.59
(11.08)
17.10
(11.23)
nobs
Number of observations 357 368 530 545
Note: Displayed statistics are Mean (Std. Dev.).

You can also split by characteristics in both columns and rows. Note that you can only use one grouping variable in rows, but several in columns (as shown above).

mt.DTable(
    df,
    vars=["wage", "logwage", "age", "tenure"],
    bycol=["worker_type"],
    byrow="gender",
    stats=["count", "mean", "std"],
    caption="Descriptive statistics by worker type and gender",
)
Descriptive statistics by worker type and gender
Blue Collar White Collar
N Mean Std. Dev. N Mean Std. Dev.
Female
Wage 357.00 53,900 24,679 530.00 65,615 27,898
ln(Wage) 357.00 10.79 0.47 530.00 11.00 0.45
Age 357.00 41.10 10.96 530.00 41.79 11.02
Years of Tenure 357.00 17.86 11.19 530.00 18.59 11.08
Male
Wage 368.00 54,360 26,129 545.00 71,399 29,204
ln(Wage) 368.00 10.79 0.49 545.00 11.08 0.46
Age 368.00 39.83 11.14 545.00 40.20 11.17
Years of Tenure 368.00 16.73 11.15 545.00 17.10 11.23

Number formatting

DTable supports flexible number formatting via the format_spec argument. You can control formatting at three levels passing a dictionary:

  • Key types accepted:
    • ('var', 'stat') — per-variable and per-statistic (most specific)
    • 'var' — all statistics for a specific variable
    • 'stat' — that statistic for all variables
  • Lookup priority (applied in this order): (var,stat) → varstat.

This logic ensures you can set global stat styles, per-variable styles, or very specific per-variable/stat styles — the most specific match wins.

# Custom format specifications for variables/statistics
format_specs = {
    # Per-variable formats (applies to all stats for that variable unless overridden)
    'wage': ',.1f',     # Wage always with 1 decimals
    # Per-variable/statistic formats (most specific, takes precedence)
    ('age', 'mean'): '.3f',   # Age mean with 3 decimals
    ('tenure', 'std'): '.4f', # Tenure std with 4 decimals
}

mt.DTable(
    df,
    vars=["wage", "age", "tenure"],
    stats=["mean", "std", "min", "max", "count"],
    format_spec=format_specs,
    caption="Custom formatting example with per-variable/statistic logic"
)
Custom formatting example with per-variable/statistic logic
Mean Std. Dev. Min Max N
Wage 62,741.8 28,312.4 25,000.0 166,589.0 1,800.0
Age 40.769 11.10 22.00 65.00 1,800
Years of Tenure 17.62 11.1762 0.00 43.00 1,800

Balance Tables with BTable

Balance Tables can be displayed with BTable which is based on DTable so inherits most of the latter’s functionality. It constructs simple balance tables that shows variables by groups (like treatments in an experiment) and performs statistical tests comparing these variables between the goups, displaying respective p-values.

For two groups it displays the p-value of the single group indicator (t test) for more then two groups the p-value of a joint Wald test that all group indicators are zero is displayed. BTable uses pyFixest to perform the tests. You can add fixed_effects via fixed_effects= ... and specify the vcov option, for instance to implement clustering (see pyfixest documentation).

mt.BTable(
    df,
    vars=["wage", "logwage", "age", "tenure"],
    group="worker_type",
    caption="Balance Table",
)
Balance Table
Blue Collar White Collar p-value
Mean Std. Dev. Mean Std. Dev.
Wage 54,134 25,409 68,547 28,701 0.000
ln(Wage) 10.79 0.48 11.04 0.46 0.000
Age 40.46 11.06 40.98 11.12 0.324
Years of Tenure 17.29 11.18 17.84 11.18 0.308