Skip to content
📖 Welcome to my knowledge base! Notes on AI/ML, Maths, CS, MBA, Trading, Economics, Health & Self-Help — all in one place.! 🎉 Discover what’s new

Statsmodels

Statsmodels is a foundational Python library for statistical modeling, hypothesis testing, and data exploration. It complements machine learning libraries like scikit-learn by focusing on statistical inference, providing p-values, confidence intervals, and detailed summaries for model results. This tutorial provides a comprehensive guide to using statsmodels for various statistical analyses.


Installation and Setup

Installing Statsmodels

pip install statsmodels

Or with conda:

conda install statsmodels

Importing Statsmodels

Statsmodels is typically imported with two main entry points:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

The statsmodels.api module provides functions that use arrays and matrices, while statsmodels.formula.api provides functions that use R-style formula strings and pandas DataFrames.


Core Concepts

Endogenous and Exogenous Variables

Statsmodels distinguishes between:

  • Endogenous variables (endog): The dependent/response variable(s) you want to explain
  • Exogenous variables (exog): The independent/predictor variable(s) you use for explanation

Two Approaches to Model Specification

Approach 1: Using formulas (R-style)

import statsmodels.formula.api as smf

# Load example data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data

# Fit model using formula
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
print(results.summary())

Approach 2: Using arrays (NumPy-style)

import statsmodels.api as sm

# Generate data
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)  # Add intercept term
beta = [1, 0.1, 0.5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e

# Fit model
results = sm.OLS(y, X).fit()
print(results.summary())

Benefits of Formula Approach

The formula approach offers several advantages:

  • Intuitive specification using R-like syntax
  • Automatic handling of categorical variables
  • Support for transformations and interactions
  • Integration with pandas DataFrames
# Formula syntax examples
'Lottery ~ Literacy + Wealth + Region'  # Main effects
'Lottery ~ Literacy * Wealth'            # Interaction term
'Lottery ~ Literacy + Wealth - 1'        # Remove intercept
'Lottery ~ np.log(Literacy)'             # Apply function

Regression and Linear Models

Ordinary Least Squares (OLS)

OLS is the foundational regression method for modeling the relationship between a continuous dependent variable and one or more independent variables.

Basic OLS Example:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

# Load example dataset
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
dat = dat[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()

# Fit OLS model
model = smf.ols('Lottery ~ Literacy + Wealth + Region', data=dat)
results = model.fit()

# Summary output
print(results.summary())

The summary output provides:

  • Coefficients (coef): Estimated regression coefficients
  • Standard errors (std err): Measure of coefficient estimate precision
  • t-statistics and p-values: Significance tests for individual coefficients
  • R-squared: Proportion of variance explained
  • F-statistic: Overall model significance test
  • AIC/BIC: Model selection criteria

Key attributes and methods from results object:

# View available results
dir(results)

# Access specific results
results.params          # Coefficients
results.pvalues         # P-values
results.rsquared        # R-squared
results.fvalue          # F-statistic
results.resid           # Residuals
results.fittedvalues    # Predicted values

Generalized Linear Models (GLM)

GLMs extend linear regression to response variables with non-normal distributions through link functions.

Supported Families:

  • Gaussian (normal) - identity link
  • Binomial - logit link (logistic regression)
  • Poisson - log link (count data)
  • Gamma - inverse power link
  • Inverse Gaussian

GLM Example (Gamma family):

import statsmodels.api as sm

# Load data
data = sm.datasets.scotland.load()
data.exog = sm.add_constant(data.exog)

# Fit GLM with Gamma family
gamma_model = sm.GLM(data.endog, data.exog, family=sm.families.Gamma())
gamma_results = gamma_model.fit()

print(gamma_results.summary())

GLM Example (Poisson family for count data):

import statsmodels.api as sm

# Load affairs dataset
data = sm.datasets.fair.load_pandas().data
data["affairs"] = np.ceil(data["affairs"])  # Convert to count

# Fit Poisson GLM
poisson_model = sm.GLM(data["affairs"], 
                       data[["rate_marriage", "age", "yrs_married"]],
                       family=sm.families.Poisson())
poisson_results = poisson_model.fit()

Robust Linear Models

For data with outliers or heteroskedasticity, robust regression methods provide more reliable estimates.

import statsmodels.api as sm
from statsmodels.robust.robust_linear_model import RLM

# Fit robust linear model
rlm_model = sm.RLM(y, X)
rlm_results = rlm_model.fit()

Linear Mixed Effects Models

For hierarchical or grouped data, mixed effects models account for both fixed and random effects.

import statsmodels.api as sm
from statsmodels.formula.api import mixedlm

# Fit mixed effects model
model = mixedlm("y ~ x", data, groups=data["group"])
results = model.fit()

Regression with Discrete Dependent Variables

Logistic Regression

For binary or multinomial outcome variables.

import statsmodels.api as sm

# Binary logistic regression
logit_model = sm.Logit(y, X)
logit_results = logit_model.fit()

# Multinomial logistic regression
mnlogit_model = sm.MNLogit(y, X)
mnlogit_results = mnlogit_model.fit()

Poisson and Negative Binomial Regression

For count data.

import statsmodels.api as sm

# Poisson regression
poisson_model = sm.Poisson(y, X)
poisson_results = poisson_model.fit()

# Negative binomial regression
nb_model = sm.NegativeBinomial(y, X)
nb_results = nb_model.fit()

Time Series Analysis

Statsmodels provides comprehensive tools for time series modeling.

ARIMA Models

import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA

# Fit ARIMA model
model = ARIMA(data, order=(p, d, q))  # p: AR order, d: differencing, q: MA order
results = model.fit()
print(results.summary())

SARIMAX Models

For seasonal data with exogenous variables.

from statsmodels.tsa.statespace.sarimax import SARIMAX

# Fit SARIMAX model
model = SARIMAX(data, 
                order=(p, d, q), 
                seasonal_order=(P, D, Q, S))
results = model.fit()

VAR Models (Vector Autoregression)

For multivariate time series.

from statsmodels.tsa.vector_ar.var_model import VAR

model = VAR(data)
results = model.fit(maxlags=15)

State Space Models

For flexible time series modeling with unobserved components.

from statsmodels.tsa.statespace.structural import UnobservedComponents

model = UnobservedComponents(data, 
                             level='local linear trend',
                             cycle=True)
results = model.fit()

Statistical Tests and Tools

Hypothesis Testing

# t-test for individual coefficients
results.t_test('variable = 0')

# F-test for multiple coefficients
results.f_test('var1 = var2 = 0')

# Wald test
results.wald_test('var1 = 0')

ANOVA

Analysis of variance for comparing groups.

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

model = ols('y ~ C(group)', data=df).fit()
anova_results = anova_lm(model)

Contingency Tables

For categorical data analysis.

from statsmodels.stats.contingency_tables import Table

table = Table(data)
print(table.test_nominal_association())

Nonparametric Methods

For distribution-free statistical tests.

from statsmodels.sandbox.stats.multicomp import multipletests
from statsmodels.stats.nonparametric import ranksums

Advanced Modeling

Generalized Additive Models (GAM)

For non-linear relationships.

from statsmodels.gam.api import GLMGam, BSplines

bs = BSplines(X, df=[6, 6])  # Specify splines
gam_model = GLMGam(y, X, smoother=bs)
gam_results = gam_model.fit()

Generalized Method of Moments (GMM)

For models with endogenous variables.

from statsmodels.sandbox.regression.gmm import GMM

Survival and Duration Analysis

For time-to-event data.

from statsmodels.duration import survfit

Working with Formulas

Categorical Variables

Statsmodels automatically handles categorical variables when they are strings or when using the C() operator.

# Automatic categorical handling
model = smf.ols('y ~ C(category)', data=df).fit()

# Specify contrast coding
model = smf.ols('y ~ C(category, Treatment)', data=df).fit()

Interactions and Transformations

# Interaction
model = smf.ols('y ~ x1 * x2', data=df).fit()

# Polynomial terms
model = smf.ols('y ~ np.power(x, 2)', data=df).fit()

# Custom functions
def log_plus_1(x):
    return np.log(x) + 1.0
    
model = smf.ols('y ~ log_plus_1(x)', data=df).fit()

Removing Variables

# Remove intercept
model = smf.ols('y ~ x - 1', data=df).fit()

# Remove specific variable
model = smf.ols('y ~ x1 + x2 - x1', data=df).fit()  # Only x2 remains

Plotting and Visualization

Statsmodels includes plotting functions for regression diagnostics and time series analysis.

import statsmodels.api as sm
import matplotlib.pyplot as plt

# Regression plots
sm.graphics.plot_partregress_grid(results)

# Q-Q plot for normality
sm.qqplot(results.resid, line='s')

# Influence plot
sm.graphics.influence_plot(results)

# Time series plots
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(data)
plot_pacf(data)

plt.show()

Model Diagnostics

Regression Diagnostics

# Influence measures
infl = results.get_influence()
cooks_d = infl.cooks_distance

# Residual diagnostics
results.test_heteroskedasticity()
results.test_serial_correlation()
results.test_normality()

Multicollinearity

# Variance Inflation Factor
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X.values, i) 
       for i in range(X.shape[1])]

Working with Datasets

Statsmodels includes several example datasets for learning and testing.

# List available datasets
import statsmodels.api as sm
print(sm.datasets.__all__)

# Load dataset
data = sm.datasets.get_rdataset("Guerry", "HistData").data
data = sm.datasets.scotland.load()
data = sm.datasets.fair.load_pandas().data
data = sm.datasets.longley.load()

Using Patsy Directly

For models that don’t support formulas, you can use patsy to create design matrices.

import patsy

# Create design matrices
formula = 'y ~ x1 * x2'
y, X = patsy.dmatrices(formula, data, return_type='matrix')

# For pandas DataFrame output
y, X = patsy.dmatrices(formula, data, return_type='dataframe')

Best Practices

Model Selection

Use information criteria for model comparison:

  • AIC (Akaike Information Criterion): Lower is better
  • BIC (Bayesian Information Criterion): Penalizes complexity more heavily
results.aic
results.bic

Handling Missing Data

Always handle missing values before modeling:

data.dropna(inplace=True)
data.fillna(data.mean(), inplace=True)

Checking Assumptions

For OLS regression, check:

  • Linearity: Plot residuals vs fitted values
  • Normality: Q-Q plot or Shapiro-Wilk test
  • Homoscedasticity: Breusch-Pagan test
  • Independence: Durbin-Watson statistic
# Breusch-Pagan test
sm.stats.diagnostic.het_breuschpagan(results.resid, results.model.exog)

# Durbin-Watson
sm.stats.durbin_watson(results.resid)

Documentation

The results object provides comprehensive documentation:

print(results.__doc__)  # General information
dir(results)            # Available attributes and methods

Integration with Other Libraries

With Pandas

Statsmodels works seamlessly with pandas DataFrames, especially with the formula API.

With NumPy

The array API expects numpy arrays or pandas objects that can be converted.

With Matplotlib

For customizing diagnostic plots.


Citation

When using statsmodels in scientific publications, please cite:

Seabold, Skipper, and Josef Perktold. “statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.

Bibtex entry:

@inproceedings{seabold2010statsmodels,
  title={statsmodels: Econometric and statistical modeling with python},
  author={Seabold, Skipper and Perktold, Josef},
  booktitle={9th Python in Science Conference},
  year={2010},
}
Last updated on