Statsmodels
Statsmodels is a foundational Python library for statistical modeling, hypothesis testing, and data exploration. It complements machine learning libraries like scikit-learn by focusing on statistical inference, providing p-values, confidence intervals, and detailed summaries for model results. This tutorial provides a comprehensive guide to using statsmodels for various statistical analyses.
Installation and Setup
Installing Statsmodels
pip install statsmodelsOr with conda:
conda install statsmodelsImporting Statsmodels
Statsmodels is typically imported with two main entry points:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pdThe statsmodels.api module provides functions that use arrays and matrices, while statsmodels.formula.api provides functions that use R-style formula strings and pandas DataFrames.
Core Concepts
Endogenous and Exogenous Variables
Statsmodels distinguishes between:
- Endogenous variables (endog): The dependent/response variable(s) you want to explain
- Exogenous variables (exog): The independent/predictor variable(s) you use for explanation
Two Approaches to Model Specification
Approach 1: Using formulas (R-style)
import statsmodels.formula.api as smf
# Load example data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
# Fit model using formula
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
print(results.summary())Approach 2: Using arrays (NumPy-style)
import statsmodels.api as sm
# Generate data
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X) # Add intercept term
beta = [1, 0.1, 0.5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e
# Fit model
results = sm.OLS(y, X).fit()
print(results.summary())Benefits of Formula Approach
The formula approach offers several advantages:
- Intuitive specification using R-like syntax
- Automatic handling of categorical variables
- Support for transformations and interactions
- Integration with pandas DataFrames
# Formula syntax examples
'Lottery ~ Literacy + Wealth + Region' # Main effects
'Lottery ~ Literacy * Wealth' # Interaction term
'Lottery ~ Literacy + Wealth - 1' # Remove intercept
'Lottery ~ np.log(Literacy)' # Apply functionRegression and Linear Models
Ordinary Least Squares (OLS)
OLS is the foundational regression method for modeling the relationship between a continuous dependent variable and one or more independent variables.
Basic OLS Example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
# Load example dataset
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
dat = dat[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
# Fit OLS model
model = smf.ols('Lottery ~ Literacy + Wealth + Region', data=dat)
results = model.fit()
# Summary output
print(results.summary())The summary output provides:
- Coefficients (coef): Estimated regression coefficients
- Standard errors (std err): Measure of coefficient estimate precision
- t-statistics and p-values: Significance tests for individual coefficients
- R-squared: Proportion of variance explained
- F-statistic: Overall model significance test
- AIC/BIC: Model selection criteria
Key attributes and methods from results object:
# View available results
dir(results)
# Access specific results
results.params # Coefficients
results.pvalues # P-values
results.rsquared # R-squared
results.fvalue # F-statistic
results.resid # Residuals
results.fittedvalues # Predicted valuesGeneralized Linear Models (GLM)
GLMs extend linear regression to response variables with non-normal distributions through link functions.
Supported Families:
- Gaussian (normal) - identity link
- Binomial - logit link (logistic regression)
- Poisson - log link (count data)
- Gamma - inverse power link
- Inverse Gaussian
GLM Example (Gamma family):
import statsmodels.api as sm
# Load data
data = sm.datasets.scotland.load()
data.exog = sm.add_constant(data.exog)
# Fit GLM with Gamma family
gamma_model = sm.GLM(data.endog, data.exog, family=sm.families.Gamma())
gamma_results = gamma_model.fit()
print(gamma_results.summary())GLM Example (Poisson family for count data):
import statsmodels.api as sm
# Load affairs dataset
data = sm.datasets.fair.load_pandas().data
data["affairs"] = np.ceil(data["affairs"]) # Convert to count
# Fit Poisson GLM
poisson_model = sm.GLM(data["affairs"],
data[["rate_marriage", "age", "yrs_married"]],
family=sm.families.Poisson())
poisson_results = poisson_model.fit()Robust Linear Models
For data with outliers or heteroskedasticity, robust regression methods provide more reliable estimates.
import statsmodels.api as sm
from statsmodels.robust.robust_linear_model import RLM
# Fit robust linear model
rlm_model = sm.RLM(y, X)
rlm_results = rlm_model.fit()Linear Mixed Effects Models
For hierarchical or grouped data, mixed effects models account for both fixed and random effects.
import statsmodels.api as sm
from statsmodels.formula.api import mixedlm
# Fit mixed effects model
model = mixedlm("y ~ x", data, groups=data["group"])
results = model.fit()Regression with Discrete Dependent Variables
Logistic Regression
For binary or multinomial outcome variables.
import statsmodels.api as sm
# Binary logistic regression
logit_model = sm.Logit(y, X)
logit_results = logit_model.fit()
# Multinomial logistic regression
mnlogit_model = sm.MNLogit(y, X)
mnlogit_results = mnlogit_model.fit()Poisson and Negative Binomial Regression
For count data.
import statsmodels.api as sm
# Poisson regression
poisson_model = sm.Poisson(y, X)
poisson_results = poisson_model.fit()
# Negative binomial regression
nb_model = sm.NegativeBinomial(y, X)
nb_results = nb_model.fit()Time Series Analysis
Statsmodels provides comprehensive tools for time series modeling.
ARIMA Models
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
# Fit ARIMA model
model = ARIMA(data, order=(p, d, q)) # p: AR order, d: differencing, q: MA order
results = model.fit()
print(results.summary())SARIMAX Models
For seasonal data with exogenous variables.
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Fit SARIMAX model
model = SARIMAX(data,
order=(p, d, q),
seasonal_order=(P, D, Q, S))
results = model.fit()VAR Models (Vector Autoregression)
For multivariate time series.
from statsmodels.tsa.vector_ar.var_model import VAR
model = VAR(data)
results = model.fit(maxlags=15)State Space Models
For flexible time series modeling with unobserved components.
from statsmodels.tsa.statespace.structural import UnobservedComponents
model = UnobservedComponents(data,
level='local linear trend',
cycle=True)
results = model.fit()Statistical Tests and Tools
Hypothesis Testing
# t-test for individual coefficients
results.t_test('variable = 0')
# F-test for multiple coefficients
results.f_test('var1 = var2 = 0')
# Wald test
results.wald_test('var1 = 0')ANOVA
Analysis of variance for comparing groups.
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
model = ols('y ~ C(group)', data=df).fit()
anova_results = anova_lm(model)Contingency Tables
For categorical data analysis.
from statsmodels.stats.contingency_tables import Table
table = Table(data)
print(table.test_nominal_association())Nonparametric Methods
For distribution-free statistical tests.
from statsmodels.sandbox.stats.multicomp import multipletests
from statsmodels.stats.nonparametric import ranksumsAdvanced Modeling
Generalized Additive Models (GAM)
For non-linear relationships.
from statsmodels.gam.api import GLMGam, BSplines
bs = BSplines(X, df=[6, 6]) # Specify splines
gam_model = GLMGam(y, X, smoother=bs)
gam_results = gam_model.fit()Generalized Method of Moments (GMM)
For models with endogenous variables.
from statsmodels.sandbox.regression.gmm import GMMSurvival and Duration Analysis
For time-to-event data.
from statsmodels.duration import survfitWorking with Formulas
Categorical Variables
Statsmodels automatically handles categorical variables when they are strings or when using the C() operator.
# Automatic categorical handling
model = smf.ols('y ~ C(category)', data=df).fit()
# Specify contrast coding
model = smf.ols('y ~ C(category, Treatment)', data=df).fit()Interactions and Transformations
# Interaction
model = smf.ols('y ~ x1 * x2', data=df).fit()
# Polynomial terms
model = smf.ols('y ~ np.power(x, 2)', data=df).fit()
# Custom functions
def log_plus_1(x):
return np.log(x) + 1.0
model = smf.ols('y ~ log_plus_1(x)', data=df).fit()Removing Variables
# Remove intercept
model = smf.ols('y ~ x - 1', data=df).fit()
# Remove specific variable
model = smf.ols('y ~ x1 + x2 - x1', data=df).fit() # Only x2 remainsPlotting and Visualization
Statsmodels includes plotting functions for regression diagnostics and time series analysis.
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Regression plots
sm.graphics.plot_partregress_grid(results)
# Q-Q plot for normality
sm.qqplot(results.resid, line='s')
# Influence plot
sm.graphics.influence_plot(results)
# Time series plots
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(data)
plot_pacf(data)
plt.show()Model Diagnostics
Regression Diagnostics
# Influence measures
infl = results.get_influence()
cooks_d = infl.cooks_distance
# Residual diagnostics
results.test_heteroskedasticity()
results.test_serial_correlation()
results.test_normality()Multicollinearity
# Variance Inflation Factor
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]Working with Datasets
Statsmodels includes several example datasets for learning and testing.
# List available datasets
import statsmodels.api as sm
print(sm.datasets.__all__)
# Load dataset
data = sm.datasets.get_rdataset("Guerry", "HistData").data
data = sm.datasets.scotland.load()
data = sm.datasets.fair.load_pandas().data
data = sm.datasets.longley.load()Using Patsy Directly
For models that don’t support formulas, you can use patsy to create design matrices.
import patsy
# Create design matrices
formula = 'y ~ x1 * x2'
y, X = patsy.dmatrices(formula, data, return_type='matrix')
# For pandas DataFrame output
y, X = patsy.dmatrices(formula, data, return_type='dataframe')Best Practices
Model Selection
Use information criteria for model comparison:
- AIC (Akaike Information Criterion): Lower is better
- BIC (Bayesian Information Criterion): Penalizes complexity more heavily
results.aic
results.bicHandling Missing Data
Always handle missing values before modeling:
data.dropna(inplace=True)
data.fillna(data.mean(), inplace=True)Checking Assumptions
For OLS regression, check:
- Linearity: Plot residuals vs fitted values
- Normality: Q-Q plot or Shapiro-Wilk test
- Homoscedasticity: Breusch-Pagan test
- Independence: Durbin-Watson statistic
# Breusch-Pagan test
sm.stats.diagnostic.het_breuschpagan(results.resid, results.model.exog)
# Durbin-Watson
sm.stats.durbin_watson(results.resid)Documentation
The results object provides comprehensive documentation:
print(results.__doc__) # General information
dir(results) # Available attributes and methodsIntegration with Other Libraries
With Pandas
Statsmodels works seamlessly with pandas DataFrames, especially with the formula API.
With NumPy
The array API expects numpy arrays or pandas objects that can be converted.
With Matplotlib
For customizing diagnostic plots.
Citation
When using statsmodels in scientific publications, please cite:
Seabold, Skipper, and Josef Perktold. “statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.
Bibtex entry:
@inproceedings{seabold2010statsmodels,
title={statsmodels: Econometric and statistical modeling with python},
author={Seabold, Skipper and Perktold, Josef},
booktitle={9th Python in Science Conference},
year={2010},
}