Introduction to Regression Coefficients
Regression coefficients are the fundamental building blocks of regression analysis. They quantify the relationship between independent variables (predictors) and a dependent variable (outcome). Understanding how to interpret these coefficients is essential for extracting meaningful insights from regression models.
Key Concepts
- Coefficient (β): Measures the change in outcome for a one-unit change in the predictor
- Standard Error: Measures the precision of the coefficient estimate
- t-statistic: Test statistic for whether the coefficient is significantly different from zero
- p-value: Probability of observing the coefficient if the true effect is zero
- Confidence Interval: Range of plausible values for the true coefficient
- R-squared: Proportion of variance explained by the model
1. Structure of a Regression Table
Understanding the Regression Output
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
# Generate sample data
np.random.seed(42)
n = 200
X1 = np.random.randn(n)
X2 = np.random.randn(n)
X3 = np.random.randn(n)
# True relationship: y = 2 + 3*X1 - 1.5*X2 + 0.5*X3 + noise
y = 2 + 3*X1 - 1.5*X2 + 0.5*X3 + np.random.randn(n) * 0.5
# Create DataFrame
df = pd.DataFrame({
'y': y,
'X1': X1,
'X2': X2,
'X3': X3
})
# Fit regression model
X = df[['X1', 'X2', 'X3']]
X = sm.add_constant(X) # Add intercept
model = sm.OLS(df['y'], X).fit()
# Display regression table
print("="*80)
print("REGRESSION RESULTS")
print("="*80)
print(model.summary())
Components of a Regression Table
# Extract individual components
print("KEY COMPONENTS OF REGRESSION TABLE")
print("="*50)
print("\n1. COEFFICIENTS (β):")
print(" - Intercept (const):", round(model.params['const'], 4))
print(" - X1 coefficient:", round(model.params['X1'], 4))
print(" - X2 coefficient:", round(model.params['X2'], 4))
print(" - X3 coefficient:", round(model.params['X3'], 4))
print("\n2. STANDARD ERRORS:")
print(" - X1 std err:", round(model.bse['X1'], 4))
print(" - X2 std err:", round(model.bse['X2'], 4))
print(" - X3 std err:", round(model.bse['X3'], 4))
print("\n3. t-STATISTICS:")
print(" - X1 t-stat:", round(model.tvalues['X1'], 4))
print(" - X2 t-stat:", round(model.tvalues['X2'], 4))
print(" - X3 t-stat:", round(model.tvalues['X3'], 4))
print("\n4. p-VALUES:")
print(" - X1 p-value:", round(model.pvalues['X1'], 4))
print(" - X2 p-value:", round(model.pvalues['X2'], 4))
print(" - X3 p-value:", round(model.pvalues['X3'], 4))
print("\n5. CONFIDENCE INTERVALS (95%):")
conf_int = model.conf_int()
print(f" X1: [{conf_int.loc['X1', 0]:.4f}, {conf_int.loc['X1', 1]:.4f}]")
print(f" X2: [{conf_int.loc['X2', 0]:.4f}, {conf_int.loc['X2', 1]:.4f}]")
print(f" X3: [{conf_int.loc['X3', 0]:.4f}, {conf_int.loc['X3', 1]:.4f}]")
print("\n6. MODEL FIT:")
print(f" R-squared: {model.rsquared:.4f}")
print(f" Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f" F-statistic: {model.fvalue:.2f}")
print(f" F-test p-value: {model.f_pvalue:.4f}")
2. Interpreting Coefficients
Simple Linear Regression
# Simple linear regression example
np.random.seed(42)
X_simple = np.random.randn(200)
y_simple = 5 + 2.5 * X_simple + np.random.randn(200) * 1
# Fit model
X_with_const = sm.add_constant(X_simple)
model_simple = sm.OLS(y_simple, X_with_const).fit()
print("SIMPLE LINEAR REGRESSION")
print("="*40)
print(model_simple.summary())
# Interpretation
print("\nINTERPRETATION:")
print(f"y = {model_simple.params['const']:.2f} + {model_simple.params['x1']:.2f} * X")
print(f"\n• Intercept ({model_simple.params['const']:.2f}):")
print(f" When X = 0, the predicted y is {model_simple.params['const']:.2f}")
print(f"\n• Slope ({model_simple.params['x1']:.2f}):")
print(f" For every 1-unit increase in X, y increases by {model_simple.params['x1']:.2f} units")
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y_simple, alpha=0.5, label='Data points')
plt.plot(X_simple, model_simple.fittedvalues, 'r-', linewidth=2,
label=f'Fit: y = {model_simple.params["const"]:.2f} + {model_simple.params["x1"]:.2f}X')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Multiple Linear Regression
# Multiple linear regression with interpretation
X_multi = np.random.randn(200, 3)
true_coefs = [2.5, -1.8, 3.2] # True coefficients for X1, X2, X3
y_multi = 1 + 2.5*X_multi[:,0] - 1.8*X_multi[:,1] + 3.2*X_multi[:,2] + np.random.randn(200) * 0.5
# Fit model
X_multi_const = sm.add_constant(X_multi)
model_multi = sm.OLS(y_multi, X_multi_const).fit()
print("MULTIPLE LINEAR REGRESSION")
print("="*50)
print(model_multi.summary())
print("\nINTERPRETATION OF COEFFICIENTS:")
print(f"y = {model_multi.params['const']:.2f} + {model_multi.params['x1']:.2f}*X1 + "
f"{model_multi.params['x2']:.2f}*X2 + {model_multi.params['x3']:.2f}*X3")
print("\n• Intercept: When all X variables are 0, y is predicted to be "
f"{model_multi.params['const']:.2f}")
print("\n• X1 coefficient: Holding X2 and X3 constant, a 1-unit increase in X1 "
f"is associated with a {model_multi.params['x1']:.2f} unit increase in y")
print("\n• X2 coefficient: Holding X1 and X3 constant, a 1-unit increase in X2 "
f"is associated with a {model_multi.params['x2']:.2f} unit decrease in y")
print("\n• X3 coefficient: Holding X1 and X2 constant, a 1-unit increase in X3 "
f"is associated with a {model_multi.params['x3']:.2f} unit increase in y")
3. Statistical Significance
Understanding p-values and t-statistics
# Demonstrating statistical significance
np.random.seed(42)
n = 100
# Generate variables with different significance levels
X_significant = np.random.randn(n)
X_insignificant = np.random.randn(n)
y = 2 + 1.5 * X_significant + np.random.randn(n) * 1.5
# Fit model
X_both = sm.add_constant(np.column_stack([X_significant, X_insignificant]))
model_significance = sm.OLS(y, X_both).fit()
print("STATISTICAL SIGNIFICANCE ANALYSIS")
print("="*50)
# Extract results
significant_coef = model_significance.params['x1']
significant_pval = model_significance.pvalues['x1']
insignificant_coef = model_significance.params['x2']
insignificant_pval = model_significance.pvalues['x2']
print(f"\nSignificant Variable (X1):")
print(f" Coefficient: {significant_coef:.4f}")
print(f" p-value: {significant_pval:.4f}")
if significant_pval < 0.05:
print(f" ✓ Statistically significant (p < 0.05)")
print(f"\nInsignificant Variable (X2):")
print(f" Coefficient: {insignificant_coef:.4f}")
print(f" p-value: {insignificant_pval:.4f}")
if insignificant_pval > 0.05:
print(f" ✗ Not statistically significant (p > 0.05)")
# Visualize confidence intervals
coef_names = ['X1 (Significant)', 'X2 (Insignificant)']
coef_values = [significant_coef, insignificant_coef]
coef_ci_lower = [model_significance.conf_int().loc['x1', 0],
model_significance.conf_int().loc['x2', 0]]
coef_ci_upper = [model_significance.conf_int().loc['x1', 1],
model_significance.conf_int().loc['x2', 1]]
plt.figure(figsize=(10, 6))
for i, (name, val, lower, upper) in enumerate(zip(coef_names, coef_values, coef_ci_lower, coef_ci_upper)):
plt.errorbar(val, i, xerr=[[val-lower], [upper-val]], fmt='o',
capsize=5, capthick=2, markersize=10)
plt.axvline(x=0, color='red', linestyle='--', alpha=0.5, label='Null effect (0)')
plt.yticks(range(len(coef_names)), coef_names)
plt.xlabel('Coefficient Value')
plt.title('95% Confidence Intervals for Coefficients')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Interpretation of Significance Levels
# Different significance levels
print("INTERPRETING p-VALUES")
print("="*40)
print("\nCommon significance levels:")
print("• p < 0.001 (***): Very strong evidence against null hypothesis")
print("• p < 0.01 (**): Strong evidence against null hypothesis")
print("• p < 0.05 (*): Moderate evidence against null hypothesis")
print("• p < 0.10 (.): Weak evidence against null hypothesis")
print("• p > 0.10: No significant evidence against null hypothesis")
print("\nSignificance stars in regression output:")
print("*** p<0.001, ** p<0.01, * p<0.05, . p<0.10")
# Generate example with different significance levels
np.random.seed(42)
n = 100
X_very_sig = np.random.randn(n)
X_sig = np.random.randn(n)
X_margin = np.random.randn(n)
X_not_sig = np.random.randn(n)
# Different effect sizes
y = (2 + 2.5*X_very_sig + 1.2*X_sig + 0.6*X_margin +
0.1*X_not_sig + np.random.randn(n))
X_all = sm.add_constant(np.column_stack([X_very_sig, X_sig, X_margin, X_not_sig]))
model_stars = sm.OLS(y, X_all).fit()
print("\nExample with different significance levels:")
for i, var in enumerate(['X_very_sig', 'X_sig', 'X_margin', 'X_not_sig']):
pval = model_stars.pvalues[f'x{i+1}']
coef = model_stars.params[f'x{i+1}']
if pval < 0.001:
stars = "***"
elif pval < 0.01:
stars = "**"
elif pval < 0.05:
stars = "*"
elif pval < 0.10:
stars = "."
else:
stars = " "
print(f"{var:15} coef = {coef:6.3f}, p = {pval:.4f} {stars}")
4. Standardized Coefficients
Calculating and Interpreting Beta Weights
from sklearn.preprocessing import StandardScaler
# Standardized coefficients for comparing variable importance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_multi)
y_scaled = (y_multi - np.mean(y_multi)) / np.std(y_multi)
# Fit model with standardized variables
X_scaled_const = sm.add_constant(X_scaled)
model_standardized = sm.OLS(y_scaled, X_scaled_const).fit()
print("STANDARDIZED COEFFICIENTS (Beta Weights)")
print("="*50)
print(model_standardized.summary())
print("\nStandardized coefficients allow comparison of variable importance:")
for i, var in enumerate(['X1', 'X2', 'X3']):
std_coef = model_standardized.params[f'x{i+1}']
print(f"{var}: β = {std_coef:.4f}")
print("\nInterpretation:")
print("• Standardized coefficients represent change in y (in standard deviations)")
print(" for a 1-standard-deviation change in X")
print("• |β| > 0.1: Small effect")
print("• |β| > 0.3: Medium effect")
print("• |β| > 0.5: Large effect")
# Visualize standardized coefficients
plt.figure(figsize=(8, 6))
coefs = model_standardized.params[1:].values # Exclude intercept
variables = ['X1', 'X2', 'X3']
colors = ['green' if c > 0 else 'red' for c in coefs]
plt.barh(variables, coefs, color=colors)
plt.axvline(x=0, color='black', linestyle='-', alpha=0.5)
plt.xlabel('Standardized Coefficient (Beta)')
plt.title('Variable Importance (Standardized Coefficients)')
plt.grid(True, alpha=0.3)
plt.show()
5. Confidence Intervals
Understanding and Using Confidence Intervals
# Generate data with known true coefficients
np.random.seed(42)
n = 100
X = np.random.randn(n, 2)
true_coefs = [1.5, 2.0]
y = 1 + 1.5*X[:,0] + 2.0*X[:,1] + np.random.randn(n) * 1
# Fit model
X_const = sm.add_constant(X)
model_ci = sm.OLS(y, X_const).fit()
print("CONFIDENCE INTERVALS")
print("="*40)
print(model_ci.summary())
# Extract confidence intervals
ci_95 = model_ci.conf_int(alpha=0.05)
ci_90 = model_ci.conf_int(alpha=0.10)
ci_99 = model_ci.conf_int(alpha=0.01)
print("\n95% Confidence Intervals:")
for var in ['const', 'x1', 'x2']:
lower, upper = ci_95.loc[var]
print(f"{var}: [{lower:.4f}, {upper:.4f}]")
print("\nInterpretation of 95% Confidence Intervals:")
print("If we repeated this study many times, 95% of the confidence intervals")
print("would contain the true population parameter.")
# Visualize confidence intervals
plt.figure(figsize=(10, 6))
coef_names = ['Intercept', 'X1', 'X2']
coef_values = model_ci.params
errors = model_ci.bse
for i, (name, val, err) in enumerate(zip(coef_names, coef_values, errors)):
plt.errorbar(val, i, xerr=1.96*err, fmt='o', capsize=5,
capthick=2, markersize=10, label='95% CI' if i==0 else "")
plt.errorbar(val, i, xerr=1.645*err, fmt='o', capsize=5,
capthick=1, markersize=8, color='gray', alpha=0.5, label='90% CI' if i==0 else "")
plt.axvline(x=0, color='red', linestyle='--', alpha=0.5)
plt.yticks(range(len(coef_names)), coef_names)
plt.xlabel('Coefficient Value')
plt.title('Confidence Intervals for Coefficients')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
6. Real-World Examples
Example 1: Housing Price Prediction
# Simulating housing data
np.random.seed(42)
n_houses = 500
sqft = np.random.normal(2000, 500, n_houses)
bedrooms = np.random.randint(1, 5, n_houses)
age = np.random.randint(0, 50, n_houses)
location_score = np.random.uniform(1, 10, n_houses)
# True relationship
price = (50000 + 150 * sqft + 20000 * bedrooms - 1000 * age +
5000 * location_score + np.random.randn(n_houses) * 20000)
# Create DataFrame
housing_df = pd.DataFrame({
'sqft': sqft,
'bedrooms': bedrooms,
'age': age,
'location_score': location_score,
'price': price
})
# Fit model
X_housing = sm.add_constant(housing_df[['sqft', 'bedrooms', 'age', 'location_score']])
model_housing = sm.OLS(housing_df['price'], X_housing).fit()
print("HOUSING PRICE PREDICTION MODEL")
print("="*50)
print(model_housing.summary())
print("\nINTERPRETATION:")
print(f"• Intercept (${model_housing.params['const']:,.0f}):")
print(" Base price when all features are zero (not meaningful in this context)")
print(f"\n• Sqft: ${model_housing.params['sqft']:.0f} per square foot")
print(" Each additional square foot increases price by $149")
print(f"\n• Bedrooms: ${model_housing.params['bedrooms']:,.0f}")
print(" Each additional bedroom increases price by $19,677")
print(f"\n• Age: -${abs(model_housing.params['age']):.0f} per year")
print(" Each year of age decreases price by $1,001")
print(f"\n• Location Score: ${model_housing.params['location_score']:,.0f} per point")
print(" Each point increase in location score increases price by $5,123")
# Predict for a sample house
sample_house = pd.DataFrame({
'const': 1,
'sqft': [2500],
'bedrooms': [4],
'age': [10],
'location_score': [8]
})
predicted_price = model_housing.predict(sample_house)[0]
print(f"\nSample house (2500 sqft, 4 beds, 10 years old, location 8):")
print(f"Predicted price: ${predicted_price:,.0f}")
Example 2: Marketing Campaign Effectiveness
# Simulating marketing data
np.random.seed(42)
n_campaigns = 200
ad_spend = np.random.uniform(1000, 10000, n_campaigns)
social_media = np.random.uniform(500, 5000, n_campaigns)
email_sends = np.random.uniform(1000, 50000, n_campaigns)
competitor_activity = np.random.uniform(0, 100, n_campaigns)
# Sales (in thousands)
sales = (50 + 0.3 * ad_spend + 0.5 * social_media +
0.05 * email_sends - 0.2 * competitor_activity +
np.random.randn(n_campaigns) * 10)
# Create DataFrame
marketing_df = pd.DataFrame({
'ad_spend': ad_spend,
'social_media': social_media,
'email_sends': email_sends,
'competitor_activity': competitor_activity,
'sales': sales
})
# Fit model
X_marketing = sm.add_constant(marketing_df[['ad_spend', 'social_media', 'email_sends', 'competitor_activity']])
model_marketing = sm.OLS(marketing_df['sales'], X_marketing).fit()
print("MARKETING CAMPAIGN EFFECTIVENESS")
print("="*50)
print(model_marketing.summary())
print("\nRETURN ON INVESTMENT (ROI) ANALYSIS:")
print(f"• Ad Spend ROI: ${model_marketing.params['ad_spend']:.3f} per $1 spent")
print(f"• Social Media ROI: ${model_marketing.params['social_media']:.3f} per $1 spent")
print(f"• Email ROI: ${model_marketing.params['email_sends']:.3f} per 1000 emails")
print("\nCOMPETITIVE EFFECT:")
print(f"• Each unit increase in competitor activity reduces sales by "
f"${abs(model_marketing.params['competitor_activity']):.2f}")
# Marketing mix optimization
print("\nOPTIMAL MARKETING MIX:")
ad_effect = model_marketing.params['ad_spend']
social_effect = model_marketing.params['social_media']
email_effect = model_marketing.params['email_sends'] * 1000 # Per 1000 emails
print(f"• Ad spend: ${ad_effect:.2f} return per $1")
print(f"• Social media: ${social_effect:.2f} return per $1")
print(f"• Email: ${email_effect:.2f} return per $1")
Example 3: Medical Study Analysis
# Simulating medical study data
np.random.seed(42)
n_patients = 300
# Treatment (1 = treatment, 0 = control)
treatment = np.random.binomial(1, 0.5, n_patients)
age = np.random.randint(25, 80, n_patients)
bmi = np.random.uniform(18, 35, n_patients)
smoking = np.random.binomial(1, 0.3, n_patients)
# Recovery score (0-100)
recovery = (50 + 15 * treatment - 0.5 * (age - 50) -
0.8 * (bmi - 25) - 10 * smoking +
np.random.randn(n_patients) * 8)
medical_df = pd.DataFrame({
'treatment': treatment,
'age': age,
'bmi': bmi,
'smoking': smoking,
'recovery': recovery
})
# Fit model
X_medical = sm.add_constant(medical_df[['treatment', 'age', 'bmi', 'smoking']])
model_medical = sm.OLS(medical_df['recovery'], X_medical).fit()
print("MEDICAL STUDY ANALYSIS")
print("="*50)
print(model_medical.summary())
print("\nCLINICAL INTERPRETATION:")
print(f"• Treatment Effect: {model_medical.params['treatment']:.2f} points")
print(" Patients receiving treatment have 15.4 points higher recovery score")
print(f"\n• Age Effect: {model_medical.params['age']:.2f} points per year")
print(" Each additional year of age reduces recovery by 0.49 points")
print(f"\n• BMI Effect: {model_medical.params['bmi']:.2f} points per BMI unit")
print(" Each additional BMI point reduces recovery by 0.81 points")
print(f"\n• Smoking Effect: {model_medical.params['smoking']:.2f} points")
print(" Smokers have 9.4 points lower recovery scores")
# Calculate number needed to treat (NNT)
print("\nNUMBER NEEDED TO TREAT (NNT):")
treatment_effect = model_medical.params['treatment']
treatment_success_rate = np.mean(medical_df[medical_df['treatment']==1]['recovery'] > 70)
control_success_rate = np.mean(medical_df[medical_df['treatment']==0]['recovery'] > 70)
print(f"Treatment success rate: {treatment_success_rate:.1%}")
print(f"Control success rate: {control_success_rate:.1%}")
print(f"Absolute risk reduction: {treatment_success_rate - control_success_rate:.1%}")
print(f"Number needed to treat: {1/(treatment_success_rate - control_success_rate):.0f}")
7. Advanced Coefficient Interpretations
Interaction Terms
# Demonstrating interaction effects
np.random.seed(42)
n = 200
X1 = np.random.randn(n)
X2 = np.random.randn(n)
# True relationship with interaction
y = 2 + 1.5*X1 + 1.2*X2 + 0.8*X1*X2 + np.random.randn(n) * 0.5
# Fit models with and without interaction
X_main = sm.add_constant(np.column_stack([X1, X2]))
model_main = sm.OLS(y, X_main).fit()
X_interaction = sm.add_constant(np.column_stack([X1, X2, X1*X2]))
model_interaction = sm.OLS(y, X_interaction).fit()
print("INTERACTION EFFECTS")
print("="*50)
print("Model without interaction:")
print(model_main.summary())
print("\nModel with interaction term:")
print(model_interaction.summary())
print("\nINTERPRETATION OF INTERACTION:")
print(f"y = {model_interaction.params['const']:.2f} + "
f"{model_interaction.params['x1']:.2f}*X1 + "
f"{model_interaction.params['x2']:.2f}*X2 + "
f"{model_interaction.params['x3']:.2f}*X1*X2")
print("\nThe interaction term means the effect of X1 on y depends on X2:")
print(f"When X2 = 0, effect of X1 = {model_interaction.params['x1']:.2f}")
print(f"When X2 = 1, effect of X1 = {model_interaction.params['x1'] + model_interaction.params['x3']:.2f}")
# Visualize interaction
plt.figure(figsize=(10, 6))
X2_low = X2 < -0.5
X2_high = X2 > 0.5
plt.scatter(X1[X2_low], y[X2_low], alpha=0.6, label='X2 Low')
plt.scatter(X1[X2_high], y[X2_high], alpha=0.6, label='X2 High')
# Fit lines for different X2 values
for X2_val, label, color in [(-1, 'X2 = -1', 'blue'), (1, 'X2 = 1', 'red')]:
y_pred = (model_interaction.params['const'] +
model_interaction.params['x1'] * X1 +
model_interaction.params['x2'] * X2_val +
model_interaction.params['x3'] * X1 * X2_val)
plt.plot(X1, y_pred, color=color, linestyle='--', linewidth=2, label=label)
plt.xlabel('X1')
plt.ylabel('y')
plt.title('Interaction Effect: Effect of X1 depends on X2')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Polynomial Terms
# Demonstrating polynomial terms
np.random.seed(42)
n = 200
X = np.random.uniform(-3, 3, n)
# Quadratic relationship
y = 2 + 1.5*X - 0.5*X**2 + np.random.randn(n) * 0.5
# Fit linear and polynomial models
X_linear = sm.add_constant(X)
model_linear = sm.OLS(y, X_linear).fit()
X_poly = sm.add_constant(np.column_stack([X, X**2]))
model_poly = sm.OLS(y, X_poly).fit()
print("POLYNOMIAL TERMS")
print("="*40)
print("Linear model R²:", model_linear.rsquared)
print("Polynomial model R²:", model_poly.rsquared)
print("\nCoefficients:")
print(f"Intercept: {model_poly.params['const']:.4f}")
print(f"X: {model_poly.params['x1']:.4f}")
print(f"X²: {model_poly.params['x2']:.4f}")
print("\nInterpretation:")
print("The positive coefficient for X and negative for X² indicates")
print("an inverted U-shaped relationship (diminishing returns).")
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Data')
# Sort X for smooth line
X_sorted = np.sort(X)
y_linear_pred = model_linear.predict(sm.add_constant(X_sorted))
y_poly_pred = model_poly.predict(sm.add_constant(np.column_stack([X_sorted, X_sorted**2])))
plt.plot(X_sorted, y_linear_pred, 'g--', linewidth=2, label='Linear fit')
plt.plot(X_sorted, y_poly_pred, 'r-', linewidth=2, label='Quadratic fit')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression: Capturing Non-linear Relationships')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
8. Common Pitfalls in Coefficient Interpretation
Pitfall 1: Overinterpreting Insignificant Coefficients
# Example of overinterpretation
np.random.seed(42)
n = 100
X = np.random.randn(n)
y = 2 + 0.1*X + np.random.randn(n) * 2 # Very weak effect
X_const = sm.add_constant(X)
model_weak = sm.OLS(y, X_const).fit()
print("PITFALL 1: OVERINTERPRETING INSIGNIFICANT COEFFICIENTS")
print("="*60)
print(model_weak.summary())
print("\nThe coefficient is 0.11 but p-value is 0.539.")
print("We cannot conclude there's a meaningful relationship.")
print("The wide confidence interval [-0.24, 0.46] includes zero.")
Pitfall 2: Ignoring Multicollinearity
# Multicollinearity example
np.random.seed(42)
n = 100
X1 = np.random.randn(n)
X2 = X1 + np.random.randn(n) * 0.1 # Highly correlated with X1
X3 = np.random.randn(n)
y = 2 + 1.5*X1 + 1.4*X2 + 1.2*X3 + np.random.randn(n) * 0.5
X_multicoll = sm.add_constant(np.column_stack([X1, X2, X3]))
model_multicoll = sm.OLS(y, X_multicoll).fit()
print("PITFALL 2: MULTICOLLINEARITY")
print("="*50)
print(model_multicoll.summary())
print("\nNotice how coefficients for X1 and X2 are unstable:")
print(f"X1 coefficient: {model_multicoll.params['x1']:.4f}")
print(f"X2 coefficient: {model_multicoll.params['x2']:.4f}")
print("\nLarge standard errors and unstable estimates indicate multicollinearity.")
# Calculate VIF (Variance Inflation Factor)
def calculate_vif(X):
vif = {}
for i in range(X.shape[1]):
model_vif = sm.OLS(X[:, i], np.delete(X, i, axis=1)).fit()
vif[f'X{i+1}'] = 1 / (1 - model_vif.rsquared)
return vif
X_vals = np.column_stack([X1, X2, X3])
vif_values = calculate_vif(X_vals)
print("\nVariance Inflation Factors (VIF):")
for var, vif in vif_values.items():
print(f"{var}: {vif:.2f}")
if vif > 10:
print(f" ⚠ High multicollinearity (VIF > 10)")
Pitfall 3: Extrapolation Beyond Data Range
# Extrapolation warning
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = 2 + 1.5*X + np.random.randn(100) * 2
X_const = sm.add_constant(X)
model_extrap = sm.OLS(y, X_const).fit()
print("PITFALL 3: EXTRAPOLATION BEYOND DATA RANGE")
print("="*55)
print(f"Data range: X in [{X.min():.1f}, {X.max():.1f}]")
print(f"Predicted at X=20: {model_extrap.params['const'] + model_extrap.params['x1']*20:.2f}")
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X, model_extrap.fittedvalues, 'b-', linewidth=2, label='Fit')
# Extrapolation
X_extrap = np.linspace(10, 20, 20)
y_extrap = model_extrap.params['const'] + model_extrap.params['x1'] * X_extrap
plt.plot(X_extrap, y_extrap, 'r--', linewidth=2, label='Extrapolation')
plt.axvline(x=10, color='green', linestyle=':', label='Data boundary')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Warning: Extrapolation Beyond Data Range')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print("\nPredictions outside the range of the data are unreliable!")
9. Best Practices for Coefficient Interpretation
Comprehensive Checklist
def coefficient_interpretation_checklist(model, X_names, y_name):
"""
Comprehensive checklist for interpreting regression coefficients
"""
print("="*70)
print("COEFFICIENT INTERPRETATION CHECKLIST")
print("="*70)
# 1. Check overall model fit
print("\n1. MODEL FIT:")
print(f" R-squared: {model.rsquared:.4f}")
print(f" Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f" F-statistic: {model.fvalue:.2f} (p = {model.f_pvalue:.4f})")
# 2. Interpret each coefficient
print("\n2. COEFFICIENT INTERPRETATIONS:")
for i, var in enumerate(X_names):
coef = model.params[var]
std_err = model.bse[var]
pval = model.pvalues[var]
ci_low, ci_high = model.conf_int().loc[var]
print(f"\n {var}:")
print(f" • Coefficient: {coef:.4f}")
print(f" • Standard Error: {std_err:.4f}")
print(f" • 95% CI: [{ci_low:.4f}, {ci_high:.4f}]")
print(f" • p-value: {pval:.4f}")
if pval < 0.05:
print(f" ✓ Statistically significant (p < 0.05)")
print(f" • Interpretation: A 1-unit increase in {var} is associated with")
print(f" a {coef:.4f} unit change in {y_name}, holding other variables constant")
else:
print(f" ✗ Not statistically significant (p > 0.05)")
print(f" • No strong evidence for a linear relationship with {y_name}")
# 3. Check for practical significance
print("\n3. PRACTICAL SIGNIFICANCE:")
print(" Consider the magnitude of coefficients in the context of the problem:")
print(" • Small coefficients may still be meaningful in some domains")
print(" • Large coefficients may be trivial in others")
# 4. Check for multicollinearity
print("\n4. MULTICOLLINEARITY CHECK:")
# Simple correlation check
print(" Consider correlation between predictors:")
# 5. Check for influential points
print("\n5. INFLUENTIAL POINTS:")
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
high_influence = np.sum(cooks_d > 4/len(cooks_d))
print(f" Cook's distance: {high_influence} influential points detected")
# 6. Check residuals
print("\n6. RESIDUAL CHECKS:")
residuals = model.resid
print(f" Mean residual: {np.mean(residuals):.6f}")
print(f" Residual standard deviation: {np.std(residuals):.4f}")
# 7. Summary
print("\n7. SUMMARY:")
print(" • Correlation does not imply causation")
print(" • Interpret coefficients in context")
print(" • Consider both statistical and practical significance")
print(" • Be cautious with extrapolation")
print(" • Check model assumptions")
# Example usage
X_names = ['X1', 'X2', 'X3']
coefficient_interpretation_checklist(model_multi, X_names, 'y')
10. Visualizing Coefficient Relationships
Coefficient Plot
def plot_coefficients(model, X_names):
"""
Create a coefficient plot with confidence intervals
"""
params = model.params[1:] # Exclude intercept
ci = model.conf_int()[1:] # Exclude intercept
fig, ax = plt.subplots(figsize=(10, 6))
y_pos = np.arange(len(X_names))
ax.errorbar(params, y_pos,
xerr=[params - ci[0], ci[1] - params],
fmt='o', capsize=5, capthick=2, markersize=8)
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(X_names)
ax.set_xlabel('Coefficient Value')
ax.set_title('Coefficient Plot with 95% Confidence Intervals')
ax.grid(True, alpha=0.3)
# Add value labels
for i, (param, err) in enumerate(zip(params, ci[1] - params)):
ax.text(param, i, f' {param:.3f}', va='center')
return fig
# Create coefficient plot
plot_coefficients(model_multi, ['X1', 'X2', 'X3'])
plt.show()
Predicted vs Actual Plot
def plot_predictions(model, y_true, y_pred, y_name):
"""
Plot predicted vs actual values
"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Predicted vs Actual
ax1.scatter(y_true, y_pred, alpha=0.5)
ax1.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()],
'r--', linewidth=2)
ax1.set_xlabel('Actual Values')
ax1.set_ylabel('Predicted Values')
ax1.set_title(f'Predicted vs Actual {y_name}')
ax1.grid(True, alpha=0.3)
# Residuals
residuals = y_true - y_pred
ax2.scatter(y_pred, residuals, alpha=0.5)
ax2.axhline(y=0, color='r', linestyle='--', linewidth=2)
ax2.set_xlabel('Predicted Values')
ax2.set_ylabel('Residuals')
ax2.set_title('Residual Plot')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
return fig
# Create predictions
y_pred = model_multi.predict(X_multi_const)
plot_predictions(model_multi, y_multi, y_pred, 'y')
plt.show()
Conclusion
Understanding regression coefficients is essential for extracting meaningful insights from data:
Key Takeaways
- Coefficient Direction: Positive = increase in outcome, Negative = decrease in outcome
- Coefficient Magnitude: Size indicates strength of relationship (in original units)
- Statistical Significance: p < 0.05 suggests the relationship is unlikely due to chance
- Confidence Intervals: Range of plausible values for the true coefficient
- Standardized Coefficients: Allow comparison of variable importance
- Interaction Terms: Show how relationships change with other variables
Interpretation Checklist
- [ ] Check overall model fit (R², F-test)
- [ ] Examine coefficient signs (expected direction?)
- [ ] Assess statistical significance (p-values)
- [ ] Consider practical significance (magnitude)
- [ ] Review confidence intervals
- [ ] Check for multicollinearity
- [ ] Examine residuals for assumptions
- [ ] Consider context and domain knowledge
Common Mistakes to Avoid
❌ Overinterpreting insignificant coefficients
❌ Ignoring multicollinearity
❌ Extrapolating beyond data range
❌ Confusing correlation with causation
❌ Misinterpreting interaction terms
❌ Ignoring practical significance
Remember: Regression coefficients provide powerful insights, but they must be interpreted carefully, with attention to statistical assumptions and domain context. Always visualize your data and check model diagnostics before drawing conclusions!