Complete Guide to Regression Tables and P-Values

Introduction to Regression Tables and P-Values

Regression tables are the cornerstone of statistical modeling in data science, providing a comprehensive summary of model results. At the heart of these tables lies the p-value—a critical statistic that helps determine whether observed relationships are statistically significant or merely due to random chance. Understanding how to interpret regression tables and p-values is essential for building reliable models and drawing valid conclusions from data.

Key Concepts

  • Coefficients: Estimated effect sizes for each predictor
  • Standard Errors: Measure of coefficient estimate precision
  • t-statistics: Ratio of coefficient to its standard error
  • p-values: Probability of observing results if null hypothesis is true
  • Confidence Intervals: Range of plausible values for coefficients
  • R-squared: Proportion of variance explained by the model

1. Understanding Regression Table Components

Anatomy of a Regression Table

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy import stats
# Generate sample data
np.random.seed(42)
n = 200
# Create features with known relationships
X1 = np.random.randn(n) * 10 + 50
X2 = np.random.randn(n) * 5 + 30
X3 = np.random.randn(n) * 8 + 20
# Create target variable with true coefficients
true_intercept = 10
true_beta1 = 2.5
true_beta2 = -1.8
true_beta3 = 0.7
y = (true_intercept + 
true_beta1 * X1 + 
true_beta2 * X2 + 
true_beta3 * X3 + 
np.random.randn(n) * 15)
# Create DataFrame
df = pd.DataFrame({
'y': y,
'X1': X1,
'X2': X2,
'X3': X3
})
print("Sample Data:")
print(df.head())
print(f"\nDataset Shape: {df.shape}")
# Fit OLS model using statsmodels
X = df[['X1', 'X2', 'X3']]
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const).fit()
# Display regression table
print("\n" + "="*80)
print("COMPLETE REGRESSION TABLE")
print("="*80)
print(model.summary())

Component Breakdown

def explain_regression_table(model):
"""
Explain each component of the regression table.
"""
print("\n" + "="*80)
print("REGRESSION TABLE COMPONENTS EXPLAINED")
print("="*80)
# 1. Dependent Variable
print("\n1. DEPENDENT VARIABLE:")
print(f"   Model is predicting: {model.model.endog_names}")
# 2. Method and Date
print("\n2. MODEL INFORMATION:")
print(f"   Method: {model.method}")
print(f"   Date/Time: {model.date}")
print(f"   Number of Observations: {model.nobs}")
# 3. Coefficients Table
print("\n3. COEFFICIENTS TABLE:")
coef_table = pd.DataFrame({
'Coefficient': model.params,
'Std Error': model.bse,
't-statistic': model.tvalues,
'p-value': model.pvalues,
'[0.025': model.conf_int()[0],
'0.975]': model.conf_int()[1]
})
print(coef_table.round(4))
# 4. Model Fit Statistics
print("\n4. MODEL FIT STATISTICS:")
print(f"   R-squared: {model.rsquared:.4f}")
print(f"   Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f"   F-statistic: {model.fvalue:.2f}")
print(f"   Prob (F-statistic): {model.f_pvalue:.4e}")
print(f"   Log-Likelihood: {model.llf:.2f}")
print(f"   AIC: {model.aic:.2f}")
print(f"   BIC: {model.bic:.2f}")
# 5. Residual Statistics
print("\n5. RESIDUAL STATISTICS:")
print(f"   Durbin-Watson: {model.durbin_watson:.4f}")
print(f"   Jarque-Bera: {model.jarque_bera[0]:.2f}")
print(f"   Prob(JB): {model.jarque_bera[1]:.4f}")
print(f"   Cond. No.: {model.condition_number:.2f}")
explain_regression_table(model)

2. Understanding P-Values

The Concept of P-Value

def explain_p_values(model):
"""
Explain what p-values mean in regression context.
"""
print("\n" + "="*80)
print("UNDERSTANDING P-VALUES IN REGRESSION")
print("="*80)
print("\nWhat is a P-Value?")
print("-" * 40)
print("""The p-value is the probability of observing a test statistic as extreme 
as the one computed, assuming the null hypothesis is true.
In regression:
- Null Hypothesis (H₀): β = 0 (no relationship)
- Alternative Hypothesis (H₁): β ≠ 0 (there is a relationship)
A small p-value (< 0.05) suggests strong evidence against H₀.
""")
# Display p-values
print("\nP-Values for Each Coefficient:")
print("-" * 40)
for var, p_val in model.pvalues.items():
if var == 'const':
print(f"Intercept p-value: {p_val:.6f}")
else:
significance = "***" if p_val < 0.001 else "**" if p_val < 0.01 else "*" if p_val < 0.05 else "ns"
print(f"{var} p-value: {p_val:.6f} {significance}")
print("\nSignificance Codes:")
print("  *** p < 0.001 (highly significant)")
print("  **  p < 0.01 (very significant)")
print("  *   p < 0.05 (significant)")
print("  ns  p ≥ 0.05 (not significant)")
explain_p_values(model)

Visualizing P-Values

def visualize_p_values(model):
"""
Visualize p-values and coefficient significance.
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Coefficients with confidence intervals
coefs = model.params.drop('const')
ci_low, ci_high = model.conf_int().drop('const').values.T
axes[0, 0].errorbar(range(len(coefs)), coefs, 
yerr=[coefs - ci_low, ci_high - coefs],
fmt='o', capsize=5, capthick=2)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xticks(range(len(coefs)))
axes[0, 0].set_xticklabels(coefs.index)
axes[0, 0].set_ylabel('Coefficient Value')
axes[0, 0].set_title('Coefficients with 95% Confidence Intervals')
axes[0, 0].grid(True, alpha=0.3)
# 2. P-values bar chart
p_values = model.pvalues.drop('const')
colors = ['red' if p < 0.05 else 'gray' for p in p_values]
axes[0, 1].bar(range(len(p_values)), p_values, color=colors)
axes[0, 1].axhline(y=0.05, color='red', linestyle='--', label='Significance Threshold (0.05)')
axes[0, 1].set_xticks(range(len(p_values)))
axes[0, 1].set_xticklabels(p_values.index)
axes[0, 1].set_ylabel('P-Value')
axes[0, 1].set_yscale('log')
axes[0, 1].set_title('P-Values by Variable (Log Scale)')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# 3. Distribution of p-values under null (simulation)
np.random.seed(42)
null_pvalues = []
for _ in range(1000):
# Generate data with no relationship
X_null = np.random.randn(100, 3)
y_null = np.random.randn(100)
model_null = sm.OLS(y_null, sm.add_constant(X_null)).fit()
null_pvalues.extend(model_null.pvalues[1:])  # exclude intercept
axes[1, 0].hist(null_pvalues, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0.05, color='red', linestyle='--', label='Significance Threshold')
axes[1, 0].set_xlabel('P-Value')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of P-Values Under Null Hypothesis')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# 4. T-statistic vs P-value relationship
t_stats = model.tvalues.drop('const')
p_vals = model.pvalues.drop('const')
axes[1, 1].scatter(t_stats, p_vals, s=100, alpha=0.7)
for i, var in enumerate(t_stats.index):
axes[1, 1].annotate(var, (t_stats[i], p_vals[i]), xytext=(5, 5), textcoords='offset points')
axes[1, 1].set_xlabel('t-statistic')
axes[1, 1].set_ylabel('P-Value')
axes[1, 1].set_title('Relationship: t-statistic vs P-Value')
axes[1, 1].set_yscale('log')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
visualize_p_values(model)

3. Calculating P-Values Manually

From Scratch Implementation

def calculate_p_values_manual(X, y):
"""
Calculate p-values for linear regression from scratch.
"""
# Add intercept term
X_with_const = np.column_stack([np.ones(len(X)), X])
# Calculate coefficients: β = (XᵀX)⁻¹Xᵀy
beta = np.linalg.inv(X_with_const.T @ X_with_const) @ X_with_const.T @ y
# Calculate residuals
y_pred = X_with_const @ beta
residuals = y - y_pred
# Degrees of freedom
n = len(y)
k = X_with_const.shape[1]
df = n - k
# Residual standard error
rss = np.sum(residuals ** 2)
mse = rss / df
se = np.sqrt(np.diag(mse * np.linalg.inv(X_with_const.T @ X_with_const)))
# Calculate t-statistics and p-values
t_stats = beta / se
p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df))
# Create results table
results = pd.DataFrame({
'Variable': ['Intercept'] + [f'X{i+1}' for i in range(X.shape[1])],
'Coefficient': beta,
'Std_Error': se,
't_statistic': t_stats,
'p_value': p_values
})
return results, t_stats, p_values, df
# Calculate p-values manually
X_manual = df[['X1', 'X2', 'X3']].values
y_manual = df['y'].values
manual_results, t_stats_manual, p_vals_manual, df_manual = calculate_p_values_manual(X_manual, y_manual)
print("\n" + "="*80)
print("MANUAL P-VALUE CALCULATION")
print("="*80)
print(manual_results.round(6))
# Compare with statsmodels
print("\n" + "="*80)
print("COMPARISON WITH STATSMODELS")
print("="*80)
comparison = pd.DataFrame({
'Variable': manual_results['Variable'],
'Manual P-value': manual_results['p_value'],
'Statsmodels P-value': model.pvalues.values
})
print(comparison.round(6))

T-Distribution and P-Values

def visualize_t_distribution():
"""
Visualize t-distribution and p-value calculation.
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Degrees of freedom
df = model.nobs - model.df_model - 1
# t-distribution
x = np.linspace(-4, 4, 1000)
t_dist = stats.t.pdf(x, df)
normal_dist = stats.norm.pdf(x)
axes[0].plot(x, t_dist, label=f't-distribution (df={df})', linewidth=2)
axes[0].plot(x, normal_dist, '--', label='Normal distribution', alpha=0.7)
axes[0].set_xlabel('t-value')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('t-Distribution vs Normal Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Show p-value area for X1
t_val = model.tvalues['X1']
# Fill areas for two-tailed test
x_fill_left = np.linspace(-4, -abs(t_val), 100)
x_fill_right = np.linspace(abs(t_val), 4, 100)
axes[1].plot(x, t_dist, linewidth=2)
axes[1].fill_between(x_fill_left, 0, stats.t.pdf(x_fill_left, df), alpha=0.5, color='red')
axes[1].fill_between(x_fill_right, 0, stats.t.pdf(x_fill_right, df), alpha=0.5, color='red')
axes[1].axvline(x=t_val, color='red', linestyle='--', label=f't = {t_val:.2f}')
axes[1].axvline(x=-t_val, color='red', linestyle='--')
axes[1].set_xlabel('t-value')
axes[1].set_ylabel('Probability Density')
axes[1].set_title(f'P-Value for X1 = {model.pvalues["X1"]:.6f}')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
visualize_t_distribution()

4. Interpreting P-Values in Context

Significance Levels and Decision Making

def interpret_significance(model):
"""
Interpret significance of each variable.
"""
print("\n" + "="*80)
print("SIGNIFICANCE INTERPRETATION")
print("="*80)
significance_levels = {
0.001: "*** (Highly Significant)",
0.01: "** (Very Significant)",
0.05: "* (Significant)",
0.10: ". (Marginally Significant)",
float('inf'): "ns (Not Significant)"
}
print("\nVariable Significance:")
print("-" * 50)
for var, p_val in model.pvalues.items():
if var == 'const':
var_name = "Intercept"
else:
var_name = var
# Determine significance level
if p_val < 0.001:
level = significance_levels[0.001]
elif p_val < 0.01:
level = significance_levels[0.01]
elif p_val < 0.05:
level = significance_levels[0.05]
elif p_val < 0.10:
level = significance_levels[0.10]
else:
level = significance_levels[float('inf')]
coef = model.params[var]
if coef > 0:
direction = "positive"
else:
direction = "negative"
print(f"\n{var_name}: p = {p_val:.6f} {level}")
print(f"  Coefficient: {coef:.4f} ({direction} effect)")
print(f"  Interpretation: For each 1-unit increase in {var_name},")
print(f"    the target variable changes by {abs(coef):.4f} units,")
print(f"    holding all else constant.")
if p_val < 0.05:
print(f"  ✓ Statistically significant at α = 0.05")
else:
print(f"  ✗ Not statistically significant at α = 0.05")
interpret_significance(model)

Effect Size vs Statistical Significance

def demonstrate_effect_size_vs_significance():
"""
Show difference between statistical significance and practical significance.
"""
np.random.seed(42)
# Case 1: Large sample, small effect
n_large = 10000
X_large = np.random.randn(n_large)
beta_small = 0.05
y_large = 10 + beta_small * X_large + np.random.randn(n_large) * 5
# Case 2: Small sample, large effect
n_small = 30
X_small = np.random.randn(n_small)
beta_large = 2.0
y_small = 10 + beta_large * X_small + np.random.randn(n_small) * 5
# Fit models
model_large = sm.OLS(y_large, sm.add_constant(X_large)).fit()
model_small = sm.OLS(y_small, sm.add_constant(X_small)).fit()
# Create comparison
comparison = pd.DataFrame({
'Scenario': ['Large Sample, Small Effect', 'Small Sample, Large Effect'],
'Sample Size': [n_large, n_small],
'True Beta': [beta_small, beta_large],
'Estimated Beta': [model_large.params[1], model_small.params[1]],
'P-Value': [model_large.pvalues[1], model_small.pvalues[1]],
'95% CI Lower': [model_large.conf_int()[1][0], model_small.conf_int()[1][0]],
'95% CI Upper': [model_large.conf_int()[1][1], model_small.conf_int()[1][1]]
})
print("\n" + "="*80)
print("EFFECT SIZE VS STATISTICAL SIGNIFICANCE")
print("="*80)
print(comparison.round(4))
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Large sample
axes[0].scatter(X_large, y_large, alpha=0.3, s=1)
X_range = np.linspace(X_large.min(), X_large.max(), 100)
y_pred = model_large.predict(sm.add_constant(X_range))
axes[0].plot(X_range, y_pred, 'r-', linewidth=2)
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title(f'Large Sample (n={n_large})\nβ={beta_small:.2f}, p={model_large.pvalues[1]:.4f}')
axes[0].grid(True, alpha=0.3)
# Small sample
axes[1].scatter(X_small, y_small, alpha=0.7, s=50)
X_range = np.linspace(X_small.min(), X_small.max(), 100)
y_pred = model_small.predict(sm.add_constant(X_range))
axes[1].plot(X_range, y_pred, 'r-', linewidth=2)
axes[1].set_xlabel('X')
axes[1].set_ylabel('y')
axes[1].set_title(f'Small Sample (n={n_small})\nβ={beta_large:.2f}, p={model_small.pvalues[1]:.4f}')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nKey Insights:")
print("  • Statistical significance doesn't imply practical importance")
print("  • With large samples, even tiny effects can be statistically significant")
print("  • With small samples, even large effects may not be statistically significant")
print("  • Always consider effect size (coefficient) alongside p-values")
demonstrate_effect_size_vs_significance()

5. Multiple Testing and P-Value Adjustments

The Multiple Testing Problem

def demonstrate_multiple_testing():
"""
Demonstrate the multiple testing problem and p-value adjustments.
"""
np.random.seed(42)
# Generate data with no real relationships
n = 100
n_features = 20
X = np.random.randn(n, n_features)
y = np.random.randn(n)
# Fit separate models for each feature
p_values = []
for i in range(n_features):
model_i = sm.OLS(y, sm.add_constant(X[:, i])).fit()
p_values.append(model_i.pvalues[1])
p_values = np.array(p_values)
# Count significant results at α = 0.05
significant = np.sum(p_values < 0.05)
print("\n" + "="*80)
print("MULTIPLE TESTING PROBLEM")
print("="*80)
print(f"Number of features: {n_features}")
print(f"Features with p < 0.05: {significant} (expected: ~{int(n_features * 0.05)})")
print(f"Expected false positives: {n_features * 0.05:.1f}")
print(f"Observed false positives: {significant}")
# Apply p-value adjustments
from statsmodels.stats.multitest import multipletests
# Bonferroni correction
bonferroni_reject, bonferroni_pvals, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
# Benjamini-Hochberg (FDR)
fdr_reject, fdr_pvals, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
# Compare adjustments
comparison = pd.DataFrame({
'Feature': [f'X{i+1}' for i in range(n_features)],
'Original P-value': p_values,
'Bonferroni P-value': bonferroni_pvals,
'FDR P-value': fdr_pvals,
'Original Sig': p_values < 0.05,
'Bonferroni Sig': bonferroni_reject,
'FDR Sig': fdr_reject
})
print("\nP-Value Adjustment Comparison:")
print(comparison.round(4).to_string(index=False))
print(f"\nSignificant after Bonferroni: {sum(bonferroni_reject)}")
print(f"Significant after FDR: {sum(fdr_reject)}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Original p-values
axes[0].bar(range(n_features), p_values, color='skyblue')
axes[0].axhline(y=0.05, color='red', linestyle='--', label='α = 0.05')
axes[0].set_xlabel('Feature Index')
axes[0].set_ylabel('P-Value')
axes[0].set_title('Original P-Values')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Adjusted p-values
width = 0.35
x = np.arange(n_features)
axes[1].bar(x - width/2, bonferroni_pvals, width, label='Bonferroni', color='lightcoral')
axes[1].bar(x + width/2, fdr_pvals, width, label='FDR (Benjamini-Hochberg)', color='lightgreen')
axes[1].axhline(y=0.05, color='red', linestyle='--', label='α = 0.05')
axes[1].set_xlabel('Feature Index')
axes[1].set_ylabel('Adjusted P-Value')
axes[1].set_title('Adjusted P-Values')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return p_values, bonferroni_pvals, fdr_pvals
p_vals_original, p_vals_bonf, p_vals_fdr = demonstrate_multiple_testing()

6. P-Values for Different Model Types

Logistic Regression

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
def logistic_regression_pvalues():
"""
Demonstrate p-values in logistic regression.
"""
# Generate binary classification data
X_class, y_class = make_classification(n_samples=500, n_features=5, 
n_informative=3, random_state=42)
# Convert to DataFrame
df_class = pd.DataFrame(X_class, columns=[f'X{i}' for i in range(5)])
df_class['y'] = y_class
# Fit logistic regression using statsmodels
X_with_const = sm.add_constant(df_class[['X0', 'X1', 'X2', 'X3', 'X4']])
logit_model = sm.Logit(df_class['y'], X_with_const).fit()
print("\n" + "="*80)
print("LOGISTIC REGRESSION RESULTS WITH P-VALUES")
print("="*80)
print(logit_model.summary())
# Interpret odds ratios
print("\n" + "="*80)
print("ODDS RATIOS AND INTERPRETATION")
print("="*80)
coefs = logit_model.params
p_vals = logit_model.pvalues
odds_ratios = np.exp(coefs)
results = pd.DataFrame({
'Coefficient': coefs,
'Odds Ratio': odds_ratios,
'P-Value': p_vals
})
print(results.round(4))
print("\nInterpretation:")
for var in coefs.index:
if var == 'const':
continue
if p_vals[var] < 0.05:
if odds_ratios[var] > 1:
print(f"  {var}: Each unit increase multiplies odds by {odds_ratios[var]:.2f} (p={p_vals[var]:.4f})")
else:
print(f"  {var}: Each unit increase multiplies odds by {odds_ratios[var]:.2f} (p={p_vals[var]:.4f})")
return logit_model
logit_model = logistic_regression_pvalues()

Time Series Regression

def time_series_regression_pvalues():
"""
Demonstrate p-values in time series regression.
"""
# Generate time series data
np.random.seed(42)
t = np.arange(200)
# Create trend and seasonal components
trend = 0.5 * t
seasonal = 10 * np.sin(2 * np.pi * t / 12)
noise = np.random.randn(200) * 5
y_ts = 100 + trend + seasonal + noise
# Create features
df_ts = pd.DataFrame({
'y': y_ts,
'time': t,
'month_sin': np.sin(2 * np.pi * t / 12),
'month_cos': np.cos(2 * np.pi * t / 12)
})
# Fit regression
X_ts = df_ts[['time', 'month_sin', 'month_cos']]
X_ts_const = sm.add_constant(X_ts)
model_ts = sm.OLS(df_ts['y'], X_ts_const).fit()
print("\n" + "="*80)
print("TIME SERIES REGRESSION WITH P-VALUES")
print("="*80)
print(model_ts.summary())
# Visualize
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
# Original vs predicted
axes[0].plot(df_ts['y'], label='Actual', alpha=0.7)
axes[0].plot(model_ts.fittedvalues, label='Fitted', alpha=0.7)
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Value')
axes[0].set_title('Time Series with Regression Fit')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Residuals
residuals = model_ts.resid
axes[1].plot(residuals, color='red', alpha=0.7)
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residuals')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return model_ts
ts_model = time_series_regression_pvalues()

7. Common Pitfalls and Misinterpretations

Pitfalls in P-Value Interpretation

def demonstrate_p_value_pitfalls():
"""
Demonstrate common pitfalls in interpreting p-values.
"""
print("\n" + "="*80)
print("COMMON P-VALUE MISINTERPRETATIONS")
print("="*80)
# Pitfall 1: p-value is NOT probability that null is true
print("\n1. P-Value is NOT the probability that the null hypothesis is true")
print("-" * 50)
print("   Misinterpretation: 'p=0.03 means there's a 3% chance the null is true'")
print("   Correct: 'If the null hypothesis were true, the probability of observing")
print("            this extreme result is 3%'")
# Pitfall 2: p-value depends on sample size
print("\n2. P-Value Depends on Sample Size")
print("-" * 50)
# Generate data with same effect size but different sample sizes
effect = 0.1
n_small = 30
n_large = 1000
np.random.seed(42)
X = np.random.randn(n_large)
y = effect * X + np.random.randn(n_large)
# Fit with small sample
model_small = sm.OLS(y[:n_small], sm.add_constant(X[:n_small])).fit()
# Fit with large sample
model_large = sm.OLS(y, sm.add_constant(X)).fit()
print(f"   Same effect size (β = {effect})")
print(f"   Small sample (n={n_small}): p = {model_small.pvalues[1]:.6f}")
print(f"   Large sample (n={n_large}): p = {model_large.pvalues[1]:.6f}")
# Pitfall 3: p-value ≠ effect size
print("\n3. P-Value Does NOT Indicate Effect Size")
print("-" * 50)
# Create two effects: one small but significant, one large but not
np.random.seed(42)
# Small effect, large sample
n1 = 5000
X1 = np.random.randn(n1)
y1 = 0.02 * X1 + np.random.randn(n1)
model1 = sm.OLS(y1, sm.add_constant(X1)).fit()
# Large effect, small sample
n2 = 20
X2 = np.random.randn(n2)
y2 = 2.0 * X2 + np.random.randn(n2) * 2
model2 = sm.OLS(y2, sm.add_constant(X2)).fit()
print(f"   Small effect (β=0.02), n={n1}: p={model1.pvalues[1]:.6f}, effect={model1.params[1]:.4f}")
print(f"   Large effect (β=2.0), n={n2}: p={model2.pvalues[1]:.6f}, effect={model2.params[1]:.4f}")
# Pitfall 4: p-value ≠ importance
print("\n4. P-Value Does NOT Indicate Importance")
print("-" * 50)
print("   A variable can be statistically significant but practically unimportant")
print("   A variable can be practically important but not statistically significant")
# Pitfall 5: p-value ≠ replication probability
print("\n5. P-Value is NOT the Probability of Replication")
print("-" * 50)
print("   Misinterpretation: 'p=0.03 means there's a 97% chance the result will replicate'")
print("   Correct: Replication probability depends on many factors beyond p-value")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Sample size effect
sizes = range(10, 500, 20)
p_values = []
for n in sizes:
X_n = np.random.randn(n)
y_n = 0.05 * X_n + np.random.randn(n)
model_n = sm.OLS(y_n, sm.add_constant(X_n)).fit()
p_values.append(model_n.pvalues[1])
axes[0, 0].plot(sizes, p_values, 'o-')
axes[0, 0].axhline(y=0.05, color='red', linestyle='--')
axes[0, 0].set_xlabel('Sample Size')
axes[0, 0].set_ylabel('P-Value')
axes[0, 0].set_title('P-Value Decreases with Sample Size')
axes[0, 0].grid(True, alpha=0.3)
# P-value vs effect size relationship
n_fixed = 100
effects = np.linspace(0, 0.5, 30)
p_vals_effect = []
for beta in effects:
X_eff = np.random.randn(n_fixed)
y_eff = beta * X_eff + np.random.randn(n_fixed)
model_eff = sm.OLS(y_eff, sm.add_constant(X_eff)).fit()
p_vals_effect.append(model_eff.pvalues[1])
axes[0, 1].plot(effects, p_vals_effect, 'o-')
axes[0, 1].axhline(y=0.05, color='red', linestyle='--')
axes[0, 1].set_xlabel('Effect Size (Beta)')
axes[0, 1].set_ylabel('P-Value')
axes[0, 1].set_title('P-Value vs Effect Size')
axes[0, 1].grid(True, alpha=0.3)
# Distribution of p-values under null
null_pvals = []
for _ in range(1000):
X_null = np.random.randn(100, 1)
y_null = np.random.randn(100)
model_null = sm.OLS(y_null, sm.add_constant(X_null)).fit()
null_pvals.append(model_null.pvalues[1])
axes[1, 0].hist(null_pvals, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0.05, color='red', linestyle='--')
axes[1, 0].set_xlabel('P-Value')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('P-Value Distribution Under Null (Uniform)')
axes[1, 0].grid(True, alpha=0.3)
# P-value vs t-statistic
t_vals = np.linspace(0, 4, 100)
p_vals_t = 2 * (1 - stats.t.cdf(t_vals, df=50))
axes[1, 1].plot(t_vals, p_vals_t)
axes[1, 1].axhline(y=0.05, color='red', linestyle='--')
axes[1, 1].axvline(x=stats.t.ppf(0.975, 50), color='green', linestyle='--', 
label='Critical t (α=0.05)')
axes[1, 1].set_xlabel('t-statistic')
axes[1, 1].set_ylabel('P-Value')
axes[1, 1].set_title('Relationship: t-statistic → P-Value')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
demonstrate_p_value_pitfalls()

8. Practical Guide to Reporting Results

Creating Publication-Ready Tables

def create_regression_table(model, coef_names=None, decimals=3):
"""
Create a publication-ready regression table.
"""
# Extract results
coefs = model.params
std_err = model.bse
t_stats = model.tvalues
p_vals = model.pvalues
ci_low, ci_high = model.conf_int().T
# Significance stars
stars = []
for p in p_vals:
if p < 0.001:
stars.append('***')
elif p < 0.01:
stars.append('**')
elif p < 0.05:
stars.append('*')
else:
stars.append('')
# Create DataFrame
if coef_names is None:
coef_names = coefs.index
table = pd.DataFrame({
'Coefficient': coefs.values,
'Std. Error': std_err.values,
't-statistic': t_stats.values,
'p-value': p_vals.values,
'95% CI Lower': ci_low.values,
'95% CI Upper': ci_high.values,
'Sig': stars
}, index=coef_names)
# Round values
table = table.round(decimals)
# Add significance stars to coefficient column
table['Coefficient'] = table['Coefficient'].astype(str) + ' ' + table['Sig']
table = table.drop('Sig', axis=1)
# Model fit statistics
fit_stats = {
'Observations': model.nobs,
'R-squared': model.rsquared,
'Adj. R-squared': model.rsquared_adj,
'F-statistic': model.fvalue,
'Prob (F-statistic)': model.f_pvalue,
'AIC': model.aic,
'BIC': model.bic
}
return table, fit_stats
# Create publication-ready table
reg_table, fit_stats = create_regression_table(model)
print("\n" + "="*80)
print("PUBLICATION-READY REGRESSION TABLE")
print("="*80)
print(reg_table.to_string())
print("\nModel Fit Statistics:")
for stat, value in fit_stats.items():
if isinstance(value, float):
print(f"  {stat}: {value:.4f}")
else:
print(f"  {stat}: {value}")

Writing Interpretations

def write_regression_interpretation(model, feature_names, target_name):
"""
Generate natural language interpretation of regression results.
"""
print("\n" + "="*80)
print("REGRESSION RESULTS INTERPRETATION")
print("="*80)
# Overall model significance
if model.f_pvalue < 0.05:
print(f"\nOverall Model:")
print(f"  The model is statistically significant (F({model.df_model}, {model.df_resid}) = {model.fvalue:.2f}, p < 0.001).")
print(f"  The predictors explain {model.rsquared:.1%} of the variance in {target_name}.")
else:
print(f"\nOverall Model:")
print(f"  The model is not statistically significant (p = {model.f_pvalue:.4f}).")
# Individual predictors
print(f"\nPredictor Effects:")
for name, coef, p_val in zip(feature_names, model.params[1:], model.pvalues[1:]):
ci_low, ci_high = model.conf_int().loc[name]
if p_val < 0.05:
direction = "increases" if coef > 0 else "decreases"
print(f"\n  {name}:")
print(f"    • Coefficient: {coef:.4f} (95% CI: {ci_low:.4f}, {ci_high:.4f})")
print(f"    • p-value: {p_val:.4f}")
print(f"    • For each 1-unit increase in {name}, {target_name} {direction} by {abs(coef):.4f} units,")
print(f"      holding all other variables constant.")
else:
print(f"\n  {name}:")
print(f"    • Coefficient: {coef:.4f} (95% CI: {ci_low:.4f}, {ci_high:.4f})")
print(f"    • p-value: {p_val:.4f}")
print(f"    • Not statistically significant at α = 0.05.")
# Interpretation of intercept
print(f"\nIntercept:")
intercept = model.params['const']
ci_low, ci_high = model.conf_int().loc['const']
print(f"  When all predictors are zero, the predicted {target_name} is {intercept:.4f}")
print(f"  (95% CI: {ci_low:.4f}, {ci_high:.4f}).")
# Generate interpretation
write_regression_interpretation(model, ['X1', 'X2', 'X3'], 'y')

9. Advanced Topics

Bootstrap P-Values

def bootstrap_p_values(X, y, n_bootstrap=1000):
"""
Calculate bootstrap p-values for regression coefficients.
"""
X_const = sm.add_constant(X)
# Original model
original_model = sm.OLS(y, X_const).fit()
original_coefs = original_model.params
# Bootstrap distribution under null
bootstrap_coefs = []
for _ in range(n_bootstrap):
# Resample with replacement
indices = np.random.choice(len(y), size=len(y), replace=True)
X_boot = X_const[indices]
y_boot = y[indices]
# Fit model
model_boot = sm.OLS(y_boot, X_boot).fit()
bootstrap_coefs.append(model_boot.params)
bootstrap_coefs = np.array(bootstrap_coefs)
# Calculate bootstrap p-values (two-tailed)
bootstrap_p_values = []
for i, orig_coef in enumerate(original_coefs):
# Count how many bootstrapped coefficients are more extreme than original
extreme_count = np.sum(np.abs(bootstrap_coefs[:, i]) >= np.abs(orig_coef))
p_value = extreme_count / n_bootstrap
bootstrap_p_values.append(p_value)
return bootstrap_p_values
# Calculate bootstrap p-values
bootstrap_p = bootstrap_p_values(X_manual, y_manual)
print("\n" + "="*80)
print("BOOTSTRAP P-VALUES VS STANDARD P-VALUES")
print("="*80)
comparison = pd.DataFrame({
'Variable': ['Intercept', 'X1', 'X2', 'X3'],
'Standard P-Value': model.pvalues,
'Bootstrap P-Value': bootstrap_p
})
print(comparison.round(6))

Permutation Tests

def permutation_p_values(X, y, n_permutations=1000):
"""
Calculate permutation p-values for regression coefficients.
"""
X_const = sm.add_constant(X)
# Original model
original_model = sm.OLS(y, X_const).fit()
original_coefs = original_model.params
# Permutation distribution
perm_coefs = []
for _ in range(n_permutations):
# Shuffle the target variable
y_perm = np.random.permutation(y)
# Fit model on permuted data
model_perm = sm.OLS(y_perm, X_const).fit()
perm_coefs.append(model_perm.params)
perm_coefs = np.array(perm_coefs)
# Calculate permutation p-values
perm_p_values = []
for i, orig_coef in enumerate(original_coefs):
# Count how many permuted coefficients are more extreme
extreme_count = np.sum(np.abs(perm_coefs[:, i]) >= np.abs(orig_coef))
p_value = extreme_count / n_permutations
perm_p_values.append(p_value)
return perm_p_values
# Calculate permutation p-values
perm_p = permutation_p_values(X_manual, y_manual)
print("\n" + "="*80)
print("PERMUTATION P-VALUES")
print("="*80)
comparison = pd.DataFrame({
'Variable': ['Intercept', 'X1', 'X2', 'X3'],
'Standard P-Value': model.pvalues,
'Permutation P-Value': perm_p
})
print(comparison.round(6))

10. Best Practices and Summary

Best Practices for Using P-Values

def summarize_best_practices():
"""
Summarize best practices for using p-values in regression.
"""
print("\n" + "="*80)
print("BEST PRACTICES FOR USING P-VALUES IN REGRESSION")
print("="*80)
practices = {
"Before Modeling": [
"Check assumptions (linearity, independence, homoscedasticity, normality)",
"Handle missing data appropriately",
"Consider multicollinearity (VIF)"
],
"During Modeling": [
"Use appropriate sample size (power analysis)",
"Consider multiple testing corrections when applicable",
"Check model diagnostics (residual plots, influence measures)"
],
"Interpreting Results": [
"Report effect sizes (coefficients) alongside p-values",
"Provide confidence intervals",
"Distinguish between statistical and practical significance",
"Consider the context and domain knowledge"
],
"Reporting": [
"Report exact p-values (not just p < 0.05)",
"Include confidence intervals",
"Disclose multiple testing corrections if applied",
"Be transparent about data and analysis choices"
],
"Common Misconceptions to Avoid": [
"p-value ≠ probability that null is true",
"p-value ≠ effect size",
"p-value ≠ importance",
"p-value ≠ replication probability",
"p-value ≠ evidence strength"
]
}
for category, items in practices.items():
print(f"\n{category}:")
print("-" * 40)
for item in items:
print(f"  • {item}")
summarize_best_practices()

Quick Reference Guide

def create_reference_guide():
"""
Create a quick reference guide for regression p-values.
"""
print("\n" + "="*80)
print("QUICK REFERENCE: REGRESSION P-VALUES")
print("="*80)
guide = {
"What is a p-value?": [
"Probability of observing data as extreme as observed, assuming null hypothesis is true",
"Ranges from 0 to 1",
"Smaller values suggest stronger evidence against null"
],
"Interpretation Guide": [
"p < 0.001: *** Strong evidence against null",
"p < 0.01: ** Very strong evidence against null",
"p < 0.05: * Moderate evidence against null",
"p < 0.10: . Weak evidence against null",
"p ≥ 0.10: ns Insufficient evidence to reject null"
],
"Common Thresholds": [
"α = 0.05: Traditional significance level",
"α = 0.01: More conservative (e.g., medical research)",
"α = 0.10: More liberal (e.g., exploratory research)"
],
"Factors Affecting P-Value": [
"Sample size (larger n → smaller p-values)",
"Effect size (larger effect → smaller p-values)",
"Variability (more noise → larger p-values)",
"Number of tests (multiple testing inflates type I error)"
],
"P-Value Alternatives": [
"Confidence intervals (more informative)",
"Effect sizes (Cohen's d, partial η²)",
"Bayesian methods (posterior probabilities)",
"Information criteria (AIC, BIC)"
]
}
for topic, items in guide.items():
print(f"\n{topic}:")
print("-" * 50)
for item in items:
print(f"  • {item}")
create_reference_guide()

11. Interactive P-Value Explorer

def interactive_p_value_explorer():
"""
Create an interactive demonstration of p-values.
"""
import ipywidgets as widgets
from IPython.display import display
# Create widgets
n_slider = widgets.IntSlider(value=100, min=10, max=500, step=10, description='Sample Size:')
beta_slider = widgets.FloatSlider(value=0.5, min=0, max=2, step=0.05, description='Effect Size:')
noise_slider = widgets.FloatSlider(value=1, min=0.5, max=3, step=0.1, description='Noise SD:')
run_button = widgets.Button(description='Run Simulation')
output = widgets.Output()
def update_plot(n, beta, noise_sd):
with output:
output.clear_output(wait=True)
# Generate data
np.random.seed(42)
X = np.random.randn(n)
y = beta * X + np.random.randn(n) * noise_sd
# Fit model
model_sim = sm.OLS(y, sm.add_constant(X)).fit()
# Create plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Scatter plot with regression line
axes[0].scatter(X, y, alpha=0.5)
X_range = np.linspace(X.min(), X.max(), 100)
y_pred = model_sim.predict(sm.add_constant(X_range))
axes[0].plot(X_range, y_pred, 'r-', linewidth=2)
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title(f'Data (n={n}, β={beta:.2f})')
axes[0].grid(True, alpha=0.3)
# Coefficient plot
axes[1].bar(['β'], [model_sim.params[1]], color='skyblue')
axes[1].errorbar(['β'], [model_sim.params[1]], 
yerr=[model_sim.bse[1]], capsize=5, color='black')
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title(f'β = {model_sim.params[1]:.3f}, p = {model_sim.pvalues[1]:.4f}')
axes[1].grid(True, alpha=0.3)
# Distribution of coefficients (bootstrap)
bootstrap_coefs = []
for _ in range(500):
idx = np.random.choice(len(y), len(y), replace=True)
X_boot = X[idx]
y_boot = y[idx]
model_boot = sm.OLS(y_boot, sm.add_constant(X_boot)).fit()
bootstrap_coefs.append(model_boot.params[1])
axes[2].hist(bootstrap_coefs, bins=30, edgecolor='black', alpha=0.7)
axes[2].axvline(x=0, color='red', linestyle='--', label='Null (0)')
axes[2].axvline(x=model_sim.params[1], color='blue', linestyle='--', 
label=f'Observed β = {model_sim.params[1]:.3f}')
axes[2].set_xlabel('Coefficient')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Bootstrap Distribution')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print interpretation
print(f"\nResults:")
print(f"  Sample Size: {n}")
print(f"  True Effect: β = {beta}")
print(f"  Estimated Effect: β = {model_sim.params[1]:.4f}")
print(f"  Standard Error: {model_sim.bse[1]:.4f}")
print(f"  t-statistic: {model_sim.tvalues[1]:.4f}")
print(f"  p-value: {model_sim.pvalues[1]:.6f}")
if model_sim.pvalues[1] < 0.05:
print(f"  Conclusion: Statistically significant (p < 0.05)")
else:
print(f"  Conclusion: Not statistically significant (p ≥ 0.05)")
def on_run_clicked(b):
update_plot(n_slider.value, beta_slider.value, noise_slider.value)
run_button.on_click(on_run_clicked)
# Display widgets
display(n_slider, beta_slider, noise_slider, run_button, output)
# Initial plot
update_plot(100, 0.5, 1)
# Uncomment to run interactive explorer (requires Jupyter/IPython)
# interactive_p_value_explorer()

Conclusion

Understanding p-values and regression tables is essential for sound statistical modeling:

Key Takeaways

  1. P-Value Definition: The probability of observing data as extreme as observed under the null hypothesis
  2. Interpretation: Small p-values (< 0.05) suggest evidence against the null hypothesis
  3. Not a Probability of Null: P-values do NOT tell you the probability that the null is true
  4. Sample Size Sensitivity: P-values decrease with larger sample sizes
  5. Multiple Testing: Adjust p-values when testing multiple hypotheses
  6. Effect Size Matters: Statistical significance ≠ practical importance
  7. Context Required: Always interpret p-values alongside coefficients, confidence intervals, and domain knowledge

Regression Table Components

ComponentDescriptionInterpretation
CoefficientEstimated effect sizeDirection and magnitude of relationship
Std ErrorPrecision of estimateSmaller = more precise
t-statisticCoefficient / Std ErrorTest statistic for significance
p-valueProbability under nullEvidence against null hypothesis
CI (95%)Range of plausible valuesPrecision and significance
R-squaredVariance explainedModel fit
F-statisticOverall model testJoint significance of predictors

Best Practices Checklist

  • [ ] Check regression assumptions before interpreting p-values
  • [ ] Report effect sizes alongside p-values
  • [ ] Provide confidence intervals for coefficients
  • [ ] Adjust for multiple testing when applicable
  • [ ] Consider sample size when interpreting significance
  • [ ] Distinguish between statistical and practical significance
  • [ ] Use domain knowledge to interpret results
  • [ ] Be transparent about analysis choices
  • [ ] Avoid common p-value misinterpretations

Mastering p-values and regression tables is fundamental to becoming a proficient data scientist. These tools, when used correctly, provide powerful insights into relationships in your data and help build reliable predictive models.

Building Blocks of C: A Complete Guide to Functions
Explains how functions work in C programming, including function declaration, definition, parameters, return values, and how functions help organize reusable code.
https://macronepal.com/bash/building-blocks-of-c-a-complete-guide-to-functions/

The Heart of Text Processing: A Complete Guide to Strings in C
Explains how strings are used in C, covering character arrays, string handling functions, and common techniques for text processing tasks.
https://macronepal.com/bash/the-heart-of-text-processing-a-complete-guide-to-strings-in-c-2/

The Cornerstone of Data Organization: A Complete Guide to Arrays in C
Describes how arrays store multiple values in C, including indexing, initialization, and using arrays to manage structured data efficiently.
https://macronepal.com/bash/the-cornerstone-of-data-organization-a-complete-guide-to-arrays-in-c/

Guaranteed Execution: A Complete Guide to the Do-While Loop in C
Explains the do-while loop structure in C, highlighting how it ensures code runs at least once before checking the loop condition.
https://macronepal.com/bash/guaranteed-execution-a-complete-guide-to-the-do-while-loop-in-c/

Mastering Iteration: A Complete Guide to the For Loop in C
Explains how the for loop works in C, including initialization, condition checking, and increment steps for repeated execution of code blocks.
https://macronepal.com/bash/mastering-iteration-a-complete-guide-to-the-for-loop-in-c/

Mastering Iteration: A Complete Guide to While Loops in C
Explains the while loop structure in C, focusing on condition-based repetition and proper loop control techniques.
https://macronepal.com/bash/mastering-iteration-a-complete-guide-to-while-loops-in-c/

Beyond If-Else: A Complete Guide to Switch Case in C
Explains how switch-case statements work in C programming, enabling efficient handling of multiple conditional branches.
https://macronepal.com/bash/beyond-if-else-a-complete-guide-to-switch-case-in-c/

Mastering the Fundamentals: A Complete Guide to Arithmetic Operations in C
Explains how arithmetic operators such as addition, subtraction, multiplication, and division work in C, along with operator precedence and usage examples.
https://macronepal.com/bash/mastering-the-fundamentals-a-complete-guide-to-arithmetic-operations-in-c/

Foundation of C Programming: A Complete Guide to Basic Input Output
Explains how input and output functions like printf and scanf work in C, forming the foundation for interacting with users and displaying program results.
https://macronepal.com/bash/foundation-of-c-programming-a-complete-guide-to-basic-input-output/

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper