Introduction to Correlation and Causality
One of the most fundamental and often misunderstood concepts in data science is the distinction between correlation and causation. While correlation measures the strength of a relationship between variables, causation implies that one variable directly influences another. Understanding this distinction is crucial for making valid inferences and avoiding misleading conclusions.
Key Concepts
- Correlation: A statistical measure of how two variables move together
- Causality: The relationship where one event directly causes another
- Confounding: Hidden variables that influence both correlated variables
- Spurious Correlation: False relationships that appear statistically significant
- Directionality: Understanding which variable influences which
1. Understanding Correlation
What is Correlation?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
# Generate correlation examples
np.random.seed(42)
n = 100
# 1. Strong positive correlation
x_pos = np.random.randn(n)
y_pos = 2 * x_pos + np.random.randn(n) * 0.3
# 2. Strong negative correlation
x_neg = np.random.randn(n)
y_neg = -2 * x_neg + np.random.randn(n) * 0.3
# 3. No correlation
x_none = np.random.randn(n)
y_none = np.random.randn(n)
# 4. Non-linear relationship
x_nonlin = np.linspace(-3, 3, n)
y_nonlin = x_nonlin**2 + np.random.randn(n) * 0.5
# Calculate correlations
corr_pos, _ = stats.pearsonr(x_pos, y_pos)
corr_neg, _ = stats.pearsonr(x_neg, y_neg)
corr_none, _ = stats.pearsonr(x_none, y_none)
corr_nonlin, _ = stats.pearsonr(x_nonlin, y_nonlin)
print("Correlation Examples:")
print(f"Strong Positive Correlation: r = {corr_pos:.3f}")
print(f"Strong Negative Correlation: r = {corr_neg:.3f}")
print(f"No Correlation: r = {corr_none:.3f}")
print(f"Non-linear (but correlation = {corr_nonlin:.3f})")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].scatter(x_pos, y_pos, alpha=0.6)
axes[0, 0].set_title(f'Strong Positive Correlation (r = {corr_pos:.3f})')
axes[0, 0].set_xlabel('X')
axes[0, 0].set_ylabel('Y')
axes[0, 1].scatter(x_neg, y_neg, alpha=0.6)
axes[0, 1].set_title(f'Strong Negative Correlation (r = {corr_neg:.3f})')
axes[0, 1].set_xlabel('X')
axes[0, 1].set_ylabel('Y')
axes[1, 0].scatter(x_none, y_none, alpha=0.6)
axes[1, 0].set_title(f'No Correlation (r = {corr_none:.3f})')
axes[1, 0].set_xlabel('X')
axes[1, 0].set_ylabel('Y')
axes[1, 1].scatter(x_nonlin, y_nonlin, alpha=0.6)
axes[1, 1].set_title(f'Non-linear Relationship (r = {corr_nonlin:.3f})')
axes[1, 1].set_xlabel('X')
axes[1, 1].set_ylabel('Y')
plt.tight_layout()
plt.show()
Types of Correlation
# Pearson, Spearman, and Kendall correlations
np.random.seed(42)
n = 200
# Linear relationship
x_linear = np.random.randn(n)
y_linear = 2 * x_linear + np.random.randn(n) * 0.5
# Monotonic but non-linear
x_monotonic = np.random.randn(n)
y_monotonic = np.exp(x_monotonic) + np.random.randn(n) * 0.5
# Non-monotonic
x_nonmonotonic = np.random.randn(n)
y_nonmonotonic = x_nonmonotonic**3 + np.random.randn(n) * 0.5
# Calculate different correlation measures
correlations = {
'Linear Relationship': {
'Pearson': stats.pearsonr(x_linear, y_linear)[0],
'Spearman': stats.spearmanr(x_linear, y_linear)[0],
'Kendall': stats.kendalltau(x_linear, y_linear)[0]
},
'Monotonic Relationship': {
'Pearson': stats.pearsonr(x_monotonic, y_monotonic)[0],
'Spearman': stats.spearmanr(x_monotonic, y_monotonic)[0],
'Kendall': stats.kendalltau(x_monotonic, y_monotonic)[0]
},
'Non-monotonic Relationship': {
'Pearson': stats.pearsonr(x_nonmonotonic, y_nonmonotonic)[0],
'Spearman': stats.spearmanr(x_nonmonotonic, y_nonmonotonic)[0],
'Kendall': stats.kendalltau(x_nonmonotonic, y_nonmonotonic)[0]
}
}
corr_df = pd.DataFrame(correlations).round(3)
print("Correlation Measures Comparison:")
print(corr_df)
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].scatter(x_linear, y_linear, alpha=0.6)
axes[0].set_title('Linear Relationship')
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[1].scatter(x_monotonic, y_monotonic, alpha=0.6)
axes[1].set_title('Monotonic Relationship')
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[2].scatter(x_nonmonotonic, y_nonmonotonic, alpha=0.6)
axes[2].set_title('Non-monotonic Relationship')
axes[2].set_xlabel('X')
axes[2].set_ylabel('Y')
plt.tight_layout()
plt.show()
2. The Difference: Correlation vs. Causation
Famous Spurious Correlations
# Demonstrating spurious correlations
np.random.seed(42)
years = np.arange(2000, 2021)
n = len(years)
# Generate unrelated but trending data
internet_usage = np.linspace(10, 85, n) + np.random.randn(n) * 2
cheese_consumption = np.linspace(20, 40, n) + np.random.randn(n) * 1.5
# Calculate correlation
correlation = np.corrcoef(internet_usage, cheese_consumption)[0, 1]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(years, internet_usage, 'b-o', label='Internet Usage (%)')
axes[0].plot(years, cheese_consumption, 'r-s', label='Cheese Consumption (lbs/person)')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Values')
axes[0].set_title('Spurious Correlation: Internet Usage vs. Cheese Consumption')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[1].scatter(internet_usage, cheese_consumption, alpha=0.7, s=100)
axes[1].set_xlabel('Internet Usage (%)')
axes[1].set_ylabel('Cheese Consumption (lbs/person)')
axes[1].set_title(f'Correlation: r = {correlation:.3f}')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Spurious Correlation Example:")
print(f"Correlation between Internet Usage and Cheese Consumption: {correlation:.3f}")
print("Does increased internet usage cause people to eat more cheese? NO!")
print("This is a classic example of spurious correlation driven by time trends.")
Causation Requires More Than Correlation
# Demonstrating that correlation does not imply causation
from sklearn.linear_model import LinearRegression
# Generate data where ice cream sales and drowning incidents are correlated
np.random.seed(42)
temperature = np.random.uniform(60, 100, 100)
ice_cream_sales = 50 + 2 * (temperature - 70) + np.random.randn(100) * 10
drowning_incidents = 10 + 0.5 * (temperature - 70) + np.random.randn(100) * 5
# Calculate correlation
correlation = np.corrcoef(ice_cream_sales, drowning_incidents)[0, 1]
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].scatter(temperature, ice_cream_sales, alpha=0.6)
axes[0].set_xlabel('Temperature (°F)')
axes[0].set_ylabel('Ice Cream Sales')
axes[0].set_title('Temperature → Ice Cream Sales')
axes[1].scatter(temperature, drowning_incidents, alpha=0.6)
axes[1].set_xlabel('Temperature (°F)')
axes[1].set_ylabel('Drowning Incidents')
axes[1].set_title('Temperature → Drowning Incidents')
axes[2].scatter(ice_cream_sales, drowning_incidents, alpha=0.6)
axes[2].set_xlabel('Ice Cream Sales')
axes[2].set_ylabel('Drowning Incidents')
axes[2].set_title(f'Ice Cream Sales vs. Drowning (r = {correlation:.3f})')
plt.tight_layout()
plt.show()
print("\nThe Confounding Variable Problem:")
print(f"Ice cream sales and drowning incidents are correlated (r = {correlation:.3f})")
print("Does ice cream cause drowning? NO!")
print("The confounder is TEMPERATURE: hot weather causes both more ice cream sales AND more drowning incidents.")
print("\nThis is why correlation ≠ causation!")
3. Confounding Variables
Understanding Confounders
# Simulating a confounding variable
np.random.seed(42)
n = 500
# Generate data with confounder
exercise_hours = np.random.uniform(0, 10, n)
age = np.random.uniform(20, 70, n)
# Health score (outcome) influenced by both exercise and age
health_score = (2 * exercise_hours - 0.5 * (age - 50) +
100 + np.random.randn(n) * 10)
# Create DataFrame
df = pd.DataFrame({
'exercise_hours': exercise_hours,
'age': age,
'health_score': health_score
})
# Correlation without controlling for confounder
crude_corr = df['exercise_hours'].corr(df['health_score'])
# Partial correlation controlling for age
from scipy.stats import pearsonr
# Residuals after removing age effect
def partial_correlation(df, x, y, control):
# Regress x on control
model_x = LinearRegression()
model_x.fit(df[[control]], df[x])
x_resid = df[x] - model_x.predict(df[[control]])
# Regress y on control
model_y = LinearRegression()
model_y.fit(df[[control]], df[y])
y_resid = df[y] - model_y.predict(df[[control]])
return pearsonr(x_resid, y_resid)[0]
partial_corr = partial_correlation(df, 'exercise_hours', 'health_score', 'age')
print("Confounding Analysis:")
print(f"Crude correlation (exercise vs. health): {crude_corr:.3f}")
print(f"Partial correlation (controlling for age): {partial_corr:.3f}")
print("\nInterpretation:")
print(f"The crude correlation overestimates the effect of exercise because")
print(f"younger people tend to exercise more AND have better health.")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Age groups
df['age_group'] = pd.cut(df['age'], bins=[20, 40, 60, 70], labels=['Young', 'Middle', 'Senior'])
# Plot by age group
for age_group, color in zip(['Young', 'Middle', 'Senior'], ['blue', 'green', 'red']):
subset = df[df['age_group'] == age_group]
axes[0].scatter(subset['exercise_hours'], subset['health_score'],
alpha=0.5, label=age_group, color=color)
axes[0].set_xlabel('Exercise Hours per Week')
axes[0].set_ylabel('Health Score')
axes[0].set_title('Health Score vs. Exercise by Age Group')
axes[0].legend()
axes[1].scatter(df['age'], df['health_score'], alpha=0.5)
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Health Score')
axes[1].set_title('Age is a Confounding Variable')
plt.tight_layout()
plt.show()
Visualizing Confounding
# DAG (Directed Acyclic Graph) visualization for confounding
import networkx as nx
# Create a simple DAG
G = nx.DiGraph()
G.add_edges_from([
('Temperature', 'Ice Cream Sales'),
('Temperature', 'Drowning Incidents'),
('Ice Cream Sales', 'Drowning Incidents') # Spurious
])
# Remove the spurious edge for the true causal structure
G_true = nx.DiGraph()
G_true.add_edges_from([
('Temperature', 'Ice Cream Sales'),
('Temperature', 'Drowning Incidents')
])
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Observed correlation
pos = nx.spring_layout(G)
nx.draw(G, pos, ax=axes[0], with_labels=True, node_color='lightblue',
node_size=2000, font_size=10, font_weight='bold', arrows=True)
axes[0].set_title('Observed Correlation\n(Spurious Relationship)')
# True causal structure
pos = nx.spring_layout(G_true)
nx.draw(G_true, pos, ax=axes[1], with_labels=True, node_color='lightgreen',
node_size=2000, font_size=10, font_weight='bold', arrows=True)
axes[1].set_title('True Causal Structure\n(Temperature is Confounder)')
plt.tight_layout()
plt.show()
4. Establishing Causality
Bradford Hill Criteria
# Example: Smoking and Lung Cancer
cancer_data = {
'Year': [1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990],
'Smoking_Prevalence': [45, 48, 52, 55, 57, 56, 54, 50, 45],
'Lung_Cancer_Rate': [15, 22, 35, 48, 62, 70, 75, 73, 68]
}
df_cancer = pd.DataFrame(cancer_data)
# Calculate correlation
correlation = df_cancer['Smoking_Prevalence'].corr(df_cancer['Lung_Cancer_Rate'])
print("Bradford Hill Criteria Applied to Smoking and Lung Cancer:")
print("="*60)
print(f"1. Strength of Association: r = {correlation:.3f} (strong correlation)")
print("2. Consistency: Multiple studies across populations show same effect")
print("3. Specificity: Smoking specifically linked to lung cancer")
print("4. Temporality: Smoking precedes cancer development")
print("5. Biological Gradient: Dose-response relationship (more smoking = higher risk)")
print("6. Plausibility: Biological mechanisms identified")
print("7. Coherence: Consistent with existing knowledge")
print("8. Experiment: Animal studies and cessation studies confirm")
print("9. Analogy: Similar relationships with other carcinogens")
# Visualize
plt.figure(figsize=(10, 6))
plt.plot(df_cancer['Year'], df_cancer['Smoking_Prevalence'], 'b-o',
label='Smoking Prevalence (%)', linewidth=2)
plt.plot(df_cancer['Year'], df_cancer['Lung_Cancer_Rate'], 'r-s',
label='Lung Cancer Rate (per 100,000)', linewidth=2)
plt.xlabel('Year')
plt.ylabel('Rate')
plt.title('Temporal Relationship: Smoking Precedes Lung Cancer')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Granger Causality Test
# Granger causality test for time series
from statsmodels.tsa.stattools import grangercausalitytests
import pandas as pd
import numpy as np
# Generate time series with causal relationship
np.random.seed(42)
n = 200
# X causes Y with lag
x = np.random.randn(n)
y = np.zeros(n)
# Y depends on previous values of X
for t in range(1, n):
y[t] = 0.5 * x[t-1] + 0.3 * y[t-1] + np.random.randn() * 0.5
# Create DataFrame
data = pd.DataFrame({'x': x, 'y': y})
# Test Granger causality (X causes Y)
print("Granger Causality Test (X → Y):")
print("="*50)
# Test with 4 lags
gc_result = grangercausalitytests(data[['y', 'x']], maxlag=4, verbose=False)
for lag in range(1, 5):
p_value = gc_result[lag][0]['ssr_ftest'][1]
print(f"Lag {lag}: F-test p-value = {p_value:.4f}")
if p_value < 0.05:
print(f" → X Granger-causes Y at lag {lag} (p < 0.05)")
# Test reverse causality (Y causes X)
print("\nReverse Causality Test (Y → X):")
print("="*50)
gc_reverse = grangercausalitytests(data[['x', 'y']], maxlag=4, verbose=False)
for lag in range(1, 5):
p_value = gc_reverse[lag][0]['ssr_ftest'][1]
print(f"Lag {lag}: F-test p-value = {p_value:.4f}")
if p_value < 0.05:
print(f" → Y Granger-causes X at lag {lag} (p < 0.05)")
5. Common Pitfalls in Correlation Analysis
Simpson's Paradox
# Demonstrating Simpson's Paradox
np.random.seed(42)
# Create two groups (e.g., treatment and control)
group_A_x = np.random.normal(50, 10, 100)
group_A_y = 2 * group_A_x + np.random.normal(0, 5, 100) + 100
group_B_x = np.random.normal(30, 10, 100)
group_B_y = 2 * group_B_x + np.random.normal(0, 5, 100)
# Combined data
x_combined = np.concatenate([group_A_x, group_B_x])
y_combined = np.concatenate([group_A_y, group_B_y])
group_labels = ['A'] * 100 + ['B'] * 100
# Overall correlation
overall_corr = np.corrcoef(x_combined, y_combined)[0, 1]
# Within-group correlations
corr_A = np.corrcoef(group_A_x, group_A_y)[0, 1]
corr_B = np.corrcoef(group_B_x, group_B_y)[0, 1]
print("Simpson's Paradox Example:")
print(f"Overall correlation: {overall_corr:.3f}")
print(f"Within Group A correlation: {corr_A:.3f}")
print(f"Within Group B correlation: {corr_B:.3f}")
print("\nThe overall correlation is positive, but within each group there's a negative trend!")
print("This happens when there's a hidden grouping variable.")
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(group_A_x, group_A_y, alpha=0.6, label='Group A', color='blue')
plt.scatter(group_B_x, group_B_y, alpha=0.6, label='Group B', color='red')
# Regression lines
z = np.polyfit(x_combined, y_combined, 1)
p = np.poly1d(z)
plt.plot(np.sort(x_combined), p(np.sort(x_combined)), 'k--',
linewidth=2, label=f'Overall (r = {overall_corr:.3f})')
zA = np.polyfit(group_A_x, group_A_y, 1)
pA = np.poly1d(zA)
plt.plot(np.sort(group_A_x), pA(np.sort(group_A_x)), 'b-',
linewidth=2, label=f'Group A (r = {corr_A:.3f})')
zB = np.polyfit(group_B_x, group_B_y, 1)
pB = np.poly1d(zB)
plt.plot(np.sort(group_B_x), pB(np.sort(group_B_x)), 'r-',
linewidth=2, label=f'Group B (r = {corr_B:.3f})')
plt.xlabel('X')
plt.ylabel('Y')
plt.title("Simpson's Paradox: Overall vs. Within-Group Correlations")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Ecological Fallacy
# Demonstrating ecological fallacy (aggregate vs. individual)
np.random.seed(42)
n_countries = 50
n_individuals_per_country = 100
# Generate country-level data
country_wealth = np.random.uniform(10000, 50000, n_countries)
country_health = 70 + 0.0005 * country_wealth + np.random.randn(n_countries) * 5
# Generate individual-level data (inverse relationship)
all_individuals = []
for i in range(n_countries):
# Within each country, richer individuals have WORSE health
individual_wealth = np.random.uniform(country_wealth[i] * 0.5,
country_wealth[i] * 1.5,
n_individuals_per_country)
# Negative relationship at individual level
individual_health = (80 - 0.0003 * (individual_wealth - country_wealth[i]) +
np.random.randn(n_individuals_per_country) * 5)
for w, h in zip(individual_wealth, individual_health):
all_individuals.append({'country': i, 'wealth': w, 'health': h})
df_individuals = pd.DataFrame(all_individuals)
# Aggregate to country level
df_country = df_individuals.groupby('country').agg({
'wealth': 'mean',
'health': 'mean'
}).reset_index()
# Correlations
country_corr = df_country['wealth'].corr(df_country['health'])
individual_corr = df_individuals['wealth'].corr(df_individuals['health'])
print("Ecological Fallacy Example:")
print(f"Country-level correlation: {country_corr:.3f}")
print(f"Individual-level correlation: {individual_corr:.3f}")
print("\nAt the country level, wealthier countries have better health.")
print("But at the individual level, within each country, wealthier individuals have worse health!")
print("This is the ecological fallacy: conclusions about individuals drawn from group data.")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Country-level plot
axes[0].scatter(df_country['wealth'], df_country['health'], alpha=0.6, s=100)
z = np.polyfit(df_country['wealth'], df_country['health'], 1)
p = np.poly1d(z)
axes[0].plot(sorted(df_country['wealth']), p(sorted(df_country['wealth'])),
'r-', linewidth=2)
axes[0].set_xlabel('Average Wealth per Country')
axes[0].set_ylabel('Average Health per Country')
axes[0].set_title(f'Country-Level (r = {country_corr:.3f})')
# Individual-level plot for one country
sample_country = df_individuals[df_individuals['country'] == 0]
axes[1].scatter(sample_country['wealth'], sample_country['health'], alpha=0.5)
z_ind = np.polyfit(sample_country['wealth'], sample_country['health'], 1)
p_ind = np.poly1d(z_ind)
axes[1].plot(sorted(sample_country['wealth']), p_ind(sorted(sample_country['wealth'])),
'r-', linewidth=2)
axes[1].set_xlabel('Individual Wealth')
axes[1].set_ylabel('Individual Health')
axes[1].set_title(f'Individual-Level (r = {np.corrcoef(sample_country["wealth"], sample_country["health"])[0,1]:.3f})')
plt.tight_layout()
plt.show()
6. Methods for Inferring Causality
Randomized Controlled Trials (RCT)
# Simulating an RCT
np.random.seed(42)
n_patients = 500
# Random assignment to treatment or control
treatment_group = np.random.choice([0, 1], n_patients, p=[0.5, 0.5])
# True treatment effect: +10 units
treatment_effect = 10
# Baseline outcome (noise)
baseline = np.random.normal(50, 15, n_patients)
# Outcome: baseline + treatment effect if in treatment group + noise
outcome = baseline + treatment_effect * treatment_group + np.random.randn(n_patients) * 10
# Analyze results
treatment_outcome = outcome[treatment_group == 1]
control_outcome = outcome[treatment_group == 0]
treatment_mean = np.mean(treatment_outcome)
control_mean = np.mean(control_outcome)
treatment_effect_observed = treatment_mean - control_mean
# T-test
t_stat, p_value = stats.ttest_ind(treatment_outcome, control_outcome)
print("Randomized Controlled Trial Results:")
print("="*50)
print(f"Treatment group mean: {treatment_mean:.2f}")
print(f"Control group mean: {control_mean:.2f}")
print(f"Estimated treatment effect: {treatment_effect_observed:.2f}")
print(f"True treatment effect: {treatment_effect}")
print(f"T-test p-value: {p_value:.4f}")
# Visualize
plt.figure(figsize=(10, 6))
plt.hist(treatment_outcome, alpha=0.5, label='Treatment', bins=30)
plt.hist(control_outcome, alpha=0.5, label='Control', bins=30)
plt.axvline(treatment_mean, color='blue', linestyle='--',
label=f'Treatment Mean = {treatment_mean:.1f}')
plt.axvline(control_mean, color='orange', linestyle='--',
label=f'Control Mean = {control_mean:.1f}')
plt.xlabel('Outcome')
plt.ylabel('Frequency')
plt.title('RCT Results: Treatment vs. Control')
plt.legend()
plt.show()
Instrumental Variables
# Instrumental Variables example
np.random.seed(42)
n = 1000
# Instrument (Z) affects treatment (X)
instrument = np.random.binomial(1, 0.5, n)
treatment = 0.7 * instrument + np.random.randn(n) * 0.3
treatment = (treatment > 0.5).astype(int)
# Treatment affects outcome (Y) with confounder
confounder = np.random.randn(n)
treatment_effect = 5
outcome = 10 + treatment_effect * treatment + 2 * confounder + np.random.randn(n)
# OLS (biased)
ols_coef = np.corrcoef(treatment, outcome)[0, 1] * (np.std(outcome) / np.std(treatment))
# IV estimate (unbiased)
# First stage: treatment ~ instrument
first_stage_coef = np.corrcoef(instrument, treatment)[0, 1] * (np.std(treatment) / np.std(instrument))
# Reduced form: outcome ~ instrument
reduced_form_coef = np.corrcoef(instrument, outcome)[0, 1] * (np.std(outcome) / np.std(instrument))
# IV estimate = reduced form / first stage
iv_estimate = reduced_form_coef / first_stage_coef
print("Instrumental Variables Analysis:")
print("="*40)
print(f"OLS estimate (biased by confounder): {ols_coef:.4f}")
print(f"True treatment effect: {treatment_effect}")
print(f"IV estimate (unbiased): {iv_estimate:.4f}")
print("\nIV uses an instrument that affects treatment but not outcome directly,")
print("allowing estimation of causal effects even with unobserved confounders.")
Difference-in-Differences
# Difference-in-Differences example
np.random.seed(42)
n_groups = 100
n_periods = 2 # Before and after
# Treatment group (50) and control group (50)
treatment = np.array([1]*50 + [0]*50)
# Time periods
time = np.array([0, 1]) # 0 = before, 1 = after
# Generate data
data = []
for i in range(n_groups):
for t in time:
# Base outcome
base = 50
# Group-specific effect
group_effect = 5 if treatment[i] else 0
# Time effect
time_effect = 2 * t
# Treatment effect (only for treatment group after treatment)
treatment_effect = 10 if (treatment[i] == 1 and t == 1) else 0
outcome = base + group_effect + time_effect + treatment_effect + np.random.randn() * 5
data.append({
'group': i,
'treatment': treatment[i],
'time': t,
'outcome': outcome
})
df_did = pd.DataFrame(data)
# Calculate DiD
# Pre-treatment
pre_treat = df_did[(df_did['treatment'] == 1) & (df_did['time'] == 0)]['outcome'].mean()
pre_control = df_did[(df_did['treatment'] == 0) & (df_did['time'] == 0)]['outcome'].mean()
# Post-treatment
post_treat = df_did[(df_did['treatment'] == 1) & (df_did['time'] == 1)]['outcome'].mean()
post_control = df_did[(df_did['treatment'] == 0) & (df_did['time'] == 1)]['outcome'].mean()
# Differences
diff_treat = post_treat - pre_treat
diff_control = post_control - pre_control
did_estimate = diff_treat - diff_control
print("Difference-in-Differences Analysis:")
print("="*40)
print(f"Treatment group: {pre_treat:.2f} → {post_treat:.2f} (Δ = {diff_treat:.2f})")
print(f"Control group: {pre_control:.2f} → {post_control:.2f} (Δ = {diff_control:.2f})")
print(f"DiD estimate: {did_estimate:.2f}")
print(f"True treatment effect: 10")
# Visualize
plt.figure(figsize=(10, 6))
plt.plot([0, 1], [pre_treat, post_treat], 'b-o', label='Treatment Group', linewidth=2, markersize=8)
plt.plot([0, 1], [pre_control, post_control], 'r-s', label='Control Group', linewidth=2, markersize=8)
# Show differences
plt.plot([1, 1], [post_control, post_treat], 'g--', linewidth=2,
label=f'DiD Effect = {did_estimate:.1f}')
plt.xlabel('Time Period')
plt.ylabel('Outcome')
plt.title('Difference-in-Differences: Treatment Effect')
plt.xticks([0, 1], ['Before', 'After'])
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
7. Real-World Examples
Education and Income
# Simulating the education-income relationship with confounders
np.random.seed(42)
n = 1000
# Confounders: family background, cognitive ability
family_background = np.random.randn(n) # Higher = better background
cognitive_ability = np.random.randn(n) # Higher = smarter
# Education (years) influenced by background and ability
education = (12 +
2 * family_background +
2 * cognitive_ability +
np.random.randn(n) * 1.5)
# Income influenced by education and confounders
income = (30000 +
5000 * education +
10000 * family_background +
8000 * cognitive_ability +
np.random.randn(n) * 5000)
df = pd.DataFrame({
'education': education,
'income': income,
'family_background': family_background,
'cognitive_ability': cognitive_ability
})
# Correlations
edu_income_corr = df['education'].corr(df['income'])
# Regression analysis
import statsmodels.api as sm
# Model without confounders
X_simple = sm.add_constant(df['education'])
model_simple = sm.OLS(df['income'], X_simple).fit()
# Model with confounders
X_full = sm.add_constant(df[['education', 'family_background', 'cognitive_ability']])
model_full = sm.OLS(df['income'], X_full).fit()
print("Education and Income Analysis:")
print("="*50)
print(f"Simple correlation: r = {edu_income_corr:.3f}")
print("\nRegression Results (without confounders):")
print(f"Education coefficient: ${model_simple.params[1]:.0f}")
print("\nRegression Results (with confounders):")
print(f"Education coefficient: ${model_full.params[1]:.0f}")
print(f"Family background coefficient: ${model_full.params[2]:.0f}")
print(f"Cognitive ability coefficient: ${model_full.params[3]:.0f}")
print("\nInterpretation:")
print("After controlling for family background and cognitive ability,")
print("the effect of education on income is reduced. These factors")
print("are confounders that influence both education and income.")
Social Media and Mental Health
# Simulating the social media-mental health relationship
np.random.seed(42)
n = 500
# Generate data with reverse causality and confounding
# Mental health influences social media use (reverse causality)
baseline_mental_health = np.random.normal(50, 15, n)
# Social media use (hours/day) influenced by baseline mental health
social_media = (2 - 0.03 * (baseline_mental_health - 50) +
np.random.randn(n) * 0.5)
# Mental health outcome (influenced by social media and baseline)
mental_health_outcome = (baseline_mental_health -
2 * social_media +
np.random.randn(n) * 5)
df_sm = pd.DataFrame({
'baseline_mental_health': baseline_mental_health,
'social_media': social_media,
'mental_health_outcome': mental_health_outcome
})
# Correlation
corr_sm = df_sm['social_media'].corr(df_sm['mental_health_outcome'])
# Regression with and without baseline
import statsmodels.api as sm
X_simple = sm.add_constant(df_sm['social_media'])
model_simple = sm.OLS(df_sm['mental_health_outcome'], X_simple).fit()
X_full = sm.add_constant(df_sm[['social_media', 'baseline_mental_health']])
model_full = sm.OLS(df_sm['mental_health_outcome'], X_full).fit()
print("Social Media and Mental Health Analysis:")
print("="*55)
print(f"Simple correlation: r = {corr_sm:.3f}")
print("\nRegression Results (without baseline):")
print(f"Social media coefficient: {model_simple.params[1]:.3f}")
print("\nRegression Results (with baseline):")
print(f"Social media coefficient: {model_full.params[1]:.3f}")
print(f"Baseline mental health coefficient: {model_full.params[2]:.3f}")
print("\nInterpretation:")
print("After controlling for baseline mental health, the negative effect")
print("of social media is reduced. People with poor mental health may")
print("use social media more (reverse causation) and also have worse")
print("outcomes regardless of social media use.")
8. Best Practices for Correlation and Causality Analysis
def correlation_causality_checklist(data, x, y, potential_confounders=None):
"""
Comprehensive checklist for correlation-causality analysis
"""
print("Correlation vs. Causality Analysis Checklist")
print("="*60)
# 1. Calculate and visualize correlation
print("\n1. CORRELATION ANALYSIS:")
corr = data[x].corr(data[y])
print(f" Pearson correlation (r) = {corr:.4f}")
# 2. Visualize the relationship
plt.figure(figsize=(8, 6))
plt.scatter(data[x], data[y], alpha=0.5)
plt.xlabel(x)
plt.ylabel(y)
plt.title(f'Scatter Plot: {x} vs. {y} (r = {corr:.3f})')
# Add regression line
z = np.polyfit(data[x], data[y], 1)
p = np.poly1d(z)
plt.plot(sorted(data[x]), p(sorted(data[x])), 'r-', linewidth=2)
plt.grid(True, alpha=0.3)
plt.show()
# 3. Check for non-linearity
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
X = data[x].values.reshape(-1, 1)
y_vals = data[y].values
# Linear model
model_linear = LinearRegression()
model_linear.fit(X, y_vals)
linear_r2 = model_linear.score(X, y_vals)
# Polynomial model
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model_poly = LinearRegression()
model_poly.fit(X_poly, y_vals)
poly_r2 = model_poly.score(X_poly, y_vals)
print(f"\n2. LINEARITY CHECK:")
print(f" Linear R² = {linear_r2:.4f}")
print(f" Quadratic R² = {poly_r2:.4f}")
if poly_r2 > linear_r2 + 0.05:
print(" ⚠ Potential non-linear relationship detected")
# 4. Check for confounders
if potential_confounders:
print("\n3. CONFOUNDER ANALYSIS:")
for confounder in potential_confounders:
if confounder in data.columns:
# Correlation with X
corr_x = data[x].corr(data[confounder])
# Correlation with Y
corr_y = data[y].corr(data[confounder])
print(f" {confounder}:")
print(f" Correlation with {x}: {corr_x:.4f}")
print(f" Correlation with {y}: {corr_y:.4f}")
# Partial correlation
from scipy.stats import pearsonr
# Residuals after removing confounder
model_x = LinearRegression()
model_x.fit(data[[confounder]], data[x])
x_resid = data[x] - model_x.predict(data[[confounder]])
model_y = LinearRegression()
model_y.fit(data[[confounder]], data[y])
y_resid = data[y] - model_y.predict(data[[confounder]])
partial_corr = pearsonr(x_resid, y_resid)[0]
print(f" Partial correlation: {partial_corr:.4f}")
if abs(partial_corr) < abs(corr):
print(f" → {confounder} may be a confounder")
# 5. Temporal considerations
print("\n4. TEMPORALITY CHECK:")
print(" Does X precede Y in time?")
print(" → This is essential for establishing causality")
# 6. Alternative explanations
print("\n5. ALTERNATIVE EXPLANATIONS:")
print(" • Reverse causation: Could Y cause X?")
print(" • Selection bias: Is the sample representative?")
print(" • Measurement error: Are variables measured accurately?")
print(" • Random chance: Is the result statistically significant?")
# 7. Summary
print("\n6. SUMMARY:")
print(" Correlation does NOT imply causation.")
print(" Evidence for causality requires:")
print(" • Strong, consistent association")
print(" • Temporal precedence (X before Y)")
print(" • No plausible confounders")
print(" • Biological/mechanistic plausibility")
print(" • Experimental evidence (when possible)")
# Example usage
df_sample = pd.DataFrame({
'x': np.random.randn(100),
'y': np.random.randn(100),
'z': np.random.randn(100)
})
correlation_causality_checklist(df_sample, 'x', 'y', ['z'])
9. Common Mistakes and How to Avoid Them
Mistake 1: Confusing Correlation with Causation
print("Common Mistake 1: Confusing Correlation with Causation")
print("="*50)
print("Example: Ice cream sales and drowning incidents are correlated.")
print("Solution: Always consider alternative explanations and confounders.")
print("Ask: Could there be a third variable causing both?")
Mistake 2: Ignoring Directionality
print("\nCommon Mistake 2: Ignoring Directionality")
print("="*50)
print("Example: Low self-esteem and depression are correlated.")
print("Question: Does low self-esteem cause depression, or does depression cause low self-esteem?")
print("Solution: Use temporal data, cross-lagged models, or experimental designs.")
Mistake 3: Overlooking Non-linear Relationships
print("\nCommon Mistake 3: Overlooking Non-linear Relationships")
print("="*50)
print("Example: A U-shaped relationship can have zero correlation.")
print("Solution: Always visualize your data before calculating correlations.")
Mistake 4: Simpson's Paradox
print("\nCommon Mistake 4: Simpson's Paradox")
print("="*50)
print("Example: An overall trend reverses when groups are examined separately.")
print("Solution: Always examine data at multiple levels of aggregation.")
Mistake 5: Ecological Fallacy
print("\nCommon Mistake 5: Ecological Fallacy")
print("="*50)
print("Example: Inferences about individuals based on group data.")
print("Solution: Use individual-level data when possible, or be cautious in interpretation.")
Conclusion
Understanding the distinction between correlation and causation is fundamental to data science:
Key Takeaways
- Correlation ≠ Causation: This is the most important principle in data analysis
- Confounders: Hidden variables can create spurious correlations
- Directionality: Determining which variable influences which is crucial
- Multiple Methods: Use RCTs, IV, DiD, and other causal inference methods
- Always Visualize: Scatter plots reveal patterns correlation coefficients miss
When Correlation is Useful
- Exploratory data analysis
- Feature selection for predictive models
- Hypothesis generation
- Measuring strength of relationships (not direction)
When Causality is Required
- Policy decisions (e.g., healthcare interventions)
- Business strategy (e.g., marketing effectiveness)
- Scientific understanding
- Resource allocation
Best Practices
- Always visualize relationships before interpreting correlations
- Consider confounders that could explain observed relationships
- Use multiple methods for causal inference when possible
- Be skeptical of claims based solely on correlation
- Communicate clearly about the limitations of your analysis
Remember: While correlation can suggest interesting relationships, establishing causality requires careful study design, appropriate methods, and rigorous analysis. In data science, it's always better to say "X is correlated with Y" than to claim "X causes Y" without proper evidence!