Why Statistics Matters in Data Science
Statistics provides the tools to:
- Understand data through description and visualization
- Make inferences about populations from samples
- Quantify uncertainty in predictions and decisions
- Validate models through hypothesis testing
- Draw conclusions that are scientifically sound
"Statistics is the grammar of data science."
The Two Pillars of Statistics
┌─────────────────────────────────────────────────────────────┐ │ STATISTICS │ ├────────────────────────────┬────────────────────────────────┤ │ DESCRIPTIVE STATISTICS │ INFERENTIAL STATISTICS │ │ Summarize & Describe │ Draw Conclusions & │ │ What's in the data? │ Make Predictions │ ├────────────────────────────┼────────────────────────────────┤ │ • Mean, Median, Mode │ • Hypothesis Testing │ │ • Standard Deviation │ • Confidence Intervals │ │ • Percentiles │ • Regression Analysis │ │ • Correlation │ • ANOVA │ │ • Visualizations │ • Bayesian Inference │ └────────────────────────────┴────────────────────────────────┘
Part 1: Descriptive Statistics
1.1 Measures of Central Tendency
Mean (Average) - Sum divided by count
import numpy as np import pandas as pd from scipy import stats data = [23, 45, 67, 12, 89, 34, 56, 78, 91, 45] # Mean mean = np.mean(data) # 54.0 # Sensitive to outliers # Median - Middle value median = np.median(data) # 50.5 # Robust to outliers # Mode - Most frequent value mode = stats.mode(data) # 45 # Useful for categorical data # When to use each: # - Mean: Normally distributed, no outliers # - Median: Skewed data, outliers present # - Mode: Categorical data, most common value
1.2 Measures of Spread (Dispersion)
# Range data_range = max(data) - min(data) # 79 # Variance - Average squared deviation from mean variance = np.var(data, ddof=0) # Population variance sample_variance = np.var(data, ddof=1) # Sample variance (n-1) # Standard Deviation - Square root of variance std_dev = np.std(data, ddof=1) # 25.8 # Interquartile Range (IQR) - Range of middle 50% Q1 = np.percentile(data, 25) # 34.0 Q3 = np.percentile(data, 75) # 78.0 IQR = Q3 - Q1 # 44.0 # Coefficient of Variation - Relative variability cv = (std_dev / mean) * 100 # 47.8%
1.3 Shape of Distribution
import matplotlib.pyplot as plt
import seaborn as sns
# Generate different distributions
normal_data = np.random.normal(0, 1, 1000)
skewed_data = np.random.exponential(2, 1000)
bimodal_data = np.concatenate([np.random.normal(-3, 1, 500),
np.random.normal(3, 1, 500)])
# Skewness - Asymmetry
skewness = stats.skew(data)
# Positive skew: tail on right
# Negative skew: tail on left
# Zero: symmetric
# Kurtosis - Tail heaviness
kurtosis = stats.kurtosis(data)
# High kurtosis: heavy tails, outliers
# Low kurtosis: light tails
# Normal distribution: kurtosis = 0 (excess kurtosis)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(normal_data, bins=30, edgecolor='black')
axes[0].set_title(f'Normal\nSkew: {stats.skew(normal_data):.2f}')
axes[1].hist(skewed_data, bins=30, edgecolor='black')
axes[1].set_title(f'Skewed Right\nSkew: {stats.skew(skewed_data):.2f}')
axes[2].hist(bimodal_data, bins=30, edgecolor='black')
axes[2].set_title('Bimodal')
plt.tight_layout()
1.4 Correlation and Covariance
# Covariance - Direction of linear relationship covariance = np.cov(x, y)[0, 1] # Positive: variables increase together # Negative: one increases, other decreases # Pearson Correlation - Strength of linear relationship correlation, p_value = stats.pearsonr(x, y) # Range: -1 to 1 # 0: no linear correlation # ±1: perfect linear correlation # Spearman Correlation - Monotonic relationship (rank-based) spearman_corr, p_value = stats.spearmanr(x, y) # More robust to outliers # Correlation matrix correlation_matrix = df.corr() sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
Part 2: Probability Fundamentals
2.1 Basic Probability Concepts
# Probability rules # P(A) = Number of favorable outcomes / Total outcomes # Complement Rule: P(not A) = 1 - P(A) # Addition Rule: P(A or B) = P(A) + P(B) - P(A and B) # Multiplication Rule (independent): P(A and B) = P(A) * P(B) # Conditional Probability: P(A|B) = P(A and B) / P(B) # Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)
2.2 Probability Distributions
Discrete Distributions
# 1. Binomial Distribution # Number of successes in n independent trials from scipy.stats import binom n, p = 10, 0.5 # 10 coin flips, probability of heads = 0.5 x = np.arange(0, n+1) pmf = binom.pmf(x, n, p) # Probability of exactly 6 heads prob_6 = binom.pmf(6, n, p) # 0.205 # Probability of 6 or more heads prob_6plus = 1 - binom.cdf(5, n, p) # 0.377 # 2. Poisson Distribution # Number of events in fixed interval from scipy.stats import poisson lambda_param = 3 # Average 3 events per interval prob_5 = poisson.pmf(5, lambda_param) # P(exactly 5 events)
Continuous Distributions
# 1. Normal (Gaussian) Distribution from scipy.stats import norm mu, sigma = 0, 1 # Standard normal x = np.linspace(-4, 4, 100) pdf = norm.pdf(x, mu, sigma) cdf = norm.cdf(x, mu, sigma) # Empirical Rule (68-95-99.7) # 68% within 1 standard deviation # 95% within 2 standard deviations # 99.7% within 3 standard deviations # Z-score calculation z_score = (x_value - mu) / sigma # 2. Uniform Distribution from scipy.stats import uniform # 3. Exponential Distribution (waiting times) from scipy.stats import expon # 4. t-Distribution (small samples) from scipy.stats import t # 5. Chi-Square Distribution from scipy.stats import chi2
2.3 Central Limit Theorem (CLT)
# CLT: Sampling distribution of the mean approaches normal
# regardless of population distribution, as sample size increases
# Demonstration
population = np.random.exponential(scale=2, size=100000)
sample_means = []
for _ in range(1000):
sample = np.random.choice(population, size=30)
sample_means.append(np.mean(sample))
# Distribution of sample means is approximately normal
plt.hist(sample_means, bins=30, density=True, alpha=0.7)
# Mean of sampling distribution ≈ population mean
print(f"Population mean: {np.mean(population):.3f}")
print(f"Mean of sample means: {np.mean(sample_means):.3f}")
# Standard error = σ / √n
standard_error = np.std(population) / np.sqrt(30)
Part 3: Inferential Statistics
3.1 Sampling and Estimation
# Point Estimates sample_mean = np.mean(sample) sample_variance = np.var(sample, ddof=1) # Confidence Intervals from scipy.stats import norm, t # For known population standard deviation confidence_level = 0.95 z_critical = norm.ppf((1 + confidence_level) / 2) margin_error = z_critical * (sigma / np.sqrt(n)) ci = [sample_mean - margin_error, sample_mean + margin_error] # For unknown population standard deviation (use t-distribution) t_critical = t.ppf((1 + confidence_level) / 2, df=n-1) margin_error = t_critical * (sample_std / np.sqrt(n)) ci = [sample_mean - margin_error, sample_mean + margin_error]
3.2 Hypothesis Testing
# Hypothesis Testing Framework
# H0 (Null): No effect/difference
# H1 (Alternative): There is an effect/difference
# α (Significance level): Type I error rate (usually 0.05)
# p-value: Probability of observing data if H0 is true
# One-Sample t-test
from scipy.stats import ttest_1samp
# Test if sample mean differs from population mean
t_stat, p_value = ttest_1samp(sample, population_mean)
if p_value < 0.05:
print("Reject H0: Significant difference")
else:
print("Fail to reject H0: No significant difference")
# Two-Sample t-test (independent)
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(sample1, sample2)
# Paired t-test (before/after)
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(before, after)
# ANOVA (comparing multiple groups)
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(group1, group2, group3)
# Chi-Square Test (categorical data)
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
3.3 Common Statistical Tests
| Test | Use Case | Assumptions |
|---|---|---|
| t-test | Compare means (2 groups) | Normal distribution, equal variance |
| ANOVA | Compare means (3+ groups) | Normal distribution, equal variance |
| Chi-Square | Categorical associations | Expected frequencies ≥ 5 |
| Mann-Whitney U | Non-parametric alternative to t-test | Independent samples |
| Wilcoxon | Non-parametric paired test | Paired samples |
| Kruskal-Wallis | Non-parametric ANOVA | Independent samples |
3.4 Type I and Type II Errors
# Type I Error (False Positive): Reject H0 when it's true # Type II Error (False Negative): Fail to reject H0 when it's false # Power = 1 - β (Probability of detecting true effect) from statsmodels.stats.power import TTestIndPower # Calculate required sample size analysis = TTestIndPower() sample_size = analysis.solve_power( effect_size=0.5, # Medium effect alpha=0.05, # Significance level power=0.80, # Desired power ratio=1.0 # Equal sample sizes )
Part 4: Advanced Statistical Concepts
4.1 Regression Analysis
import statsmodels.api as sm from sklearn.linear_model import LinearRegression # Simple Linear Regression X = df['feature'].values.reshape(-1, 1) y = df['target'].values model = LinearRegression() model.fit(X, y) # Get statistics with statsmodels X_with_const = sm.add_constant(X) ols_model = sm.OLS(y, X_with_const).fit() print(ols_model.summary()) # Key outputs: # - R-squared: Proportion of variance explained # - Coefficients: Slope and intercept # - p-values: Significance of predictors # - Confidence intervals: Range of coefficient estimates
4.2 Bayesian Statistics
# Bayesian Framework: Posterior ∝ Likelihood × Prior # Simple Bayesian inference example from scipy.stats import beta, binom # Prior: Beta distribution (conjugate prior for binomial) prior_alpha, prior_beta = 2, 2 # Weak prior favoring 0.5 # Likelihood: Observed data successes, trials = 7, 10 # Posterior: Beta(prior_alpha + successes, prior_beta + trials - successes) posterior_alpha = prior_alpha + successes posterior_beta = prior_beta + trials - successes # Credible interval (Bayesian confidence interval) credible_interval = beta.interval(0.95, posterior_alpha, posterior_beta)
4.3 A/B Testing
def ab_test_analysis(control, treatment):
"""
Analyze A/B test results
"""
from scipy.stats import ttest_ind
# Descriptive statistics
print(f"Control: n={len(control)}, mean={np.mean(control):.3f}")
print(f"Treatment: n={len(treatment)}, mean={np.mean(treatment):.3f}")
# Statistical test
t_stat, p_value = ttest_ind(treatment, control)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(control)-1) * np.var(control, ddof=1) +
(len(treatment)-1) * np.var(treatment, ddof=1)) /
(len(control) + len(treatment) - 2))
cohens_d = (np.mean(treatment) - np.mean(control)) / pooled_std
# Lift calculation
lift = ((np.mean(treatment) - np.mean(control)) / np.mean(control)) * 100
# Recommendation
if p_value < 0.05:
if np.mean(treatment) > np.mean(control):
result = f"Treatment wins! Lift: {lift:.1f}% (p={p_value:.4f})"
else:
result = f"Control wins! (p={p_value:.4f})"
else:
result = "No significant difference found"
return {
'p_value': p_value,
'effect_size': cohens_d,
'lift': lift,
'result': result
}
Part 5: Practical Applications in Data Science
5.1 Exploratory Data Analysis (EDA) Statistics
def comprehensive_eda_stats(df):
"""Generate comprehensive statistical summary"""
# Basic statistics
print("=== BASIC STATISTICS ===\n")
print(df.describe(include='all'))
# Missing values
print("\n=== MISSING VALUES ===\n")
missing = df.isnull().sum()
print(missing[missing > 0])
# Distribution statistics
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("\n=== DISTRIBUTION METRICS ===\n")
for col in numeric_cols:
data = df[col].dropna()
print(f"{col}:")
print(f" Skewness: {stats.skew(data):.3f}")
print(f" Kurtosis: {stats.kurtosis(data):.3f}")
# Normality test (Shapiro-Wilk)
if len(data) < 5000: # Shapiro-Wilk limited to ~5000 samples
_, p_value = stats.shapiro(data[:5000])
normal = "Yes" if p_value > 0.05 else "No"
print(f" Normally distributed: {normal} (p={p_value:.3f})")
# Correlation analysis
print("\n=== CORRELATION ANALYSIS ===\n")
corr_matrix = df[numeric_cols].corr()
# Find highly correlated pairs
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.7:
high_corr.append({
'pair': f"{corr_matrix.columns[i]} - {corr_matrix.columns[j]}",
'correlation': corr_matrix.iloc[i, j]
})
if high_corr:
print("Highly correlated pairs (>0.7):")
for item in high_corr:
print(f" {item['pair']}: {item['correlation']:.3f}")
# Outlier detection (IQR method)
print("\n=== OUTLIER DETECTION ===\n")
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
5.2 Statistical Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
def statistical_feature_selection(X, y, method='f_stat', k=10):
"""
Select features using statistical methods
"""
if method == 'f_stat':
selector = SelectKBest(score_func=f_classif, k=k)
elif method == 'mutual_info':
selector = SelectKBest(score_func=mutual_info_classif, k=k)
X_selected = selector.fit_transform(X, y)
# Get feature scores
scores = pd.DataFrame({
'feature': X.columns,
'score': selector.scores_
}).sort_values('score', ascending=False)
print(f"Top {k} features:")
print(scores.head(k))
return X_selected, scores
5.3 Statistical Assumptions for Models
def check_regression_assumptions(X, y, model):
"""
Check key assumptions for linear regression
"""
from scipy.stats import shapiro, bartlett
# Fit model
model.fit(X, y)
predictions = model.predict(X)
residuals = y - predictions
# 1. Linearity (residuals vs fitted plot)
plt.figure(figsize=(12, 3))
plt.subplot(1, 3, 1)
plt.scatter(predictions, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Linearity Check')
# 2. Normality of residuals
plt.subplot(1, 3, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
# Shapiro-Wilk test
_, p_value = shapiro(residuals[:5000]) # Limit for large datasets
print(f"Normality test (p-value): {p_value:.4f}")
# 3. Homoscedasticity (constant variance)
plt.subplot(1, 3, 3)
plt.scatter(predictions, np.abs(residuals), alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('|Residuals|')
plt.title('Homoscedasticity Check')
plt.tight_layout()
# Breusch-Pagan test for heteroscedasticity
# (implement or use statsmodels)
return residuals
Part 6: Statistical Thinking in Data Science
Key Principles
- Correlation ≠ Causation
# Spurious correlations can mislead # Always consider confounding variables # Use randomized experiments when possible
- Statistical Significance vs. Practical Significance
# Large sample sizes can make trivial effects significant # Always check effect size alongside p-values effect_size = (mean_treatment - mean_control) / pooled_std # Cohen's d: 0.2 (small), 0.5 (medium), 0.8 (large)
- The Danger of p-hacking
# Multiple testing increases Type I error # Adjust significance level: α_adjusted = α / n_tests from statsmodels.stats.multitest import multipletests adjusted_p_values = multipletests(p_values, method='bonferroni')
- Sample Size Considerations
# Larger isn't always better # Consider: effect size, variability, desired power # Balance statistical power with practical constraints
Part 7: Essential Statistical Formulas
Quick Reference
| Concept | Formula |
|---|---|
| Mean | μ = Σx / N |
| Variance | σ² = Σ(x - μ)² / N |
| Standard Deviation | σ = √σ² |
| Z-score | z = (x - μ) / σ |
| Correlation | r = Σ((x-μₓ)(y-μᵧ)) / (nσₓσᵧ) |
| Standard Error | SE = σ / √n |
| Confidence Interval | CI = x̄ ± z(α/2) × (σ/√n) |
| t-statistic | t = (x̄ - μ) / (s/√n) |
| Chi-Square | χ² = Σ(O - E)² / E |
| Bayes' Theorem | P(A|B) = P(B|A)P(A) / P(B) |
Conclusion: Statistics Mindset for Data Scientists
# The statistical mindset: # 1. Always question data quality and collection methods # 2. Understand the assumptions behind your tests # 3. Quantify uncertainty in all estimates # 4. Consider practical significance, not just statistical # 5. Visualize everything before testing # 6. Be skeptical of your own conclusions # 7. Reproducibility is key
Key Takeaway: Statistics is not just a set of formulas—it's a framework for thinking about data, uncertainty, and decision-making. Master these concepts not to memorize equations, but to develop the intuition needed to extract reliable insights from messy, real-world data.