Why Statistics Matters in Data Science

Why Statistics Matters in Data Science

Statistics provides the tools to:

  • Understand data through description and visualization
  • Make inferences about populations from samples
  • Quantify uncertainty in predictions and decisions
  • Validate models through hypothesis testing
  • Draw conclusions that are scientifically sound

"Statistics is the grammar of data science."

The Two Pillars of Statistics

┌─────────────────────────────────────────────────────────────┐
│                      STATISTICS                              │
├────────────────────────────┬────────────────────────────────┤
│   DESCRIPTIVE STATISTICS   │    INFERENTIAL STATISTICS       │
│   Summarize & Describe     │    Draw Conclusions &           │
│   What's in the data?      │    Make Predictions             │
├────────────────────────────┼────────────────────────────────┤
│ • Mean, Median, Mode       │ • Hypothesis Testing            │
│ • Standard Deviation       │ • Confidence Intervals          │
│ • Percentiles              │ • Regression Analysis           │
│ • Correlation              │ • ANOVA                         │
│ • Visualizations           │ • Bayesian Inference            │
└────────────────────────────┴────────────────────────────────┘

Part 1: Descriptive Statistics

1.1 Measures of Central Tendency

Mean (Average) - Sum divided by count

import numpy as np
import pandas as pd
from scipy import stats
data = [23, 45, 67, 12, 89, 34, 56, 78, 91, 45]
# Mean
mean = np.mean(data)  # 54.0
# Sensitive to outliers
# Median - Middle value
median = np.median(data)  # 50.5
# Robust to outliers
# Mode - Most frequent value
mode = stats.mode(data)  # 45
# Useful for categorical data
# When to use each:
# - Mean: Normally distributed, no outliers
# - Median: Skewed data, outliers present
# - Mode: Categorical data, most common value

1.2 Measures of Spread (Dispersion)

# Range
data_range = max(data) - min(data)  # 79
# Variance - Average squared deviation from mean
variance = np.var(data, ddof=0)  # Population variance
sample_variance = np.var(data, ddof=1)  # Sample variance (n-1)
# Standard Deviation - Square root of variance
std_dev = np.std(data, ddof=1)  # 25.8
# Interquartile Range (IQR) - Range of middle 50%
Q1 = np.percentile(data, 25)  # 34.0
Q3 = np.percentile(data, 75)  # 78.0
IQR = Q3 - Q1  # 44.0
# Coefficient of Variation - Relative variability
cv = (std_dev / mean) * 100  # 47.8%

1.3 Shape of Distribution

import matplotlib.pyplot as plt
import seaborn as sns
# Generate different distributions
normal_data = np.random.normal(0, 1, 1000)
skewed_data = np.random.exponential(2, 1000)
bimodal_data = np.concatenate([np.random.normal(-3, 1, 500), 
np.random.normal(3, 1, 500)])
# Skewness - Asymmetry
skewness = stats.skew(data)
# Positive skew: tail on right
# Negative skew: tail on left
# Zero: symmetric
# Kurtosis - Tail heaviness
kurtosis = stats.kurtosis(data)
# High kurtosis: heavy tails, outliers
# Low kurtosis: light tails
# Normal distribution: kurtosis = 0 (excess kurtosis)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(normal_data, bins=30, edgecolor='black')
axes[0].set_title(f'Normal\nSkew: {stats.skew(normal_data):.2f}')
axes[1].hist(skewed_data, bins=30, edgecolor='black')
axes[1].set_title(f'Skewed Right\nSkew: {stats.skew(skewed_data):.2f}')
axes[2].hist(bimodal_data, bins=30, edgecolor='black')
axes[2].set_title('Bimodal')
plt.tight_layout()

1.4 Correlation and Covariance

# Covariance - Direction of linear relationship
covariance = np.cov(x, y)[0, 1]
# Positive: variables increase together
# Negative: one increases, other decreases
# Pearson Correlation - Strength of linear relationship
correlation, p_value = stats.pearsonr(x, y)
# Range: -1 to 1
# 0: no linear correlation
# ±1: perfect linear correlation
# Spearman Correlation - Monotonic relationship (rank-based)
spearman_corr, p_value = stats.spearmanr(x, y)
# More robust to outliers
# Correlation matrix
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

Part 2: Probability Fundamentals

2.1 Basic Probability Concepts

# Probability rules
# P(A) = Number of favorable outcomes / Total outcomes
# Complement Rule: P(not A) = 1 - P(A)
# Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)
# Multiplication Rule (independent): P(A and B) = P(A) * P(B)
# Conditional Probability: P(A|B) = P(A and B) / P(B)
# Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)

2.2 Probability Distributions

Discrete Distributions

# 1. Binomial Distribution
# Number of successes in n independent trials
from scipy.stats import binom
n, p = 10, 0.5  # 10 coin flips, probability of heads = 0.5
x = np.arange(0, n+1)
pmf = binom.pmf(x, n, p)
# Probability of exactly 6 heads
prob_6 = binom.pmf(6, n, p)  # 0.205
# Probability of 6 or more heads
prob_6plus = 1 - binom.cdf(5, n, p)  # 0.377
# 2. Poisson Distribution
# Number of events in fixed interval
from scipy.stats import poisson
lambda_param = 3  # Average 3 events per interval
prob_5 = poisson.pmf(5, lambda_param)  # P(exactly 5 events)

Continuous Distributions

# 1. Normal (Gaussian) Distribution
from scipy.stats import norm
mu, sigma = 0, 1  # Standard normal
x = np.linspace(-4, 4, 100)
pdf = norm.pdf(x, mu, sigma)
cdf = norm.cdf(x, mu, sigma)
# Empirical Rule (68-95-99.7)
# 68% within 1 standard deviation
# 95% within 2 standard deviations
# 99.7% within 3 standard deviations
# Z-score calculation
z_score = (x_value - mu) / sigma
# 2. Uniform Distribution
from scipy.stats import uniform
# 3. Exponential Distribution (waiting times)
from scipy.stats import expon
# 4. t-Distribution (small samples)
from scipy.stats import t
# 5. Chi-Square Distribution
from scipy.stats import chi2

2.3 Central Limit Theorem (CLT)

# CLT: Sampling distribution of the mean approaches normal
# regardless of population distribution, as sample size increases
# Demonstration
population = np.random.exponential(scale=2, size=100000)
sample_means = []
for _ in range(1000):
sample = np.random.choice(population, size=30)
sample_means.append(np.mean(sample))
# Distribution of sample means is approximately normal
plt.hist(sample_means, bins=30, density=True, alpha=0.7)
# Mean of sampling distribution ≈ population mean
print(f"Population mean: {np.mean(population):.3f}")
print(f"Mean of sample means: {np.mean(sample_means):.3f}")
# Standard error = σ / √n
standard_error = np.std(population) / np.sqrt(30)

Part 3: Inferential Statistics

3.1 Sampling and Estimation

# Point Estimates
sample_mean = np.mean(sample)
sample_variance = np.var(sample, ddof=1)
# Confidence Intervals
from scipy.stats import norm, t
# For known population standard deviation
confidence_level = 0.95
z_critical = norm.ppf((1 + confidence_level) / 2)
margin_error = z_critical * (sigma / np.sqrt(n))
ci = [sample_mean - margin_error, sample_mean + margin_error]
# For unknown population standard deviation (use t-distribution)
t_critical = t.ppf((1 + confidence_level) / 2, df=n-1)
margin_error = t_critical * (sample_std / np.sqrt(n))
ci = [sample_mean - margin_error, sample_mean + margin_error]

3.2 Hypothesis Testing

# Hypothesis Testing Framework
# H0 (Null): No effect/difference
# H1 (Alternative): There is an effect/difference
# α (Significance level): Type I error rate (usually 0.05)
# p-value: Probability of observing data if H0 is true
# One-Sample t-test
from scipy.stats import ttest_1samp
# Test if sample mean differs from population mean
t_stat, p_value = ttest_1samp(sample, population_mean)
if p_value < 0.05:
print("Reject H0: Significant difference")
else:
print("Fail to reject H0: No significant difference")
# Two-Sample t-test (independent)
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(sample1, sample2)
# Paired t-test (before/after)
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(before, after)
# ANOVA (comparing multiple groups)
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(group1, group2, group3)
# Chi-Square Test (categorical data)
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

3.3 Common Statistical Tests

TestUse CaseAssumptions
t-testCompare means (2 groups)Normal distribution, equal variance
ANOVACompare means (3+ groups)Normal distribution, equal variance
Chi-SquareCategorical associationsExpected frequencies ≥ 5
Mann-Whitney UNon-parametric alternative to t-testIndependent samples
WilcoxonNon-parametric paired testPaired samples
Kruskal-WallisNon-parametric ANOVAIndependent samples

3.4 Type I and Type II Errors

# Type I Error (False Positive): Reject H0 when it's true
# Type II Error (False Negative): Fail to reject H0 when it's false
# Power = 1 - β (Probability of detecting true effect)
from statsmodels.stats.power import TTestIndPower
# Calculate required sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(
effect_size=0.5,  # Medium effect
alpha=0.05,       # Significance level
power=0.80,       # Desired power
ratio=1.0         # Equal sample sizes
)

Part 4: Advanced Statistical Concepts

4.1 Regression Analysis

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
# Simple Linear Regression
X = df['feature'].values.reshape(-1, 1)
y = df['target'].values
model = LinearRegression()
model.fit(X, y)
# Get statistics with statsmodels
X_with_const = sm.add_constant(X)
ols_model = sm.OLS(y, X_with_const).fit()
print(ols_model.summary())
# Key outputs:
# - R-squared: Proportion of variance explained
# - Coefficients: Slope and intercept
# - p-values: Significance of predictors
# - Confidence intervals: Range of coefficient estimates

4.2 Bayesian Statistics

# Bayesian Framework: Posterior ∝ Likelihood × Prior
# Simple Bayesian inference example
from scipy.stats import beta, binom
# Prior: Beta distribution (conjugate prior for binomial)
prior_alpha, prior_beta = 2, 2  # Weak prior favoring 0.5
# Likelihood: Observed data
successes, trials = 7, 10
# Posterior: Beta(prior_alpha + successes, prior_beta + trials - successes)
posterior_alpha = prior_alpha + successes
posterior_beta = prior_beta + trials - successes
# Credible interval (Bayesian confidence interval)
credible_interval = beta.interval(0.95, posterior_alpha, posterior_beta)

4.3 A/B Testing

def ab_test_analysis(control, treatment):
"""
Analyze A/B test results
"""
from scipy.stats import ttest_ind
# Descriptive statistics
print(f"Control: n={len(control)}, mean={np.mean(control):.3f}")
print(f"Treatment: n={len(treatment)}, mean={np.mean(treatment):.3f}")
# Statistical test
t_stat, p_value = ttest_ind(treatment, control)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(control)-1) * np.var(control, ddof=1) + 
(len(treatment)-1) * np.var(treatment, ddof=1)) / 
(len(control) + len(treatment) - 2))
cohens_d = (np.mean(treatment) - np.mean(control)) / pooled_std
# Lift calculation
lift = ((np.mean(treatment) - np.mean(control)) / np.mean(control)) * 100
# Recommendation
if p_value < 0.05:
if np.mean(treatment) > np.mean(control):
result = f"Treatment wins! Lift: {lift:.1f}% (p={p_value:.4f})"
else:
result = f"Control wins! (p={p_value:.4f})"
else:
result = "No significant difference found"
return {
'p_value': p_value,
'effect_size': cohens_d,
'lift': lift,
'result': result
}

Part 5: Practical Applications in Data Science

5.1 Exploratory Data Analysis (EDA) Statistics

def comprehensive_eda_stats(df):
"""Generate comprehensive statistical summary"""
# Basic statistics
print("=== BASIC STATISTICS ===\n")
print(df.describe(include='all'))
# Missing values
print("\n=== MISSING VALUES ===\n")
missing = df.isnull().sum()
print(missing[missing > 0])
# Distribution statistics
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("\n=== DISTRIBUTION METRICS ===\n")
for col in numeric_cols:
data = df[col].dropna()
print(f"{col}:")
print(f"  Skewness: {stats.skew(data):.3f}")
print(f"  Kurtosis: {stats.kurtosis(data):.3f}")
# Normality test (Shapiro-Wilk)
if len(data) < 5000:  # Shapiro-Wilk limited to ~5000 samples
_, p_value = stats.shapiro(data[:5000])
normal = "Yes" if p_value > 0.05 else "No"
print(f"  Normally distributed: {normal} (p={p_value:.3f})")
# Correlation analysis
print("\n=== CORRELATION ANALYSIS ===\n")
corr_matrix = df[numeric_cols].corr()
# Find highly correlated pairs
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.7:
high_corr.append({
'pair': f"{corr_matrix.columns[i]} - {corr_matrix.columns[j]}",
'correlation': corr_matrix.iloc[i, j]
})
if high_corr:
print("Highly correlated pairs (>0.7):")
for item in high_corr:
print(f"  {item['pair']}: {item['correlation']:.3f}")
# Outlier detection (IQR method)
print("\n=== OUTLIER DETECTION ===\n")
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")

5.2 Statistical Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
def statistical_feature_selection(X, y, method='f_stat', k=10):
"""
Select features using statistical methods
"""
if method == 'f_stat':
selector = SelectKBest(score_func=f_classif, k=k)
elif method == 'mutual_info':
selector = SelectKBest(score_func=mutual_info_classif, k=k)
X_selected = selector.fit_transform(X, y)
# Get feature scores
scores = pd.DataFrame({
'feature': X.columns,
'score': selector.scores_
}).sort_values('score', ascending=False)
print(f"Top {k} features:")
print(scores.head(k))
return X_selected, scores

5.3 Statistical Assumptions for Models

def check_regression_assumptions(X, y, model):
"""
Check key assumptions for linear regression
"""
from scipy.stats import shapiro, bartlett
# Fit model
model.fit(X, y)
predictions = model.predict(X)
residuals = y - predictions
# 1. Linearity (residuals vs fitted plot)
plt.figure(figsize=(12, 3))
plt.subplot(1, 3, 1)
plt.scatter(predictions, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Linearity Check')
# 2. Normality of residuals
plt.subplot(1, 3, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
# Shapiro-Wilk test
_, p_value = shapiro(residuals[:5000])  # Limit for large datasets
print(f"Normality test (p-value): {p_value:.4f}")
# 3. Homoscedasticity (constant variance)
plt.subplot(1, 3, 3)
plt.scatter(predictions, np.abs(residuals), alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('|Residuals|')
plt.title('Homoscedasticity Check')
plt.tight_layout()
# Breusch-Pagan test for heteroscedasticity
# (implement or use statsmodels)
return residuals

Part 6: Statistical Thinking in Data Science

Key Principles

  1. Correlation ≠ Causation
# Spurious correlations can mislead
# Always consider confounding variables
# Use randomized experiments when possible
  1. Statistical Significance vs. Practical Significance
# Large sample sizes can make trivial effects significant
# Always check effect size alongside p-values
effect_size = (mean_treatment - mean_control) / pooled_std
# Cohen's d: 0.2 (small), 0.5 (medium), 0.8 (large)
  1. The Danger of p-hacking
# Multiple testing increases Type I error
# Adjust significance level: α_adjusted = α / n_tests
from statsmodels.stats.multitest import multipletests
adjusted_p_values = multipletests(p_values, method='bonferroni')
  1. Sample Size Considerations
# Larger isn't always better
# Consider: effect size, variability, desired power
# Balance statistical power with practical constraints

Part 7: Essential Statistical Formulas

Quick Reference

ConceptFormula
Meanμ = Σx / N
Varianceσ² = Σ(x - μ)² / N
Standard Deviationσ = √σ²
Z-scorez = (x - μ) / σ
Correlationr = Σ((x-μₓ)(y-μᵧ)) / (nσₓσᵧ)
Standard ErrorSE = σ / √n
Confidence IntervalCI = x̄ ± z(α/2) × (σ/√n)
t-statistict = (x̄ - μ) / (s/√n)
Chi-Squareχ² = Σ(O - E)² / E
Bayes' TheoremP(A|B) = P(B|A)P(A) / P(B)

Conclusion: Statistics Mindset for Data Scientists

# The statistical mindset:
# 1. Always question data quality and collection methods
# 2. Understand the assumptions behind your tests
# 3. Quantify uncertainty in all estimates
# 4. Consider practical significance, not just statistical
# 5. Visualize everything before testing
# 6. Be skeptical of your own conclusions
# 7. Reproducibility is key

Key Takeaway: Statistics is not just a set of formulas—it's a framework for thinking about data, uncertainty, and decision-making. Master these concepts not to memorize equations, but to develop the intuition needed to extract reliable insights from messy, real-world data.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper