Why Statistics Matters in Data Science
Statistics provides the tools to:
- Understand data through description and visualization
- Make inferences about populations from samples
- Quantify uncertainty in predictions and decisions
- Validate models through hypothesis testing
- Draw conclusions that are scientifically sound
"Statistics is the grammar of data science."
The Two Pillars of Statistics
┌─────────────────────────────────────────────────────────────┐ │ STATISTICS │ ├────────────────────────────┬────────────────────────────────┤ │ DESCRIPTIVE STATISTICS │ INFERENTIAL STATISTICS │ │ Summarize & Describe │ Draw Conclusions & │ │ What's in the data? │ Make Predictions │ ├────────────────────────────┼────────────────────────────────┤ │ • Mean, Median, Mode │ • Hypothesis Testing │ │ • Standard Deviation │ • Confidence Intervals │ │ • Percentiles │ • Regression Analysis │ │ • Correlation │ • ANOVA │ │ • Visualizations │ • Bayesian Inference │ └────────────────────────────┴────────────────────────────────┘
Part 1: Descriptive Statistics
1.1 Measures of Central Tendency
Mean (Average) - Sum divided by count
import numpy as np import pandas as pd from scipy import stats data = [23, 45, 67, 12, 89, 34, 56, 78, 91, 45] # Mean mean = np.mean(data) # 54.0 # Sensitive to outliers # Median - Middle value median = np.median(data) # 50.5 # Robust to outliers # Mode - Most frequent value mode = stats.mode(data) # 45 # Useful for categorical data # When to use each: # - Mean: Normally distributed, no outliers # - Median: Skewed data, outliers present # - Mode: Categorical data, most common value
1.2 Measures of Spread (Dispersion)
# Range data_range = max(data) - min(data) # 79 # Variance - Average squared deviation from mean variance = np.var(data, ddof=0) # Population variance sample_variance = np.var(data, ddof=1) # Sample variance (n-1) # Standard Deviation - Square root of variance std_dev = np.std(data, ddof=1) # 25.8 # Interquartile Range (IQR) - Range of middle 50% Q1 = np.percentile(data, 25) # 34.0 Q3 = np.percentile(data, 75) # 78.0 IQR = Q3 - Q1 # 44.0 # Coefficient of Variation - Relative variability cv = (std_dev / mean) * 100 # 47.8%
1.3 Shape of Distribution
import matplotlib.pyplot as plt
import seaborn as sns
# Generate different distributions
normal_data = np.random.normal(0, 1, 1000)
skewed_data = np.random.exponential(2, 1000)
bimodal_data = np.concatenate([np.random.normal(-3, 1, 500),
np.random.normal(3, 1, 500)])
# Skewness - Asymmetry
skewness = stats.skew(data)
# Positive skew: tail on right
# Negative skew: tail on left
# Zero: symmetric
# Kurtosis - Tail heaviness
kurtosis = stats.kurtosis(data)
# High kurtosis: heavy tails, outliers
# Low kurtosis: light tails
# Normal distribution: kurtosis = 0 (excess kurtosis)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(normal_data, bins=30, edgecolor='black')
axes[0].set_title(f'Normal\nSkew: {stats.skew(normal_data):.2f}')
axes[1].hist(skewed_data, bins=30, edgecolor='black')
axes[1].set_title(f'Skewed Right\nSkew: {stats.skew(skewed_data):.2f}')
axes[2].hist(bimodal_data, bins=30, edgecolor='black')
axes[2].set_title('Bimodal')
plt.tight_layout()
1.4 Correlation and Covariance
# Covariance - Direction of linear relationship covariance = np.cov(x, y)[0, 1] # Positive: variables increase together # Negative: one increases, other decreases # Pearson Correlation - Strength of linear relationship correlation, p_value = stats.pearsonr(x, y) # Range: -1 to 1 # 0: no linear correlation # ±1: perfect linear correlation # Spearman Correlation - Monotonic relationship (rank-based) spearman_corr, p_value = stats.spearmanr(x, y) # More robust to outliers # Correlation matrix correlation_matrix = df.corr() sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
Part 2: Probability Fundamentals
2.1 Basic Probability Concepts
# Probability rules # P(A) = Number of favorable outcomes / Total outcomes # Complement Rule: P(not A) = 1 - P(A) # Addition Rule: P(A or B) = P(A) + P(B) - P(A and B) # Multiplication Rule (independent): P(A and B) = P(A) * P(B) # Conditional Probability: P(A|B) = P(A and B) / P(B) # Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)
2.2 Probability Distributions
Discrete Distributions
# 1. Binomial Distribution # Number of successes in n independent trials from scipy.stats import binom n, p = 10, 0.5 # 10 coin flips, probability of heads = 0.5 x = np.arange(0, n+1) pmf = binom.pmf(x, n, p) # Probability of exactly 6 heads prob_6 = binom.pmf(6, n, p) # 0.205 # Probability of 6 or more heads prob_6plus = 1 - binom.cdf(5, n, p) # 0.377 # 2. Poisson Distribution # Number of events in fixed interval from scipy.stats import poisson lambda_param = 3 # Average 3 events per interval prob_5 = poisson.pmf(5, lambda_param) # P(exactly 5 events)
Continuous Distributions
# 1. Normal (Gaussian) Distribution from scipy.stats import norm mu, sigma = 0, 1 # Standard normal x = np.linspace(-4, 4, 100) pdf = norm.pdf(x, mu, sigma) cdf = norm.cdf(x, mu, sigma) # Empirical Rule (68-95-99.7) # 68% within 1 standard deviation # 95% within 2 standard deviations # 99.7% within 3 standard deviations # Z-score calculation z_score = (x_value - mu) / sigma # 2. Uniform Distribution from scipy.stats import uniform # 3. Exponential Distribution (waiting times) from scipy.stats import expon # 4. t-Distribution (small samples) from scipy.stats import t # 5. Chi-Square Distribution from scipy.stats import chi2
2.3 Central Limit Theorem (CLT)
# CLT: Sampling distribution of the mean approaches normal
# regardless of population distribution, as sample size increases
# Demonstration
population = np.random.exponential(scale=2, size=100000)
sample_means = []
for _ in range(1000):
sample = np.random.choice(population, size=30)
sample_means.append(np.mean(sample))
# Distribution of sample means is approximately normal
plt.hist(sample_means, bins=30, density=True, alpha=0.7)
# Mean of sampling distribution ≈ population mean
print(f"Population mean: {np.mean(population):.3f}")
print(f"Mean of sample means: {np.mean(sample_means):.3f}")
# Standard error = σ / √n
standard_error = np.std(population) / np.sqrt(30)
Part 3: Inferential Statistics
3.1 Sampling and Estimation
# Point Estimates sample_mean = np.mean(sample) sample_variance = np.var(sample, ddof=1) # Confidence Intervals from scipy.stats import norm, t # For known population standard deviation confidence_level = 0.95 z_critical = norm.ppf((1 + confidence_level) / 2) margin_error = z_critical * (sigma / np.sqrt(n)) ci = [sample_mean - margin_error, sample_mean + margin_error] # For unknown population standard deviation (use t-distribution) t_critical = t.ppf((1 + confidence_level) / 2, df=n-1) margin_error = t_critical * (sample_std / np.sqrt(n)) ci = [sample_mean - margin_error, sample_mean + margin_error]
3.2 Hypothesis Testing
# Hypothesis Testing Framework
# H0 (Null): No effect/difference
# H1 (Alternative): There is an effect/difference
# α (Significance level): Type I error rate (usually 0.05)
# p-value: Probability of observing data if H0 is true
# One-Sample t-test
from scipy.stats import ttest_1samp
# Test if sample mean differs from population mean
t_stat, p_value = ttest_1samp(sample, population_mean)
if p_value < 0.05:
print("Reject H0: Significant difference")
else:
print("Fail to reject H0: No significant difference")
# Two-Sample t-test (independent)
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(sample1, sample2)
# Paired t-test (before/after)
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(before, after)
# ANOVA (comparing multiple groups)
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(group1, group2, group3)
# Chi-Square Test (categorical data)
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
3.3 Common Statistical Tests
| Test | Use Case | Assumptions |
|---|---|---|
| t-test | Compare means (2 groups) | Normal distribution, equal variance |
| ANOVA | Compare means (3+ groups) | Normal distribution, equal variance |
| Chi-Square | Categorical associations | Expected frequencies ≥ 5 |
| Mann-Whitney U | Non-parametric alternative to t-test | Independent samples |
| Wilcoxon | Non-parametric paired test | Paired samples |
| Kruskal-Wallis | Non-parametric ANOVA | Independent samples |
3.4 Type I and Type II Errors
# Type I Error (False Positive): Reject H0 when it's true # Type II Error (False Negative): Fail to reject H0 when it's false # Power = 1 - β (Probability of detecting true effect) from statsmodels.stats.power import TTestIndPower # Calculate required sample size analysis = TTestIndPower() sample_size = analysis.solve_power( effect_size=0.5, # Medium effect alpha=0.05, # Significance level power=0.80, # Desired power ratio=1.0 # Equal sample sizes )
Part 4: Advanced Statistical Concepts
4.1 Regression Analysis
import statsmodels.api as sm from sklearn.linear_model import LinearRegression # Simple Linear Regression X = df['feature'].values.reshape(-1, 1) y = df['target'].values model = LinearRegression() model.fit(X, y) # Get statistics with statsmodels X_with_const = sm.add_constant(X) ols_model = sm.OLS(y, X_with_const).fit() print(ols_model.summary()) # Key outputs: # - R-squared: Proportion of variance explained # - Coefficients: Slope and intercept # - p-values: Significance of predictors # - Confidence intervals: Range of coefficient estimates
4.2 Bayesian Statistics
# Bayesian Framework: Posterior ∝ Likelihood × Prior # Simple Bayesian inference example from scipy.stats import beta, binom # Prior: Beta distribution (conjugate prior for binomial) prior_alpha, prior_beta = 2, 2 # Weak prior favoring 0.5 # Likelihood: Observed data successes, trials = 7, 10 # Posterior: Beta(prior_alpha + successes, prior_beta + trials - successes) posterior_alpha = prior_alpha + successes posterior_beta = prior_beta + trials - successes # Credible interval (Bayesian confidence interval) credible_interval = beta.interval(0.95, posterior_alpha, posterior_beta)
4.3 A/B Testing
def ab_test_analysis(control, treatment):
"""
Analyze A/B test results
"""
from scipy.stats import ttest_ind
# Descriptive statistics
print(f"Control: n={len(control)}, mean={np.mean(control):.3f}")
print(f"Treatment: n={len(treatment)}, mean={np.mean(treatment):.3f}")
# Statistical test
t_stat, p_value = ttest_ind(treatment, control)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(control)-1) * np.var(control, ddof=1) +
(len(treatment)-1) * np.var(treatment, ddof=1)) /
(len(control) + len(treatment) - 2))
cohens_d = (np.mean(treatment) - np.mean(control)) / pooled_std
# Lift calculation
lift = ((np.mean(treatment) - np.mean(control)) / np.mean(control)) * 100
# Recommendation
if p_value < 0.05:
if np.mean(treatment) > np.mean(control):
result = f"Treatment wins! Lift: {lift:.1f}% (p={p_value:.4f})"
else:
result = f"Control wins! (p={p_value:.4f})"
else:
result = "No significant difference found"
return {
'p_value': p_value,
'effect_size': cohens_d,
'lift': lift,
'result': result
}
Part 5: Practical Applications in Data Science
5.1 Exploratory Data Analysis (EDA) Statistics
def comprehensive_eda_stats(df):
"""Generate comprehensive statistical summary"""
# Basic statistics
print("=== BASIC STATISTICS ===\n")
print(df.describe(include='all'))
# Missing values
print("\n=== MISSING VALUES ===\n")
missing = df.isnull().sum()
print(missing[missing > 0])
# Distribution statistics
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("\n=== DISTRIBUTION METRICS ===\n")
for col in numeric_cols:
data = df[col].dropna()
print(f"{col}:")
print(f" Skewness: {stats.skew(data):.3f}")
print(f" Kurtosis: {stats.kurtosis(data):.3f}")
# Normality test (Shapiro-Wilk)
if len(data) < 5000: # Shapiro-Wilk limited to ~5000 samples
_, p_value = stats.shapiro(data[:5000])
normal = "Yes" if p_value > 0.05 else "No"
print(f" Normally distributed: {normal} (p={p_value:.3f})")
# Correlation analysis
print("\n=== CORRELATION ANALYSIS ===\n")
corr_matrix = df[numeric_cols].corr()
# Find highly correlated pairs
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.7:
high_corr.append({
'pair': f"{corr_matrix.columns[i]} - {corr_matrix.columns[j]}",
'correlation': corr_matrix.iloc[i, j]
})
if high_corr:
print("Highly correlated pairs (>0.7):")
for item in high_corr:
print(f" {item['pair']}: {item['correlation']:.3f}")
# Outlier detection (IQR method)
print("\n=== OUTLIER DETECTION ===\n")
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
5.2 Statistical Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
def statistical_feature_selection(X, y, method='f_stat', k=10):
"""
Select features using statistical methods
"""
if method == 'f_stat':
selector = SelectKBest(score_func=f_classif, k=k)
elif method == 'mutual_info':
selector = SelectKBest(score_func=mutual_info_classif, k=k)
X_selected = selector.fit_transform(X, y)
# Get feature scores
scores = pd.DataFrame({
'feature': X.columns,
'score': selector.scores_
}).sort_values('score', ascending=False)
print(f"Top {k} features:")
print(scores.head(k))
return X_selected, scores
5.3 Statistical Assumptions for Models
def check_regression_assumptions(X, y, model):
"""
Check key assumptions for linear regression
"""
from scipy.stats import shapiro, bartlett
# Fit model
model.fit(X, y)
predictions = model.predict(X)
residuals = y - predictions
# 1. Linearity (residuals vs fitted plot)
plt.figure(figsize=(12, 3))
plt.subplot(1, 3, 1)
plt.scatter(predictions, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Linearity Check')
# 2. Normality of residuals
plt.subplot(1, 3, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
# Shapiro-Wilk test
_, p_value = shapiro(residuals[:5000]) # Limit for large datasets
print(f"Normality test (p-value): {p_value:.4f}")
# 3. Homoscedasticity (constant variance)
plt.subplot(1, 3, 3)
plt.scatter(predictions, np.abs(residuals), alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('|Residuals|')
plt.title('Homoscedasticity Check')
plt.tight_layout()
# Breusch-Pagan test for heteroscedasticity
# (implement or use statsmodels)
return residuals
Part 6: Statistical Thinking in Data Science
Key Principles
- Correlation ≠ Causation
# Spurious correlations can mislead # Always consider confounding variables # Use randomized experiments when possible
- Statistical Significance vs. Practical Significance
# Large sample sizes can make trivial effects significant # Always check effect size alongside p-values effect_size = (mean_treatment - mean_control) / pooled_std # Cohen's d: 0.2 (small), 0.5 (medium), 0.8 (large)
- The Danger of p-hacking
# Multiple testing increases Type I error # Adjust significance level: α_adjusted = α / n_tests from statsmodels.stats.multitest import multipletests adjusted_p_values = multipletests(p_values, method='bonferroni')
- Sample Size Considerations
# Larger isn't always better # Consider: effect size, variability, desired power # Balance statistical power with practical constraints
Part 7: Essential Statistical Formulas
Quick Reference
| Concept | Formula |
|---|---|
| Mean | μ = Σx / N |
| Variance | σ² = Σ(x - μ)² / N |
| Standard Deviation | σ = √σ² |
| Z-score | z = (x - μ) / σ |
| Correlation | r = Σ((x-μₓ)(y-μᵧ)) / (nσₓσᵧ) |
| Standard Error | SE = σ / √n |
| Confidence Interval | CI = x̄ ± z(α/2) × (σ/√n) |
| t-statistic | t = (x̄ - μ) / (s/√n) |
| Chi-Square | χ² = Σ(O - E)² / E |
| Bayes' Theorem | P(A|B) = P(B|A)P(A) / P(B) |
Conclusion: Statistics Mindset for Data Scientists
# The statistical mindset: # 1. Always question data quality and collection methods # 2. Understand the assumptions behind your tests # 3. Quantify uncertainty in all estimates # 4. Consider practical significance, not just statistical # 5. Visualize everything before testing # 6. Be skeptical of your own conclusions # 7. Reproducibility is key
Key Takeaway: Statistics is not just a set of formulas—it's a framework for thinking about data, uncertainty, and decision-making. Master these concepts not to memorize equations, but to develop the intuition needed to extract reliable insights from messy, real-world data.
Building Blocks of C: A Complete Guide to Functions
Explains how functions work in C programming, including function declaration, definition, parameters, return values, and how functions help organize reusable code.
https://macronepal.com/bash/building-blocks-of-c-a-complete-guide-to-functions/
The Heart of Text Processing: A Complete Guide to Strings in C
Explains how strings are used in C, covering character arrays, string handling functions, and common techniques for text processing tasks.
https://macronepal.com/bash/the-heart-of-text-processing-a-complete-guide-to-strings-in-c-2/
The Cornerstone of Data Organization: A Complete Guide to Arrays in C
Describes how arrays store multiple values in C, including indexing, initialization, and using arrays to manage structured data efficiently.
https://macronepal.com/bash/the-cornerstone-of-data-organization-a-complete-guide-to-arrays-in-c/
Guaranteed Execution: A Complete Guide to the Do-While Loop in C
Explains the do-while loop structure in C, highlighting how it ensures code runs at least once before checking the loop condition.
https://macronepal.com/bash/guaranteed-execution-a-complete-guide-to-the-do-while-loop-in-c/
Mastering Iteration: A Complete Guide to the For Loop in C
Explains how the for loop works in C, including initialization, condition checking, and increment steps for repeated execution of code blocks.
https://macronepal.com/bash/mastering-iteration-a-complete-guide-to-the-for-loop-in-c/
Mastering Iteration: A Complete Guide to While Loops in C
Explains the while loop structure in C, focusing on condition-based repetition and proper loop control techniques.
https://macronepal.com/bash/mastering-iteration-a-complete-guide-to-while-loops-in-c/
Beyond If-Else: A Complete Guide to Switch Case in C
Explains how switch-case statements work in C programming, enabling efficient handling of multiple conditional branches.
https://macronepal.com/bash/beyond-if-else-a-complete-guide-to-switch-case-in-c/
Mastering the Fundamentals: A Complete Guide to Arithmetic Operations in C
Explains how arithmetic operators such as addition, subtraction, multiplication, and division work in C, along with operator precedence and usage examples.
https://macronepal.com/bash/mastering-the-fundamentals-a-complete-guide-to-arithmetic-operations-in-c/
Foundation of C Programming: A Complete Guide to Basic Input Output
Explains how input and output functions like printf and scanf work in C, forming the foundation for interacting with users and displaying program results.
https://macronepal.com/bash/foundation-of-c-programming-a-complete-guide-to-basic-input-output/