Why Statistics Matters in Data Science

Table of Contents

Why Statistics Matters in Data Science

Statistics provides the tools to:

Understand data through description and visualization
Make inferences about populations from samples
Quantify uncertainty in predictions and decisions
Validate models through hypothesis testing
Draw conclusions that are scientifically sound

"Statistics is the grammar of data science."

The Two Pillars of Statistics

┌─────────────────────────────────────────────────────────────┐
│                      STATISTICS                              │
├────────────────────────────┬────────────────────────────────┤
│   DESCRIPTIVE STATISTICS   │    INFERENTIAL STATISTICS       │
│   Summarize & Describe     │    Draw Conclusions &           │
│   What's in the data?      │    Make Predictions             │
├────────────────────────────┼────────────────────────────────┤
│ • Mean, Median, Mode       │ • Hypothesis Testing            │
│ • Standard Deviation       │ • Confidence Intervals          │
│ • Percentiles              │ • Regression Analysis           │
│ • Correlation              │ • ANOVA                         │
│ • Visualizations           │ • Bayesian Inference            │
└────────────────────────────┴────────────────────────────────┘

Part 1: Descriptive Statistics

1.1 Measures of Central Tendency

Mean (Average) - Sum divided by count

import numpy as np
import pandas as pd
from scipy import stats
data = [23, 45, 67, 12, 89, 34, 56, 78, 91, 45]
# Mean
mean = np.mean(data)  # 54.0
# Sensitive to outliers
# Median - Middle value
median = np.median(data)  # 50.5
# Robust to outliers
# Mode - Most frequent value
mode = stats.mode(data)  # 45
# Useful for categorical data
# When to use each:
# - Mean: Normally distributed, no outliers
# - Median: Skewed data, outliers present
# - Mode: Categorical data, most common value

1.2 Measures of Spread (Dispersion)

# Range
data_range = max(data) - min(data)  # 79
# Variance - Average squared deviation from mean
variance = np.var(data, ddof=0)  # Population variance
sample_variance = np.var(data, ddof=1)  # Sample variance (n-1)
# Standard Deviation - Square root of variance
std_dev = np.std(data, ddof=1)  # 25.8
# Interquartile Range (IQR) - Range of middle 50%
Q1 = np.percentile(data, 25)  # 34.0
Q3 = np.percentile(data, 75)  # 78.0
IQR = Q3 - Q1  # 44.0
# Coefficient of Variation - Relative variability
cv = (std_dev / mean) * 100  # 47.8%

1.3 Shape of Distribution

import matplotlib.pyplot as plt
import seaborn as sns
# Generate different distributions
normal_data = np.random.normal(0, 1, 1000)
skewed_data = np.random.exponential(2, 1000)
bimodal_data = np.concatenate([np.random.normal(-3, 1, 500), 
np.random.normal(3, 1, 500)])
# Skewness - Asymmetry
skewness = stats.skew(data)
# Positive skew: tail on right
# Negative skew: tail on left
# Zero: symmetric
# Kurtosis - Tail heaviness
kurtosis = stats.kurtosis(data)
# High kurtosis: heavy tails, outliers
# Low kurtosis: light tails
# Normal distribution: kurtosis = 0 (excess kurtosis)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(normal_data, bins=30, edgecolor='black')
axes[0].set_title(f'Normal\nSkew: {stats.skew(normal_data):.2f}')
axes[1].hist(skewed_data, bins=30, edgecolor='black')
axes[1].set_title(f'Skewed Right\nSkew: {stats.skew(skewed_data):.2f}')
axes[2].hist(bimodal_data, bins=30, edgecolor='black')
axes[2].set_title('Bimodal')
plt.tight_layout()

1.4 Correlation and Covariance

# Covariance - Direction of linear relationship
covariance = np.cov(x, y)[0, 1]
# Positive: variables increase together
# Negative: one increases, other decreases
# Pearson Correlation - Strength of linear relationship
correlation, p_value = stats.pearsonr(x, y)
# Range: -1 to 1
# 0: no linear correlation
# ±1: perfect linear correlation
# Spearman Correlation - Monotonic relationship (rank-based)
spearman_corr, p_value = stats.spearmanr(x, y)
# More robust to outliers
# Correlation matrix
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

Part 2: Probability Fundamentals

2.1 Basic Probability Concepts

# Probability rules
# P(A) = Number of favorable outcomes / Total outcomes
# Complement Rule: P(not A) = 1 - P(A)
# Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)
# Multiplication Rule (independent): P(A and B) = P(A) * P(B)
# Conditional Probability: P(A|B) = P(A and B) / P(B)
# Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)

2.2 Probability Distributions

Discrete Distributions

# 1. Binomial Distribution
# Number of successes in n independent trials
from scipy.stats import binom
n, p = 10, 0.5  # 10 coin flips, probability of heads = 0.5
x = np.arange(0, n+1)
pmf = binom.pmf(x, n, p)
# Probability of exactly 6 heads
prob_6 = binom.pmf(6, n, p)  # 0.205
# Probability of 6 or more heads
prob_6plus = 1 - binom.cdf(5, n, p)  # 0.377
# 2. Poisson Distribution
# Number of events in fixed interval
from scipy.stats import poisson
lambda_param = 3  # Average 3 events per interval
prob_5 = poisson.pmf(5, lambda_param)  # P(exactly 5 events)

Continuous Distributions

# 1. Normal (Gaussian) Distribution
from scipy.stats import norm
mu, sigma = 0, 1  # Standard normal
x = np.linspace(-4, 4, 100)
pdf = norm.pdf(x, mu, sigma)
cdf = norm.cdf(x, mu, sigma)
# Empirical Rule (68-95-99.7)
# 68% within 1 standard deviation
# 95% within 2 standard deviations
# 99.7% within 3 standard deviations
# Z-score calculation
z_score = (x_value - mu) / sigma
# 2. Uniform Distribution
from scipy.stats import uniform
# 3. Exponential Distribution (waiting times)
from scipy.stats import expon
# 4. t-Distribution (small samples)
from scipy.stats import t
# 5. Chi-Square Distribution
from scipy.stats import chi2

2.3 Central Limit Theorem (CLT)

# CLT: Sampling distribution of the mean approaches normal
# regardless of population distribution, as sample size increases
# Demonstration
population = np.random.exponential(scale=2, size=100000)
sample_means = []
for _ in range(1000):
sample = np.random.choice(population, size=30)
sample_means.append(np.mean(sample))
# Distribution of sample means is approximately normal
plt.hist(sample_means, bins=30, density=True, alpha=0.7)
# Mean of sampling distribution ≈ population mean
print(f"Population mean: {np.mean(population):.3f}")
print(f"Mean of sample means: {np.mean(sample_means):.3f}")
# Standard error = σ / √n
standard_error = np.std(population) / np.sqrt(30)

Part 3: Inferential Statistics

3.1 Sampling and Estimation

# Point Estimates
sample_mean = np.mean(sample)
sample_variance = np.var(sample, ddof=1)
# Confidence Intervals
from scipy.stats import norm, t
# For known population standard deviation
confidence_level = 0.95
z_critical = norm.ppf((1 + confidence_level) / 2)
margin_error = z_critical * (sigma / np.sqrt(n))
ci = [sample_mean - margin_error, sample_mean + margin_error]
# For unknown population standard deviation (use t-distribution)
t_critical = t.ppf((1 + confidence_level) / 2, df=n-1)
margin_error = t_critical * (sample_std / np.sqrt(n))
ci = [sample_mean - margin_error, sample_mean + margin_error]

3.2 Hypothesis Testing

# Hypothesis Testing Framework
# H0 (Null): No effect/difference
# H1 (Alternative): There is an effect/difference
# α (Significance level): Type I error rate (usually 0.05)
# p-value: Probability of observing data if H0 is true
# One-Sample t-test
from scipy.stats import ttest_1samp
# Test if sample mean differs from population mean
t_stat, p_value = ttest_1samp(sample, population_mean)
if p_value < 0.05:
print("Reject H0: Significant difference")
else:
print("Fail to reject H0: No significant difference")
# Two-Sample t-test (independent)
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(sample1, sample2)
# Paired t-test (before/after)
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(before, after)
# ANOVA (comparing multiple groups)
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(group1, group2, group3)
# Chi-Square Test (categorical data)
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

3.3 Common Statistical Tests

Test	Use Case	Assumptions
t-test	Compare means (2 groups)	Normal distribution, equal variance
ANOVA	Compare means (3+ groups)	Normal distribution, equal variance
Chi-Square	Categorical associations	Expected frequencies ≥ 5
Mann-Whitney U	Non-parametric alternative to t-test	Independent samples
Wilcoxon	Non-parametric paired test	Paired samples
Kruskal-Wallis	Non-parametric ANOVA	Independent samples

3.4 Type I and Type II Errors

# Type I Error (False Positive): Reject H0 when it's true
# Type II Error (False Negative): Fail to reject H0 when it's false
# Power = 1 - β (Probability of detecting true effect)
from statsmodels.stats.power import TTestIndPower
# Calculate required sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(
effect_size=0.5,  # Medium effect
alpha=0.05,       # Significance level
power=0.80,       # Desired power
ratio=1.0         # Equal sample sizes
)

Part 4: Advanced Statistical Concepts

4.1 Regression Analysis

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
# Simple Linear Regression
X = df['feature'].values.reshape(-1, 1)
y = df['target'].values
model = LinearRegression()
model.fit(X, y)
# Get statistics with statsmodels
X_with_const = sm.add_constant(X)
ols_model = sm.OLS(y, X_with_const).fit()
print(ols_model.summary())
# Key outputs:
# - R-squared: Proportion of variance explained
# - Coefficients: Slope and intercept
# - p-values: Significance of predictors
# - Confidence intervals: Range of coefficient estimates

4.2 Bayesian Statistics

# Bayesian Framework: Posterior ∝ Likelihood × Prior
# Simple Bayesian inference example
from scipy.stats import beta, binom
# Prior: Beta distribution (conjugate prior for binomial)
prior_alpha, prior_beta = 2, 2  # Weak prior favoring 0.5
# Likelihood: Observed data
successes, trials = 7, 10
# Posterior: Beta(prior_alpha + successes, prior_beta + trials - successes)
posterior_alpha = prior_alpha + successes
posterior_beta = prior_beta + trials - successes
# Credible interval (Bayesian confidence interval)
credible_interval = beta.interval(0.95, posterior_alpha, posterior_beta)

4.3 A/B Testing

def ab_test_analysis(control, treatment):
"""
Analyze A/B test results
"""
from scipy.stats import ttest_ind
# Descriptive statistics
print(f"Control: n={len(control)}, mean={np.mean(control):.3f}")
print(f"Treatment: n={len(treatment)}, mean={np.mean(treatment):.3f}")
# Statistical test
t_stat, p_value = ttest_ind(treatment, control)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(control)-1) * np.var(control, ddof=1) + 
(len(treatment)-1) * np.var(treatment, ddof=1)) / 
(len(control) + len(treatment) - 2))
cohens_d = (np.mean(treatment) - np.mean(control)) / pooled_std
# Lift calculation
lift = ((np.mean(treatment) - np.mean(control)) / np.mean(control)) * 100
# Recommendation
if p_value < 0.05:
if np.mean(treatment) > np.mean(control):
result = f"Treatment wins! Lift: {lift:.1f}% (p={p_value:.4f})"
else:
result = f"Control wins! (p={p_value:.4f})"
else:
result = "No significant difference found"
return {
'p_value': p_value,
'effect_size': cohens_d,
'lift': lift,
'result': result
}

Part 5: Practical Applications in Data Science

5.1 Exploratory Data Analysis (EDA) Statistics

def comprehensive_eda_stats(df):
"""Generate comprehensive statistical summary"""
# Basic statistics
print("=== BASIC STATISTICS ===\n")
print(df.describe(include='all'))
# Missing values
print("\n=== MISSING VALUES ===\n")
missing = df.isnull().sum()
print(missing[missing > 0])
# Distribution statistics
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("\n=== DISTRIBUTION METRICS ===\n")
for col in numeric_cols:
data = df[col].dropna()
print(f"{col}:")
print(f"  Skewness: {stats.skew(data):.3f}")
print(f"  Kurtosis: {stats.kurtosis(data):.3f}")
# Normality test (Shapiro-Wilk)
if len(data) < 5000:  # Shapiro-Wilk limited to ~5000 samples
_, p_value = stats.shapiro(data[:5000])
normal = "Yes" if p_value > 0.05 else "No"
print(f"  Normally distributed: {normal} (p={p_value:.3f})")
# Correlation analysis
print("\n=== CORRELATION ANALYSIS ===\n")
corr_matrix = df[numeric_cols].corr()
# Find highly correlated pairs
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.7:
high_corr.append({
'pair': f"{corr_matrix.columns[i]} - {corr_matrix.columns[j]}",
'correlation': corr_matrix.iloc[i, j]
})
if high_corr:
print("Highly correlated pairs (>0.7):")
for item in high_corr:
print(f"  {item['pair']}: {item['correlation']:.3f}")
# Outlier detection (IQR method)
print("\n=== OUTLIER DETECTION ===\n")
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")

5.2 Statistical Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
def statistical_feature_selection(X, y, method='f_stat', k=10):
"""
Select features using statistical methods
"""
if method == 'f_stat':
selector = SelectKBest(score_func=f_classif, k=k)
elif method == 'mutual_info':
selector = SelectKBest(score_func=mutual_info_classif, k=k)
X_selected = selector.fit_transform(X, y)
# Get feature scores
scores = pd.DataFrame({
'feature': X.columns,
'score': selector.scores_
}).sort_values('score', ascending=False)
print(f"Top {k} features:")
print(scores.head(k))
return X_selected, scores

5.3 Statistical Assumptions for Models

def check_regression_assumptions(X, y, model):
"""
Check key assumptions for linear regression
"""
from scipy.stats import shapiro, bartlett
# Fit model
model.fit(X, y)
predictions = model.predict(X)
residuals = y - predictions
# 1. Linearity (residuals vs fitted plot)
plt.figure(figsize=(12, 3))
plt.subplot(1, 3, 1)
plt.scatter(predictions, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Linearity Check')
# 2. Normality of residuals
plt.subplot(1, 3, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
# Shapiro-Wilk test
_, p_value = shapiro(residuals[:5000])  # Limit for large datasets
print(f"Normality test (p-value): {p_value:.4f}")
# 3. Homoscedasticity (constant variance)
plt.subplot(1, 3, 3)
plt.scatter(predictions, np.abs(residuals), alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('|Residuals|')
plt.title('Homoscedasticity Check')
plt.tight_layout()
# Breusch-Pagan test for heteroscedasticity
# (implement or use statsmodels)
return residuals

Part 6: Statistical Thinking in Data Science

Key Principles

Correlation ≠ Causation

# Spurious correlations can mislead
# Always consider confounding variables
# Use randomized experiments when possible

Statistical Significance vs. Practical Significance

# Large sample sizes can make trivial effects significant
# Always check effect size alongside p-values
effect_size = (mean_treatment - mean_control) / pooled_std
# Cohen's d: 0.2 (small), 0.5 (medium), 0.8 (large)

The Danger of p-hacking

# Multiple testing increases Type I error
# Adjust significance level: α_adjusted = α / n_tests
from statsmodels.stats.multitest import multipletests
adjusted_p_values = multipletests(p_values, method='bonferroni')

Sample Size Considerations

# Larger isn't always better
# Consider: effect size, variability, desired power
# Balance statistical power with practical constraints

Part 7: Essential Statistical Formulas

Quick Reference

Concept	Formula
Mean	μ = Σx / N
Variance	σ² = Σ(x - μ)² / N
Standard Deviation	σ = √σ²
Z-score	z = (x - μ) / σ
Correlation	r = Σ((x-μₓ)(y-μᵧ)) / (nσₓσᵧ)
Standard Error	SE = σ / √n
Confidence Interval	CI = x̄ ± z(α/2) × (σ/√n)
t-statistic	t = (x̄ - μ) / (s/√n)
Chi-Square	χ² = Σ(O - E)² / E
Bayes' Theorem	P(A\|B) = P(B\|A)P(A) / P(B)

Conclusion: Statistics Mindset for Data Scientists

# The statistical mindset:
# 1. Always question data quality and collection methods
# 2. Understand the assumptions behind your tests
# 3. Quantify uncertainty in all estimates
# 4. Consider practical significance, not just statistical
# 5. Visualize everything before testing
# 6. Be skeptical of your own conclusions
# 7. Reproducibility is key

Key Takeaway: Statistics is not just a set of formulas—it's a framework for thinking about data, uncertainty, and decision-making. Master these concepts not to memorize equations, but to develop the intuition needed to extract reliable insights from messy, real-world data.

Building Blocks of C: A Complete Guide to Functions
Explains how functions work in C programming, including function declaration, definition, parameters, return values, and how functions help organize reusable code.
https://macronepal.com/bash/building-blocks-of-c-a-complete-guide-to-functions/

The Heart of Text Processing: A Complete Guide to Strings in C
Explains how strings are used in C, covering character arrays, string handling functions, and common techniques for text processing tasks.
https://macronepal.com/bash/the-heart-of-text-processing-a-complete-guide-to-strings-in-c-2/

The Cornerstone of Data Organization: A Complete Guide to Arrays in C
Describes how arrays store multiple values in C, including indexing, initialization, and using arrays to manage structured data efficiently.
https://macronepal.com/bash/the-cornerstone-of-data-organization-a-complete-guide-to-arrays-in-c/

Guaranteed Execution: A Complete Guide to the Do-While Loop in C
Explains the do-while loop structure in C, highlighting how it ensures code runs at least once before checking the loop condition.
https://macronepal.com/bash/guaranteed-execution-a-complete-guide-to-the-do-while-loop-in-c/

Mastering Iteration: A Complete Guide to the For Loop in C
Explains how the for loop works in C, including initialization, condition checking, and increment steps for repeated execution of code blocks.
https://macronepal.com/bash/mastering-iteration-a-complete-guide-to-the-for-loop-in-c/

Mastering Iteration: A Complete Guide to While Loops in C
Explains the while loop structure in C, focusing on condition-based repetition and proper loop control techniques.
https://macronepal.com/bash/mastering-iteration-a-complete-guide-to-while-loops-in-c/

Beyond If-Else: A Complete Guide to Switch Case in C
Explains how switch-case statements work in C programming, enabling efficient handling of multiple conditional branches.
https://macronepal.com/bash/beyond-if-else-a-complete-guide-to-switch-case-in-c/

Mastering the Fundamentals: A Complete Guide to Arithmetic Operations in C
Explains how arithmetic operators such as addition, subtraction, multiplication, and division work in C, along with operator precedence and usage examples.
https://macronepal.com/bash/mastering-the-fundamentals-a-complete-guide-to-arithmetic-operations-in-c/

Foundation of C Programming: A Complete Guide to Basic Input Output
Explains how input and output functions like printf and scanf work in C, forming the foundation for interacting with users and displaying program results.
https://macronepal.com/bash/foundation-of-c-programming-a-complete-guide-to-basic-input-output/

Why Statistics Matters in Data Science

The Two Pillars of Statistics

Part 1: Descriptive Statistics

1.1 Measures of Central Tendency

1.2 Measures of Spread (Dispersion)

1.3 Shape of Distribution

1.4 Correlation and Covariance

Part 2: Probability Fundamentals

2.1 Basic Probability Concepts

2.2 Probability Distributions

Discrete Distributions

Continuous Distributions

2.3 Central Limit Theorem (CLT)

Part 3: Inferential Statistics

3.1 Sampling and Estimation

3.2 Hypothesis Testing

3.3 Common Statistical Tests

3.4 Type I and Type II Errors

Part 4: Advanced Statistical Concepts

4.1 Regression Analysis

4.2 Bayesian Statistics

4.3 A/B Testing

Part 5: Practical Applications in Data Science

5.1 Exploratory Data Analysis (EDA) Statistics

5.2 Statistical Feature Selection

5.3 Statistical Assumptions for Models

Part 6: Statistical Thinking in Data Science

Key Principles

Part 7: Essential Statistical Formulas

Quick Reference

Conclusion: Statistics Mindset for Data Scientists

Leave a Reply Cancel reply

Macro Nepal Helper