Complete Guide to Data Science with Python

Introduction to Data Science with Python

Data Science is an interdisciplinary field that combines statistics, mathematics, programming, and domain expertise to extract meaningful insights from data. Python has emerged as the dominant language for data science due to its simplicity, extensive libraries, and vibrant community. This comprehensive guide covers everything you need to start your journey in data science with Python.

Key Concepts

  • Data Analysis: Exploring, cleaning, and transforming data
  • Statistical Modeling: Understanding patterns and relationships
  • Machine Learning: Building predictive models
  • Data Visualization: Communicating insights effectively
  • Big Data: Handling large-scale datasets
  • Reproducibility: Ensuring results can be replicated

Why Python for Data Science?

  • Rich Ecosystem: NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch
  • Easy to Learn: Readable syntax, gentle learning curve
  • Active Community: Extensive documentation, tutorials, and support
  • Integration: Works well with SQL, big data tools, and visualization libraries
  • Production Ready: Deploy models with Flask, Django, FastAPI

1. Setting Up Your Data Science Environment

Installing Python and Essential Libraries

# Install Python (3.8 or higher recommended)
# Check Python version
python --version
# Install essential libraries with pip
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
# For scientific computing
pip install scipy statsmodels
# For deep learning
pip install tensorflow pytorch
# For Jupyter notebooks
pip install jupyterlab
# Create a virtual environment (recommended)
python -m venv datascience_env
source datascience_env/bin/activate  # On Windows: datascience_env\Scripts\activate

Jupyter Notebook Setup

# Launch Jupyter Notebook
# jupyter notebook
# Or Jupyter Lab (modern interface)
# jupyter lab
# Import libraries in your notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

2. NumPy - Numerical Computing Foundation

Creating Arrays

import numpy as np
# Creating arrays from lists
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
# Special arrays
zeros = np.zeros((3, 4))           # 3x4 matrix of zeros
ones = np.ones((2, 3))              # 2x3 matrix of ones
eye = np.eye(4)                     # 4x4 identity matrix
random = np.random.rand(3, 3)       # 3x3 random numbers (0-1)
random_int = np.random.randint(0, 100, size=(5, 5))  # Random integers
# Ranges
range_arr = np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
linspace_arr = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
print(f"Array: {arr1}")
print(f"Shape: {arr1.shape}")
print(f"Dimensions: {arr1.ndim}")
print(f"Data type: {arr1.dtype}")
print(f"Size: {arr1.size}")
print(f"Bytes: {arr1.nbytes}")

Array Operations

# Basic arithmetic (element-wise)
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
print(f"Addition: {a + b}")
print(f"Subtraction: {a - b}")
print(f"Multiplication: {a * b}")
print(f"Division: {a / b}")
print(f"Power: {a ** 2}")
# Statistical operations
data = np.random.randn(1000)  # 1000 random numbers (normal distribution)
print(f"Mean: {data.mean():.4f}")
print(f"Median: {np.median(data):.4f}")
print(f"Std: {data.std():.4f}")
print(f"Min: {data.min():.4f}")
print(f"Max: {data.max():.4f}")
print(f"Sum: {data.sum():.4f}")
# Broadcasting
matrix = np.array([[1, 2, 3], [4, 5, 6]])
vector = np.array([10, 20, 30])
# Add vector to each row
result = matrix + vector
print(result)
# [[11 22 33]
#  [14 25 36]]
# Matrix operations
A = np.random.rand(3, 4)
B = np.random.rand(4, 2)
dot_product = np.dot(A, B)  # Matrix multiplication
# Advanced operations
exp_data = np.exp(data)      # Exponential
log_data = np.log(np.abs(data) + 0.01)  # Log
sqrt_data = np.sqrt(np.abs(data))       # Square root

Indexing and Slicing

# 1D array indexing
arr = np.array([10, 20, 30, 40, 50, 60, 70, 80])
print(arr[0])      # First element: 10
print(arr[-1])     # Last element: 80
print(arr[2:5])    # Elements 2-4: [30, 40, 50]
print(arr[::2])    # Every other: [10, 30, 50, 70]
# 2D array indexing
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
print(matrix[0, 1])     # Row 0, Col 1: 2
print(matrix[1])        # Row 1: [5, 6, 7, 8]
print(matrix[:, 2])     # Column 2: [3, 7, 11]
print(matrix[1:3, 1:3]) # Submatrix: [[6, 7], [10, 11]]
# Boolean indexing
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
mask = data > 5
print(data[mask])  # [6, 7, 8, 9, 10]
# Fancy indexing
indices = [0, 2, 4, 6]
print(data[indices])  # [1, 3, 5, 7]

Reshaping and Joining

# Reshaping arrays
arr = np.arange(12)
print(f"Original: {arr}")
print(f"Reshape 3x4: \n{arr.reshape(3, 4)}")
print(f"Reshape 2x6: \n{arr.reshape(2, 6)}")
# Flattening
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Flatten: {matrix.flatten()}")
print(f"Ravel: {matrix.ravel()}")
# Stacking
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(f"Vertical stack: \n{np.vstack([a, b])}")
print(f"Horizontal stack: {np.hstack([a, b])}")
# Concatenation
c = np.array([[1, 2], [3, 4]])
d = np.array([[5, 6], [7, 8]])
print(f"Concatenate rows: \n{np.concatenate([c, d], axis=0)}")
print(f"Concatenate columns: \n{np.concatenate([c, d], axis=1)}")

3. Pandas - Data Manipulation and Analysis

Creating DataFrames

import pandas as pd
import numpy as np
# From dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'Salary': [50000, 60000, 75000, 55000, 65000],
'Department': ['IT', 'HR', 'IT', 'Finance', 'HR']
}
df = pd.DataFrame(data)
print(df)
# From CSV
# df = pd.read_csv('data.csv')
# df = pd.read_excel('data.xlsx')
# df = pd.read_json('data.json')
# From NumPy array
np_array = np.random.rand(5, 3)
df_np = pd.DataFrame(np_array, columns=['A', 'B', 'C'])
# From list of dictionaries
records = [
{'Name': 'Alice', 'Age': 25},
{'Name': 'Bob', 'Age': 30},
{'Name': 'Charlie', 'Age': 35}
]
df_records = pd.DataFrame(records)

DataFrame Information

# Basic information
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index}")
print(f"Data types:\n{df.dtypes}")
# Summary statistics
print(f"Summary statistics:\n{df.describe()}")
# First/last rows
print(f"First 3 rows:\n{df.head(3)}")
print(f"Last 2 rows:\n{df.tail(2)}")
# Info summary
df.info()
# Missing values
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Missing values percentage: {df.isnull().sum() / len(df) * 100:.1f}%")

Selecting and Filtering Data

# Column selection
print(df['Name'])                    # Single column (Series)
print(df[['Name', 'Age']])           # Multiple columns
# Row selection
print(df.iloc[0])                    # First row by position
print(df.iloc[1:3])                  # Rows 1-2 by position
print(df.loc[0])                     # First row by label
# Conditional filtering
print(df[df['Age'] > 30])            # Age > 30
print(df[(df['Age'] > 25) & (df['Salary'] > 60000)])  # Multiple conditions
print(df[df['Department'].isin(['IT', 'HR'])])        # In list
# Query method
print(df.query('Age > 25 and Salary > 60000'))
# String filtering
print(df[df['Name'].str.contains('A')])  # Names containing 'A'
print(df[df['Name'].str.startswith('C')]) # Names starting with 'C'

Data Manipulation

# Adding columns
df['Bonus'] = df['Salary'] * 0.1
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100], labels=['Young', 'Middle', 'Senior'])
# Removing columns
df_cleaned = df.drop(['Bonus'], axis=1)
df_cleaned = df.drop(columns=['Bonus'])
# Renaming columns
df_renamed = df.rename(columns={'Name': 'Full Name', 'Salary': 'Annual Salary'})
# Sorting
df_sorted = df.sort_values('Age')
df_sorted_desc = df.sort_values('Salary', ascending=False)
# Group by operations
grouped = df.groupby('Department')
print(f"Group sizes:\n{grouped.size()}")
print(f"Mean by department:\n{grouped[['Age', 'Salary']].mean()}")
print(f"Multiple aggregations:\n{grouped['Salary'].agg(['mean', 'std', 'min', 'max'])}")
# Apply functions
df['Age_Squared'] = df['Age'].apply(lambda x: x ** 2)
df['Name_Length'] = df['Name'].apply(len)
# Custom aggregation
def range_func(x):
return x.max() - x.min()
print(grouped['Salary'].agg(range_func))

Handling Missing Data

# Create DataFrame with missing values
df_missing = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, np.nan],
'C': [1, np.nan, np.nan, 4, 5]
})
# Detect missing values
print(df_missing.isnull())
print(df_missing.isnull().sum())
# Drop missing values
df_dropped_rows = df_missing.dropna()           # Drop rows with any NaN
df_dropped_cols = df_missing.dropna(axis=1)     # Drop columns with any NaN
df_dropped_all = df_missing.dropna(how='all')   # Drop rows where all values are NaN
# Fill missing values
df_filled_mean = df_missing.fillna(df_missing.mean())
df_filled_ffill = df_missing.fillna(method='ffill')  # Forward fill
df_filled_value = df_missing.fillna(0)                # Fill with 0
df_filled_interpolate = df_missing.interpolate()      # Linear interpolation
# Specific column handling
df_missing['A'] = df_missing['A'].fillna(df_missing['A'].median())

Merging and Joining

# Create two DataFrames for merging
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Department': ['IT', 'HR', 'IT', 'Finance']
})
df2 = pd.DataFrame({
'ID': [1, 2, 3, 5],
'Salary': [50000, 60000, 75000, 80000],
'Bonus': [5000, 6000, 7500, 8000]
})
# Inner join (only matching keys)
inner_join = pd.merge(df1, df2, on='ID', how='inner')
print(f"Inner join:\n{inner_join}")
# Left join (keep all from left)
left_join = pd.merge(df1, df2, on='ID', how='left')
print(f"Left join:\n{left_join}")
# Right join (keep all from right)
right_join = pd.merge(df1, df2, on='ID', how='right')
print(f"Right join:\n{right_join}")
# Outer join (keep all from both)
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print(f"Outer join:\n{outer_join}")
# Join on multiple columns
df3 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Year': [2020, 2020, 2021, 2021],
'Sales': [100, 150, 200, 250]
})
df4 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Year': [2020, 2020, 2021, 2021],
'Target': [90, 140, 180, 240]
})
merged = pd.merge(df3, df4, on=['ID', 'Year'])
print(f"Merged on multiple columns:\n{merged}")
# Concatenation
df_concat = pd.concat([df1, df2], axis=0)  # Stack vertically
df_concat_horiz = pd.concat([df1, df2], axis=1)  # Stack horizontally

4. Data Visualization

Matplotlib Basics

import matplotlib.pyplot as plt
import numpy as np
# Basic line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)', color='blue', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', color='red', linestyle='--')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Sine and Cosine Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Scatter plot
x = np.random.randn(100)
y = np.random.randn(100)
axes[0, 0].scatter(x, y, alpha=0.5)
axes[0, 0].set_title('Scatter Plot')
# Histogram
data = np.random.randn(1000)
axes[0, 1].hist(data, bins=30, alpha=0.7, color='green')
axes[0, 1].set_title('Histogram')
# Bar plot
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
axes[1, 0].bar(categories, values, color='orange')
axes[1, 0].set_title('Bar Chart')
# Pie chart
sizes = [30, 25, 20, 15, 10]
labels = ['Python', 'JavaScript', 'Java', 'C++', 'Ruby']
axes[1, 1].pie(sizes, labels=labels, autopct='%1.1f%%')
axes[1, 1].set_title('Pie Chart')
plt.tight_layout()
plt.show()

Seaborn - Statistical Visualization

import seaborn as sns
import pandas as pd
import numpy as np
# Load example dataset
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
# Distribution plots
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
sns.histplot(tips['total_bill'], bins=30, kde=True)
plt.title('Total Bill Distribution')
plt.subplot(1, 3, 2)
sns.boxplot(x='day', y='total_bill', data=tips)
plt.title('Bill by Day')
plt.subplot(1, 3, 3)
sns.violinplot(x='time', y='total_bill', data=tips)
plt.title('Bill by Time')
plt.tight_layout()
plt.show()
# Relationship plots
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
sns.scatterplot(x='total_bill', y='tip', hue='time', data=tips)
plt.title('Total Bill vs Tip')
plt.subplot(1, 3, 2)
sns.regplot(x='total_bill', y='tip', data=tips)
plt.title('With Regression Line')
plt.subplot(1, 3, 3)
sns.lmplot(x='total_bill', y='tip', hue='time', data=tips)
plt.title('Faceted by Time')
plt.tight_layout()
plt.show()
# Correlation heatmap
correlation = tips.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()
# Pairplot (iris dataset)
sns.pairplot(iris, hue='species')
plt.show()
# Categorical plots
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
sns.countplot(x='day', data=tips)
plt.title('Count by Day')
plt.subplot(1, 3, 2)
sns.barplot(x='day', y='total_bill', data=tips)
plt.title('Average Bill by Day')
plt.subplot(1, 3, 3)
sns.pointplot(x='day', y='total_bill', hue='sex', data=tips)
plt.title('Bill by Day and Gender')
plt.tight_layout()
plt.show()

Advanced Visualization

# 3D plots
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
x = np.random.rand(100)
y = np.random.rand(100)
z = np.random.rand(100)
ax.scatter(x, y, z, c=z, cmap='viridis', alpha=0.6)
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
plt.title('3D Scatter Plot')
plt.show()
# Contour plots
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = np.exp(-X**2 - Y**2)
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.contour(X, Y, Z, levels=20)
plt.title('Contour Plot')
plt.subplot(1, 3, 2)
plt.contourf(X, Y, Z, levels=20, cmap='viridis')
plt.colorbar()
plt.title('Filled Contour')
plt.subplot(1, 3, 3)
plt.imshow(Z, extent=[-3, 3, -3, 3], origin='lower', cmap='viridis')
plt.colorbar()
plt.title('Heatmap')
plt.tight_layout()
plt.show()
# Animations
from matplotlib.animation import FuncAnimation
fig, ax = plt.subplots(figsize=(8, 6))
x = np.linspace(0, 2*np.pi, 100)
line, = ax.plot(x, np.sin(x))
def animate(i):
line.set_ydata(np.sin(x + i/10.0))
return line,
anim = FuncAnimation(fig, animate, frames=100, interval=50, blit=True)
# anim.save('animation.gif', writer='pillow')
plt.show()

5. Data Cleaning and Preprocessing

Handling Outliers

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Create data with outliers
np.random.seed(42)
data = np.random.randn(1000)
data = np.append(data, [10, -8, 12, -9])  # Add outliers
df = pd.DataFrame({'value': data})
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['value']))
outliers_z = df[z_scores > 3]
print(f"Outliers (Z-score > 3): {len(outliers_z)}")
# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
print(f"Outliers (IQR): {len(outliers_iqr)}")
# Visualize outliers
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].boxplot(df['value'])
axes[0].set_title('Boxplot')
axes[1].hist(df['value'], bins=50)
axes[1].set_title('Histogram')
axes[2].scatter(range(len(df)), df['value'])
axes[2].set_title('Scatter Plot')
plt.tight_layout()
plt.show()
# Handle outliers
# Method 1: Remove outliers
df_clean = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]
# Method 2: Cap outliers
df_capped = df.copy()
df_capped['value'] = np.clip(df_capped['value'], lower_bound, upper_bound)
# Method 3: Winsorization
from scipy.stats.mstats import winsorize
df_winsorized = df.copy()
df_winsorized['value'] = winsorize(df_winsorized['value'], limits=[0.05, 0.05])

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Create sample data
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Salary': [30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000],
'Experience': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
}
df = pd.DataFrame(data)
# Standardization (Z-score normalization)
scaler_std = StandardScaler()
df_std = pd.DataFrame(
scaler_std.fit_transform(df),
columns=df.columns,
index=df.index
)
# Min-Max scaling
scaler_minmax = MinMaxScaler()
df_minmax = pd.DataFrame(
scaler_minmax.fit_transform(df),
columns=df.columns,
index=df.index
)
# Robust scaling (handles outliers)
scaler_robust = RobustScaler()
df_robust = pd.DataFrame(
scaler_robust.fit_transform(df),
columns=df.columns,
index=df.index
)
# Compare scaling methods
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(df)
axes[0, 0].set_title('Original Data')
axes[0, 0].legend(df.columns)
axes[0, 1].plot(df_std)
axes[0, 1].set_title('Standardized (Z-score)')
axes[0, 1].legend(df_std.columns)
axes[1, 0].plot(df_minmax)
axes[1, 0].set_title('Min-Max Scaled')
axes[1, 0].legend(df_minmax.columns)
axes[1, 1].plot(df_robust)
axes[1, 1].set_title('Robust Scaled')
axes[1, 1].legend(df_robust.columns)
plt.tight_layout()
plt.show()

Handling Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create sample data
df = pd.DataFrame({
'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green'],
'Size': ['S', 'M', 'L', 'M', 'L', 'S'],
'Price': [100, 150, 200, 120, 180, 90]
})
# Label Encoding
le_color = LabelEncoder()
le_size = LabelEncoder()
df['Color_Encoded'] = le_color.fit_transform(df['Color'])
df['Size_Encoded'] = le_size.fit_transform(df['Size'])
print("Label Encoded:\n", df[['Color', 'Color_Encoded', 'Size', 'Size_Encoded']])
# One-Hot Encoding
df_onehot = pd.get_dummies(df, columns=['Color', 'Size'])
print("\nOne-Hot Encoded:\n", df_onehot)
# Custom mapping
size_mapping = {'S': 0, 'M': 1, 'L': 2}
df['Size_Mapped'] = df['Size'].map(size_mapping)
# Frequency encoding
color_freq = df['Color'].value_counts() / len(df)
df['Color_Freq'] = df['Color'].map(color_freq)
# Target encoding (mean encoding)
target_mean = df.groupby('Color')['Price'].mean()
df['Color_Target'] = df['Color'].map(target_mean)
print("\nAdvanced Encodings:")
print(df[['Color', 'Price', 'Color_Target', 'Color_Freq']])

Feature Engineering

# Create sample dataset
df = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=100),
'value': np.random.randn(100),
'category': np.random.choice(['A', 'B', 'C'], 100),
'temperature': np.random.randint(-10, 40, 100),
'humidity': np.random.randint(20, 90, 100)
})
# Date features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
# Lag features
df['value_lag1'] = df['value'].shift(1)
df['value_lag2'] = df['value'].shift(2)
df['value_rolling_mean'] = df['value'].rolling(window=3).mean()
# Interaction features
df['temp_humidity'] = df['temperature'] * df['humidity']
df['temp_squared'] = df['temperature'] ** 2
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['temperature', 'humidity']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['temp', 'hum']))
# Binning
df['temp_bin'] = pd.cut(df['temperature'], bins=5, labels=['Very Cold', 'Cold', 'Mild', 'Warm', 'Hot'])
df['temp_bin_ordinal'] = pd.cut(df['temperature'], bins=5, labels=False)
# Encoding cyclical features
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
print("Engineered Features:")
print(df.head())

6. Exploratory Data Analysis (EDA)

Comprehensive EDA Template

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
def comprehensive_eda(df, target=None):
"""
Comprehensive EDA function for any dataset
"""
print("="*50)
print("DATASET OVERVIEW")
print("="*50)
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print("\n" + "="*50)
print("FIRST 5 ROWS")
print("="*50)
print(df.head())
print("\n" + "="*50)
print("BASIC STATISTICS")
print("="*50)
print(df.describe())
print("\n" + "="*50)
print("MISSING VALUES")
print("="*50)
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
'Missing Count': missing,
'Missing %': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])
# Correlation analysis
print("\n" + "="*50)
print("CORRELATION ANALYSIS")
print("="*50)
numeric_df = df.select_dtypes(include=[np.number])
if not numeric_df.empty:
corr_matrix = numeric_df.corr()
print(corr_matrix)
# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()
# Correlation with target
if target and target in numeric_df.columns:
target_corr = corr_matrix[target].sort_values(ascending=False)
print(f"\nCorrelation with target '{target}':\n{target_corr}")
# Distribution plots for numeric features
print("\n" + "="*50)
print("NUMERIC FEATURE DISTRIBUTIONS")
print("="*50)
numeric_cols = numeric_df.columns
n_cols = len(numeric_cols)
if n_cols > 0:
fig, axes = plt.subplots(n_cols, 2, figsize=(15, 4*n_cols))
if n_cols == 1:
axes = axes.reshape(1, -1)
for idx, col in enumerate(numeric_cols):
# Histogram
axes[idx, 0].hist(df[col].dropna(), bins=30, alpha=0.7, edgecolor='black')
axes[idx, 0].set_title(f'{col} - Histogram')
axes[idx, 0].set_xlabel(col)
axes[idx, 0].set_ylabel('Frequency')
# Boxplot
axes[idx, 1].boxplot(df[col].dropna())
axes[idx, 1].set_title(f'{col} - Boxplot')
axes[idx, 1].set_ylabel(col)
plt.tight_layout()
plt.show()
# Categorical features analysis
cat_cols = df.select_dtypes(include=['object', 'category']).columns
if len(cat_cols) > 0:
print("\n" + "="*50)
print("CATEGORICAL FEATURE ANALYSIS")
print("="*50)
fig, axes = plt.subplots(len(cat_cols), 2, figsize=(15, 5*len(cat_cols)))
if len(cat_cols) == 1:
axes = axes.reshape(1, -1)
for idx, col in enumerate(cat_cols):
# Value counts
value_counts = df[col].value_counts()
axes[idx, 0].bar(value_counts.index, value_counts.values)
axes[idx, 0].set_title(f'{col} - Value Counts')
axes[idx, 0].set_xlabel(col)
axes[idx, 0].set_ylabel('Count')
axes[idx, 0].tick_params(axis='x', rotation=45)
# Pie chart (if not too many categories)
if len(value_counts) <= 10:
axes[idx, 1].pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%')
axes[idx, 1].set_title(f'{col} - Distribution')
else:
axes[idx, 1].barh(value_counts.index[:10], value_counts.values[:10])
axes[idx, 1].set_title(f'{col} - Top 10 Categories')
plt.tight_layout()
plt.show()
# Outlier detection
print("\n" + "="*50)
print("OUTLIER DETECTION")
print("="*50)
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower) | (df[col] > upper)]
print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.2f}%)")
# Skewness and kurtosis
print("\n" + "="*50)
print("SKEWNESS AND KURTOSIS")
print("="*50)
for col in numeric_cols:
skew = df[col].skew()
kurt = df[col].kurtosis()
print(f"{col}: Skewness = {skew:.3f}, Kurtosis = {kurt:.3f}")
# Target analysis
if target and target in df.columns:
print("\n" + "="*50)
print(f"TARGET ANALYSIS: {target}")
print("="*50)
if df[target].dtype in ['int64', 'float64']:
# Regression target
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.hist(df[target].dropna(), bins=30, edgecolor='black')
plt.title(f'{target} Distribution')
plt.subplot(1, 2, 2)
plt.boxplot(df[target].dropna())
plt.title(f'{target} Boxplot')
plt.show()
else:
# Classification target
target_counts = df[target].value_counts()
plt.figure(figsize=(8, 6))
plt.bar(target_counts.index, target_counts.values)
plt.title(f'{target} Distribution')
plt.xlabel(target)
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
return df
# Example usage
# df = pd.read_csv('your_data.csv')
# comprehensive_eda(df, target='target_column')

7. Machine Learning with Scikit-learn

Train-Test Split

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris, load_boston
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Training class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")

Classification Models

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train_scaled, y_train)
y_pred = log_reg.predict(X_test_scaled)
print("Logistic Regression")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
plt.figure(figsize=(20, 10))
plot_tree(dt, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.title('Decision Tree Visualization')
plt.show()
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': iris.feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("Feature Importance:")
print(feature_importance)
# SVM
from sklearn.svm import SVC
svm = SVC(kernel='rbf', gamma='auto', random_state=42)
svm.fit(X_train_scaled, y_train)
y_pred_svm = svm.predict(X_test_scaled)
print("SVM")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)
print("KNN")
print(f"Accuracy: {accuracy_score(y_test, y_pred_knn):.4f}")

Cross-Validation and Hyperparameter Tuning

from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
# Cross-validation scores
cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
# Grid Search for Random Forest
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)
# Randomized Search (faster for large grids)
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions=param_grid,
n_iter=20,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
print("Best params (Random):", random_search.best_params_)

Regression Models

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
# Ridge Regression (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)
# Lasso Regression (L1 regularization)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
# Gradient Boosting
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
# Compare models
models = {
'Linear Regression': y_pred_lr,
'Ridge': y_pred_ridge,
'Lasso': y_pred_lasso,
'Random Forest': y_pred_rf,
'Gradient Boosting': y_pred_gb
}
print("Regression Results:")
print("="*60)
for name, y_pred in models.items():
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"{name}:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE: {mae:.4f}")
print(f"  R²: {r2:.4f}")
print()
# Feature importance for tree-based models
rf_importance = pd.DataFrame({
'feature': housing.feature_names,
'importance': rf_reg.feature_importances_
}).sort_values('importance', ascending=False)
print("Random Forest Feature Importance:")
print(rf_importance)
# Plot predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Random Forest: Actual vs Predicted')
plt.show()

Clustering

from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Load data
from sklearn.datasets import make_blobs
# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# K-means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
# Find optimal number of clusters
inertias = []
silhouette_scores = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))
# Elbow method
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('Number of clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')
axes[1].plot(K_range, silhouette_scores, 'ro-')
axes[1].set_xlabel('Number of clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score Method')
plt.tight_layout()
plt.show()
# Visualize clusters
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis')
plt.title('True Clusters')
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
s=200, c='red', marker='x', linewidths=3)
plt.title('K-Means Clusters')
plt.show()
# DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
y_dbscan = dbscan.fit_predict(X_scaled)
print(f"K-Means clusters: {len(np.unique(y_kmeans))}")
print(f"DBSCAN clusters: {len(np.unique(y_dbscan))} (including noise)")
print(f"DBSCAN noise points: {np.sum(y_dbscan == -1)}")

Model Evaluation and Validation

from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.model_selection import learning_curve, validation_curve
# Confusion matrix visualization
def plot_confusion_matrix(y_true, y_pred, classes):
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
xticklabels=classes, yticklabels=classes)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Plot ROC curves (for binary classification)
def plot_roc_curves(models, X_test, y_test):
plt.figure(figsize=(10, 8))
for name, model in models.items():
if hasattr(model, "predict_proba"):
y_pred_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend(loc='lower right')
plt.show()
# Learning curves
def plot_learning_curve(estimator, X, y, cv=5):
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1)
plt.plot(train_sizes, train_mean, 'o-', label='Training score')
plt.plot(train_sizes, test_mean, 'o-', label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.title('Learning Curves')
plt.legend(loc='best')
plt.grid(True)
plt.show()

8. Deep Learning with TensorFlow/Keras

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Set random seed for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
# Load and prepare data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
iris = load_iris()
X = iris.data
y = iris.target.reshape(-1, 1)
# One-hot encode labels
encoder = OneHotEncoder()
y_encoded = encoder.fit_transform(y).toarray()
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Build neural network
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(4,)),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(3, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Model summary
model.summary()
# Train model
history = model.fit(
X_train_scaled, y_train,
epochs=100,
batch_size=16,
validation_split=0.2,
verbose=1
)
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_title('Model Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[1].plot(history.history['accuracy'], label='Training Accuracy')
axes[1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[1].set_title('Model Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].legend()
plt.tight_layout()
plt.show()
# Evaluate model
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Test accuracy: {test_accuracy:.4f}")
# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true_classes = np.argmax(y_test, axis=1)
# Classification report
print("\nClassification Report:")
print(classification_report(y_true_classes, y_pred_classes, target_names=iris.target_names))
# Convolutional Neural Network for image classification
from tensorflow.keras.datasets import mnist
# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Preprocess
X_train = X_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
X_test = X_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
# One-hot encode labels
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
# Build CNN
cnn_model = keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
cnn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train CNN
cnn_history = cnn_model.fit(
X_train, y_train,
epochs=5,
batch_size=64,
validation_split=0.2,
verbose=1
)
# Evaluate CNN
test_loss, test_acc = cnn_model.evaluate(X_test, y_test, verbose=0)
print(f"CNN Test accuracy: {test_acc:.4f}")
# Make predictions on test images
predictions = cnn_model.predict(X_test[:5])
# Visualize predictions
fig, axes = plt.subplots(1, 5, figsize=(12, 3))
for i, ax in enumerate(axes):
ax.imshow(X_test[i].reshape(28, 28), cmap='gray')
ax.set_title(f'Pred: {np.argmax(predictions[i])}')
ax.axis('off')
plt.show()

9. Real-World Data Science Project

Customer Churn Prediction Project

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')
# Generate synthetic customer churn data
np.random.seed(42)
n_customers = 10000
# Generate features
data = {
'customer_id': range(1, n_customers + 1),
'age': np.random.randint(18, 70, n_customers),
'gender': np.random.choice(['Male', 'Female'], n_customers),
'tenure': np.random.randint(1, 72, n_customers),  # months
'monthly_charges': np.random.uniform(20, 100, n_customers),
'total_charges': np.random.uniform(200, 5000, n_customers),
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers, 
p=[0.5, 0.3, 0.2]),
'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_customers),
'paperless_billing': np.random.choice([0, 1], n_customers, p=[0.4, 0.6]),
'num_services': np.random.randint(1, 8, n_customers),
'has_online_security': np.random.choice([0, 1], n_customers, p=[0.6, 0.4]),
'has_tech_support': np.random.choice([0, 1], n_customers, p=[0.6, 0.4]),
'avg_monthly_gb_download': np.random.uniform(0, 200, n_customers),
'num_complaints': np.random.poisson(0.5, n_customers),
'satisfaction_score': np.random.randint(1, 6, n_customers)
}
df = pd.DataFrame(data)
# Create churn label (synthetic, based on some features)
df['churn'] = (
(df['tenure'] < 12) * 0.3 +
(df['monthly_charges'] > 60) * 0.2 +
(df['contract_type'] == 'Month-to-month') * 0.2 +
(df['num_complaints'] > 0) * 0.2 +
(df['satisfaction_score'] < 3) * 0.1
)
df['churn'] = (df['churn'] > np.random.uniform(0, 0.8, n_customers)).astype(int)
# Display basic info
print("Dataset Info:")
print(df.info())
print("\nChurn Rate:")
print(df['churn'].value_counts(normalize=True))
# Exploratory Data Analysis
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Age distribution by churn
sns.histplot(data=df, x='age', hue='churn', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution by Churn')
# Monthly charges by churn
sns.boxplot(data=df, x='churn', y='monthly_charges', ax=axes[0, 1])
axes[0, 1].set_title('Monthly Charges by Churn')
# Tenure by churn
sns.boxplot(data=df, x='churn', y='tenure', ax=axes[0, 2])
axes[0, 2].set_title('Tenure by Churn')
# Contract type by churn
contract_churn = pd.crosstab(df['contract_type'], df['churn'], normalize='index')
contract_churn.plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Churn Rate by Contract Type')
# Payment method by churn
payment_churn = pd.crosstab(df['payment_method'], df['churn'], normalize='index')
payment_churn.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Churn Rate by Payment Method')
axes[1, 1].tick_params(axis='x', rotation=45)
# Satisfaction score by churn
sns.boxplot(data=df, x='churn', y='satisfaction_score', ax=axes[1, 2])
axes[1, 2].set_title('Satisfaction Score by Churn')
plt.tight_layout()
plt.show()
# Feature engineering
# Encode categorical variables
categorical_cols = ['gender', 'contract_type', 'payment_method']
for col in categorical_cols:
df[col] = LabelEncoder().fit_transform(df[col])
# Create additional features
df['avg_monthly_charges_per_service'] = df['monthly_charges'] / df['num_services']
df['tenure_years'] = df['tenure'] / 12
df['is_high_value'] = ((df['monthly_charges'] > 70) & (df['tenure'] > 12)).astype(int)
# Select features for modeling
feature_cols = ['age', 'gender', 'tenure', 'monthly_charges', 'total_charges', 
'contract_type', 'payment_method', 'paperless_billing', 
'num_services', 'has_online_security', 'has_tech_support',
'avg_monthly_gb_download', 'num_complaints', 'satisfaction_score',
'avg_monthly_charges_per_service', 'tenure_years', 'is_high_value']
X = df[feature_cols]
y = df['churn']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train models
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100)
}
results = {}
for name, model in models.items():
# Train
model.fit(X_train_scaled, y_train)
# Predict
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
accuracy = model.score(X_test_scaled, y_test)
roc_auc = roc_auc_score(y_test, y_pred_proba)
results[name] = {
'model': model,
'accuracy': accuracy,
'roc_auc': roc_auc,
'y_pred': y_pred,
'y_pred_proba': y_pred_proba
}
print(f"{name}:")
print(f"  Accuracy: {accuracy:.4f}")
print(f"  ROC AUC: {roc_auc:.4f}")
print()
# Compare models
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# ROC Curves
for name, res in results.items():
fpr, tpr, _ = roc_curve(y_test, res['y_pred_proba'])
axes[0].plot(fpr, tpr, label=f"{name} (AUC = {res['roc_auc']:.3f})")
axes[0].plot([0, 1], [0, 1], 'k--')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curves')
axes[0].legend()
# Feature Importance (Random Forest)
rf_model = results['Random Forest']['model']
feature_importance = pd.DataFrame({
'feature': feature_cols,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False).head(10)
axes[1].barh(feature_importance['feature'], feature_importance['importance'])
axes[1].set_xlabel('Importance')
axes[1].set_title('Top 10 Features (Random Forest)')
# Confusion Matrix
cm = confusion_matrix(y_test, results['Random Forest']['y_pred'])
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[2])
axes[2].set_xlabel('Predicted')
axes[2].set_ylabel('Actual')
axes[2].set_title('Confusion Matrix - Random Forest')
plt.tight_layout()
plt.show()
# Detailed classification report
print("\nClassification Report - Random Forest:")
print(classification_report(y_test, results['Random Forest']['y_pred']))
# Business insights
print("\nKey Insights for Customer Churn Prevention:")
print("="*50)
# Calculate churn risk by segment
high_risk_segments = []
# By contract type
contract_risk = df.groupby('contract_type')['churn'].mean()
high_risk_contract = contract_risk.idxmax()
high_risk_segments.append(f"Customers with {high_risk_contract} contracts")
# By payment method
payment_risk = df.groupby('payment_method')['churn'].mean()
high_risk_payment = payment_risk.idxmax()
high_risk_segments.append(f"Customers using {high_risk_payment}")
# By low satisfaction
satisfaction_risk = df[df['satisfaction_score'] <= 2]['churn'].mean()
high_risk_segments.append(f"Low satisfaction customers (score <= 2)")
# By complaints
complaint_risk = df[df['num_complaints'] > 0]['churn'].mean()
high_risk_segments.append(f"Customers with complaints")
print("\nHigh-risk customer segments to focus retention efforts:")
for segment in high_risk_segments:
print(f"  • {segment}")
print("\nRecommended actions:")
print("  1. Offer incentives for longer contract commitments")
print("  2. Improve payment experience for high-risk payment methods")
print("  3. Proactive customer support for low satisfaction customers")
print("  4. Address complaints quickly and follow up")
print("  5. Consider loyalty programs for short-tenure customers")

Conclusion

Data Science with Python is a vast and exciting field. This guide has covered the essential libraries, techniques, and workflows:

Key Takeaways

  1. NumPy: Foundation for numerical computing
  2. Pandas: Data manipulation and analysis
  3. Matplotlib/Seaborn: Data visualization
  4. Scikit-learn: Machine learning algorithms
  5. TensorFlow/Keras: Deep learning
  6. EDA: Understanding data before modeling
  7. Feature Engineering: Creating meaningful features
  8. Model Evaluation: Assessing performance

Best Practices

  1. Reproducibility: Set random seeds, version control
  2. Documentation: Comment code, document assumptions
  3. Modular Code: Write reusable functions
  4. Testing: Validate data and model assumptions
  5. Version Control: Track changes to code and data
  6. Monitoring: Track model performance over time

Next Steps

  • Advanced Topics: Time series, NLP, Computer Vision
  • Big Data: PySpark, Dask
  • MLOps: Model deployment, monitoring
  • Specialization: Choose a domain (finance, healthcare, etc.)
  • Contributions: Open source data science projects

Remember: Data science is an iterative process. Start with simple models, validate your assumptions, and gradually increase complexity. Always focus on solving real problems with data!

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper