Complete Guide to Linear Regression: A Practical Case Study

Introduction

Linear regression is one of the most fundamental and widely used algorithms in data science and machine learning. This comprehensive guide walks through a complete linear regression case study, from problem definition to model deployment, using a realistic dataset. We'll cover every step of the data science workflow with detailed explanations and code examples.

Case Study Scenario

Problem: A real estate company wants to predict house prices based on various features to help buyers and sellers make informed decisions.

Goal: Build a linear regression model that accurately predicts house prices and identifies the most important factors influencing price.


Part 1: Problem Definition and Data Collection

1.1 Understanding the Problem

First, we need to understand what we're trying to predict and what data we need.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')
# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
print("Linear Regression Case Study: House Price Prediction")
print("=" * 60)

1.2 Generate or Load Dataset

For this case study, we'll create a realistic synthetic dataset based on real-world housing market characteristics.

def generate_housing_data(n_samples=1000):
"""Generate realistic housing data"""
np.random.seed(42)
# Generate features
sqft = np.random.normal(2000, 500, n_samples)  # Square footage
sqft = np.clip(sqft, 500, 5000)
bedrooms = np.random.randint(1, 6, n_samples)  # Number of bedrooms
bathrooms = np.random.uniform(1, 4, n_samples)  # Number of bathrooms
bathrooms = np.round(bathrooms * 2) / 2
age = np.random.randint(0, 100, n_samples)  # Age of house
lot_size = np.random.exponential(5000, n_samples)  # Lot size in sq ft
lot_size = np.clip(lot_size, 1000, 20000)
# Location scores (0-10)
location_score = np.random.uniform(0, 10, n_samples)
# Condition score (1-5)
condition = np.random.choice([1, 2, 3, 4, 5], n_samples, 
p=[0.05, 0.1, 0.4, 0.3, 0.15])
# Garage (0-3 cars)
garage = np.random.choice([0, 1, 2, 3], n_samples, 
p=[0.1, 0.3, 0.4, 0.2])
# Generate price (with realistic relationships)
base_price = 50000
sqft_coef = 80
bedroom_coef = 15000
bathroom_coef = 12000
age_coef = -500
lot_coef = 2
location_coef = 15000
condition_coef = 8000
garage_coef = 8000
# Add interactions
price = (base_price +
sqft_coef * sqft +
bedroom_coef * bedrooms +
bathroom_coef * bathrooms +
age_coef * age +
lot_coef * lot_size +
location_coef * location_score +
condition_coef * condition +
garage_coef * garage +
# Interactions
20 * sqft * location_score +  # Location matters more for larger houses
-200 * age * condition)  # Older houses in poor condition worth less
# Add random noise
noise = np.random.normal(0, 20000, n_samples)
price = price + noise
price = np.maximum(price, 50000)  # Minimum price
# Create DataFrame
df = pd.DataFrame({
'sqft': sqft,
'bedrooms': bedrooms,
'bathrooms': bathrooms,
'age': age,
'lot_size': lot_size,
'location_score': location_score,
'condition': condition,
'garage': garage,
'price': price
})
return df
# Generate dataset
df = generate_housing_data(1000)
print(f"Dataset shape: {df.shape}")
print("\nFirst 5 rows:")
print(df.head())

1.3 Initial Data Exploration

# Dataset overview
print("\nDataset Info:")
print(df.info())
print("\nDescriptive Statistics:")
print(df.describe().round(2))
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

Part 2: Exploratory Data Analysis (EDA)

2.1 Distribution Analysis

# Price distribution
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Price distribution
axes[0, 0].hist(df['price'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Price ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of House Prices')
axes[0, 0].axvline(df['price'].mean(), color='red', linestyle='--', 
label=f"Mean: ${df['price'].mean():,.0f}")
axes[0, 0].axvline(df['price'].median(), color='green', linestyle='--', 
label=f"Median: ${df['price'].median():,.0f}")
axes[0, 0].legend()
# Square footage distribution
axes[0, 1].hist(df['sqft'], bins=40, edgecolor='black', alpha=0.7)
axes[0, 1].set_xlabel('Square Footage')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Square Footage')
# Age distribution
axes[0, 2].hist(df['age'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 2].set_xlabel('Age (years)')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Distribution of House Age')
# Bedrooms distribution
df['bedrooms'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_xlabel('Number of Bedrooms')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Distribution of Bedrooms')
# Bathrooms distribution
df['bathrooms'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_xlabel('Number of Bathrooms')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Distribution of Bathrooms')
# Condition distribution
df['condition'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 2])
axes[1, 2].set_xlabel('Condition (1=Poor, 5=Excellent)')
axes[1, 2].set_ylabel('Count')
axes[1, 2].set_title('Distribution of Condition')
plt.tight_layout()
plt.show()
# Statistical summary of price
print("\nPrice Statistics:")
print(f"Mean:     ${df['price'].mean():,.0f}")
print(f"Median:   ${df['price'].median():,.0f}")
print(f"Std Dev:  ${df['price'].std():,.0f}")
print(f"Min:      ${df['price'].min():,.0f}")
print(f"Max:      ${df['price'].max():,.0f}")

2.2 Relationship Analysis

# Scatter plots for continuous variables
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Price vs Square Footage
axes[0, 0].scatter(df['sqft'], df['price'], alpha=0.5)
axes[0, 0].set_xlabel('Square Footage')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].set_title('Price vs Square Footage')
# Add trend line
z = np.polyfit(df['sqft'], df['price'], 1)
p = np.poly1d(z)
axes[0, 0].plot(df['sqft'].sort_values(), p(df['sqft'].sort_values()), 
'r-', linewidth=2, label='Trend')
axes[0, 0].legend()
# Price vs Age
axes[0, 1].scatter(df['age'], df['price'], alpha=0.5)
axes[0, 1].set_xlabel('Age (years)')
axes[0, 1].set_ylabel('Price ($)')
axes[0, 1].set_title('Price vs Age')
z = np.polyfit(df['age'], df['price'], 1)
p = np.poly1d(z)
axes[0, 1].plot(df['age'].sort_values(), p(df['age'].sort_values()), 
'r-', linewidth=2, label='Trend')
axes[0, 1].legend()
# Price vs Lot Size
axes[1, 0].scatter(df['lot_size'], df['price'], alpha=0.5)
axes[1, 0].set_xlabel('Lot Size (sq ft)')
axes[1, 0].set_ylabel('Price ($)')
axes[1, 0].set_title('Price vs Lot Size')
z = np.polyfit(df['lot_size'], df['price'], 1)
p = np.poly1d(z)
axes[1, 0].plot(df['lot_size'].sort_values(), p(df['lot_size'].sort_values()), 
'r-', linewidth=2, label='Trend')
axes[1, 0].legend()
# Price vs Location Score
axes[1, 1].scatter(df['location_score'], df['price'], alpha=0.5)
axes[1, 1].set_xlabel('Location Score (0-10)')
axes[1, 1].set_ylabel('Price ($)')
axes[1, 1].set_title('Price vs Location Score')
z = np.polyfit(df['location_score'], df['price'], 1)
p = np.poly1d(z)
axes[1, 1].plot(df['location_score'].sort_values(), p(df['location_score'].sort_values()), 
'r-', linewidth=2, label='Trend')
axes[1, 1].legend()
plt.tight_layout()
plt.show()

2.3 Categorical Variable Analysis

# Box plots for categorical variables
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Price by Bedrooms
df.boxplot(column='price', by='bedrooms', ax=axes[0])
axes[0].set_title('Price by Number of Bedrooms')
axes[0].set_xlabel('Bedrooms')
axes[0].set_ylabel('Price ($)')
# Price by Condition
df.boxplot(column='price', by='condition', ax=axes[1])
axes[1].set_title('Price by Condition')
axes[1].set_xlabel('Condition (1=Poor, 5=Excellent)')
axes[1].set_ylabel('Price ($)')
# Price by Garage
df.boxplot(column='price', by='garage', ax=axes[2])
axes[2].set_title('Price by Garage Size')
axes[2].set_xlabel('Garage Spaces')
axes[2].set_ylabel('Price ($)')
plt.suptitle('')
plt.tight_layout()
plt.show()

2.4 Correlation Analysis

# Correlation matrix
corr_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
fmt='.3f', square=True, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Features')
plt.tight_layout()
plt.show()
# Correlation with target
print("\nCorrelation with Price:")
corr_with_price = corr_matrix['price'].drop('price').sort_values(ascending=False)
for feature, corr in corr_with_price.items():
print(f"  {feature:15} : {corr:.3f}")
# Identify highly correlated features (multicollinearity)
print("\nHighly Correlated Features (|r| > 0.7):")
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.7:
print(f"  {corr_matrix.columns[i]} ↔ {corr_matrix.columns[j]}: {corr_matrix.iloc[i, j]:.3f}")

Part 3: Data Preprocessing

3.1 Handle Outliers

def detect_outliers(df, column, threshold=3):
"""Detect outliers using Z-score"""
z_scores = np.abs((df[column] - df[column].mean()) / df[column].std())
outliers = df[z_scores > threshold]
return outliers
# Detect outliers in price
price_outliers = detect_outliers(df, 'price')
print(f"Outliers in price: {len(price_outliers)} houses")
print(f"Outlier prices range: ${price_outliers['price'].min():,.0f} - ${price_outliers['price'].max():,.0f}")
# Visualize outliers
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Box plot with outliers
axes[0].boxplot(df['price'])
axes[0].set_ylabel('Price ($)')
axes[0].set_title('Box Plot of Prices (with outliers)')
# Histogram with outlier highlight
axes[1].hist(df['price'], bins=50, alpha=0.7, edgecolor='black')
axes[1].hist(price_outliers['price'], bins=20, alpha=0.5, color='red', edgecolor='black')
axes[1].set_xlabel('Price ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Price Distribution with Outliers Highlighted')
axes[1].legend(['Normal', 'Outliers'])
plt.tight_layout()
plt.show()
# For linear regression, we'll cap outliers at 3 standard deviations
def cap_outliers(df, column, n_std=3):
"""Cap outliers at n standard deviations"""
mean = df[column].mean()
std = df[column].std()
upper_limit = mean + n_std * std
lower_limit = mean - n_std * std
df[column] = df[column].clip(lower_limit, upper_limit)
return df
df = cap_outliers(df, 'price')
print(f"After capping: Price range: ${df['price'].min():,.0f} - ${df['price'].max():,.0f}")

3.2 Feature Engineering

# Create new features
df['price_per_sqft'] = df['price'] / df['sqft']
df['rooms_per_sqft'] = df['bedrooms'] / df['sqft'] * 1000
df['bath_per_bedroom'] = df['bathrooms'] / df['bedrooms']
df['age_squared'] = df['age'] ** 2
df['sqft_squared'] = df['sqft'] ** 2
df['sqft_bedroom_interaction'] = df['sqft'] * df['bedrooms']
df['location_condition_interaction'] = df['location_score'] * df['condition']
df['age_condition_interaction'] = df['age'] * df['condition']
print("New features created:")
new_features = ['price_per_sqft', 'rooms_per_sqft', 'bath_per_bedroom', 
'age_squared', 'sqft_squared', 'sqft_bedroom_interaction',
'location_condition_interaction', 'age_condition_interaction']
for feat in new_features:
print(f"  {feat}")

3.3 Prepare Features for Modeling

# Select features
feature_cols = ['sqft', 'bedrooms', 'bathrooms', 'age', 'lot_size', 
'location_score', 'condition', 'garage', 'price_per_sqft',
'rooms_per_sqft', 'age_squared', 'sqft_squared', 
'sqft_bedroom_interaction', 'location_condition_interaction']
X = df[feature_cols]
y = df['price']
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("\nFeatures scaled successfully")

Part 4: Model Building

4.1 Simple Linear Regression (Baseline)

# Start with simple linear regression using the most correlated feature
best_feature = corr_with_price.index[0]
print(f"Starting with best single feature: {best_feature}")
# Simple model
simple_model = LinearRegression()
X_train_simple = X_train[[best_feature]]
X_test_simple = X_test[[best_feature]]
simple_model.fit(X_train_simple, y_train)
# Predictions
y_pred_simple = simple_model.predict(X_test_simple)
# Evaluate
r2_simple = r2_score(y_test, y_pred_simple)
rmse_simple = np.sqrt(mean_squared_error(y_test, y_pred_simple))
mae_simple = mean_absolute_error(y_test, y_pred_simple)
print("\nSimple Linear Regression Results:")
print(f"  R² Score: {r2_simple:.4f}")
print(f"  RMSE: ${rmse_simple:,.0f}")
print(f"  MAE: ${mae_simple:,.0f}")
print(f"  Coefficient: {simple_model.coef_[0]:.2f}")
print(f"  Intercept: {simple_model.intercept_:,.2f}")

4.2 Multiple Linear Regression

# Multiple linear regression with all features
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Predictions
y_pred_train = model.predict(X_train_scaled)
y_pred_test = model.predict(X_test_scaled)
# Evaluate
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)
print("Multiple Linear Regression Results")
print("=" * 50)
print(f"Training R²: {r2_train:.4f}")
print(f"Test R²: {r2_test:.4f}")
print(f"Test RMSE: ${rmse_test:,.0f}")
print(f"Test MAE: ${mae_test:,.0f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': feature_cols,
'coefficient': model.coef_,
'abs_coefficient': np.abs(model.coef_)
}).sort_values('abs_coefficient', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))

4.3 Regularized Regression (Ridge and Lasso)

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)
# Lasso Regression
lasso = Lasso(alpha=0.01)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
# Evaluate
ridge_r2 = r2_score(y_test, y_pred_ridge)
lasso_r2 = r2_score(y_test, y_pred_lasso)
print("Regularized Regression Results")
print("=" * 50)
print(f"Ridge R²: {ridge_r2:.4f}")
print(f"Lasso R²: {lasso_r2:.4f}")
# Feature selection with Lasso
lasso_features = pd.DataFrame({
'feature': feature_cols,
'coefficient': lasso.coef_
}).sort_values('coefficient', ascending=False)
print("\nLasso Feature Selection (non-zero coefficients):")
non_zero = lasso_features[lasso_features['coefficient'] != 0]
print(non_zero.to_string(index=False))

Part 5: Model Evaluation

5.1 Residual Analysis

# Calculate residuals
residuals = y_test - y_pred_test
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs Predicted
axes[0, 0].scatter(y_pred_test, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Predicted Price ($)')
axes[0, 0].set_ylabel('Residuals ($)')
axes[0, 0].set_title('Residuals vs Predicted Values')
axes[0, 0].grid(True, alpha=0.3)
# Histogram of residuals
axes[0, 1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[0, 1].set_xlabel('Residuals ($)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Residuals')
axes[0, 1].axvline(x=0, color='r', linestyle='--', linewidth=2)
# Q-Q plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot (Normality Check)')
axes[1, 0].grid(True, alpha=0.3)
# Actual vs Predicted
axes[1, 1].scatter(y_test, y_pred_test, alpha=0.5)
axes[1, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
'r--', linewidth=2, label='Perfect Prediction')
axes[1, 1].set_xlabel('Actual Price ($)')
axes[1, 1].set_ylabel('Predicted Price ($)')
axes[1, 1].set_title('Actual vs Predicted Prices')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Residual statistics
print("Residual Analysis")
print("=" * 50)
print(f"Mean of residuals: ${residuals.mean():.2f}")
print(f"Std of residuals: ${residuals.std():.2f}")
print(f"Min residual: ${residuals.min():.2f}")
print(f"Max residual: ${residuals.max():.2f}")
print(f"95% of residuals within: ±${1.96 * residuals.std():.0f}")

5.2 Cross-Validation

from sklearn.model_selection import cross_val_score, KFold
# Perform cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring='r2')
print("Cross-Validation Results")
print("=" * 50)
print(f"CV R² scores: {cv_scores}")
print(f"Mean CV R²: {cv_scores.mean():.4f}")
print(f"Std CV R²: {cv_scores.std():.4f}")
# Compare with baseline
baseline_cv = cross_val_score(simple_model, X_train_simple, y_train, cv=kfold, scoring='r2')
print(f"\nBaseline CV R²: {baseline_cv.mean():.4f}")
print(f"Improvement: {(cv_scores.mean() - baseline_cv.mean()) * 100:.1f}%")

Part 6: Model Interpretation

6.1 Feature Importance Visualization

# Feature importance plot
fig, ax = plt.subplots(figsize=(10, 8))
importance_data = feature_importance.head(15).sort_values('coefficient')
colors = ['green' if x > 0 else 'red' for x in importance_data['coefficient']]
ax.barh(importance_data['feature'], importance_data['coefficient'], color=colors)
ax.axvline(x=0, color='black', linewidth=1)
ax.set_xlabel('Coefficient Value')
ax.set_title('Feature Importance (Coefficient Magnitude)')
# Add value labels
for i, (value, name) in enumerate(zip(importance_data['coefficient'], importance_data['feature'])):
ax.text(value, i, f'  {value:.0f}', va='center')
plt.tight_layout()
plt.show()
# Economic interpretation
print("\nEconomic Interpretation of Key Features")
print("=" * 60)
print("Based on the model coefficients (scaled features):")
for idx, row in feature_importance.head(8).iterrows():
feature = row['feature']
coef = row['coefficient']
if coef > 0:
impact = "increases"
else:
impact = "decreases"
print(f"  • {feature}: {impact} price by ${abs(coef):,.0f} per standard deviation")

6.2 Prediction Example

# Create a sample house for prediction
sample_house = pd.DataFrame({
'sqft': [2000],
'bedrooms': [3],
'bathrooms': [2],
'age': [10],
'lot_size': [5000],
'location_score': [8],
'condition': [4],
'garage': [2],
'price_per_sqft': [0],  # Will be computed
'rooms_per_sqft': [0],
'age_squared': [0],
'sqft_squared': [0],
'sqft_bedroom_interaction': [0],
'location_condition_interaction': [0]
})
# Compute derived features
sample_house['price_per_sqft'] = 0  # Placeholder, will be predicted
sample_house['rooms_per_sqft'] = sample_house['bedrooms'] / sample_house['sqft'] * 1000
sample_house['age_squared'] = sample_house['age'] ** 2
sample_house['sqft_squared'] = sample_house['sqft'] ** 2
sample_house['sqft_bedroom_interaction'] = sample_house['sqft'] * sample_house['bedrooms']
sample_house['location_condition_interaction'] = sample_house['location_score'] * sample_house['condition']
# Scale features
sample_scaled = scaler.transform(sample_house[feature_cols])
# Predict price
predicted_price = model.predict(sample_scaled)[0]
print("Sample House Prediction")
print("=" * 50)
print("House Features:")
for col in feature_cols[:8]:  # Show original features
print(f"  {col}: {sample_house[col].iloc[0]}")
print(f"\nPredicted Price: ${predicted_price:,.0f}")

Part 7: Model Optimization

7.1 Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV
# Tune Ridge regression
ridge_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='r2')
ridge_grid.fit(X_train_scaled, y_train)
print("Ridge Hyperparameter Tuning")
print("=" * 50)
print(f"Best alpha: {ridge_grid.best_params_['alpha']}")
print(f"Best CV R²: {ridge_grid.best_score_:.4f}")
# Tune Lasso regression
lasso_params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 1]}
lasso_grid = GridSearchCV(Lasso(max_iter=10000), lasso_params, cv=5, scoring='r2')
lasso_grid.fit(X_train_scaled, y_train)
print("\nLasso Hyperparameter Tuning")
print("=" * 50)
print(f"Best alpha: {lasso_grid.best_params_['alpha']}")
print(f"Best CV R²: {lasso_grid.best_score_:.4f}")
# Best model
best_model = ridge_grid.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)
best_r2 = r2_score(y_test, y_pred_best)
print("\nOptimized Model Results")
print("=" * 50)
print(f"Test R²: {best_r2:.4f}")
print(f"Improvement: {(best_r2 - r2_test) * 100:.2f}%")

7.2 Feature Selection Based on Lasso

# Use optimized Lasso for feature selection
best_lasso = lasso_grid.best_estimator_
best_lasso.fit(X_train_scaled, y_train)
# Get selected features
selected_features = []
for feature, coef in zip(feature_cols, best_lasso.coef_):
if abs(coef) > 0.01:
selected_features.append(feature)
print(f"Features selected by Lasso: {len(selected_features)} out of {len(feature_cols)}")
print("Selected features:", selected_features)
# Train model with selected features
X_train_selected = X_train_scaled[:, [feature_cols.index(f) for f in selected_features]]
X_test_selected = X_test_scaled[:, [feature_cols.index(f) for f in selected_features]]
model_selected = LinearRegression()
model_selected.fit(X_train_selected, y_train)
y_pred_selected = model_selected.predict(X_test_selected)
r2_selected = r2_score(y_test, y_pred_selected)
print(f"\nModel with selected features R²: {r2_selected:.4f}")
print(f"Original model R²: {r2_test:.4f}")

Part 8: Model Deployment Preparation

8.1 Save the Model

import joblib
# Save model and preprocessing objects
joblib.dump(best_model, 'house_price_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(feature_cols, 'feature_cols.pkl')
print("Model and preprocessing objects saved successfully")

8.2 Create Prediction Function

def predict_house_price(sqft, bedrooms, bathrooms, age, lot_size, 
location_score, condition, garage):
"""
Predict house price based on features.
Parameters:
-----------
sqft : float - Square footage
bedrooms : int - Number of bedrooms
bathrooms : float - Number of bathrooms
age : int - Age of house in years
lot_size : float - Lot size in sq ft
location_score : float - Location score (0-10)
condition : int - Condition (1-5)
garage : int - Number of garage spaces
Returns:
--------
predicted_price : float - Predicted price
"""
# Create DataFrame with input features
input_data = pd.DataFrame({
'sqft': [sqft],
'bedrooms': [bedrooms],
'bathrooms': [bathrooms],
'age': [age],
'lot_size': [lot_size],
'location_score': [location_score],
'condition': [condition],
'garage': [garage],
'price_per_sqft': [0],
'rooms_per_sqft': [bedrooms / sqft * 1000],
'age_squared': [age ** 2],
'sqft_squared': [sqft ** 2],
'sqft_bedroom_interaction': [sqft * bedrooms],
'location_condition_interaction': [location_score * condition]
})
# Scale features
input_scaled = scaler.transform(input_data[feature_cols])
# Predict
predicted_price = best_model.predict(input_scaled)[0]
return predicted_price
# Test prediction
test_price = predict_house_price(2000, 3, 2, 10, 5000, 8, 4, 2)
print(f"Test prediction: ${test_price:,.0f}")

8.3 Create Interactive Prediction Tool

def interactive_prediction():
"""Interactive function to predict house prices"""
print("\n" + "="*50)
print("House Price Predictor")
print("="*50)
try:
sqft = float(input("Square footage: "))
bedrooms = int(input("Number of bedrooms: "))
bathrooms = float(input("Number of bathrooms: "))
age = int(input("Age of house (years): "))
lot_size = float(input("Lot size (sq ft): "))
location_score = float(input("Location score (0-10): "))
condition = int(input("Condition (1-5): "))
garage = int(input("Number of garage spaces: "))
price = predict_house_price(sqft, bedrooms, bathrooms, age, 
lot_size, location_score, condition, garage)
print(f"\n{'='*50}")
print(f"Predicted Price: ${price:,.0f}")
print(f"{'='*50}")
# Add confidence interval (based on residual standard deviation)
confidence = 1.96 * residuals.std()
print(f"95% Confidence Interval: ${price - confidence:,.0f} - ${price + confidence:,.0f}")
except ValueError:
print("Invalid input. Please enter numeric values.")
# Uncomment to run interactive prediction
# interactive_prediction()

Part 9: Results Summary and Insights

9.1 Model Performance Summary

print("\n" + "="*60)
print("LINEAR REGRESSION CASE STUDY - FINAL RESULTS")
print("="*60)
print("\n1. MODEL PERFORMANCE")
print("-"*40)
print(f"   R² Score: {best_r2:.4f} ({best_r2*100:.1f}% of variance explained)")
print(f"   RMSE: ${rmse_test:,.0f}")
print(f"   MAE: ${mae_test:,.0f}")
print(f"   Mean Price: ${y_test.mean():,.0f}")
print(f"   Relative Error: {(mae_test / y_test.mean()) * 100:.1f}%")
print("\n2. KEY INSIGHTS")
print("-"*40)
# Most important features
top_features = feature_importance.head(5)
for idx, row in top_features.iterrows():
if row['coefficient'] > 0:
print(f"   ✓ {row['feature']}: Increases price (coefficient: {row['coefficient']:.0f})")
else:
print(f"   ✗ {row['feature']}: Decreases price (coefficient: {row['coefficient']:.0f})")
print("\n3. BUSINESS RECOMMENDATIONS")
print("-"*40)
print("   • Focus marketing on properties with high location scores")
print("   • Larger square footage provides the highest ROI")
print("   • Renovations that increase condition rating can significantly boost value")
print("   • Properties with garages command premium prices")
print("   • Age negatively impacts price, but at a decreasing rate")
print("\n4. MODEL LIMITATIONS")
print("-"*40)
print("   • Linear model may not capture all complex non-linear relationships")
print("   • Assumes independence of features (no multicollinearity)")
print("   • Sensitive to outliers (though we capped them)")
print("   • Does not account for market trends or seasonality")
print("\n5. NEXT STEPS")
print("-"*40)
print("   • Collect more data to improve model accuracy")
print("   • Add more features (school ratings, crime rates, amenities)")
print("   • Try non-linear models (Random Forest, Gradient Boosting)")
print("   • Deploy model as web API for real-time predictions")
print("   • Monitor model performance and retrain periodically")
print("\n" + "="*60)

9.2 Visual Summary

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Actual vs Predicted
axes[0, 0].scatter(y_test, y_pred_best, alpha=0.5)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
'r--', linewidth=2)
axes[0, 0].set_xlabel('Actual Price ($)')
axes[0, 0].set_ylabel('Predicted Price ($)')
axes[0, 0].set_title(f'Actual vs Predicted (R² = {best_r2:.3f})')
axes[0, 0].grid(True, alpha=0.3)
# Residuals distribution
residuals_best = y_test - y_pred_best
axes[0, 1].hist(residuals_best, bins=30, edgecolor='black', alpha=0.7)
axes[0, 1].set_xlabel('Residuals ($)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Residuals')
axes[0, 1].axvline(x=0, color='r', linestyle='--', linewidth=2)
# Feature importance
importance_data = feature_importance.head(8).sort_values('coefficient')
colors = ['green' if x > 0 else 'red' for x in importance_data['coefficient']]
axes[1, 0].barh(importance_data['feature'], importance_data['coefficient'], color=colors)
axes[1, 0].set_xlabel('Coefficient')
axes[1, 0].set_title('Top 8 Feature Importance')
axes[1, 0].axvline(x=0, color='black', linewidth=1)
# Model comparison
models = ['Simple', 'Multiple', 'Ridge', 'Lasso', 'Optimized']
r2_scores = [r2_simple, r2_test, ridge_r2, lasso_r2, best_r2]
axes[1, 1].bar(models, r2_scores, color='skyblue', edgecolor='black')
axes[1, 1].set_ylabel('R² Score')
axes[1, 1].set_title('Model Comparison')
axes[1, 1].set_ylim(0, 1)
for i, v in enumerate(r2_scores):
axes[1, 1].text(i, v + 0.01, f'{v:.3f}', ha='center')
plt.tight_layout()
plt.show()

Part 10: Conclusion and Key Learnings

Summary of Key Findings

print("\n" + "="*70)
print("LINEAR REGRESSION CASE STUDY: KEY LEARNINGS")
print("="*70)
learnings = [
"1. DATA PREPROCESSING IS CRUCIAL",
"   • Outliers can significantly impact model performance",
"   • Feature engineering (interactions, polynomials) can capture complex relationships",
"   • Scaling is essential for interpretation of coefficients",
"",
"2. MODEL SELECTION MATTERS",
"   • Multiple linear regression provides better fit than single feature models",
"   • Regularization (Ridge/Lasso) helps prevent overfitting",
"   • Cross-validation is essential for model evaluation",
"",
"3. INTERPRETABILITY IS A KEY ADVANTAGE",
"   • Linear models offer clear interpretation of feature impact",
"   • Coefficients show direction and magnitude of relationships",
"   • Important for business stakeholders to understand model decisions",
"",
"4. ASSUMPTIONS MUST BE CHECKED",
"   • Linearity: Relationships should be linear",
"   • Independence: Residuals should be independent",
"   • Homoscedasticity: Constant variance of residuals",
"   • Normality: Residuals should be normally distributed",
"",
"5. REAL-WORLD APPLICATIONS",
"   • House price prediction is a classic regression problem",
"   • Feature importance reveals key business drivers",
"   • Model can be deployed for real-time predictions",
]
for learning in learnings:
print(learning)
print("\n" + "="*70)

Final Model Metrics

# Save final metrics for reporting
final_metrics = {
'model': 'Linear Regression (Optimized)',
'r2_score': best_r2,
'rmse': rmse_test,
'mae': mae_test,
'mean_price': y_test.mean(),
'relative_error': (mae_test / y_test.mean()) * 100,
'cv_mean': cv_scores.mean(),
'cv_std': cv_scores.std()
}
print("\nFINAL MODEL METRICS")
print("="*50)
for metric, value in final_metrics.items():
if metric in ['rmse', 'mae', 'mean_price']:
print(f"{metric:15}: ${value:,.0f}")
elif metric == 'relative_error':
print(f"{metric:15}: {value:.1f}%")
else:
print(f"{metric:15}: {value:.4f}")

Conclusion

This comprehensive case study demonstrates the complete workflow of a linear regression project:

Key Takeaways

  1. Problem Definition: Understanding the business problem is the first and most important step
  2. Data Exploration: EDA reveals patterns, relationships, and potential issues
  3. Preprocessing: Feature engineering and outlier handling significantly impact model quality
  4. Model Building: Multiple linear regression provides a strong baseline
  5. Regularization: Ridge and Lasso help prevent overfitting and enable feature selection
  6. Evaluation: Multiple metrics and cross-validation ensure robust assessment
  7. Interpretation: Linear models offer excellent interpretability for stakeholders
  8. Deployment: Models can be saved and deployed for real-world use

Business Impact

The model successfully predicts house prices with good accuracy and reveals that:

  • Square footage is the most important factor
  • Location score has a strong positive impact
  • Age negatively affects price
  • Condition and garage add significant value

These insights can help real estate agents, buyers, and sellers make more informed decisions.

Next Steps

  1. Collect more diverse data (school districts, crime rates, amenities)
  2. Explore non-linear models for comparison
  3. Build an interactive dashboard for predictions
  4. Deploy model as a web API
  5. Monitor model performance and retrain periodically

This case study provides a solid foundation for tackling real-world regression problems using linear models!

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper