What is Data Preparation?
Data Preparation (also called data preprocessing, data wrangling, or data cleaning) is the process of cleaning, transforming, and organizing raw data into a structured, high-quality format suitable for analysis and modeling. It typically consumes 60-80% of the time in a data science project.
"Data preparation is like doing your homework before a test. You can't expect good results without it."
Why Data Preparation Matters
Poor Data Preparation → Garbage In → Garbage Out (GIGO) Quality Data Preparation → Reliable Models → Actionable Insights
Impact of Poor Data Preparation
- Biased models that don't generalize
- Inaccurate predictions and insights
- Wasted computational resources
- Failed deployments in production
- Misleading business decisions
The Data Preparation Pipeline
Raw Data → Collection → Cleaning → Transformation → Integration → Reduction → Quality Data ↓ ↓ ↓ ↓ ↓ ↓ Sources Import Fix Issues Format/Scale Combine Sample/Select
1. Data Collection & Understanding
Know Your Data
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('customer_data.csv')
# First look
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.info()) # Data types and memory usage
print(df.shape) # Dimensions (rows, columns)
print(df.columns) # Column names
# Statistical summary
print(df.describe()) # Numerical columns
print(df.describe(include=['object'])) # Categorical columns
Data Types Assessment
# Check data types
df.dtypes
# Convert types
df['date'] = pd.to_datetime(df['date']) # String to datetime
df['category'] = df['category'].astype('category') # Object to category
df['age'] = df['age'].astype('int32') # Downcast for memory
2. Handling Missing Values
Types of Missing Data
| Type | Description | Example |
|---|---|---|
| MCAR | Missing Completely at Random | Survey question accidentally skipped |
| MAR | Missing at Random | Women less likely to report salary |
| MNAR | Missing Not at Random | High-income individuals hide income |
Detecting Missing Values
# Count missing values
print(df.isnull().sum())
# Percentage missing
print((df.isnull().sum() / len(df)) * 100)
# Visualize missing data
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, yticklabels=False)
plt.title('Missing Values Heatmap')
plt.show()
# Missing values per row
df['missing_count'] = df.isnull().sum(axis=1)
Strategies for Handling Missing Values
A. Deletion Methods
# Remove rows with any missing values df_clean = df.dropna() # Remove rows where specific column is missing df_clean = df.dropna(subset=['critical_column']) # Remove columns with >50% missing threshold = 0.5 df_clean = df.dropna(thresh=int(len(df) * (1 - threshold)), axis=1)
B. Imputation Methods
from sklearn.impute import SimpleImputer, KNNImputer # 1. Mean/Median/Mode imputation imputer_mean = SimpleImputer(strategy='mean') df['age'] = imputer_mean.fit_transform(df[['age']]) imputer_median = SimpleImputer(strategy='median') df['salary'] = imputer_median.fit_transform(df[['salary']]) imputer_mode = SimpleImputer(strategy='most_frequent') df['category'] = imputer_mode.fit_transform(df[['category']]) # 2. Constant value imputation imputer_const = SimpleImputer(strategy='constant', fill_value='Unknown') df['city'] = imputer_const.fit_transform(df[['city']]) # 3. Forward/Backward fill (time series) df = df.fillna(method='ffill') # Forward fill df = df.fillna(method='bfill') # Backward fill # 4. KNN imputation imputer_knn = KNNImputer(n_neighbors=5) df_imputed = imputer_knn.fit_transform(df) # 5. Regression imputation (custom) from sklearn.linear_model import LinearRegression def regression_impute(df, target_col, feature_cols): known = df[df[target_col].notna()] unknown = df[df[target_col].isna()] model = LinearRegression() model.fit(known[feature_cols], known[target_col]) predicted = model.predict(unknown[feature_cols]) df.loc[df[target_col].isna(), target_col] = predicted return df
3. Handling Outliers
Detecting Outliers
# Method 1: Z-Score
from scipy import stats
z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))
outliers_z = (z_scores > 3).sum()
# Method 2: IQR (Interquartile Range)
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
# Method 3: Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Box plot
df.boxplot(ax=axes[0])
axes[0].set_title('Box Plot')
# Histogram
df.hist(ax=axes[1], bins=30)
axes[1].set_title('Histogram')
# Scatter plot
axes[2].scatter(range(len(df)), df['value'])
axes[2].set_title('Scatter Plot')
plt.tight_layout()
Handling Outliers
# Option 1: Remove outliers Q1 = df['column'].quantile(0.25) Q3 = df['column'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR df_clean = df[(df['column'] >= lower_bound) & (df['column'] <= upper_bound)] # Option 2: Cap/Winsorize from scipy.stats.mstats import winsorize df['column_winsorized'] = winsorize(df['column'], limits=[0.05, 0.05]) # Option 3: Transform df['log_column'] = np.log1p(df['column']) # Log transform # Option 4: Treat separately df['is_outlier'] = ((df['column'] < lower_bound) | (df['column'] > upper_bound))
4. Data Transformation
Scaling and Normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler # Standardization (z-score): (x - μ) / σ scaler_std = StandardScaler() df_standardized = scaler_std.fit_transform(df) # Min-Max Normalization: (x - min) / (max - min) scaler_minmax = MinMaxScaler() df_normalized = scaler_minmax.fit_transform(df) # Robust Scaling (uses median, robust to outliers) scaler_robust = RobustScaler() df_robust = scaler_robust.fit_transform(df) # Choose based on use case: # - StandardScaler: When data is normally distributed # - MinMaxScaler: When you need bounded values [0,1] # - RobustScaler: When data has outliers
Encoding Categorical Variables
# 1. Label Encoding (ordinal categories) from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['encoded'] = le.fit_transform(df['category']) # 2. One-Hot Encoding (nominal categories) df_onehot = pd.get_dummies(df, columns=['category'], prefix='cat') # 3. Target Encoding from category_encoders import TargetEncoder te = TargetEncoder() df['encoded'] = te.fit_transform(df['category'], df['target']) # 4. Frequency Encoding freq_map = df['category'].value_counts().to_dict() df['freq_encoded'] = df['category'].map(freq_map) # 5. Binary Encoding from category_encoders import BinaryEncoder be = BinaryEncoder() df_binary = be.fit_transform(df[['category']])
Feature Engineering
# Create new features from existing ones
# 1. Date/Time features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
# 2. Mathematical combinations
df['area'] = df['length'] * df['width']
df['ratio'] = df['value1'] / df['value2']
df['difference'] = df['value1'] - df['value2']
df['avg'] = (df['value1'] + df['value2']) / 2
# 3. Aggregated features
df['customer_total_spend'] = df.groupby('customer_id')['amount'].transform('sum')
df['customer_avg_spend'] = df.groupby('customer_id')['amount'].transform('mean')
df['customer_purchase_count'] = df.groupby('customer_id')['transaction_id'].transform('count')
# 4. Binning/Discretization
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100],
labels=['Child', 'Young', 'Adult', 'Senior'])
# 5. Text features
df['text_length'] = df['description'].str.len()
df['word_count'] = df['description'].str.split().str.len()
df['has_urgent'] = df['description'].str.contains('urgent', case=False).astype(int)
5. Data Integration
Combining Multiple Data Sources
# Concatenation (stacking)
df_combined = pd.concat([df1, df2], axis=0) # Vertically (more rows)
df_features = pd.concat([df1, df2], axis=1) # Horizontally (more columns)
# Merging/Joining
# Inner join (intersection)
df_merged = pd.merge(df1, df2, on='key_column', how='inner')
# Left join (all from left)
df_merged = pd.merge(df1, df2, on='key_column', how='left')
# Right join (all from right)
df_merged = pd.merge(df1, df2, on='key_column', how='right')
# Outer join (union)
df_merged = pd.merge(df1, df2, on='key_column', how='outer')
# Join on multiple columns
df_merged = pd.merge(df1, df2, on=['col1', 'col2'], how='inner')
# Handle duplicate column names
df_merged = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
6. Data Reduction
Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, RFE
# 1. Feature Selection - Statistical methods
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# 2. Recursive Feature Elimination
from sklearn.linear_model import LogisticRegression
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
# 3. Feature Importance from models
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, y)
importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
# 4. PCA (Principal Component Analysis)
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)
# 5. Correlation-based removal
correlation_matrix = df.corr().abs()
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)]
df_reduced = df.drop(columns=to_drop)
Sampling Techniques
# 1. Random sampling df_sample = df.sample(n=1000, random_state=42) # n samples df_sample = df.sample(frac=0.1, random_state=42) # 10% of data # 2. Stratified sampling from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) # 3. Handling imbalanced datasets from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler # SMOTE (Synthetic Minority Over-sampling) smote = SMOTE(random_state=42) X_resampled, y_resampled = smote.fit_resample(X, y) # Under-sampling undersampler = RandomUnderSampler(random_state=42) X_resampled, y_resampled = undersampler.fit_resample(X, y)
7. Data Validation
Quality Checks
def validate_data(df):
"""Comprehensive data validation function"""
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")
# Check for missing values
missing = df.isnull().sum()
print(f"Missing values:\n{missing[missing > 0]}")
# Check data types
print(f"Data types:\n{df.dtypes}")
# Check value ranges
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
print(f"{col}: min={df[col].min()}, max={df[col].max()}")
# Check for invalid values
if 'age' in df.columns:
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)].shape[0]
print(f"Invalid ages: {invalid_age}")
# Check categorical values
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
print(f"{col} unique values: {df[col].nunique()}")
if df[col].nunique() < 20:
print(f" Values: {df[col].unique()}")
return {
'duplicates': duplicates,
'missing': missing[missing > 0],
'valid': True
}
8. Automation with Pipelines
Scikit-learn Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Define column types
numeric_features = ['age', 'salary', 'years_experience']
categorical_features = ['department', 'education', 'city']
# Create transformers
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Create full pipeline with model
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# Use the pipeline
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
9. Best Practices & Checklist
Data Preparation Checklist
□ Data Collection □ Understand data sources □ Document data dictionary □ Check data licensing/compliance □ Data Cleaning □ Handle missing values □ Remove duplicates □ Fix inconsistent formatting □ Handle outliers appropriately □ Data Transformation □ Scale/normalize numerical features □ Encode categorical variables □ Create date/time features □ Engineer meaningful features □ Data Integration □ Merge multiple sources □ Handle schema mismatches □ Resolve conflicts □ Data Reduction □ Select relevant features □ Remove highly correlated features □ Sample if necessary □ Validation □ Verify data types □ Check value ranges □ Validate business rules □ Test with sample queries □ Documentation □ Document transformations applied □ Note assumptions made □ Version control prepared data
Common Pitfalls to Avoid
# ❌ Don't do this # 1. Removing outliers before understanding them df = df[df['value'] < 100] # Why? Maybe 100+ is valid! # 2. Applying transformations without reason df['log_age'] = np.log(df['age']) # Age isn't skewed! # 3. Using wrong imputation df['salary'] = df['salary'].fillna(df['salary'].mean()) # Mean affected by outliers! # 4. Data leakage in preprocessing scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # fit_transform on entire dataset! # Should fit only on training data # 5. Ignoring temporal dependencies df_shuffled = df.sample(frac=1) # Don't shuffle time series!
10. Tools for Data Preparation
| Tool | Purpose | Best For |
|---|---|---|
| Python (pandas) | General wrangling | Flexibility, custom logic |
| R (tidyverse) | Data manipulation | Statistical analysis |
| SQL | Database cleaning | Large datasets in databases |
| Spark | Big data processing | Distributed computing |
| OpenRefine | Interactive cleaning | Quick exploration, messy data |
| Trifacta | Visual wrangling | Business users, automation |
| Alteryx | Workflow automation | Repeatable processes |
Complete Example: End-to-End Data Preparation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
def prepare_customer_churn_data(df):
"""Complete data preparation pipeline for customer churn"""
print("Initial shape:", df.shape)
# 1. Initial exploration
print("\n--- Data Overview ---")
print(df.info())
print("\nMissing values:\n", df.isnull().sum())
# 2. Handle missing values
# Numeric: median imputation
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
df[col].fillna(df[col].median(), inplace=True)
# Categorical: mode imputation
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
df[col].fillna(df[col].mode()[0], inplace=True)
# 3. Handle outliers (cap at 99th percentile)
for col in numeric_cols:
upper = df[col].quantile(0.99)
df[col] = np.where(df[col] > upper, upper, df[col])
# 4. Feature engineering
# Create tenure groups
df['tenure_group'] = pd.cut(df['tenure'],
bins=[0, 12, 24, 48, 72, 100],
labels=['<1yr', '1-2yr', '2-4yr', '4-6yr', '>6yr'])
# Create total services
service_cols = ['phone_service', 'internet_service', 'streaming_tv', 'streaming_movies']
df['total_services'] = df[service_cols].sum(axis=1)
# 5. Encode categorical variables
le = LabelEncoder()
for col in cat_cols:
df[col] = le.fit_transform(df[col].astype(str))
# 6. Scale numeric features
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
# 7. Remove duplicates
df.drop_duplicates(inplace=True)
# 8. Split data
X = df.drop('churn', axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print("\n--- Preparation Complete ---")
print(f"Final shape: {df.shape}")
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
return X_train, X_test, y_train, y_test
# Usage
# X_train, X_test, y_train, y_test = prepare_customer_churn_data(df)
Key Takeaway: Data preparation is not just a preliminary step—it's the foundation upon which all data science work is built. Investing time in thorough, systematic data preparation yields more reliable models, more accurate insights, and ultimately better business outcomes. Remember: quality data preparation = quality data science.