Introduction to DataFrames
DataFrames are the cornerstone of data analysis in Python. They provide a powerful, flexible, and efficient structure for working with tabular data—similar to spreadsheets or SQL tables. The pandas library revolutionized data analysis in Python by introducing the DataFrame, making complex data operations intuitive and fast.
Key Concepts
- Tabular Structure: Data organized in rows and columns
- Indexing: Both row and column labels for intuitive access
- Vectorized Operations: Fast, element-wise computations
- Missing Data Handling: Built-in support for NaN values
- Data Alignment: Automatic alignment based on labels
- Integration: Seamless work with NumPy, Matplotlib, and scikit-learn
1. Getting Started with DataFrames
Installation and Setup
# Install pandas (if not already installed)
# pip install pandas numpy matplotlib seaborn
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set display options for better readability
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)
Creating DataFrames
# Method 1: From dictionary of lists
df_dict = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 28, 32],
'City': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin'],
'Salary': [70000, 85000, 95000, 68000, 78000]
})
print("DataFrame from dictionary:")
print(df_dict)
print()
# Method 2: From list of dictionaries
df_list = pd.DataFrame([
{'Name': 'Alice', 'Age': 25, 'City': 'New York', 'Salary': 70000},
{'Name': 'Bob', 'Age': 30, 'City': 'London', 'Salary': 85000},
{'Name': 'Charlie', 'Age': 35, 'City': 'Paris', 'Salary': 95000},
])
print("DataFrame from list of dictionaries:")
print(df_list)
print()
# Method 3: From NumPy array
data = np.array([
['Alice', 25, 'New York', 70000],
['Bob', 30, 'London', 85000],
['Charlie', 35, 'Paris', 95000]
])
df_numpy = pd.DataFrame(data, columns=['Name', 'Age', 'City', 'Salary'])
print("DataFrame from NumPy array:")
print(df_numpy)
print()
# Method 4: From CSV file
# df = pd.read_csv('data.csv')
# Method 5: From Excel file
# df = pd.read_excel('data.xlsx')
# Method 6: From SQL database
# import sqlite3
# conn = sqlite3.connect('database.db')
# df = pd.read_sql_query('SELECT * FROM table', conn)
DataFrame Structure and Metadata
# Create sample DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Henry'],
'Age': [25, 30, 35, 28, 32, 29, 31, 33],
'Department': ['Sales', 'IT', 'Marketing', 'IT', 'Sales', 'HR', 'Marketing', 'IT'],
'Salary': [70000, 85000, 95000, 68000, 78000, 62000, 72000, 88000],
'Experience': [2, 5, 8, 3, 4, 1, 3, 6]
})
# Basic information
print("DataFrame Info:")
print("=" * 50)
df.info()
print()
print("DataFrame Shape:", df.shape)
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print()
print("Column Names:", df.columns.tolist())
print()
print("Data Types:")
print(df.dtypes)
print()
print("Index:", df.index)
print()
print("First 5 rows:")
print(df.head())
print()
print("Last 3 rows:")
print(df.tail(3))
print()
print("Statistical Summary:")
print(df.describe())
print()
print("Non-null counts:")
print(df.count())
2. Data Inspection and Exploration
Basic Exploration
# Summary statistics for all columns
print("Statistical Summary:")
print(df.describe(include='all')) # include='all' includes object columns
print()
# Quick summary of object columns
print("Object columns summary:")
print(df.describe(include=['object']))
print()
# Unique values in a column
print("Unique departments:", df['Department'].unique())
print("Number of unique departments:", df['Department'].nunique())
print()
# Value counts
print("Department distribution:")
print(df['Department'].value_counts())
print()
print("Department distribution (normalized):")
print(df['Department'].value_counts(normalize=True))
print()
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print()
print("Total missing values:", df.isnull().sum().sum())
print()
# Check data types
print("Data types:")
print(df.dtypes)
print()
# Memory usage
print("Memory usage:")
print(df.memory_usage(deep=True))
Advanced Exploration
# Correlation matrix (only numeric columns)
print("Correlation matrix:")
print(df.corr())
print()
# Correlation with target
print("Correlation with Salary:")
print(df.corr()['Salary'].sort_values(ascending=False))
print()
# Covariance matrix
print("Covariance matrix:")
print(df.cov())
print()
# Skewness and kurtosis
print("Skewness:")
print(df.skew())
print()
print("Kurtosis:")
print(df.kurtosis())
print()
# Quantiles
print("Salary quantiles:")
print(df['Salary'].quantile([0.25, 0.5, 0.75, 0.9, 0.95]))
print()
# Unique combinations
print("Department-Experience combinations:")
print(df.groupby('Department')['Experience'].unique())
3. Data Selection and Filtering
Column Selection
# Select single column (returns Series)
print("Single column (Series):")
print(df['Name'])
print()
print("Single column type:", type(df['Name']))
print()
# Select multiple columns
print("Multiple columns:")
print(df[['Name', 'Age', 'Salary']])
print()
# Select columns using dot notation (only if name has no spaces)
print("Dot notation:")
print(df.Name.head())
print()
# Select columns with specific data types
print("Numeric columns:")
print(df.select_dtypes(include=['int64', 'float64']).head())
print()
print("Object columns:")
print(df.select_dtypes(include=['object']).head())
Row Selection
# Using iloc (integer position-based)
print("First row (iloc[0]):")
print(df.iloc[0])
print()
print("First 3 rows (iloc[:3]):")
print(df.iloc[:3])
print()
print("Rows 2-4 (iloc[2:5]):")
print(df.iloc[2:5])
print()
print("Specific rows and columns (iloc[[0,2,4], [0,2,4]]):")
print(df.iloc[[0, 2, 4], [0, 2, 4]])
print()
# Using loc (label-based)
print("Row with index 2 (loc[2]):")
print(df.loc[2])
print()
print("Rows 2-4 (loc[2:5]):")
print(df.loc[2:5])
print()
print("Specific rows and columns (loc[[0,2,4], ['Name', 'Salary']]):")
print(df.loc[[0, 2, 4], ['Name', 'Salary']])
print()
# Boolean indexing
print("Employees older than 30:")
print(df[df['Age'] > 30])
print()
print("IT department employees:")
print(df[df['Department'] == 'IT'])
print()
print("IT employees with salary > 80000:")
print(df[(df['Department'] == 'IT') & (df['Salary'] > 80000)])
print()
print("Employees with salary between 70000 and 80000:")
print(df[df['Salary'].between(70000, 80000)])
print()
# Using query method
print("Query: Age > 30 and Department == 'IT':")
print(df.query('Age > 30 and Department == "IT"'))
print()
Advanced Filtering
# isin() for multiple values
print("Employees in Sales or Marketing:")
print(df[df['Department'].isin(['Sales', 'Marketing'])])
print()
# String methods
print("Names starting with 'A':")
print(df[df['Name'].str.startswith('A')])
print()
print("Names containing 'e':")
print(df[df['Name'].str.contains('e', case=False)])
print()
# Using where()
print("Where condition (fills NaN for False):")
print(df['Salary'].where(df['Age'] > 30))
print()
# Using mask()
print("Mask condition (fills NaN for True):")
print(df['Salary'].mask(df['Age'] > 30))
print()
# Combining conditions
conditions = (df['Age'] > 30) & (df['Department'] == 'IT')
print("Complex condition:")
print(df.loc[conditions, ['Name', 'Age', 'Department', 'Salary']])
4. Data Manipulation
Adding and Modifying Columns
# Add a new column
df['Bonus'] = df['Salary'] * 0.1
print("Added Bonus column:")
print(df[['Name', 'Salary', 'Bonus']])
print()
# Add column based on condition
df['Senior'] = df['Experience'] >= 5
print("Added Senior column:")
print(df[['Name', 'Experience', 'Senior']])
print()
# Using assign (returns new DataFrame)
df_new = df.assign(Tax=df['Salary'] * 0.2, NetSalary=df['Salary'] * 0.8)
print("Using assign method:")
print(df_new[['Name', 'Salary', 'Tax', 'NetSalary']].head())
print()
# Insert column at specific position
df.insert(2, 'Gender', ['F', 'M', 'M', 'M', 'F', 'M', 'F', 'M'])
print("Inserted Gender column:")
print(df.head())
print()
# Rename columns
df_renamed = df.rename(columns={
'Name': 'Employee_Name',
'Salary': 'Annual_Salary',
'Experience': 'Years_Experience'
})
print("Renamed columns:")
print(df_renamed.columns.tolist())
print()
Column Operations
# Arithmetic operations
df['Salary_Increase'] = df['Salary'] * 1.05
df['Age_Next_Year'] = df['Age'] + 1
print("Calculated columns:")
print(df[['Name', 'Salary', 'Salary_Increase', 'Age', 'Age_Next_Year']].head())
print()
# Apply function to column
df['Salary_Thousand'] = df['Salary'].apply(lambda x: f"${x/1000:.0f}K")
print("Formatted salary:")
print(df[['Name', 'Salary', 'Salary_Thousand']].head())
print()
# Vectorized string operations
df['Name_Upper'] = df['Name'].str.upper()
df['Name_Length'] = df['Name'].str.len()
print("String operations:")
print(df[['Name', 'Name_Upper', 'Name_Length']].head())
print()
# Categorical conversion
df['Department'] = df['Department'].astype('category')
print("Department as categorical:")
print(df['Department'].dtype)
print()
# Convert to datetime
df['Hire_Date'] = pd.date_range('2020-01-01', periods=len(df), freq='M')
df['Year'] = df['Hire_Date'].dt.year
df['Month'] = df['Hire_Date'].dt.month
print("Date operations:")
print(df[['Name', 'Hire_Date', 'Year', 'Month']].head())
Handling Missing Data
# Create DataFrame with missing values
df_missing = df.copy()
df_missing.loc[2, 'Salary'] = np.nan
df_missing.loc[4, 'Experience'] = np.nan
df_missing.loc[6, 'Department'] = np.nan
print("DataFrame with missing values:")
print(df_missing)
print()
# Check missing values
print("Missing values count:")
print(df_missing.isnull().sum())
print()
print("Missing values percentage:")
print(df_missing.isnull().sum() / len(df_missing) * 100)
print()
# Drop rows with any missing values
df_dropped = df_missing.dropna()
print("After dropping rows with any missing values:")
print(df_dropped)
print(f"Shape: {df_dropped.shape}")
print()
# Drop rows with all missing values
df_dropped_all = df_missing.dropna(how='all')
print("After dropping rows with all missing values:")
print(df_dropped_all)
print()
# Fill missing values
df_filled_mean = df_missing.fillna(df_missing.mean())
print("Filled with mean (numeric columns):")
print(df_filled_mean)
print()
df_filled_median = df_missing.fillna(df_missing.median())
df_filled_mode = df_missing.fillna(df_missing.mode().iloc[0])
df_filled_forward = df_missing.fillna(method='ffill')
df_filled_backward = df_missing.fillna(method='bfill')
# Fill specific columns
df_filled_specific = df_missing.copy()
df_filled_specific['Salary'].fillna(df_filled_specific['Salary'].mean(), inplace=True)
df_filled_specific['Experience'].fillna(0, inplace=True)
df_filled_specific['Department'].fillna('Unknown', inplace=True)
print("Filled specific columns:")
print(df_filled_specific)
print()
# Interpolate missing values
df_interpolated = df_missing.interpolate()
print("Interpolated missing values:")
print(df_interpolated)
5. Grouping and Aggregation
Basic GroupBy
# Group by single column
print("Group by Department:")
dept_groups = df.groupby('Department')
print(dept_groups)
print()
# Get groups
print("Group keys:", dept_groups.groups.keys())
print()
# Iterate through groups
for name, group in dept_groups:
print(f"\nDepartment: {name}")
print(group[['Name', 'Salary', 'Experience']])
print()
# Basic aggregations
print("Department statistics:")
print(df.groupby('Department')['Salary'].mean())
print()
print("Multiple aggregations:")
print(df.groupby('Department').agg({
'Salary': ['mean', 'median', 'min', 'max', 'std'],
'Age': ['mean', 'min', 'max'],
'Experience': ['mean', 'sum']
}))
print()
# Named aggregations (pandas 0.25+)
print("Named aggregations:")
print(df.groupby('Department').agg(
avg_salary=('Salary', 'mean'),
max_salary=('Salary', 'max'),
avg_age=('Age', 'mean'),
total_experience=('Experience', 'sum')
))
print()
Advanced GroupBy
# Group by multiple columns
print("Group by Department and Senior:")
grouped = df.groupby(['Department', 'Senior'])
print(grouped.size())
print()
print("Multi-level grouping aggregations:")
print(df.groupby(['Department', 'Senior']).agg({
'Salary': ['mean', 'count'],
'Age': 'mean'
}))
print()
# Transform
df['Salary_Normalized'] = df.groupby('Department')['Salary'].transform(lambda x: (x - x.mean()) / x.std())
print("Normalized salary within department:")
print(df[['Name', 'Department', 'Salary', 'Salary_Normalized']])
print()
# Filter groups
print("Departments with average salary > 75000:")
high_salary_depts = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 75000)
print(high_salary_depts['Department'].unique())
print()
# Apply custom function
def salary_range(group):
return pd.Series({
'min_salary': group['Salary'].min(),
'max_salary': group['Salary'].max(),
'range': group['Salary'].max() - group['Salary'].min()
})
print("Salary range by department:")
print(df.groupby('Department').apply(salary_range))
print()
# Aggregation with multiple functions
def custom_agg(x):
return pd.Series({
'count': len(x),
'mean': x.mean(),
'median': x.median(),
'std': x.std()
})
print("Custom aggregation on Salary:")
print(df.groupby('Department')['Salary'].apply(custom_agg))
Pivot Tables and Crosstab
# Pivot table
print("Pivot table - Department vs Senior:")
pivot = pd.pivot_table(df,
values='Salary',
index='Department',
columns='Senior',
aggfunc='mean',
fill_value=0)
print(pivot)
print()
print("Pivot table with multiple values:")
pivot_multi = pd.pivot_table(df,
values=['Salary', 'Age'],
index='Department',
columns='Senior',
aggfunc=['mean', 'count'],
fill_value=0)
print(pivot_multi)
print()
# Crosstab
print("Crosstab - Department vs Senior:")
crosstab = pd.crosstab(df['Department'], df['Senior'], margins=True)
print(crosstab)
print()
print("Crosstab with percentages:")
crosstab_norm = pd.crosstab(df['Department'], df['Senior'], normalize='index') * 100
print(crosstab_norm)
6. Merging and Joining
Concatenation
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['Sales', 'IT', 'Marketing', 'IT']
})
df2 = pd.DataFrame({
'ID': [5, 6, 7, 8],
'Name': ['Emma', 'Frank', 'Grace', 'Henry'],
'Department': ['HR', 'Sales', 'IT', 'Marketing']
})
df3 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Salary': [70000, 85000, 95000, 68000],
'Experience': [2, 5, 8, 3]
})
# Vertical concatenation (stack rows)
print("Vertical concatenation:")
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)
print()
# Horizontal concatenation (join columns)
print("Horizontal concatenation:")
df_concat_h = pd.concat([df1, df3], axis=1)
print(df_concat_h)
print()
# Concatenation with keys
print("Concatenation with keys:")
df_concat_keys = pd.concat([df1, df2], keys=['Group1', 'Group2'])
print(df_concat_keys)
print()
Merging
# Create two DataFrames for merging
employees = pd.DataFrame({
'Employee_ID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Department_ID': [1, 2, 1, 3, 2]
})
departments = pd.DataFrame({
'Department_ID': [1, 2, 3, 4],
'Department_Name': ['Sales', 'IT', 'Marketing', 'HR'],
'Location': ['New York', 'London', 'Paris', 'Berlin']
})
# Inner join (default)
print("Inner join:")
df_inner = pd.merge(employees, departments, on='Department_ID', how='inner')
print(df_inner)
print()
# Left join
print("Left join:")
df_left = pd.merge(employees, departments, on='Department_ID', how='left')
print(df_left)
print()
# Right join
print("Right join:")
df_right = pd.merge(employees, departments, on='Department_ID', how='right')
print(df_right)
print()
# Outer join
print("Outer join:")
df_outer = pd.merge(employees, departments, on='Department_ID', how='outer')
print(df_outer)
print()
# Merge with different column names
employees2 = employees.rename(columns={'Department_ID': 'Dept_ID'})
df_merge_diff = pd.merge(employees2, departments, left_on='Dept_ID', right_on='Department_ID')
print("Merge with different column names:")
print(df_merge_diff)
print()
# Merge on index
df_merge_index = pd.merge(employees, departments, left_on='Department_ID', right_index=True)
print("Merge on index:")
print(df_merge_index)
Joining
# Create DataFrames with indexes
df_left = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [70000, 85000, 95000]
}, index=['A', 'B', 'C'])
df_right = pd.DataFrame({
'Department': ['Sales', 'IT', 'Marketing'],
'Experience': [2, 5, 8]
}, index=['A', 'B', 'D'])
# Inner join on index
print("Inner join on index:")
df_join_inner = df_left.join(df_right, how='inner')
print(df_join_inner)
print()
# Left join on index
print("Left join on index:")
df_join_left = df_left.join(df_right, how='left')
print(df_join_left)
print()
# Outer join on index
print("Outer join on index:")
df_join_outer = df_left.join(df_right, how='outer')
print(df_join_outer)
print()
7. Time Series Data
Working with Dates and Times
# Create time series data
dates = pd.date_range('2023-01-01', periods=100, freq='D')
ts_data = pd.DataFrame({
'date': dates,
'value': np.random.randn(100).cumsum(),
'sales': np.random.randint(100, 1000, 100)
})
print("Time series data:")
print(ts_data.head(10))
print()
# Set date as index
ts_data.set_index('date', inplace=True)
print("Data with date index:")
print(ts_data.head())
print()
# Resampling
print("Monthly resampling (mean):")
print(ts_data.resample('M').mean())
print()
print("Weekly resampling (sum):")
print(ts_data.resample('W').sum().head())
print()
# Rolling window calculations
print("7-day rolling mean:")
print(ts_data['value'].rolling(window=7).mean().head(10))
print()
print("7-day rolling sum:")
print(ts_data['sales'].rolling(window=7).sum().head(10))
print()
# Expanding window
print("Expanding mean:")
print(ts_data['value'].expanding().mean().head(10))
print()
# Time-based shifting
print("Shifted data (1 day):")
print(ts_data['value'].shift(1).head())
print()
print("Shifted data (-1 day):")
print(ts_data['value'].shift(-1).head())
print()
# Difference
print("Daily differences:")
print(ts_data['value'].diff().head())
print()
# Time zone handling
ts_data_tz = ts_data.copy()
ts_data_tz.index = ts_data_tz.index.tz_localize('UTC')
print("Time zone localized:")
print(ts_data_tz.head())
print()
8. Data Visualization with DataFrame
Basic Plots
# Line plot
plt.figure(figsize=(12, 6))
ts_data['value'].plot(title='Time Series Plot', color='blue', linewidth=2)
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True, alpha=0.3)
plt.show()
# Multiple line plots
plt.figure(figsize=(12, 6))
ts_data[['value', 'sales']].plot(subplots=True, layout=(2,1), figsize=(12, 8))
plt.tight_layout()
plt.show()
# Bar plot
plt.figure(figsize=(10, 6))
df.groupby('Department')['Salary'].mean().plot(kind='bar', color=['skyblue', 'lightcoral', 'lightgreen'])
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Horizontal bar plot
plt.figure(figsize=(10, 6))
df.groupby('Department')['Salary'].mean().sort_values().plot(kind='barh')
plt.title('Average Salary by Department')
plt.xlabel('Average Salary')
plt.tight_layout()
plt.show()
# Histogram
plt.figure(figsize=(10, 6))
df['Age'].hist(bins=10, edgecolor='black', alpha=0.7)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
# Box plot
plt.figure(figsize=(10, 6))
df.boxplot(column='Salary', by='Department')
plt.title('Salary Distribution by Department')
plt.suptitle('') # Remove automatic title
plt.xlabel('Department')
plt.ylabel('Salary')
plt.xticks(rotation=45)
plt.show()
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Experience'], df['Salary'], alpha=0.6)
plt.title('Experience vs Salary')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.grid(True, alpha=0.3)
plt.show()
Advanced Visualization with Seaborn
# Set style
sns.set_style('whitegrid')
# Pairplot
sns.pairplot(df[['Age', 'Salary', 'Experience']], diag_kind='kde')
plt.show()
# Heatmap of correlations
plt.figure(figsize=(8, 6))
sns.heatmap(df[['Age', 'Salary', 'Experience']].corr(),
annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()
# Boxen plot (for large datasets)
plt.figure(figsize=(10, 6))
sns.boxenplot(x='Department', y='Salary', data=df)
plt.title('Salary Distribution by Department')
plt.show()
# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='Department', y='Salary', data=df)
plt.title('Salary Distribution by Department')
plt.show()
# Bar plot with error bars
plt.figure(figsize=(10, 6))
sns.barplot(x='Department', y='Salary', data=df, ci=95)
plt.title('Average Salary by Department with 95% CI')
plt.show()
# Regression plot
plt.figure(figsize=(10, 6))
sns.regplot(x='Experience', y='Salary', data=df, scatter_kws={'alpha':0.5})
plt.title('Experience vs Salary with Regression Line')
plt.show()
9. Performance Optimization
Efficient DataFrame Operations
# Create large DataFrame for demonstration
large_df = pd.DataFrame(np.random.randn(1000000, 10),
columns=[f'col_{i}' for i in range(10)])
# Vectorized operations (fast)
import time
start = time.time()
large_df['new_col'] = large_df['col_0'] + large_df['col_1']
vectorized_time = time.time() - start
# Loop operations (slow)
start = time.time()
new_col = []
for i in range(len(large_df)):
new_col.append(large_df.loc[i, 'col_0'] + large_df.loc[i, 'col_1'])
large_df['new_col2'] = new_col
loop_time = time.time() - start
print(f"Vectorized operation time: {vectorized_time:.4f} seconds")
print(f"Loop operation time: {loop_time:.4f} seconds")
print(f"Speedup: {loop_time/vectorized_time:.1f}x")
print()
# Use appropriate data types
print("Memory usage before optimization:")
print(large_df.memory_usage(deep=True).sum() / 1024**2, "MB")
# Convert to smaller data types
optimized_df = large_df.copy()
for col in optimized_df.columns:
if optimized_df[col].dtype == 'float64':
optimized_df[col] = optimized_df[col].astype('float32')
print("Memory usage after optimization:")
print(optimized_df.memory_usage(deep=True).sum() / 1024**2, "MB")
print()
# Categorical data optimization
categorical_df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000000),
'value': np.random.randn(1000000)
})
print("Memory usage without categorical:")
print(categorical_df.memory_usage(deep=True).sum() / 1024**2, "MB")
categorical_df['category'] = categorical_df['category'].astype('category')
print("Memory usage with categorical:")
print(categorical_df.memory_usage(deep=True).sum() / 1024**2, "MB")
Query Optimization
# Using query vs boolean indexing
df_large = pd.DataFrame({
'A': np.random.randn(1000000),
'B': np.random.randn(1000000),
'C': np.random.choice(['X', 'Y', 'Z'], 1000000)
})
# Boolean indexing
start = time.time()
result1 = df_large[(df_large['A'] > 0) & (df_large['B'] < 0) & (df_large['C'] == 'X')]
time1 = time.time() - start
# Query method
start = time.time()
result2 = df_large.query('A > 0 and B < 0 and C == "X"')
time2 = time.time() - start
print(f"Boolean indexing time: {time1:.4f} seconds")
print(f"Query method time: {time2:.4f} seconds")
print()
# Using eval for faster operations
start = time.time()
df_large['D'] = df_large.eval('A + B')
time3 = time.time() - start
start = time.time()
df_large['D2'] = df_large['A'] + df_large['B']
time4 = time.time() - start
print(f"eval time: {time3:.4f} seconds")
print(f"Direct addition time: {time4:.4f} seconds")
10. Real-World Data Science Examples
Example 1: Customer Segmentation
# Create customer data
np.random.seed(42)
customers = pd.DataFrame({
'Customer_ID': range(1, 1001),
'Age': np.random.randint(18, 70, 1000),
'Annual_Income': np.random.randint(30000, 150000, 1000),
'Spending_Score': np.random.randint(1, 100, 1000),
'Purchase_Frequency': np.random.choice(['Low', 'Medium', 'High'], 1000, p=[0.3, 0.5, 0.2]),
'Gender': np.random.choice(['Male', 'Female'], 1000)
})
print("Customer Data Sample:")
print(customers.head())
print()
# Customer segmentation by age group
customers['Age_Group'] = pd.cut(customers['Age'],
bins=[0, 25, 35, 50, 100],
labels=['Young', 'Adult', 'Middle-Aged', 'Senior'])
print("Age Group Distribution:")
print(customers['Age_Group'].value_counts())
print()
# RFM-like analysis
def rfm_score(row):
if row['Spending_Score'] > 70:
return 'High'
elif row['Spending_Score'] > 40:
return 'Medium'
else:
return 'Low'
customers['Segment'] = customers.apply(rfm_score, axis=1)
print("Customer Segments:")
print(customers['Segment'].value_counts())
print()
# Segment analysis
print("Segment Analysis:")
segment_stats = customers.groupby('Segment').agg({
'Age': ['mean', 'std'],
'Annual_Income': ['mean', 'median'],
'Spending_Score': ['mean', 'min', 'max']
})
print(segment_stats)
print()
# Visualization
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(x='Segment', y='Annual_Income', data=customers)
plt.title('Income Distribution by Segment')
plt.subplot(1, 2, 2)
sns.countplot(x='Purchase_Frequency', hue='Segment', data=customers)
plt.title('Purchase Frequency by Segment')
plt.tight_layout()
plt.show()
Example 2: Sales Analysis
# Create sales data
dates = pd.date_range('2023-01-01', periods=365, freq='D')
products = ['Product A', 'Product B', 'Product C', 'Product D']
regions = ['North', 'South', 'East', 'West']
np.random.seed(42)
sales_data = pd.DataFrame({
'Date': np.random.choice(dates, 5000),
'Product': np.random.choice(products, 5000, p=[0.4, 0.3, 0.2, 0.1]),
'Region': np.random.choice(regions, 5000),
'Quantity': np.random.randint(1, 20, 5000),
'Unit_Price': np.random.uniform(10, 100, 5000)
})
sales_data['Revenue'] = sales_data['Quantity'] * sales_data['Unit_Price']
sales_data['Date'] = pd.to_datetime(sales_data['Date'])
print("Sales Data Sample:")
print(sales_data.head())
print()
# Monthly sales analysis
sales_data['YearMonth'] = sales_data['Date'].dt.to_period('M')
monthly_sales = sales_data.groupby(['YearMonth', 'Product']).agg({
'Revenue': 'sum',
'Quantity': 'sum'
}).reset_index()
print("Monthly Sales by Product:")
print(monthly_sales.head(12))
print()
# Region performance
region_performance = sales_data.groupby('Region').agg({
'Revenue': ['sum', 'mean', 'std'],
'Quantity': 'sum'
}).round(2)
print("Region Performance:")
print(region_performance)
print()
# Product trends
product_trends = sales_data.groupby(['Product', 'Date']).agg({
'Revenue': 'sum'
}).reset_index()
# Visualization
plt.figure(figsize=(14, 6))
# Revenue by region
plt.subplot(1, 3, 1)
sales_data.groupby('Region')['Revenue'].sum().plot(kind='bar', color='skyblue')
plt.title('Revenue by Region')
plt.ylabel('Total Revenue')
# Revenue by product
plt.subplot(1, 3, 2)
sales_data.groupby('Product')['Revenue'].sum().plot(kind='bar', color='lightcoral')
plt.title('Revenue by Product')
# Monthly trend
plt.subplot(1, 3, 3)
monthly_sales.groupby('YearMonth')['Revenue'].sum().plot(marker='o')
plt.title('Monthly Revenue Trend')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Correlation analysis
numeric_cols = ['Quantity', 'Unit_Price', 'Revenue']
correlation = sales_data[numeric_cols].corr()
print("Correlation Matrix:")
print(correlation)
Example 3: Data Cleaning Pipeline
# Create messy dataset
messy_data = pd.DataFrame({
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Name': ['Alice', 'BOB', 'Charlie', 'david', 'Emma', 'Frank', 'Grace', 'HENRY', 'Ivy', 'JACK'],
'Age': [25, -5, 35, 28, 32, 29, 31, 33, 150, 22],
'Salary': [70000, 85000, -1000, 68000, 78000, 62000, 72000, 88000, 95000, 0],
'Join_Date': ['2020-01-15', '2021/03/20', '15-05-2020', '2022-07-10',
'2023-01-05', 'invalid', '2021-11-30', '2022-09-25',
'2023-03-15', '2020-12-01'],
'Department': ['Sales', 'IT', 'Marketing', 'IT', 'Sales',
'HR', 'Marketing', 'IT', 'Sales', 'Unknown']
})
print("Messy Data:")
print(messy_data)
print()
def clean_data(df):
"""Complete data cleaning pipeline"""
df_clean = df.copy()
# 1. Standardize text columns
df_clean['Name'] = df_clean['Name'].str.title()
df_clean['Department'] = df_clean['Department'].str.capitalize()
# 2. Handle invalid ages
df_clean['Age'] = df_clean['Age'].clip(0, 120)
df_clean.loc[df_clean['Age'] > 100, 'Age'] = df_clean['Age'].median()
# 3. Handle invalid salaries
df_clean['Salary'] = df_clean['Salary'].clip(0, 200000)
df_clean.loc[df_clean['Salary'] <= 0, 'Salary'] = df_clean['Salary'].median()
# 4. Parse dates
df_clean['Join_Date'] = pd.to_datetime(df_clean['Join_Date'], errors='coerce')
df_clean['Join_Date'].fillna(pd.Timestamp('2020-01-01'), inplace=True)
# 5. Replace unknown values
df_clean['Department'] = df_clean['Department'].replace('Unknown', 'Other')
# 6. Remove duplicates
df_clean.drop_duplicates(subset=['Name'], keep='first', inplace=True)
return df_clean
cleaned_data = clean_data(messy_data)
print("Cleaned Data:")
print(cleaned_data)
print()
print("Data Quality Report:")
print(f"Original shape: {messy_data.shape}")
print(f"Cleaned shape: {cleaned_data.shape}")
print(f"Rows removed: {len(messy_data) - len(cleaned_data)}")
print()
print("Missing values after cleaning:")
print(cleaned_data.isnull().sum())
Example 4: Feature Engineering
# Create base dataset
employee_data = pd.DataFrame({
'Employee_ID': range(1, 101),
'Hire_Date': pd.date_range('2020-01-01', periods=100, freq='D'),
'Department': np.random.choice(['Sales', 'IT', 'Marketing', 'HR'], 100),
'Salary': np.random.normal(70000, 15000, 100).astype(int),
'Projects_Completed': np.random.poisson(5, 100),
'Performance_Score': np.random.uniform(1, 5, 100),
'Training_Hours': np.random.exponential(20, 100).astype(int)
})
# Feature engineering
engineered = employee_data.copy()
# 1. Time-based features
engineered['Hire_Year'] = engineered['Hire_Date'].dt.year
engineered['Hire_Month'] = engineered['Hire_Date'].dt.month
engineered['Hire_Quarter'] = engineered['Hire_Date'].dt.quarter
engineered['Years_Employed'] = (pd.Timestamp.now() - engineered['Hire_Date']).dt.days / 365.25
# 2. Binning features
engineered['Salary_Band'] = pd.cut(engineered['Salary'],
bins=[0, 50000, 70000, 90000, 150000],
labels=['Low', 'Medium', 'High', 'Very High'])
engineered['Performance_Level'] = pd.cut(engineered['Performance_Score'],
bins=[0, 2, 3, 4, 5],
labels=['Poor', 'Average', 'Good', 'Excellent'])
# 3. Interaction features
engineered['Productivity_Ratio'] = engineered['Projects_Completed'] / (engineered['Years_Employed'] + 1)
engineered['Efficiency_Score'] = engineered['Performance_Score'] * engineered['Projects_Completed']
# 4. Aggregated features
dept_stats = engineered.groupby('Department')['Salary'].transform(['mean', 'std'])
engineered['Salary_vs_Dept_Avg'] = engineered['Salary'] - dept_stats['mean']
engineered['Salary_vs_Dept_Std'] = (engineered['Salary'] - dept_stats['mean']) / dept_stats['std']
# 5. Encoding categorical variables
engineered = pd.get_dummies(engineered, columns=['Department', 'Salary_Band'], prefix=['Dept', 'SalaryBand'])
# 6. Ranking features
engineered['Salary_Rank'] = engineered['Salary'].rank(method='dense', ascending=False)
print("Engineered Features:")
print(engineered.head())
print()
print("Feature Information:")
print(engineered.info())
11. Best Practices
Memory Optimization
def optimize_dataframe(df):
"""Optimize DataFrame memory usage"""
df_opt = df.copy()
for col in df_opt.columns:
col_type = df_opt[col].dtype
# Optimize numeric columns
if col_type != 'object':
c_min = df_opt[col].min()
c_max = df_opt[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df_opt[col] = df_opt[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df_opt[col] = df_opt[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df_opt[col] = df_opt[col].astype(np.int32)
else:
if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df_opt[col] = df_opt[col].astype(np.float32)
# Optimize object columns (categorical)
elif df_opt[col].nunique() / len(df_opt) < 0.5:
df_opt[col] = df_opt[col].astype('category')
return df_opt
# Example usage
large_df = pd.DataFrame({
'int_col': np.random.randint(0, 100, 1000000),
'float_col': np.random.randn(1000000),
'category_col': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000000)
})
print("Memory before optimization:")
print(large_df.memory_usage(deep=True).sum() / 1024**2, "MB")
optimized_df = optimize_dataframe(large_df)
print("Memory after optimization:")
print(optimized_df.memory_usage(deep=True).sum() / 1024**2, "MB")
Writing Clean Code
# Use meaningful variable names
# Bad
df1 = pd.read_csv('data.csv')
x = df1.groupby('col1')['col2'].mean()
# Good
employee_df = pd.read_csv('employees.csv')
avg_salary_by_dept = employee_df.groupby('department')['salary'].mean()
# Use method chaining
# Bad
df = pd.read_csv('data.csv')
df = df[df['age'] > 18]
df = df.groupby('city')['income'].mean()
df = df.reset_index()
# Good
result = (pd.read_csv('data.csv')
.query('age > 18')
.groupby('city')['income']
.mean()
.reset_index())
# Use descriptive column names
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
# Document data transformations
# Good
def clean_customer_data(df):
"""
Clean customer dataset by handling missing values and outliers.
Steps:
1. Remove duplicate customer records
2. Fill missing age with median
3. Cap income at 99th percentile
4. Standardize country names
"""
df_clean = df.copy()
# Implementation...
return df_clean
12. Common Pitfalls and Solutions
Pitfall 1: Chained Assignment
# Bad (chained assignment) df[df['age'] > 30]['income'] = 50000 # May not work # Good df.loc[df['age'] > 30, 'income'] = 50000 # Bad df['new_col'] = df['col1'] / df['col2'] df[df['col2'] == 0]['new_col'] = 0 # Good df['new_col'] = np.where(df['col2'] == 0, 0, df['col1'] / df['col2'])
Pitfall 2: Modifying During Iteration
# Bad for index, row in df.iterrows(): if row['value'] > 100: df.at[index, 'value'] = 100 # Good (vectorized) df['value'] = df['value'].clip(upper=100) # Good (using apply for complex operations) def process_row(row): return row['value'] if row['value'] <= 100 else 100 df['value'] = df.apply(process_row, axis=1)
Pitfall 3: SettingWithCopyWarning
# Bad (may cause SettingWithCopyWarning) subset = df[df['age'] > 30] subset['new_col'] = 100 # Good (explicit copy) subset = df[df['age'] > 30].copy() subset['new_col'] = 100 # Good (using loc) df.loc[df['age'] > 30, 'new_col'] = 100
Pitfall 4: Inefficient Operations
# Bad df['new_col'] = 0 for i in range(len(df)): df.loc[i, 'new_col'] = df.loc[i, 'col1'] + df.loc[i, 'col2'] # Good df['new_col'] = df['col1'] + df['col2'] # Bad result = [] for i in range(len(df)): result.append(df.loc[i, 'col1'] * 2) # Good result = df['col1'] * 2
Conclusion
DataFrames are the foundation of data science in Python. Mastering pandas DataFrames is essential for efficient data analysis:
Key Takeaways
- DataFrame Creation: Multiple ways to create DataFrames (dictionaries, lists, NumPy arrays, files)
- Data Inspection: info(), describe(), head(), tail() for quick overview
- Data Selection: loc[], iloc[], boolean indexing, query()
- Data Manipulation: Adding columns, modifying data, handling missing values
- Grouping and Aggregation: Powerful groupby operations for summarization
- Merging and Joining: Combine datasets efficiently
- Time Series: Special handling for temporal data
- Visualization: Integrated plotting for quick insights
- Performance: Vectorized operations over loops
Best Practices
- Use vectorized operations instead of loops
- Handle missing data appropriately
- Use meaningful variable names
- Chain methods for cleaner code
- Optimize data types for memory efficiency
- Test with small samples before full dataset
- Document data transformations
- Use version control for analysis scripts
DataFrames are not just a tool—they're a mindset. Thinking in terms of vectorized, tabular operations will make you a more efficient and effective data scientist!