π± Project Overview
Build a machine learning model that classifies SMS/Emails as Spam or Ham (Not Spam).
π― Why This Project?
- Perfect for beginners learning NLP
- Real-world application (everyone hates spam!)
- Small dataset size (runs on any laptop)
- Teaches fundamental ML concepts
π οΈ Tech Stack Options
| Level | Tools |
|---|---|
| Beginner | Python, Pandas, Scikit-learn, Jupyter Notebook |
| Intermediate | Add NLTK/Spacy, Flask/FastAPI for deployment |
| Advanced | Add Deep Learning (LSTM/Transformers), Docker |
π Dataset Options
- SMS Spam Collection - UCI ML Repository (5,574 messages)
- Enron Email Dataset - For email spam detection
- Twitter Spam Dataset - For social media spam
Quick Download:
# Direct download link for SMS Spam Collection !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
π§© Project Structure
spam-classifier/ β βββ data/ β βββ raw/ # Original dataset β βββ processed/ # Cleaned data β βββ notebooks/ β βββ 01_eda.ipynb # Exploratory Data Analysis β βββ 02_feature_engineering.ipynb β βββ 03_model_training.ipynb β βββ src/ β βββ data_preprocessing.py β βββ feature_extraction.py β βββ model.py β βββ utils.py β βββ models/ β βββ spam_classifier.pkl # Saved model β βββ app/ β βββ static/ # CSS, JS files β βββ templates/ # HTML files β βββ app.py # Flask web app β βββ requirements.txt βββ README.md
π Step-by-Step Implementation
Step 1: Setup and Data Loading
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import nltk
import re
import pickle
# Load data
df = pd.read_csv('data/raw/SMSSpamCollection', sep='\t', names=['label', 'message'])
print(df.shape)
print(df.head())
Step 2: Exploratory Data Analysis
# Check class distribution
print(df['label'].value_counts())
print(df['label'].value_counts(normalize=True) * 100)
# Visualize
sns.countplot(x='label', data=df)
plt.title('Spam vs Ham Distribution')
plt.show()
# Add text length feature
df['message_length'] = df['message'].apply(len)
df['word_count'] = df['message'].apply(lambda x: len(x.split()))
# Compare lengths
df.groupby('label').agg({
'message_length': ['mean', 'median', 'max'],
'word_count': ['mean', 'median', 'max']
})
Step 3: Text Preprocessing
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation and special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()
# Tokenize and remove stopwords, then lemmatize
words = text.split()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)
# Apply preprocessing
df['clean_message'] = df['message'].apply(preprocess_text)
Step 4: Feature Engineering (Choose One)
Option A: Bag of Words
vectorizer = CountVectorizer(max_features=5000) X = vectorizer.fit_transform(df['clean_message']).toarray()
Option B: TF-IDF (Better for this task)
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2)) X = vectorizer.fit_transform(df['clean_message']).toarray()
Option C: Word Embeddings (Advanced)
# Using pre-trained Word2Vec or GloVe # More complex, better for deep learning approaches
Step 5: Train-Test Split
# Encode labels
y = df['label'].map({'ham': 0, 'spam': 1})
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
Step 6: Model Training
# Train multiple models and compare
# 1. Naive Bayes (Best for text classification)
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)
# 2. Logistic Regression
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
# 3. Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
# Compare results
models = ['Naive Bayes', 'Logistic Regression', 'Random Forest']
for name, y_pred in zip(models, [y_pred_nb, y_pred_lr, y_pred_rf]):
print(f"\n{name} Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))
Step 7: Model Evaluation & Visualization
# Confusion Matrix Heatmap
from sklearn.metrics import ConfusionMatrixDisplay
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, (model, y_pred, ax) in enumerate(zip(models, [y_pred_nb, y_pred_lr, y_pred_rf], axes)):
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred, ax=ax, cmap='Blues',
display_labels=['Ham', 'Spam']
)
ax.set_title(model)
plt.tight_layout()
plt.show()
# ROC Curve
from sklearn.metrics import roc_curve, auc
plt.figure(figsize=(8, 6))
for model, name in zip([nb_model, lr_model, rf_model], models):
if hasattr(model, "predict_proba"):
y_pred_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.show()
Step 8: Save Best Model
# Save the best performing model (usually Naive Bayes)
best_model = nb_model
pickle.dump(best_model, open('models/spam_classifier.pkl', 'wb'))
pickle.dump(vectorizer, open('models/vectorizer.pkl', 'wb'))
π Step 9: Build Web Application (Flask)
app.py
from flask import Flask, render_template, request, jsonify
import pickle
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
app = Flask(__name__)
# Load model and vectorizer
model = pickle.load(open('models/spam_classifier.pkl', 'rb'))
vectorizer = pickle.load(open('models/vectorizer.pkl', 'rb'))
# Initialize NLP tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
words = text.split()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict():
if request.method == 'POST':
message = request.form['message']
# Preprocess and predict
clean_msg = preprocess_text(message)
vectorized_msg = vectorizer.transform([clean_msg])
prediction = model.predict(vectorized_msg)[0]
probability = model.predict_proba(vectorized_msg)[0]
result = {
'prediction': 'SPAM' if prediction == 1 else 'NOT SPAM',
'confidence': float(max(probability)),
'spam_probability': float(probability[1]),
'ham_probability': float(probability[0])
}
return jsonify(result)
if __name__ == '__main__':
app.run(debug=True)
templates/index.html
<!DOCTYPE html>
<html>
<head>
<title>SMS Spam Classifier</title>
<style>
body { font-family: Arial; max-width: 600px; margin: 50px auto; padding: 20px; }
textarea { width: 100%; height: 100px; margin: 10px 0; padding: 10px; }
button { background: #4CAF50; color: white; padding: 10px 20px; border: none; cursor: pointer; }
.result { margin-top: 20px; padding: 20px; border-radius: 5px; display: none; }
.spam { background: #ffdddd; border: 1px solid #ff0000; }
.ham { background: #ddffdd; border: 1px solid #00ff00; }
</style>
</head>
<body>
<h1>SMS Spam Classifier</h1>
<p>Enter a message to check if it's spam or not:</p>
<textarea id="message" placeholder="Type your message here..."></textarea>
<button onclick="predict()">Check Message</button>
<div id="result" class="result"></div>
<script>
function predict() {
const message = document.getElementById('message').value;
fetch('/predict', {
method: 'POST',
headers: {'Content-Type': 'application/x-www-form-urlencoded'},
body: 'message=' + encodeURIComponent(message)
})
.then(response => response.json())
.then(data => {
const resultDiv = document.getElementById('result');
resultDiv.style.display = 'block';
if (data.prediction === 'SPAM') {
resultDiv.className = 'result spam';
resultDiv.innerHTML = `<h2>β οΈ SPAM DETECTED!</h2>
<p>Confidence: ${(data.confidence * 100).toFixed(2)}%</p>
<p>This message is likely spam.</p>`;
} else {
resultDiv.className = 'result ham';
resultDiv.innerHTML = `<h2>β
SAFE MESSAGE</h2>
<p>Confidence: ${(data.confidence * 100).toFixed(2)}%</p>
<p>This message appears to be legitimate.</p>`;
}
});
}
</script>
</body>
</html>
π Expected Results
| Metric | Naive Bayes | Logistic Regression | Random Forest |
|---|---|---|---|
| Accuracy | 97-98% | 97-98% | 96-97% |
| Precision (Spam) | 99% | 98% | 97% |
| Recall (Spam) | 92% | 93% | 90% |
| F1-Score (Spam) | 95% | 95% | 93% |
π Enhancements & Next Steps
- Deep Learning Version: Use LSTM or Transformers (BERT)
- Real-time Integration: Connect to Gmail API for live filtering
- Multi-language Support: Add support for non-English messages
- Explainability: Add SHAP/LIME to explain why a message is spam
- Mobile App: Convert to Android/iOS app using TensorFlow Lite
- Browser Extension: Create Chrome extension for email spam filtering
π¦ requirements.txt
pandas==2.0.3 numpy==1.24.3 scikit-learn==1.3.0 matplotlib==3.7.2 seaborn==0.12.2 nltk==3.8.1 flask==2.3.2 joblib==1.3.1
π Project Deliverables
- β Working ML model (97%+ accuracy)
- β Web interface for testing
- β API endpoint for integration
- β Documentation
- β Test cases
- β Deployment ready (can deploy on Heroku/Railway)
This project is beginner-friendly, teaches fundamental ML concepts, and produces a portfolio-worthy application that solves a real-world problem!