PROJECT NO. 1: SPAM CLASSIFIER (Beginner NLP Project)

📱 Project Overview

Build a machine learning model that classifies SMS/Emails as Spam or Ham (Not Spam).

🎯 Why This Project?

Perfect for beginners learning NLP
Real-world application (everyone hates spam!)
Small dataset size (runs on any laptop)
Teaches fundamental ML concepts

🛠️ Tech Stack Options

Level	Tools
Beginner	Python, Pandas, Scikit-learn, Jupyter Notebook
Intermediate	Add NLTK/Spacy, Flask/FastAPI for deployment
Advanced	Add Deep Learning (LSTM/Transformers), Docker

📂 Dataset Options

SMS Spam Collection - UCI ML Repository (5,574 messages)
Enron Email Dataset - For email spam detection
Twitter Spam Dataset - For social media spam

Quick Download:

# Direct download link for SMS Spam Collection
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

🧩 Project Structure

spam-classifier/
│
├── data/
│   ├── raw/               # Original dataset
│   └── processed/          # Cleaned data
│
├── notebooks/
│   ├── 01_eda.ipynb        # Exploratory Data Analysis
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_training.ipynb
│
├── src/
│   ├── data_preprocessing.py
│   ├── feature_extraction.py
│   ├── model.py
│   └── utils.py
│
├── models/
│   └── spam_classifier.pkl  # Saved model
│
├── app/
│   ├── static/              # CSS, JS files
│   ├── templates/           # HTML files
│   └── app.py               # Flask web app
│
├── requirements.txt
└── README.md

📝 Step-by-Step Implementation

Step 1: Setup and Data Loading

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import nltk
import re
import pickle
# Load data
df = pd.read_csv('data/raw/SMSSpamCollection', sep='\t', names=['label', 'message'])
print(df.shape)
print(df.head())

Step 2: Exploratory Data Analysis

# Check class distribution
print(df['label'].value_counts())
print(df['label'].value_counts(normalize=True) * 100)
# Visualize
sns.countplot(x='label', data=df)
plt.title('Spam vs Ham Distribution')
plt.show()
# Add text length feature
df['message_length'] = df['message'].apply(len)
df['word_count'] = df['message'].apply(lambda x: len(x.split()))
# Compare lengths
df.groupby('label').agg({
'message_length': ['mean', 'median', 'max'],
'word_count': ['mean', 'median', 'max']
})

Step 3: Text Preprocessing

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation and special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()
# Tokenize and remove stopwords, then lemmatize
words = text.split()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)
# Apply preprocessing
df['clean_message'] = df['message'].apply(preprocess_text)

Step 4: Feature Engineering (Choose One)

Option A: Bag of Words

vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['clean_message']).toarray()

Option B: TF-IDF (Better for this task)

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df['clean_message']).toarray()

Option C: Word Embeddings (Advanced)

# Using pre-trained Word2Vec or GloVe
# More complex, better for deep learning approaches

Step 5: Train-Test Split

# Encode labels
y = df['label'].map({'ham': 0, 'spam': 1})
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Step 6: Model Training

# Train multiple models and compare
# 1. Naive Bayes (Best for text classification)
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)
# 2. Logistic Regression
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
# 3. Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
# Compare results
models = ['Naive Bayes', 'Logistic Regression', 'Random Forest']
for name, y_pred in zip(models, [y_pred_nb, y_pred_lr, y_pred_rf]):
print(f"\n{name} Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

Step 7: Model Evaluation & Visualization

# Confusion Matrix Heatmap
from sklearn.metrics import ConfusionMatrixDisplay
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, (model, y_pred, ax) in enumerate(zip(models, [y_pred_nb, y_pred_lr, y_pred_rf], axes)):
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred, ax=ax, cmap='Blues', 
display_labels=['Ham', 'Spam']
)
ax.set_title(model)
plt.tight_layout()
plt.show()
# ROC Curve
from sklearn.metrics import roc_curve, auc
plt.figure(figsize=(8, 6))
for model, name in zip([nb_model, lr_model, rf_model], models):
if hasattr(model, "predict_proba"):
y_pred_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.show()

Step 8: Save Best Model

# Save the best performing model (usually Naive Bayes)
best_model = nb_model
pickle.dump(best_model, open('models/spam_classifier.pkl', 'wb'))
pickle.dump(vectorizer, open('models/vectorizer.pkl', 'wb'))

🌐 Step 9: Build Web Application (Flask)

app.py

from flask import Flask, render_template, request, jsonify
import pickle
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
app = Flask(__name__)
# Load model and vectorizer
model = pickle.load(open('models/spam_classifier.pkl', 'rb'))
vectorizer = pickle.load(open('models/vectorizer.pkl', 'rb'))
# Initialize NLP tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
words = text.split()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict():
if request.method == 'POST':
message = request.form['message']
# Preprocess and predict
clean_msg = preprocess_text(message)
vectorized_msg = vectorizer.transform([clean_msg])
prediction = model.predict(vectorized_msg)[0]
probability = model.predict_proba(vectorized_msg)[0]
result = {
'prediction': 'SPAM' if prediction == 1 else 'NOT SPAM',
'confidence': float(max(probability)),
'spam_probability': float(probability[1]),
'ham_probability': float(probability[0])
}
return jsonify(result)
if __name__ == '__main__':
app.run(debug=True)

templates/index.html

<!DOCTYPE html>
<html>
<head>
<title>SMS Spam Classifier</title>
<style>
body { font-family: Arial; max-width: 600px; margin: 50px auto; padding: 20px; }
textarea { width: 100%; height: 100px; margin: 10px 0; padding: 10px; }
button { background: #4CAF50; color: white; padding: 10px 20px; border: none; cursor: pointer; }
.result { margin-top: 20px; padding: 20px; border-radius: 5px; display: none; }
.spam { background: #ffdddd; border: 1px solid #ff0000; }
.ham { background: #ddffdd; border: 1px solid #00ff00; }
</style>
</head>
<body>
<h1>SMS Spam Classifier</h1>
<p>Enter a message to check if it's spam or not:</p>
<textarea id="message" placeholder="Type your message here..."></textarea>
<button onclick="predict()">Check Message</button>
<div id="result" class="result"></div>
<script>
function predict() {
const message = document.getElementById('message').value;
fetch('/predict', {
method: 'POST',
headers: {'Content-Type': 'application/x-www-form-urlencoded'},
body: 'message=' + encodeURIComponent(message)
})
.then(response => response.json())
.then(data => {
const resultDiv = document.getElementById('result');
resultDiv.style.display = 'block';
if (data.prediction === 'SPAM') {
resultDiv.className = 'result spam';
resultDiv.innerHTML = `<h2>⚠️ SPAM DETECTED!</h2>
<p>Confidence: ${(data.confidence * 100).toFixed(2)}%</p>
<p>This message is likely spam.</p>`;
} else {
resultDiv.className = 'result ham';
resultDiv.innerHTML = `<h2>✅ SAFE MESSAGE</h2>
<p>Confidence: ${(data.confidence * 100).toFixed(2)}%</p>
<p>This message appears to be legitimate.</p>`;
}
});
}
</script>
</body>
</html>

📊 Expected Results

Metric	Naive Bayes	Logistic Regression	Random Forest
Accuracy	97-98%	97-98%	96-97%
Precision (Spam)	99%	98%	97%
Recall (Spam)	92%	93%	90%
F1-Score (Spam)	95%	95%	93%

🚀 Enhancements & Next Steps

Deep Learning Version: Use LSTM or Transformers (BERT)
Real-time Integration: Connect to Gmail API for live filtering
Multi-language Support: Add support for non-English messages
Explainability: Add SHAP/LIME to explain why a message is spam
Mobile App: Convert to Android/iOS app using TensorFlow Lite
Browser Extension: Create Chrome extension for email spam filtering

📦 requirements.txt

pandas==2.0.3
numpy==1.24.3
scikit-learn==1.3.0
matplotlib==3.7.2
seaborn==0.12.2
nltk==3.8.1
flask==2.3.2
joblib==1.3.1

🏆 Project Deliverables

✅ Working ML model (97%+ accuracy)
✅ Web interface for testing
✅ API endpoint for integration
✅ Documentation
✅ Test cases
✅ Deployment ready (can deploy on Heroku/Railway)

This project is beginner-friendly, teaches fundamental ML concepts, and produces a portfolio-worthy application that solves a real-world problem!