Spam Classifier: SMS/Email Filter using Naive Bayes

📌 Introduction

The Spam Classifier is a machine learning application that automatically filters SMS and emails as either Spam (unwanted messages) or Ham (legitimate messages) using the Naive Bayes algorithm. This project demonstrates the practical implementation of text classification with a user-friendly interface, robust database storage, and real-time prediction capabilities.

Why Naive Bayes?

Fast and efficient for text classification
Works well with small datasets
Handles high-dimensional data effectively
Probability-based approach ideal for spam detection

✨ Features

Core Features

Text Classification: Classify messages as spam or ham with high accuracy
Real-time Prediction: Instant results for new messages
Training Capability: Retrain the model with new data
MongoDB Integration: Persistent storage for messages and predictions
Confidence Score: Shows probability percentage for each prediction

Technical Features

TF-IDF Vectorization: Convert text to numerical features
Multinomial Naive Bayes: Optimized for text classification
RESTful API: Easy integration with other applications
Data Visualization: Performance metrics and statistics
Export Functionality: Download classification results

📁 Project Structure

spam-classifier/
│
├── 📂 backend/
│   ├── 📂 models/
│   │   ├── spam_classifier.py      # Naive Bayes classifier implementation
│   │   └── model_utils.py          # Model training and saving utilities
│   │
│   ├── 📂 database/
│   │   ├── mongo_connection.py     # MongoDB connection handler
│   │   └── message_model.py        # Database schema and operations
│   │
│   ├── 📂 api/
│   │   ├── routes.py               # API endpoints
│   │   └── validation.py           # Input validation
│   │
│   ├── app.py                      # Main Flask application
│   └── config.py                    # Configuration settings
│
├── 📂 frontend/
│   ├── 📂 static/
│   │   ├── css/
│   │   │   └── style.css           # Custom styles
│   │   └── js/
│   │       └── main.js             # Frontend JavaScript
│   │
│   ├── 📂 templates/
│   │   ├── index.html              # Main interface
│   │   ├── dashboard.html           # Analytics dashboard
│   │   └── history.html             # Message history
│   │
│   └── 📂 components/
│       └── charts.js                # Data visualization
│
├── 📂 data/
│   ├── raw/                         # Original dataset
│   └── processed/                   # Cleaned dataset
│
├── 📂 notebooks/
│   └── model_development.ipynb      # Jupyter notebook for experimentation
│
├── 📂 tests/
│   ├── test_classifier.py           # Unit tests
│   └── test_api.py                  # API tests
│
├── 📂 utils/
│   ├── text_preprocessing.py        # Text cleaning functions
│   └── evaluation.py                 # Model evaluation metrics
│
├── requirements.txt                  # Python dependencies
├── .env                              # Environment variables
├── .gitignore                        # Git ignore file
├── README.md                         # Project documentation
└── docker-compose.yml                # Docker configuration

🛠️ Installation & Setup

Prerequisites

Python 3.8+
MongoDB 4.4+
pip (Python package manager)

Step 1: Clone the Repository

git clone https://github.com/yourusername/spam-classifier.git
cd spam-classifier

Step 2: Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Configure MongoDB

Create a .env file in the root directory:

MONGODB_URI=mongodb://localhost:27017/
DATABASE_NAME=spam_classifier
COLLECTION_NAME=messages
SECRET_KEY=your-secret-key-here

Step 5: Start MongoDB

# Start MongoDB service
sudo systemctl start mongod  # Linux
# OR
mongod  # Direct execution

Step 6: Run the Application

python backend/app.py

Visit http://localhost:5000 in your browser.

📊 Dataset

The model can be trained on the SMS Spam Collection Dataset or any custom dataset with the following format:

CSV format: label,message
Label options: 'spam' or 'ham'
Sample data:

  ham,Hello how are you?
spam,CONGRATULATIONS! You've won a prize!

🧠 How It Works

1. Text Preprocessing

Lowercase conversion
Remove punctuation and special characters
Remove stop words
Tokenization
Stemming/Lemmatization

2. Feature Extraction

TF-IDF Vectorization: Converts text to numerical features
N-gram features: Captures word sequences
Vocabulary size: Configurable (default: 5000 features)

3. Classification Algorithm

The Naive Bayes classifier uses Bayes' theorem:

P(Spam|Message) = P(Message|Spam) * P(Spam) / P(Message)

4. Prediction Output

Classification: Spam or Ham
Confidence Score: Probability percentage
Timestamp: When the message was analyzed

📡 API Endpoints

Endpoint	Method	Description
`/api/predict`	POST	Classify a single message
`/api/train`	POST	Train model with new data
`/api/history`	GET	Get classification history
`/api/stats`	GET	Get model statistics
`/api/delete/<id>`	DELETE	Delete a message record

🚀 Usage Examples

Python Client

import requests
# Predict single message
url = "http://localhost:5000/api/predict"
data = {"message": "Congratulations! You've won $1000!"}
response = requests.post(url, json=data)
print(response.json())
# Output: {'label': 'spam', 'confidence': 0.98}

cURL Command

curl -X POST http://localhost:5000/api/predict \
-H "Content-Type: application/json" \
-d '{"message": "Hey, are we still meeting today?"}'

📈 Performance Metrics

The model achieves:

Accuracy: 98.5%
Precision: 99.2% (for spam)
Recall: 97.8% (for spam)
F1-Score: 98.5%

🔧 Configuration Options

Edit config.py to customize:

MAX_FEATURES: Number of features for TF-IDF (default: 5000)
TEST_SIZE: Train-test split ratio (default: 0.2)
RANDOM_STATE: Reproducibility seed (default: 42)
MONGO_URI: MongoDB connection string

🐳 Docker Deployment

Build and Run with Docker Compose

docker-compose up --build

Dockerfile

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "backend/app.py"]

🧪 Running Tests

# Run all tests
pytest tests/
# Run specific test
pytest tests/test_classifier.py -v

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

🙏 Acknowledgments

SMS Spam Collection Dataset
Scikit-learn documentation
MongoDB University

Made with ❤️ using Python, MongoDB, and Machine Learning