๐ Introduction
The Spam Classifier is a machine learning application that automatically filters SMS and emails as either Spam (unwanted messages) or Ham (legitimate messages) using the Naive Bayes algorithm. This project demonstrates the practical implementation of text classification with a user-friendly interface, robust database storage, and real-time prediction capabilities.
Why Naive Bayes?
- Fast and efficient for text classification
- Works well with small datasets
- Handles high-dimensional data effectively
- Probability-based approach ideal for spam detection
โจ Features
Core Features
- Text Classification: Classify messages as spam or ham with high accuracy
- Real-time Prediction: Instant results for new messages
- Training Capability: Retrain the model with new data
- MongoDB Integration: Persistent storage for messages and predictions
- Confidence Score: Shows probability percentage for each prediction
Technical Features
- TF-IDF Vectorization: Convert text to numerical features
- Multinomial Naive Bayes: Optimized for text classification
- RESTful API: Easy integration with other applications
- Data Visualization: Performance metrics and statistics
- Export Functionality: Download classification results
๐ Project Structure
spam-classifier/ โ โโโ ๐ backend/ โ โโโ ๐ models/ โ โ โโโ spam_classifier.py # Naive Bayes classifier implementation โ โ โโโ model_utils.py # Model training and saving utilities โ โ โ โโโ ๐ database/ โ โ โโโ mongo_connection.py # MongoDB connection handler โ โ โโโ message_model.py # Database schema and operations โ โ โ โโโ ๐ api/ โ โ โโโ routes.py # API endpoints โ โ โโโ validation.py # Input validation โ โ โ โโโ app.py # Main Flask application โ โโโ config.py # Configuration settings โ โโโ ๐ frontend/ โ โโโ ๐ static/ โ โ โโโ css/ โ โ โ โโโ style.css # Custom styles โ โ โโโ js/ โ โ โโโ main.js # Frontend JavaScript โ โ โ โโโ ๐ templates/ โ โ โโโ index.html # Main interface โ โ โโโ dashboard.html # Analytics dashboard โ โ โโโ history.html # Message history โ โ โ โโโ ๐ components/ โ โโโ charts.js # Data visualization โ โโโ ๐ data/ โ โโโ raw/ # Original dataset โ โโโ processed/ # Cleaned dataset โ โโโ ๐ notebooks/ โ โโโ model_development.ipynb # Jupyter notebook for experimentation โ โโโ ๐ tests/ โ โโโ test_classifier.py # Unit tests โ โโโ test_api.py # API tests โ โโโ ๐ utils/ โ โโโ text_preprocessing.py # Text cleaning functions โ โโโ evaluation.py # Model evaluation metrics โ โโโ requirements.txt # Python dependencies โโโ .env # Environment variables โโโ .gitignore # Git ignore file โโโ README.md # Project documentation โโโ docker-compose.yml # Docker configuration
๐ ๏ธ Installation & Setup
Prerequisites
- Python 3.8+
- MongoDB 4.4+
- pip (Python package manager)
Step 1: Clone the Repository
git clone https://github.com/yourusername/spam-classifier.git cd spam-classifier
Step 2: Create Virtual Environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
Step 3: Install Dependencies
pip install -r requirements.txt
Step 4: Configure MongoDB
Create a .env file in the root directory:
MONGODB_URI=mongodb://localhost:27017/ DATABASE_NAME=spam_classifier COLLECTION_NAME=messages SECRET_KEY=your-secret-key-here
Step 5: Start MongoDB
# Start MongoDB service sudo systemctl start mongod # Linux # OR mongod # Direct execution
Step 6: Run the Application
python backend/app.py
Visit http://localhost:5000 in your browser.
๐ Dataset
The model can be trained on the SMS Spam Collection Dataset or any custom dataset with the following format:
- CSV format:
label,message - Label options: 'spam' or 'ham'
- Sample data:
ham,Hello how are you? spam,CONGRATULATIONS! You've won a prize!
๐ง How It Works
1. Text Preprocessing
- Lowercase conversion
- Remove punctuation and special characters
- Remove stop words
- Tokenization
- Stemming/Lemmatization
2. Feature Extraction
- TF-IDF Vectorization: Converts text to numerical features
- N-gram features: Captures word sequences
- Vocabulary size: Configurable (default: 5000 features)
3. Classification Algorithm
The Naive Bayes classifier uses Bayes' theorem:
P(Spam|Message) = P(Message|Spam) * P(Spam) / P(Message)
4. Prediction Output
- Classification: Spam or Ham
- Confidence Score: Probability percentage
- Timestamp: When the message was analyzed
๐ก API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/predict | POST | Classify a single message |
/api/train | POST | Train model with new data |
/api/history | GET | Get classification history |
/api/stats | GET | Get model statistics |
/api/delete/<id> | DELETE | Delete a message record |
๐ Usage Examples
Python Client
import requests
# Predict single message
url = "http://localhost:5000/api/predict"
data = {"message": "Congratulations! You've won $1000!"}
response = requests.post(url, json=data)
print(response.json())
# Output: {'label': 'spam', 'confidence': 0.98}
cURL Command
curl -X POST http://localhost:5000/api/predict \
-H "Content-Type: application/json" \
-d '{"message": "Hey, are we still meeting today?"}'
๐ Performance Metrics
The model achieves:
- Accuracy: 98.5%
- Precision: 99.2% (for spam)
- Recall: 97.8% (for spam)
- F1-Score: 98.5%
๐ง Configuration Options
Edit config.py to customize:
- MAX_FEATURES: Number of features for TF-IDF (default: 5000)
- TEST_SIZE: Train-test split ratio (default: 0.2)
- RANDOM_STATE: Reproducibility seed (default: 42)
- MONGO_URI: MongoDB connection string
๐ณ Docker Deployment
Build and Run with Docker Compose
docker-compose up --build
Dockerfile
FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "backend/app.py"]
๐งช Running Tests
# Run all tests pytest tests/ # Run specific test pytest tests/test_classifier.py -v
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ค Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ Acknowledgments
- SMS Spam Collection Dataset
- Scikit-learn documentation
- MongoDB University
Made with โค๏ธ using Python, MongoDB, and Machine Learning