📌 Introduction
The Spam Classifier is a machine learning application that automatically filters SMS and emails as either Spam (unwanted messages) or Ham (legitimate messages) using the Naive Bayes algorithm. This project demonstrates the practical implementation of text classification with a user-friendly interface, robust database storage, and real-time prediction capabilities.
Why Naive Bayes?
- Fast and efficient for text classification
- Works well with small datasets
- Handles high-dimensional data effectively
- Probability-based approach ideal for spam detection
✨ Features
Core Features
- Text Classification: Classify messages as spam or ham with high accuracy
- Real-time Prediction: Instant results for new messages
- Training Capability: Retrain the model with new data
- MongoDB Integration: Persistent storage for messages and predictions
- Confidence Score: Shows probability percentage for each prediction
Technical Features
- TF-IDF Vectorization: Convert text to numerical features
- Multinomial Naive Bayes: Optimized for text classification
- RESTful API: Easy integration with other applications
- Data Visualization: Performance metrics and statistics
- Export Functionality: Download classification results
📁 Project Structure
spam-classifier/ │ ├── 📂 backend/ │ ├── 📂 models/ │ │ ├── spam_classifier.py # Naive Bayes classifier implementation │ │ └── model_utils.py # Model training and saving utilities │ │ │ ├── 📂 database/ │ │ ├── mongo_connection.py # MongoDB connection handler │ │ └── message_model.py # Database schema and operations │ │ │ ├── 📂 api/ │ │ ├── routes.py # API endpoints │ │ └── validation.py # Input validation │ │ │ ├── app.py # Main Flask application │ └── config.py # Configuration settings │ ├── 📂 frontend/ │ ├── 📂 static/ │ │ ├── css/ │ │ │ └── style.css # Custom styles │ │ └── js/ │ │ └── main.js # Frontend JavaScript │ │ │ ├── 📂 templates/ │ │ ├── index.html # Main interface │ │ ├── dashboard.html # Analytics dashboard │ │ └── history.html # Message history │ │ │ └── 📂 components/ │ └── charts.js # Data visualization │ ├── 📂 data/ │ ├── raw/ # Original dataset │ └── processed/ # Cleaned dataset │ ├── 📂 notebooks/ │ └── model_development.ipynb # Jupyter notebook for experimentation │ ├── 📂 tests/ │ ├── test_classifier.py # Unit tests │ └── test_api.py # API tests │ ├── 📂 utils/ │ ├── text_preprocessing.py # Text cleaning functions │ └── evaluation.py # Model evaluation metrics │ ├── requirements.txt # Python dependencies ├── .env # Environment variables ├── .gitignore # Git ignore file ├── README.md # Project documentation └── docker-compose.yml # Docker configuration
🛠️ Installation & Setup
Prerequisites
- Python 3.8+
- MongoDB 4.4+
- pip (Python package manager)
Step 1: Clone the Repository
git clone https://github.com/yourusername/spam-classifier.git cd spam-classifier
Step 2: Create Virtual Environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
Step 3: Install Dependencies
pip install -r requirements.txt
Step 4: Configure MongoDB
Create a .env file in the root directory:
MONGODB_URI=mongodb://localhost:27017/ DATABASE_NAME=spam_classifier COLLECTION_NAME=messages SECRET_KEY=your-secret-key-here
Step 5: Start MongoDB
# Start MongoDB service sudo systemctl start mongod # Linux # OR mongod # Direct execution
Step 6: Run the Application
python backend/app.py
Visit http://localhost:5000 in your browser.
📊 Dataset
The model can be trained on the SMS Spam Collection Dataset or any custom dataset with the following format:
- CSV format:
label,message - Label options: 'spam' or 'ham'
- Sample data:
ham,Hello how are you? spam,CONGRATULATIONS! You've won a prize!
🧠 How It Works
1. Text Preprocessing
- Lowercase conversion
- Remove punctuation and special characters
- Remove stop words
- Tokenization
- Stemming/Lemmatization
2. Feature Extraction
- TF-IDF Vectorization: Converts text to numerical features
- N-gram features: Captures word sequences
- Vocabulary size: Configurable (default: 5000 features)
3. Classification Algorithm
The Naive Bayes classifier uses Bayes' theorem:
P(Spam|Message) = P(Message|Spam) * P(Spam) / P(Message)
4. Prediction Output
- Classification: Spam or Ham
- Confidence Score: Probability percentage
- Timestamp: When the message was analyzed
📡 API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/predict | POST | Classify a single message |
/api/train | POST | Train model with new data |
/api/history | GET | Get classification history |
/api/stats | GET | Get model statistics |
/api/delete/<id> | DELETE | Delete a message record |
🚀 Usage Examples
Python Client
import requests
# Predict single message
url = "http://localhost:5000/api/predict"
data = {"message": "Congratulations! You've won $1000!"}
response = requests.post(url, json=data)
print(response.json())
# Output: {'label': 'spam', 'confidence': 0.98}
cURL Command
curl -X POST http://localhost:5000/api/predict \
-H "Content-Type: application/json" \
-d '{"message": "Hey, are we still meeting today?"}'
📈 Performance Metrics
The model achieves:
- Accuracy: 98.5%
- Precision: 99.2% (for spam)
- Recall: 97.8% (for spam)
- F1-Score: 98.5%
🔧 Configuration Options
Edit config.py to customize:
- MAX_FEATURES: Number of features for TF-IDF (default: 5000)
- TEST_SIZE: Train-test split ratio (default: 0.2)
- RANDOM_STATE: Reproducibility seed (default: 42)
- MONGO_URI: MongoDB connection string
🐳 Docker Deployment
Build and Run with Docker Compose
docker-compose up --build
Dockerfile
FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "backend/app.py"]
🧪 Running Tests
# Run all tests pytest tests/ # Run specific test pytest tests/test_classifier.py -v
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🤝 Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
🙏 Acknowledgments
- SMS Spam Collection Dataset
- Scikit-learn documentation
- MongoDB University
Made with ❤️ using Python, MongoDB, and Machine Learning