๐ INTRODUCTION
This project implements a machine learning-based sentiment analyzer that can classify movie or product reviews into Positive, Negative, or Neutral categories. It uses multiple algorithms including Naive Bayes, Logistic Regression, and LSTM (deep learning) for comparison. The system includes comprehensive text preprocessing, feature extraction, and model evaluation. MongoDB is used to store reviews, predictions, and model performance metrics.
โจ FEATURES
- Multi-class Sentiment Analysis: Classify reviews as Positive, Negative, or Neutral
- Multiple Algorithms: Compare performance of different ML/DL models
- Real-time Analysis: Analyze individual reviews instantly
- Batch Processing: Analyze multiple reviews at once
- Model Training & Evaluation: Train and compare multiple models
- Visualization: Generate sentiment distribution charts and confusion matrices
- Export Functionality: Export results to CSV/JSON/Excel
- API Endpoints: RESTful API for integration with other applications
- Interactive Dashboard: Web-based interface for easy interaction
- Performance Metrics: Track accuracy, precision, recall, and F1-score
๐ PROJECT STRUCTURE
sentiment-analyzer/ โ โโโ config/ โ โโโ mongodb_config.py โ โโโ model_config.py โ โโโ models/ โ โโโ sentiment_classifier.py โ โโโ deep_learning_model.py โ โโโ model_comparator.py โ โโโ database/ โ โโโ db_operations.py โ โโโ review_schema.py โ โโโ utils/ โ โโโ text_preprocessing.py โ โโโ feature_extraction.py โ โโโ visualization.py โ โโโ api/ โ โโโ app.py โ โโโ routes.py โ โโโ web/ โ โโโ templates/ โ โ โโโ index.html โ โ โโโ analyze.html โ โ โโโ dashboard.html โ โโโ static/ โ โโโ css/ โ โ โโโ style.css โ โโโ js/ โ โโโ main.js โ โโโ data/ โ โโโ movie_reviews.csv โ โโโ product_reviews.csv โ โโโ requirements.txt โโโ .env โโโ README.md
๐ COMPLETE CODE
1. requirements.txt
pymongo==4.5.0 scikit-learn==1.3.0 pandas==2.0.3 numpy==1.24.3 nltk==3.8.1 tensorflow==2.13.0 keras==2.13.1 transformers==4.35.0 joblib==1.3.2 python-dotenv==1.0.0 flask==2.3.2 flask-cors==4.0.0 flask-restful==0.3.10 matplotlib==3.7.2 seaborn==0.12.2 plotly==5.17.0 wordcloud==1.9.2 textblob==0.17.1 vaderSentiment==3.3.2 gunicorn==21.2.0 celery==5.3.4 redis==5.0.1
2. config/mongodb_config.py
import os
from dotenv import load_dotenv
load_dotenv()
class MongoDBConfig:
# MongoDB connection settings
MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/')
DATABASE_NAME = os.getenv('DATABASE_NAME', 'sentiment_analyzer_db')
# Collections
REVIEWS_COLLECTION = 'reviews'
PREDICTIONS_COLLECTION = 'predictions'
MODEL_METRICS_COLLECTION = 'model_metrics'
TRAINING_DATA_COLLECTION = 'training_data'
# Connection settings
MAX_POOL_SIZE = 100
MIN_POOL_SIZE = 10
MAX_IDLE_TIME_MS = 10000
RETRY_WRITES = True
3. config/model_config.py
class ModelConfig:
# Model types
AVAILABLE_MODELS = ['naive_bayes', 'logistic_regression', 'svm', 'lstm', 'bert']
# Default model
DEFAULT_MODEL = 'logistic_regression'
# Text preprocessing settings
MAX_FEATURES = 5000
MAX_SEQUENCE_LENGTH = 200
EMBEDDING_DIM = 100
# Deep learning settings
BATCH_SIZE = 32
EPOCHS = 10
VALIDATION_SPLIT = 0.2
# Sentiment labels
SENTIMENT_LABELS = {
0: 'Negative',
1: 'Neutral',
2: 'Positive'
}
# Label mapping
LABEL_MAPPING = {
'negative': 0,
'neutral': 1,
'positive': 2
}
4. database/review_schema.py
from datetime import datetime
from bson import ObjectId
class ReviewSchema:
"""Schema for review documents in MongoDB"""
@staticmethod
def get_review_schema(review_text, source='user', metadata=None):
"""Create a review document following the schema"""
return {
'review_text': review_text,
'source': source,
'sentiment_label': None,
'sentiment_score': None,
'confidence': None,
'created_at': datetime.utcnow(),
'processed_at': None,
'review_length': len(review_text),
'word_count': len(review_text.split()),
'metadata': metadata or {
'has_emojis': False,
'has_uppercase': any(c.isupper() for c in review_text),
'has_punctuation': any(c in '!?.' for c in review_text)
}
}
@staticmethod
def get_prediction_schema(review_id, review_text, prediction, confidence,
actual_label=None, model_used='default'):
"""Create a prediction document"""
return {
'review_id': ObjectId(review_id) if isinstance(review_id, str) else review_id,
'review_text': review_text,
'prediction': prediction,
'confidence': confidence,
'actual_label': actual_label,
'is_correct': prediction == actual_label if actual_label else None,
'model_used': model_used,
'created_at': datetime.utcnow()
}
@staticmethod
def get_metrics_schema(metrics):
"""Create model metrics document"""
return {
'model_name': metrics['model_name'],
'accuracy': metrics['accuracy'],
'precision': metrics['precision'],
'recall': metrics['recall'],
'f1_score': metrics['f1_score'],
'confusion_matrix': metrics['confusion_matrix'],
'classification_report': metrics.get('classification_report', {}),
'training_time': metrics.get('training_time', 0),
'created_at': datetime.utcnow()
}
5. database/db_operations.py
from pymongo import MongoClient, errors
from pymongo.errors import ConnectionFailure
from config.mongodb_config import MongoDBConfig
from database.review_schema import ReviewSchema
import logging
from datetime import datetime, timedelta
import pandas as pd
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class MongoDBOperations:
def __init__(self):
self.config = MongoDBConfig()
self.client = None
self.db = None
self.connect()
def connect(self):
"""Establish connection to MongoDB"""
try:
self.client = MongoClient(
self.config.MONGODB_URI,
maxPoolSize=self.config.MAX_POOL_SIZE,
minPoolSize=self.config.MIN_POOL_SIZE,
maxIdleTimeMS=self.config.MAX_IDLE_TIME_MS,
retryWrites=self.config.RETRY_WRITES
)
self.db = self.client[self.config.DATABASE_NAME]
# Test connection
self.client.admin.command('ping')
logger.info("Successfully connected to MongoDB")
# Create indexes
self.create_indexes()
except ConnectionFailure as e:
logger.error(f"Failed to connect to MongoDB: {e}")
raise
def create_indexes(self):
"""Create necessary indexes for better query performance"""
try:
# Reviews collection indexes
self.db[self.config.REVIEWS_COLLECTION].create_index('created_at')
self.db[self.config.REVIEWS_COLLECTION].create_index('sentiment_label')
self.db[self.config.REVIEWS_COLLECTION].create_index('source')
# Predictions collection indexes
self.db[self.config.PREDICTIONS_COLLECTION].create_index('created_at')
self.db[self.config.PREDICTIONS_COLLECTION].create_index('prediction')
self.db[self.config.PREDICTIONS_COLLECTION].create_index('model_used')
# Model metrics collection indexes
self.db[self.config.MODEL_METRICS_COLLECTION].create_index('model_name')
self.db[self.config.MODEL_METRICS_COLLECTION].create_index('created_at')
logger.info("Database indexes created successfully")
except Exception as e:
logger.error(f"Error creating indexes: {e}")
def insert_review(self, review_text, source='user', metadata=None):
"""Insert a single review into database"""
try:
review_doc = ReviewSchema.get_review_schema(review_text, source, metadata)
result = self.db[self.config.REVIEWS_COLLECTION].insert_one(review_doc)
# Update processed_at
self.db[self.config.REVIEWS_COLLECTION].update_one(
{'_id': result.inserted_id},
{'$set': {'processed_at': datetime.utcnow()}}
)
logger.info(f"Review inserted with ID: {result.inserted_id}")
return result.inserted_id
except Exception as e:
logger.error(f"Error inserting review: {e}")
return None
def insert_many_reviews(self, reviews_list, source='bulk_upload'):
"""Insert multiple reviews"""
try:
review_docs = [
ReviewSchema.get_review_schema(review, source)
for review in reviews_list
]
result = self.db[self.config.REVIEWS_COLLECTION].insert_many(review_docs)
logger.info(f"Inserted {len(result.inserted_ids)} reviews")
return result.inserted_ids
except Exception as e:
logger.error(f"Error inserting multiple reviews: {e}")
return None
def save_prediction(self, review_id, review_text, prediction, confidence,
actual_label=None, model_used='default'):
"""Save a prediction result"""
try:
prediction_doc = ReviewSchema.get_prediction_schema(
review_id, review_text, prediction, confidence,
actual_label, model_used
)
result = self.db[self.config.PREDICTIONS_COLLECTION].insert_one(prediction_doc)
# Update the review with prediction
self.db[self.config.REVIEWS_COLLECTION].update_one(
{'_id': ObjectId(review_id) if isinstance(review_id, str) else review_id},
{
'$set': {
'sentiment_label': prediction,
'sentiment_score': confidence,
'confidence': confidence
}
}
)
return result.inserted_id
except Exception as e:
logger.error(f"Error saving prediction: {e}")
return None
def update_review_label(self, review_id, label, confidence=None):
"""Update review with actual sentiment label"""
try:
update_data = {'sentiment_label': label}
if confidence:
update_data['confidence'] = confidence
result = self.db[self.config.REVIEWS_COLLECTION].update_one(
{'_id': ObjectId(review_id) if isinstance(review_id, str) else review_id},
{'$set': update_data}
)
return result.modified_count > 0
except Exception as e:
logger.error(f"Error updating review label: {e}")
return False
def get_training_data(self, limit=None, labeled_only=True):
"""Retrieve training data from database"""
try:
query = {'sentiment_label': {'$ne': None}} if labeled_only else {}
cursor = self.db[self.config.REVIEWS_COLLECTION].find(query)
if limit:
cursor = cursor.limit(limit)
reviews = []
labels = []
for doc in cursor:
reviews.append(doc['review_text'])
labels.append(doc['sentiment_label'])
return reviews, labels
except Exception as e:
logger.error(f"Error retrieving training data: {e}")
return [], []
def get_reviews_by_sentiment(self, sentiment):
"""Get all reviews with specific sentiment"""
try:
cursor = self.db[self.config.REVIEWS_COLLECTION].find(
{'sentiment_label': sentiment}
).sort('created_at', -1)
return list(cursor)
except Exception as e:
logger.error(f"Error retrieving reviews by sentiment: {e}")
return []
def get_prediction_statistics(self, days=30):
"""Get statistics about predictions for the last N days"""
try:
cutoff_date = datetime.utcnow() - timedelta(days=days)
pipeline = [
{
'$match': {
'created_at': {'$gte': cutoff_date}
}
},
{
'$group': {
'_id': {
'prediction': '$prediction',
'model_used': '$model_used'
},
'count': {'$sum': 1},
'avg_confidence': {'$avg': '$confidence'}
}
},
{
'$sort': {'count': -1}
}
]
stats = list(self.db[self.config.PREDICTIONS_COLLECTION].aggregate(pipeline))
return stats
except Exception as e:
logger.error(f"Error getting prediction statistics: {e}")
return []
def get_model_performance(self, model_name=None):
"""Get model performance metrics"""
try:
query = {}
if model_name:
query['model_name'] = model_name
cursor = self.db[self.config.MODEL_METRICS_COLLECTION].find(
query
).sort('created_at', -1).limit(10)
return list(cursor)
except Exception as e:
logger.error(f"Error getting model performance: {e}")
return []
def export_reviews_to_dataframe(self):
"""Export all reviews to pandas DataFrame"""
try:
cursor = self.db[self.config.REVIEWS_COLLECTION].find()
df = pd.DataFrame(list(cursor))
# Convert ObjectId to string
if '_id' in df.columns:
df['_id'] = df['_id'].astype(str)
return df
except Exception as e:
logger.error(f"Error exporting to DataFrame: {e}")
return pd.DataFrame()
def close_connection(self):
"""Close MongoDB connection"""
if self.client:
self.client.close()
logger.info("MongoDB connection closed")
6. utils/text_preprocessing.py
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import emoji
import string
# Download required NLTK data
try:
nltk.data.find('tokenizers/punkt')
nltk.data.find('corpora/stopwords')
nltk.data.find('corpora/wordnet')
except LookupError:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
class TextPreprocessor:
def __init__(self):
self.lemmatizer = WordNetLemmatizer()
self.stop_words = set(stopwords.words('english'))
def clean_text(self, text):
"""Clean and preprocess text"""
if not isinstance(text, str):
text = str(text)
# Convert to lowercase
text = text.lower()
# Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
def convert_emojis(self, text):
"""Convert emojis to text representation"""
return emoji.demojize(text)
def tokenize(self, text):
"""Tokenize text into words"""
return word_tokenize(text)
def remove_stopwords(self, tokens):
"""Remove stopwords from tokens"""
return [token for token in tokens if token not in self.stop_words]
def lemmatize(self, tokens):
"""Apply lemmatization to tokens"""
return [self.lemmatizer.lemmatize(token) for token in tokens]
def get_pos_tags(self, tokens):
"""Get part-of-speech tags"""
return nltk.pos_tag(tokens)
def extract_sentiment_features(self, text):
"""Extract features relevant for sentiment analysis"""
features = {}
# Count of exclamation marks
features['exclamation_count'] = text.count('!')
# Count of question marks
features['question_count'] = text.count('?')
# Count of positive emojis
positive_emojis = ['๐', '๐', '๐', '๐', 'โค๏ธ', '๐', '๐', '๐']
features['positive_emoji_count'] = sum(text.count(emoji) for emoji in positive_emojis)
# Count of negative emojis
negative_emojis = ['๐', '๐ ', '๐', '๐ข', '๐ญ', '๐ค', '๐ก']
features['negative_emoji_count'] = sum(text.count(emoji) for emoji in negative_emojis)
# Check for all caps words (shouting)
words = text.split()
features['all_caps_count'] = sum(1 for word in words if word.isupper() and len(word) > 1)
# Word count
features['word_count'] = len(words)
# Character count
features['char_count'] = len(text)
# Average word length
features['avg_word_length'] = features['char_count'] / features['word_count'] if features['word_count'] > 0 else 0
return features
def preprocess(self, text, advanced=True):
"""Complete preprocessing pipeline"""
# Convert emojis
text = self.convert_emojis(text)
# Clean text
cleaned = self.clean_text(text)
# Tokenize
tokens = self.tokenize(cleaned)
# Remove stopwords
tokens = self.remove_stopwords(tokens)
# Lemmatize
tokens = self.lemmatize(tokens)
# Extract additional features if advanced preprocessing is enabled
if advanced:
sentiment_features = self.extract_sentiment_features(text)
return ' '.join(tokens), sentiment_features
return ' '.join(tokens)
def preprocess_batch(self, texts, advanced=True):
"""Preprocess a batch of texts"""
results = []
for text in texts:
results.append(self.preprocess(text, advanced))
return results
7. utils/feature_extraction.py
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
class FeatureExtractor:
def __init__(self):
self.count_vectorizer = CountVectorizer(max_features=5000, ngram_range=(1, 3))
self.tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
self.vader_analyzer = SentimentIntensityAnalyzer()
def extract_bow_features(self, texts, fit=False):
"""Extract Bag of Words features"""
if fit:
features = self.count_vectorizer.fit_transform(texts)
self.bow_feature_names = self.count_vectorizer.get_feature_names_out()
else:
features = self.count_vectorizer.transform(texts)
return features
def extract_tfidf_features(self, texts, fit=False):
"""Extract TF-IDF features"""
if fit:
features = self.tfidf_vectorizer.fit_transform(texts)
self.tfidf_feature_names = self.tfidf_vectorizer.get_feature_names_out()
else:
features = self.tfidf_vectorizer.transform(texts)
return features
def extract_lexicon_features(self, texts):
"""Extract lexicon-based sentiment features using TextBlob and VADER"""
features = []
for text in texts:
text_features = {}
# TextBlob features
blob = TextBlob(text)
text_features['textblob_polarity'] = blob.sentiment.polarity
text_features['textblob_subjectivity'] = blob.sentiment.subjectivity
# VADER features
vader_scores = self.vader_analyzer.polarity_scores(text)
text_features['vader_neg'] = vader_scores['neg']
text_features['vader_neu'] = vader_scores['neu']
text_features['vader_pos'] = vader_scores['pos']
text_features['vader_compound'] = vader_scores['compound']
features.append(text_features)
return np.array([list(f.values()) for f in features])
def extract_ngram_features(self, texts, n=3):
"""Extract n-gram features"""
vectorizer = CountVectorizer(ngram_range=(1, n), max_features=1000)
return vectorizer.fit_transform(texts)
def combine_features(self, texts, use_tfidf=True, use_lexicon=True):
"""Combine multiple feature extraction methods"""
features_list = []
# Extract TF-IDF features
if use_tfidf:
tfidf_features = self.extract_tfidf_features(texts, fit=True)
features_list.append(tfidf_features.toarray())
# Extract lexicon features
if use_lexicon:
lexicon_features = self.extract_lexicon_features(texts)
features_list.append(lexicon_features)
# Combine all features
if features_list:
combined_features = np.hstack(features_list)
return combined_features
else:
return None
8. models/sentiment_classifier.py
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
import joblib
import time
from utils.text_preprocessing import TextPreprocessor
from utils.feature_extraction import FeatureExtractor
from config.model_config import ModelConfig
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SentimentClassifier:
def __init__(self, model_type='logistic_regression'):
self.model_type = model_type
self.model = None
self.preprocessor = TextPreprocessor()
self.feature_extractor = FeatureExtractor()
self.config = ModelConfig()
self.is_trained = False
# Initialize model based on type
self._initialize_model()
def _initialize_model(self):
"""Initialize the specified model"""
if self.model_type == 'naive_bayes':
self.model = MultinomialNB(alpha=1.0)
elif self.model_type == 'logistic_regression':
self.model = LogisticRegression(
max_iter=1000,
C=1.0,
random_state=42,
multi_class='multinomial'
)
elif self.model_type == 'svm':
self.model = SVC(kernel='linear', probability=True, random_state=42)
elif self.model_type == 'random_forest':
self.model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
elif self.model_type == 'gradient_boosting':
self.model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
random_state=42
)
else:
raise ValueError(f"Unsupported model type: {self.model_type}")
logger.info(f"Initialized {self.model_type} model")
def train(self, texts, labels, test_size=0.2, random_state=42, use_grid_search=False):
"""Train the sentiment classifier"""
try:
start_time = time.time()
# Preprocess texts
logger.info("Preprocessing texts...")
processed_texts = [self.preprocessor.preprocess(text) for text in texts]
# Extract features
logger.info("Extracting features...")
X = self.feature_extractor.extract_tfidf_features(processed_texts, fit=True)
# Convert labels to numerical values
y = np.array([self.config.LABEL_MAPPING[label.lower()] for label in labels])
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state, stratify=y
)
# Hyperparameter tuning if requested
if use_grid_search:
self._hyperparameter_tuning(X_train, y_train)
else:
# Train model
logger.info(f"Training {self.model_type} model...")
self.model.fit(X_train, y_train)
# Make predictions
y_pred = self.model.predict(X_test)
y_pred_proba = self.model.predict_proba(X_test) if hasattr(self.model, 'predict_proba') else None
# Calculate metrics
metrics = self._calculate_metrics(y_test, y_pred, y_pred_proba)
# Cross-validation
cv_scores = cross_val_score(self.model, X_train, y_train, cv=5)
metrics['cv_mean'] = cv_scores.mean()
metrics['cv_std'] = cv_scores.std()
# Add training metadata
metrics['model_type'] = self.model_type
metrics['training_time'] = time.time() - start_time
metrics['num_features'] = X.shape[1]
metrics['num_samples'] = len(texts)
self.is_trained = True
logger.info(f"Model trained successfully with accuracy: {metrics['accuracy']:.4f}")
return metrics, X_test, y_test, y_pred
except Exception as e:
logger.error(f"Error during training: {e}")
raise
def _hyperparameter_tuning(self, X_train, y_train):
"""Perform hyperparameter tuning using GridSearchCV"""
logger.info(f"Performing hyperparameter tuning for {self.model_type}...")
param_grids = {
'naive_bayes': {'alpha': [0.1, 0.5, 1.0, 2.0]},
'logistic_regression': {
'C': [0.1, 1.0, 10.0],
'penalty': ['l2']
},
'svm': {
'C': [0.1, 1.0, 10.0],
'kernel': ['linear', 'rbf']
},
'random_forest': {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, None]
},
'gradient_boosting': {
'n_estimators': [50, 100],
'learning_rate': [0.01, 0.1, 0.3]
}
}
if self.model_type in param_grids:
grid_search = GridSearchCV(
self.model,
param_grids[self.model_type],
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
self.model = grid_search.best_estimator_
logger.info(f"Best parameters: {grid_search.best_params_}")
def _calculate_metrics(self, y_true, y_pred, y_pred_proba=None):
"""Calculate performance metrics"""
metrics = {
'accuracy': accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred, average='weighted'),
'recall': recall_score(y_true, y_pred, average='weighted'),
'f1_score': f1_score(y_true, y_pred, average='weighted'),
'confusion_matrix': confusion_matrix(y_true, y_pred).tolist(),
'classification_report': classification_report(
y_true, y_pred,
target_names=list(self.config.SENTIMENT_LABELS.values()),
output_dict=True
)
}
# Add per-class metrics
for i, label in self.config.SENTIMENT_LABELS.items():
metrics[f'{label.lower()}_precision'] = precision_score(
y_true, y_pred, labels=[i], average='macro'
)
metrics[f'{label.lower()}_recall'] = recall_score(
y_true, y_pred, labels=[i], average='macro'
)
return metrics
def predict(self, text):
"""Predict sentiment for a single text"""
if not self.is_trained:
raise ValueError("Model not trained yet. Please train the model first.")
# Preprocess text
processed_text = self.preprocessor.preprocess(text)
# Extract features
X = self.feature_extractor.extract_tfidf_features([processed_text])
# Make prediction
prediction_idx = self.model.predict(X)[0]
prediction = self.config.SENTIMENT_LABELS[prediction_idx]
# Get probability scores if available
confidence = None
probabilities = None
if hasattr(self.model, 'predict_proba'):
probabilities = self.model.predict_proba(X)[0]
confidence = float(max(probabilities))
return {
'sentiment': prediction,
'sentiment_idx': int(prediction_idx),
'confidence': confidence,
'probabilities': probabilities.tolist() if probabilities is not None else None
}
def predict_batch(self, texts):
"""Predict sentiment for multiple texts"""
if not self.is_trained:
raise ValueError("Model not trained yet. Please train the model first.")
# Preprocess texts
processed_texts = [self.preprocessor.preprocess(text) for text in texts]
# Extract features
X = self.feature_extractor.extract_tfidf_features(processed_texts)
# Make predictions
predictions_idx = self.model.predict(X)
predictions = [self.config.SENTIMENT_LABELS[idx] for idx in predictions_idx]
# Get probability scores if available
confidences = None
if hasattr(self.model, 'predict_proba'):
probabilities = self.model.predict_proba(X)
confidences = [float(max(probs)) for probs in probabilities]
results = []
for i, (text, pred_idx, pred) in enumerate(zip(texts, predictions_idx, predictions)):
result = {
'text': text,
'sentiment': pred,
'sentiment_idx': int(pred_idx),
'confidence': confidences[i] if confidences else None
}
results.append(result)
return results
def save_model(self, filepath='models/saved_models/sentiment_model.pkl'):
"""Save the trained model to disk"""
if not self.is_trained:
raise ValueError("No trained model to save.")
model_data = {
'model': self.model,
'model_type': self.model_type,
'feature_extractor': self.feature_extractor,
'preprocessor': self.preprocessor,
'config': self.config
}
joblib.dump(model_data, filepath)
logger.info(f"Model saved to {filepath}")
def load_model(self, filepath='models/saved_models/sentiment_model.pkl'):
"""Load a trained model from disk"""
try:
model_data = joblib.load(filepath)
self.model = model_data['model']
self.model_type = model_data['model_type']
self.feature_extractor = model_data['feature_extractor']
self.preprocessor = model_data['preprocessor']
self.config = model_data['config']
self.is_trained = True
logger.info(f"Model loaded from {filepath}")
except Exception as e:
logger.error(f"Error loading model: {e}")
raise
def get_model_info(self):
"""Get information about the current model"""
return {
'is_trained': self.is_trained,
'model_type': self.model_type,
'num_features': len(self.feature_extractor.tfidf_feature_names) if hasattr(self.feature_extractor, 'tfidf_feature_names') else 0,
'sentiment_labels': self.config.SENTIMENT_LABELS
}
9. models/deep_learning_model.py
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import train_test_split
import numpy as np
import joblib
from config.model_config import ModelConfig
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DeepLearningSentimentClassifier:
def __init__(self, model_type='lstm', max_words=5000, max_len=200):
self.model_type = model_type
self.max_words = max_words
self.max_len = max_len
self.model = None
self.tokenizer = Tokenizer(num_words=max_words)
self.config = ModelConfig()
self.is_trained = False
def build_model(self):
"""Build deep learning model architecture"""
if self.model_type == 'lstm':
self.model = Sequential([
Embedding(self.max_words, 128, input_length=self.max_len),
LSTM(128, dropout=0.2, recurrent_dropout=0.2, return_sequences=True),
LSTM(64, dropout=0.2, recurrent_dropout=0.2),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(3, activation='softmax')
])
elif self.model_type == 'bilstm':
self.model = Sequential([
Embedding(self.max_words, 128, input_length=self.max_len),
Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(3, activation='softmax')
])
elif self.model_type == 'cnn_lstm':
self.model = Sequential([
Embedding(self.max_words, 128, input_length=self.max_len),
Conv1D(128, 5, activation='relu'),
MaxPooling1D(5),
LSTM(64, dropout=0.2, recurrent_dropout=0.2),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(3, activation='softmax')
])
# Compile model
self.model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
logger.info(f"Built {self.model_type} model architecture")
logger.info(self.model.summary())
def prepare_data(self, texts, labels):
"""Prepare data for deep learning model"""
# Tokenize texts
self.tokenizer.fit_on_texts(texts)
sequences = self.tokenizer.texts_to_sequences(texts)
# Pad sequences
X = pad_sequences(sequences, maxlen=self.max_len, padding='post', truncating='post')
# Convert labels
y = np.array([self.config.LABEL_MAPPING[label.lower()] for label in labels])
return X, y
def train(self, texts, labels, validation_split=0.2, epochs=10, batch_size=32):
"""Train the deep learning model"""
try:
# Prepare data
X, y = self.prepare_data(texts, labels)
# Split data
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=validation_split, random_state=42, stratify=y
)
# Build model if not already built
if self.model is None:
self.build_model()
# Callbacks
callbacks = [
EarlyStopping(patience=3, restore_best_weights=True),
ModelCheckpoint(
'models/saved_models/best_dl_model.h5',
save_best_only=True,
monitor='val_accuracy'
)
]
# Train model
logger.info(f"Training {self.model_type} model...")
history = self.model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs,
batch_size=batch_size,
callbacks=callbacks,
verbose=1
)
self.is_trained = True
# Evaluate on validation set
val_loss, val_accuracy = self.model.evaluate(X_val, y_val, verbose=0)
metrics = {
'model_type': self.model_type,
'val_accuracy': val_accuracy,
'val_loss': val_loss,
'history': {
'accuracy': history.history['accuracy'],
'val_accuracy': history.history['val_accuracy'],
'loss': history.history['loss'],
'val_loss': history.history['val_loss']
}
}
logger.info(f"Model trained successfully with validation accuracy: {val_accuracy:.4f}")
return metrics, X_val, y_val
except Exception as e:
logger.error(f"Error during training: {e}")
raise
def predict(self, text):
"""Predict sentiment for a single text"""
if not self.is_trained:
raise ValueError("Model not trained yet. Please train the model first.")
# Prepare text
sequences = self.tokenizer.texts_to_sequences([text])
X = pad_sequences(sequences, maxlen=self.max_len, padding='post', truncating='post')
# Make prediction
probabilities = self.model.predict(X)[0]
prediction_idx = np.argmax(probabilities)
prediction = self.config.SENTIMENT_LABELS[prediction_idx]
confidence = float(probabilities[prediction_idx])
return {
'sentiment': prediction,
'sentiment_idx': int(prediction_idx),
'confidence': confidence,
'probabilities': probabilities.tolist()
}
def predict_batch(self, texts):
"""Predict sentiment for multiple texts"""
if not self.is_trained:
raise ValueError("Model not trained yet. Please train the model first.")
# Prepare texts
sequences = self.tokenizer.texts_to_sequences(texts)
X = pad_sequences(sequences, maxlen=self.max_len, padding='post', truncating='post')
# Make predictions
probabilities = self.model.predict(X)
predictions_idx = np.argmax(probabilities, axis=1)
predictions = [self.config.SENTIMENT_LABELS[idx] for idx in predictions_idx]
confidences = [float(probs[idx]) for probs, idx in zip(probabilities, predictions_idx)]
results = []
for i, (text, pred_idx, pred, conf) in enumerate(zip(texts, predictions_idx, predictions, confidences)):
results.append({
'text': text,
'sentiment': pred,
'sentiment_idx': int(pred_idx),
'confidence': conf,
'probabilities': probabilities[i].tolist()
})
return results
def save_model(self, filepath='models/saved_models/dl_sentiment_model'):
"""Save the trained model and tokenizer"""
if not self.is_trained:
raise ValueError("No trained model to save.")
# Save Keras model
self.model.save(f"{filepath}.h5")
# Save tokenizer and config
model_data = {
'tokenizer': self.tokenizer,
'model_type': self.model_type,
'max_words': self.max_words,
'max_len': self.max_len,
'config': self.config
}
joblib.dump(model_data, f"{filepath}_config.pkl")
logger.info(f"Model saved to {filepath}")
def load_model(self, filepath='models/saved_models/dl_sentiment_model'):
"""Load a trained model from disk"""
try:
# Load Keras model
self.model = tf.keras.models.load_model(f"{filepath}.h5")
# Load tokenizer and config
model_data = joblib.load(f"{filepath}_config.pkl")
self.tokenizer = model_data['tokenizer']
self.model_type = model_data['model_type']
self.max_words = model_data['max_words']
self.max_len = model_data['max_len']
self.config = model_data['config']
self.is_trained = True
logger.info(f"Model loaded from {filepath}")
except Exception as e:
logger.error(f"Error loading model: {e}")
raise
10. models/model_comparator.py
from models.sentiment_classifier import SentimentClassifier
from models.deep_learning_model import DeepLearningSentimentClassifier
from database.db_operations import MongoDBOperations
from utils.visualization import Visualizer
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ModelComparator:
def __init__(self):
self.models = {}
self.results = {}
self.db_ops = MongoDBOperations()
self.visualizer = Visualizer()
def initialize_models(self):
"""Initialize all available models"""
# Traditional ML models
self.models['naive_bayes'] = SentimentClassifier('naive_bayes')
self.models['logistic_regression'] = SentimentClassifier('logistic_regression')
self.models['svm'] = SentimentClassifier('svm')
self.models['random_forest'] = SentimentClassifier('random_forest')
# Deep Learning models
self.models['lstm'] = DeepLearningSentimentClassifier('lstm')
self.models['bilstm'] = DeepLearningSentimentClassifier('bilstm')
logger.info(f"Initialized {len(self.models)} models")
def train_and_compare(self, texts, labels, test_size=0.2):
"""Train all models and compare their performance"""
results = {}
# Split data for consistent evaluation
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=test_size, random_state=42, stratify=labels
)
for model_name, model in self.models.items():
logger.info(f"\n{'='*50}")
logger.info(f"Training {model_name}...")
logger.info(f"{'='*50}")
start_time = time.time()
try:
# Train model
if 'lstm' in model_name: # Deep learning models
metrics, _, _ = model.train(X_train, y_train, epochs=5)
# Test on test set
predictions = model.predict_batch(X_test)
y_pred = [p['sentiment_idx'] for p in predictions]
else: # Traditional ML models
metrics, X_test_tfidf, y_test, y_pred = model.train(
X_train, y_train, test_size=test_size
)
training_time = time.time() - start_time
# Store results
results[model_name] = {
'model': model,
'metrics': metrics,
'training_time': training_time,
'predictions': y_pred if 'y_pred' in locals() else None
}
logger.info(f"{model_name} completed in {training_time:.2f} seconds")
except Exception as e:
logger.error(f"Error training {model_name}: {e}")
results[model_name] = {'error': str(e)}
self.results = results
return results
def get_comparison_report(self):
"""Generate comparison report for all models"""
comparison = []
for model_name, result in self.results.items():
if 'error' in result:
continue
metrics = result['metrics']
if isinstance(metrics, dict):
comparison.append({
'Model': model_name,
'Accuracy': metrics.get('accuracy', metrics.get('val_accuracy', 0)),
'Precision': metrics.get('precision', 0),
'Recall': metrics.get('recall', 0),
'F1-Score': metrics.get('f1_score', 0),
'Training Time (s)': round(result['training_time'], 2)
})
df = pd.DataFrame(comparison)
df = df.sort_values('Accuracy', ascending=False)
return df
def plot_comparison(self):
"""Plot comparison of model performances"""
df = self.get_comparison_report()
if not df.empty:
# Plot accuracy comparison
self.visualizer.plot_model_comparison(
df['Model'].tolist(),
df['Accuracy'].tolist(),
'Model Accuracy Comparison',
'Accuracy'
)
# Plot training time comparison
self.visualizer.plot_model_comparison(
df['Model'].tolist(),
df['Training Time (s)'].tolist(),
'Model Training Time Comparison',
'Training Time (seconds)'
)
def get_best_model(self, metric='Accuracy'):
"""Get the best performing model based on specified metric"""
df = self.get_comparison_report()
if not df.empty:
best_model = df.loc[df[metric].idxmax()]
return {
'model_name': best_model['Model'],
'metric_value': best_model[metric],
'metric': metric
}
return None
def save_all_models(self):
"""Save all trained models"""
for model_name, result in self.results.items():
if 'error' not in result and 'model' in result:
try:
result['model'].save_model(f'models/saved_models/{model_name}_model')
logger.info(f"Saved {model_name} model")
except Exception as e:
logger.error(f"Error saving {model_name}: {e}")
11. utils/visualization.py
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from wordcloud import WordCloud
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
import io
import base64
class Visualizer:
def __init__(self):
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
def plot_sentiment_distribution(self, sentiments, title="Sentiment Distribution"):
"""Plot distribution of sentiments"""
plt.figure(figsize=(10, 6))
# Count sentiments
sentiment_counts = pd.Series(sentiments).value_counts()
# Create bar plot
ax = sns.barplot(x=sentiment_counts.index, y=sentiment_counts.values)
plt.title(title, fontsize=16, fontweight='bold')
plt.xlabel('Sentiment', fontsize=12)
plt.ylabel('Count', fontsize=12)
# Add value labels on bars
for i, v in enumerate(sentiment_counts.values):
ax.text(i, v + 0.1, str(v), ha='center', fontsize=11)
plt.tight_layout()
# Convert to base64 for web display
img = io.BytesIO()
plt.savefig(img, format='png', dpi=100, bbox_inches='tight')
img.seek(0)
plot_url = base64.b64encode(img.getvalue()).decode()
plt.close()
return plot_url
def plot_confusion_matrix(self, y_true, y_pred, labels=['Negative', 'Neutral', 'Positive']):
"""Plot confusion matrix"""
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=labels, yticklabels=labels)
plt.title('Confusion Matrix', fontsize=16, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.tight_layout()
# Convert to base64
img = io.BytesIO()
plt.savefig(img, format='png', dpi=100, bbox_inches='tight')
img.seek(0)
plot_url = base64.b64encode(img.getvalue()).decode()
plt.close()
return plot_url
def create_wordcloud(self, texts, title="Word Cloud"):
"""Create word cloud from texts"""
# Combine all texts
all_text = ' '.join(texts)
# Create word cloud
wordcloud = WordCloud(
width=800, height=400,
background_color='white',
max_words=100,
colormap='viridis'
).generate(all_text)
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title(title, fontsize=16, fontweight='bold')
plt.axis('off')
plt.tight_layout()
# Convert to base64
img = io.BytesIO()
plt.savefig(img, format='png', dpi=100, bbox_inches='tight')
img.seek(0)
plot_url = base64.b64encode(img.getvalue()).decode()
plt.close()
return plot_url
def plot_interactive_sentiment_timeline(self, dates, sentiments, title="Sentiment Timeline"):
"""Create interactive timeline plot using plotly"""
df = pd.DataFrame({
'Date': dates,
'Sentiment': sentiments
})
fig = px.line(df, x='Date', y='Sentiment', title=title)
fig.update_layout(
xaxis_title="Date",
yaxis_title="Sentiment Score",
hovermode='x'
)
return fig.to_html()
def plot_model_comparison(self, model_names, scores, title, ylabel):
"""Plot comparison of different models"""
plt.figure(figsize=(12, 6))
colors = plt.cm.viridis(np.linspace(0, 1, len(model_names)))
bars = plt.bar(model_names, scores, color=colors)
plt.title(title, fontsize=16, fontweight='bold')
plt.xlabel('Model', fontsize=12)
plt.ylabel(ylabel, fontsize=12)
plt.xticks(rotation=45, ha='right')
# Add value labels
for bar, score in zip(bars, scores):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom', fontsize=10)
plt.tight_layout()
# Convert to base64
img = io.BytesIO()
plt.savefig(img, format='png', dpi=100, bbox_inches='tight')
img.seek(0)
plot_url = base64.b64encode(img.getvalue()).decode()
plt.close()
return plot_url
def plot_feature_importance(self, feature_names, importance_scores, top_n=20):
"""Plot feature importance"""
# Sort by importance
indices = np.argsort(importance_scores)[-top_n:]
plt.figure(figsize=(10, 8))
plt.barh(range(top_n), importance_scores[indices])
plt.yticks(range(top_n), [feature_names[i] for i in indices])
plt.xlabel('Importance Score', fontsize=12)
plt.title(f'Top {top_n} Most Important Features', fontsize=16, fontweight='bold')
plt.tight_layout()
# Convert to base64
img = io.BytesIO()
plt.savefig(img, format='png', dpi=100, bbox_inches='tight')
img.seek(0)
plot_url = base64.b64encode(img.getvalue()).decode()
plt.close()
return plot_url
def plot_sentiment_confidence(self, predictions, confidences):
"""Plot sentiment predictions with confidence scores"""
plt.figure(figsize=(12, 6))
# Create scatter plot
colors = {'Positive': 'green', 'Neutral': 'blue', 'Negative': 'red'}
point_colors = [colors[p] for p in predictions]
plt.scatter(range(len(predictions)), confidences,
c=point_colors, alpha=0.6, s=100)
plt.xlabel('Sample Index', fontsize=12)
plt.ylabel('Confidence Score', fontsize=12)
plt.title('Prediction Confidence by Sentiment', fontsize=16, fontweight='bold')
plt.ylim(0, 1.1)
# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=sentiment)
for sentiment, color in colors.items()]
plt.legend(handles=legend_elements)
plt.tight_layout()
# Convert to base64
img = io.BytesIO()
plt.savefig(img, format='png', dpi=100, bbox_inches='tight')
img.seek(0)
plot_url = base64.b64encode(img.getvalue()).decode()
plt.close()
return plot_url
12. api/app.py (Flask API)
from flask import Flask, request, jsonify, render_template
from flask_cors import CORS
from flask_restful import Api, Resource
from models.sentiment_classifier import SentimentClassifier
from models.deep_learning_model import DeepLearningSentimentClassifier
from models.model_comparator import ModelComparator
from database.db_operations import MongoDBOperations
from utils.text_preprocessing import TextPreprocessor
from utils.visualization import Visualizer
import pandas as pd
import json
from bson import ObjectId
import logging
app = Flask(__name__)
CORS(app)
api = Api(app)
# Initialize components
db_ops = MongoDBOperations()
preprocessor = TextPreprocessor()
visualizer = Visualizer()
classifier = SentimentClassifier('logistic_regression')
dl_classifier = DeepLearningSentimentClassifier('lstm')
comparator = ModelComparator()
# Try to load pre-trained models
try:
classifier.load_model()
dl_classifier.load_model()
logging.info("Loaded pre-trained models")
except:
logging.info("No pre-trained models found. Please train models first.")
class JSONEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, ObjectId):
return str(o)
return json.JSONEncoder.default(self, o)
app.json_encoder = JSONEncoder
class HealthCheck(Resource):
def get(self):
return {'status': 'healthy', 'message': 'Sentiment Analyzer API is running'}
class AnalyzeSentiment(Resource):
def post(self):
data = request.get_json()
text = data.get('text', '')
model_type = data.get('model', 'logistic_regression')
if not text:
return {'error': 'No text provided'}, 400
try:
# Choose model
if model_type == 'lstm' or model_type == 'bilstm':
result = dl_classifier.predict(text)
else:
result = classifier.predict(text)
# Save to database
review_id = db_ops.insert_review(text)
db_ops.save_prediction(
review_id, text, result['sentiment'],
result['confidence'], model_used=model_type
)
return {
'success': True,
'result': result
}
except Exception as e:
return {'error': str(e)}, 500
class BatchAnalyze(Resource):
def post(self):
data = request.get_json()
texts = data.get('texts', [])
model_type = data.get('model', 'logistic_regression')
if not texts:
return {'error': 'No texts provided'}, 400
try:
# Choose model
if model_type == 'lstm' or model_type == 'bilstm':
results = dl_classifier.predict_batch(texts)
else:
results = classifier.predict_batch(texts)
# Save to database
for text, result in zip(texts, results):
review_id = db_ops.insert_review(text)
db_ops.save_prediction(
review_id, text, result['sentiment'],
result['confidence'], model_used=model_type
)
return {
'success': True,
'results': results,
'count': len(results)
}
except Exception as e:
return {'error': str(e)}, 500
class TrainModel(Resource):
def post(self):
data = request.get_json()
model_type = data.get('model_type', 'logistic_regression')
try:
# Get training data
texts, labels = db_ops.get_training_data()
if len(texts) == 0:
return {'error': 'No training data found'}, 400
# Train appropriate model
if model_type == 'lstm' or model_type == 'bilstm':
model = DeepLearningSentimentClassifier(model_type)
metrics, _, _ = model.train(texts, labels)
model.save_model()
else:
model = SentimentClassifier(model_type)
metrics, _, _, _ = model.train(texts, labels)
model.save_model()
# Save metrics to database
db_ops.update_model_metrics(metrics)
return {
'success': True,
'message': f'{model_type} model trained successfully',
'metrics': metrics
}
except Exception as e:
return {'error': str(e)}, 500
class CompareModels(Resource):
def get(self):
try:
# Get training data
texts, labels = db_ops.get_training_data()
if len(texts) == 0:
return {'error': 'No training data found'}, 400
# Initialize comparator
comparator.initialize_models()
# Train and compare
results = comparator.train_and_compare(texts, labels)
# Get comparison report
comparison_df = comparator.get_comparison_report()
return {
'success': True,
'comparison': comparison_df.to_dict('records'),
'best_model': comparator.get_best_model()
}
except Exception as e:
return {'error': str(e)}, 500
class GetStatistics(Resource):
def get(self):
try:
# Get database statistics
prediction_stats = db_ops.get_prediction_statistics()
model_performance = db_ops.get_model_performance()
# Get reviews data
df = db_ops.export_reviews_to_dataframe()
statistics = {
'total_reviews': len(df) if not df.empty else 0,
'sentiment_distribution': df['sentiment_label'].value_counts().to_dict() if not df.empty and 'sentiment_label' in df.columns else {},
'prediction_statistics': prediction_stats,
'model_performance': model_performance
}
return {
'success': True,
'statistics': statistics
}
except Exception as e:
return {'error': str(e)}, 500
class GetVisualizations(Resource):
def get(self):
try:
# Get data for visualizations
df = db_ops.export_reviews_to_dataframe()
if df.empty or 'sentiment_label' not in df.columns:
return {'error': 'No data available for visualization'}, 400
# Generate visualizations
plots = {}
# Sentiment distribution
plots['sentiment_distribution'] = visualizer.plot_sentiment_distribution(
df['sentiment_label'].tolist()
)
# Word cloud
plots['wordcloud'] = visualizer.create_wordcloud(
df['review_text'].tolist()
)
return {
'success': True,
'plots': plots
}
except Exception as e:
return {'error': str(e)}, 500
# Web routes
@app.route('/')
def index():
return render_template('index.html')
@app.route('/analyze')
def analyze_page():
return render_template('analyze.html')
@app.route('/dashboard')
def dashboard():
return render_template('dashboard.html')
# API routes
api.add_resource(HealthCheck, '/api/health')
api.add_resource(AnalyzeSentiment, '/api/analyze')
api.add_resource(BatchAnalyze, '/api/analyze/batch')
api.add_resource(TrainModel, '/api/train')
api.add_resource(CompareModels, '/api/compare')
api.add_resource(GetStatistics, '/api/statistics')
api.add_resource(GetVisualizations, '/api/visualizations')
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=5000)
13. web/templates/index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sentiment Analyzer - Home</title>
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css">
</head>
<body>
<nav class="navbar navbar-expand-lg navbar-dark bg-primary">
<div class="container">
<a class="navbar-brand" href="/">
<i class="fas fa-smile"></i> Sentiment Analyzer
</a>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarNav">
<ul class="navbar-nav ms-auto">
<li class="nav-item">
<a class="nav-link active" href="/">Home</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/analyze">Analyze</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/dashboard">Dashboard</a>
</li>
</ul>
</div>
</div>
</nav>
<div class="hero-section">
<div class="container text-center">
<h1 class="display-4">Sentiment Analysis Made Easy</h1>
<p class="lead">Analyze movie and product reviews with state-of-the-art machine learning</p>
<a href="/analyze" class="btn btn-light btn-lg mt-3">
<i class="fas fa-play"></i> Start Analyzing
</a>
</div>
</div>
<div class="container features-section py-5">
<h2 class="text-center mb-5">Key Features</h2>
<div class="row">
<div class="col-md-4 mb-4">
<div class="card h-100">
<div class="card-body text-center">
<i class="fas fa-robot fa-3x text-primary mb-3"></i>
<h5 class="card-title">Multiple ML Models</h5>
<p class="card-text">Choose from Naive Bayes, Logistic Regression, SVM, LSTM, and more</p>
</div>
</div>
</div>
<div class="col-md-4 mb-4">
<div class="card h-100">
<div class="card-body text-center">
<i class="fas fa-chart-line fa-3x text-primary mb-3"></i>
<h5 class="card-title">Real-time Analysis</h5>
<p class="card-text">Get instant sentiment predictions with confidence scores</p>
</div>
</div>
</div>
<div class="col-md-4 mb-4">
<div class="card h-100">
<div class="card-body text-center">
<i class="fas fa-database fa-3x text-primary mb-3"></i>
<h5 class="card-title">MongoDB Integration</h5>
<p class="card-text">All reviews and predictions are stored for future analysis</p>
</div>
</div>
</div>
</div>
</div>
<div class="bg-light py-5">
<div class="container">
<h2 class="text-center mb-5">How It Works</h2>
<div class="row">
<div class="col-md-3 text-center">
<div class="step-circle">1</div>
<h5>Input Text</h5>
<p>Enter your movie or product review</p>
</div>
<div class="col-md-3 text-center">
<div class="step-circle">2</div>
<h5>Preprocessing</h5>
<p>Text is cleaned and prepared for analysis</p>
</div>
<div class="col-md-3 text-center">
<div class="step-circle">3</div>
<h5>ML Analysis</h5>
<p>Our models analyze the sentiment</p>
</div>
<div class="col-md-3 text-center">
<div class="step-circle">4</div>
<h5>Get Results</h5>
<p>Receive sentiment classification with confidence score</p>
</div>
</div>
</div>
</div>
<footer class="bg-dark text-white py-4">
<div class="container text-center">
<p>© 2024 Sentiment Analyzer. All rights reserved.</p>
</div>
</footer>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"></script>
<script src="{{ url_for('static', filename='js/main.js') }}"></script>
</body>
</html>
14. web/templates/analyze.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sentiment Analyzer - Analyze</title>
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css">
</head>
<body>
<nav class="navbar navbar-expand-lg navbar-dark bg-primary">
<div class="container">
<a class="navbar-brand" href="/">
<i class="fas fa-smile"></i> Sentiment Analyzer
</a>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarNav">
<ul class="navbar-nav ms-auto">
<li class="nav-item">
<a class="nav-link" href="/">Home</a>
</li>
<li class="nav-item">
<a class="nav-link active" href="/analyze">Analyze</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/dashboard">Dashboard</a>
</li>
</ul>
</div>
</div>
</nav>
<div class="container py-5">
<h2 class="text-center mb-4">Analyze Sentiment</h2>
<div class="row">
<div class="col-md-8 mx-auto">
<div class="card">
<div class="card-body">
<form id="analyzeForm">
<div class="mb-3">
<label for="reviewText" class="form-label">Enter your review:</label>
<textarea class="form-control" id="reviewText" rows="4"
placeholder="Type or paste your movie or product review here..."></textarea>
</div>
<div class="mb-3">
<label for="modelSelect" class="form-label">Select Model:</label>
<select class="form-select" id="modelSelect">
<option value="logistic_regression">Logistic Regression</option>
<option value="naive_bayes">Naive Bayes</option>
<option value="svm">SVM</option>
<option value="random_forest">Random Forest</option>
<option value="lstm">LSTM (Deep Learning)</option>
<option value="bilstm">Bi-LSTM (Deep Learning)</option>
</select>
</div>
<button type="submit" class="btn btn-primary w-100">
<i class="fas fa-search"></i> Analyze Sentiment
</button>
</form>
</div>
</div>
<!-- Results Section -->
<div id="resultsSection" class="mt-4" style="display: none;">
<div class="card">
<div class="card-header bg-success text-white">
<h5 class="mb-0">Analysis Results</h5>
</div>
<div class="card-body">
<div class="row">
<div class="col-md-6">
<h6>Sentiment:</h6>
<div id="sentimentResult" class="display-4"></div>
</div>
<div class="col-md-6">
<h6>Confidence:</h6>
<div id="confidenceResult" class="display-4"></div>
</div>
</div>
<div class="mt-4">
<h6>Probability Distribution:</h6>
<div class="progress mb-2" id="negativeBar">
<div class="progress-bar bg-danger" role="progressbar" style="width: 0%">Negative</div>
</div>
<div class="progress mb-2" id="neutralBar">
<div class="progress-bar bg-warning" role="progressbar" style="width: 0%">Neutral</div>
</div>
<div class="progress mb-2" id="positiveBar">
<div class="progress-bar bg-success" role="progressbar" style="width: 0%">Positive</div>
</div>
</div>
<div class="mt-4">
<h6>Processed Text:</h6>
<p id="processedText" class="text-muted"></p>
</div>
</div>
</div>
</div>
<!-- Error Section -->
<div id="errorSection" class="mt-4" style="display: none;">
<div class="alert alert-danger" role="alert">
<i class="fas fa-exclamation-triangle"></i>
<span id="errorMessage"></span>
</div>
</div>
<!-- Loading Spinner -->
<div id="loadingSpinner" class="text-center mt-4" style="display: none;">
<div class="spinner-border text-primary" role="status">
<span class="visually-hidden">Loading...</span>
</div>
</div>
</div>
</div>
</div>
<footer class="bg-dark text-white py-4 mt-5">
<div class="container text-center">
<p>© 2024 Sentiment Analyzer. All rights reserved.</p>
</div>
</footer>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"></script>
<script src="{{ url_for('static', filename='js/main.js') }}"></script>
<script>
document.getElementById('analyzeForm').addEventListener('submit', async (e) => {
e.preventDefault();
const text = document.getElementById('reviewText').value;
const model = document.getElementById('modelSelect').value;
if (!text) {
alert('Please enter a review');
return;
}
// Show loading spinner
document.getElementById('loadingSpinner').style.display = 'block';
document.getElementById('resultsSection').style.display = 'none';
document.getElementById('errorSection').style.display = 'none';
try {
const response = await fetch('/api/analyze', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ text, model })
});
const data = await response.json();
// Hide loading spinner
document.getElementById('loadingSpinner').style.display = 'none';
if (data.success) {
// Display results
const result = data.result;
document.getElementById('sentimentResult').textContent = result.sentiment;
document.getElementById('confidenceResult').textContent =
(result.confidence * 100).toFixed(2) + '%';
// Update progress bars if probabilities available
if (result.probabilities) {
const probs = result.probabilities;
document.querySelector('#negativeBar .progress-bar').style.width =
(probs[0] * 100) + '%';
document.querySelector('#neutralBar .progress-bar').style.width =
(probs[1] * 100) + '%';
document.querySelector('#positiveBar .progress-bar').style.width =
(probs[2] * 100) + '%';
}
document.getElementById('resultsSection').style.display = 'block';
} else {
document.getElementById('errorMessage').textContent = data.error;
document.getElementById('errorSection').style.display = 'block';
}
} catch (error) {
document.getElementById('loadingSpinner').style.display = 'none';
document.getElementById('errorMessage').textContent = error.message;
document.getElementById('errorSection').style.display = 'block';
}
});
</script>
</body>
</html>
15. web/static/css/style.css
:root {
--primary-color: #4a90e2;
--secondary-color: #f5f5f5;
--success-color: #28a745;
--danger-color: #dc3545;
--warning-color: #ffc107;
}
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
background-color: #f8f9fa;
}
.hero-section {
background: linear-gradient(135deg, var(--primary-color), #6c5ce7);
color: white;
padding: 100px 0;
}
.features-section {
background-color: white;
}
.features-section .card {
transition: transform 0.3s ease;
border: none;
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
}
.features-section .card:hover {
transform: translateY(-10px);
}
.step-circle {
width: 50px;
height: 50px;
background-color: var(--primary-color);
color: white;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
margin: 0 auto 20px;
font-size: 24px;
font-weight: bold;
}
/* Dashboard Styles */
.stat-card {
background: linear-gradient(135deg, #667eea, #764ba2);
color: white;
border-radius: 10px;
padding: 20px;
margin-bottom: 20px;
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
}
.stat-card i {
font-size: 48px;
opacity: 0.8;
}
.stat-card .stat-value {
font-size: 32px;
font-weight: bold;
margin-top: 10px;
}
.stat-card .stat-label {
font-size: 16px;
opacity: 0.9;
}
/* Chart Containers */
.chart-container {
background-color: white;
border-radius: 10px;
padding: 20px;
margin-bottom: 20px;
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
}
/* Sentiment Badges */
.sentiment-badge {
padding: 5px 10px;
border-radius: 20px;
font-size: 14px;
font-weight: 500;
}
.sentiment-positive {
background-color: #d4edda;
color: #155724;
}
.sentiment-neutral {
background-color: #fff3cd;
color: #856404;
}
.sentiment-negative {
background-color: #f8d7da;
color: #721c24;
}
/* Progress Bars */
.progress {
height: 25px;
margin-bottom: 10px;
border-radius: 5px;
}
.progress-bar {
line-height: 25px;
font-size: 14px;
}
/* Loading Spinner */
.spinner-border {
width: 3rem;
height: 3rem;
}
/* Responsive Design */
@media (max-width: 768px) {
.hero-section {
padding: 50px 0;
}
.hero-section h1 {
font-size: 28px;
}
.stat-card i {
font-size: 36px;
}
.stat-card .stat-value {
font-size: 24px;
}
}
/* Animations */
@keyframes fadeIn {
from {
opacity: 0;
transform: translateY(20px);
}
to {
opacity: 1;
transform: translateY(0);
}
}
.fade-in {
animation: fadeIn 0.5s ease;
}
/* Custom Scrollbar */
::-webkit-scrollbar {
width: 10px;
}
::-webkit-scrollbar-track {
background: #f1f1f1;
}
::-webkit-scrollbar-thumb {
background: var(--primary-color);
border-radius: 5px;
}
::-webkit-scrollbar-thumb:hover {
background: #357abd;
}
16. web/static/js/main.js
// Utility functions
class SentimentAnalyzer {
static async analyzeText(text, model = 'logistic_regression') {
try {
const response = await fetch('/api/analyze', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ text, model })
});
return await response.json();
} catch (error) {
console.error('Error analyzing text:', error);
throw error;
}
}
static async getStatistics() {
try {
const response = await fetch('/api/statistics');
return await response.json();
} catch (error) {
console.error('Error getting statistics:', error);
throw error;
}
}
static async getVisualizations() {
try {
const response = await fetch('/api/visualizations');
return await response.json();
} catch (error) {
console.error('Error getting visualizations:', error);
throw error;
}
}
static getSentimentColor(sentiment) {
const colors = {
'Positive': '#28a745',
'Neutral': '#ffc107',
'Negative': '#dc3545'
};
return colors[sentiment] || '#6c757d';
}
static getSentimentIcon(sentiment) {
const icons = {
'Positive': 'fa-smile',
'Neutral': 'fa-meh',
'Negative': 'fa-frown'
};
return icons[sentiment] || 'fa-question-circle';
}
}
// Dashboard initialization
class Dashboard {
constructor() {
this.charts = {};
}
async init() {
try {
const stats = await SentimentAnalyzer.getStatistics();
const viz = await SentimentAnalyzer.getVisualizations();
this.updateStats(stats.statistics);
this.renderCharts(viz.plots);
} catch (error) {
console.error('Error initializing dashboard:', error);
}
}
updateStats(statistics) {
// Update statistics cards
document.querySelectorAll('[data-stat]').forEach(element => {
const statName = element.dataset.stat;
if (statistics[statName] !== undefined) {
if (statName === 'sentiment_distribution') {
// Handle sentiment distribution
const total = Object.values(statistics[statName]).reduce((a, b) => a + b, 0);
element.textContent = total;
} else {
element.textContent = statistics[statName];
}
}
});
}
renderCharts(plots) {
// Render sentiment distribution chart
if (plots.sentiment_distribution) {
const img = document.createElement('img');
img.src = 'data:image/png;base64,' + plots.sentiment_distribution;
img.className = 'img-fluid';
document.getElementById('sentimentChart').appendChild(img);
}
// Render word cloud
if (plots.wordcloud) {
const img = document.createElement('img');
img.src = 'data:image/png;base64,' + plots.wordcloud;
img.className = 'img-fluid';
document.getElementById('wordCloud').appendChild(img);
}
}
}
// Export functionality
class ExportManager {
static async exportToCSV(data, filename = 'sentiment_data.csv') {
const csv = this.convertToCSV(data);
this.downloadFile(csv, filename, 'text/csv');
}
static async exportToJSON(data, filename = 'sentiment_data.json') {
const json = JSON.stringify(data, null, 2);
this.downloadFile(json, filename, 'application/json');
}
static convertToCSV(data) {
if (!Array.isArray(data) || data.length === 0) return '';
const headers = Object.keys(data[0]);
const csvRows = [];
csvRows.push(headers.join(','));
for (const row of data) {
const values = headers.map(header => {
const value = row[header] || '';
return `"${value.toString().replace(/"/g, '""')}"`;
});
csvRows.push(values.join(','));
}
return csvRows.join('\n');
}
static downloadFile(content, filename, contentType) {
const blob = new Blob([content], { type: contentType });
const url = URL.createObjectURL(blob);
const link = document.createElement('a');
link.href = url;
link.download = filename;
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
URL.revokeObjectURL(url);
}
}
// Initialize dashboard when on dashboard page
document.addEventListener('DOMContentLoaded', () => {
if (window.location.pathname === '/dashboard') {
const dashboard = new Dashboard();
dashboard.init();
}
});
// Export button handlers
document.querySelectorAll('[data-export]').forEach(button => {
button.addEventListener('click', async (e) => {
const format = button.dataset.export;
const dataType = button.dataset.data || 'statistics';
try {
let data;
if (dataType === 'statistics') {
const response = await SentimentAnalyzer.getStatistics();
data = response.statistics;
}
if (format === 'csv') {
await ExportManager.exportToCSV([data], `sentiment_${dataType}.csv`);
} else if (format === 'json') {
await ExportManager.exportToJSON(data, `sentiment_${dataType}.json`);
}
} catch (error) {
console.error('Error exporting data:', error);
alert('Error exporting data. Please try again.');
}
});
});
17. .env file
# MongoDB Configuration MONGODB_URI=mongodb://localhost:27017/ DATABASE_NAME=sentiment_analyzer_db # Flask Configuration FLASK_APP=api/app.py FLASK_ENV=development SECRET_KEY=your-secret-key-here # Model Configuration DEFAULT_MODEL=logistic_regression MAX_FEATURES=5000 MAX_SEQUENCE_LENGTH=200 # Redis Configuration (for Celery) REDIS_URL=redis://localhost:6379/0
18. Sample Data File (data/movie_reviews.csv)
review,sentiment "This movie was absolutely fantastic! Great acting and storyline.",positive "Terrible movie. Waste of time and money.",negative "Decent film, but nothing special. Average acting.",neutral "One of the best movies I've ever seen! Highly recommended.",positive "Boring and predictable. Fell asleep halfway through.",negative "Good performances but the plot was confusing.",neutral "Masterpiece! Director's best work yet.",positive "Awful script and poor direction. Disappointing.",negative "Entertaining enough for a Sunday afternoon watch.",neutral "Brilliant cinematography and emotional depth.",positive
19. run.py (Main entry point)
```python
!/usr/bin/env python3
"""
Sentiment Analyzer - Main entry point
Run this script to start the application
"""
import argparse
import sys
import os
from database.db_operations import MongoDBOperations
from models.sentiment_classifier import SentimentClassifier
from models.deep_learning_model import DeepLearningSentimentClassifier
from models.model_comparator import ModelComparator
from api.app import app
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(name)
def setup_database():
"""Initialize database and load sample data if needed"""
logger.info("Setting up databaseโฆ")
db_ops = MongoDBOperations()
# Check if database is empty
reviews, _ = db_ops.get_training_data()
if len(reviews) == 0:
logger.info("No training data found. Loading sample data...")
# Load sample movie reviews
sample_data = [
("This movie was absolutely fantastic! Great acting and storyline.", "positive"),
("Terrible movie. Waste of time and money.", "negative"),
("Decent film, but nothing special. Average acting.", "neutral"),
("One of the best movies I've ever seen! Highly recommended.", "positive"),
("Boring and predictable. Fell asleep halfway through.", "negative"),
("Good performances but the plot was confusing.", "neutral"),
("Masterpiece! Director's best work yet.", "positive"),
("Awful script and poor direction. Disappointing.", "negative"),
("Entertaining enough for a Sunday afternoon watch.", "neutral"),
("Brilliant cinematography and emotional depth.", "positive"),
]
for review, sentiment in sample_data:
review_id = db_ops.insert_review(review, source='sample')
db_ops.update_review_label(review_id, sentiment)
logger.info(f"Loaded {len(sample_data)} sample reviews")
db_ops.close_connection()
return True
def train_default_model():
"""Train default model with available data"""
logger.info("Training default modelโฆ")
db_ops = MongoDBOperations()
texts, labels = db_ops.get_training_data()
if len(texts) > 0:
# Train traditional ML model
classifier = SentimentClassifier('logistic_regression')
metrics, _, _, _ = classifier.train(texts, labels)
classifier.save_model()
logger.info(f"Model trained with accuracy: {metrics['accuracy']:.4f}")
# Train deep learning model if enough data
if len(texts) >= 100:
dl_classifier = DeepLearningSentimentClassifier('lstm')
dl_metrics, _, _ = dl_classifier.train(texts, labels, epochs=5)
dl_classifier.save_model()
logger.info(f"Deep learning model trained with accuracy: {dl_metrics.get('val_accuracy', 0):.4f}")
db_ops.close_connection()
return True
def run_api(host='0.0.0.0', port=5000, debug=False):
"""Run the Flask API server"""
logger.info(f"Starting API server on {host}:{port}")
app.run(host=host, port=port, debug=debug)
def run_interactive():
"""Run interactive command-line interface"""
from models.sentiment_classifier import SentimentClassifier
print("\n" + "="*60)
print("SENTIMENT ANALYZER - Interactive Mode")
print("="*60)
# Load model
classifier = SentimentClassifier('logistic_regression')
try:
classifier.load_model()
print("โ Model loaded successfully")
except:
print("! No trained model found. Please train the model first.")
return
while True:
print("\n" + "-"*60)
print("Options:")
print("1. Analyze a review")
print("2. Exit")
choice = input("\nEnter your choice (1-2): ").strip()
if choice == '1':
review = input("\nEnter your review: ").strip()
if review:
print("\nAnalyzing...")
result = classifier.predict(review)
print(f"\nSentiment: {result['sentiment']}")
print(f"Confidence: {result['confidence']:.4f}")
if result.get('probabilities'):
print("\nProbabilities:")
sentiments = ['Negative', 'Neutral', 'Positive']
for sent, prob in zip(sentiments, result['probabilities']):
print(f" {sent}: {prob:.4f}")
else:
print("Please enter a review")
elif choice == '2':
print("\nGoodbye!")
break
else:
print("Invalid choice. Please try again.")
def main():
parser = argparse.ArgumentParser(description='Sentiment Analyzer Application')
parser.add_argument('--mode', choices=['api', 'train', 'interactive', 'setup'],
default='api', help='
default='api', help='Application mode')
parser.add_argument('--host', default='0.0.0.0', help='API host')
parser.add_argument('--port', type=int, default=5000, help='API port')
parser.add_argument('--debug', action='store_true', help='Debug mode')
args = parser.parse_args()
if args.mode == 'setup':
# Setup database and load sample data
setup_database()
train_default_model()
logger.info("Setup completed successfully")
elif args.mode == 'train':
# Train models with existing data
train_default_model()
elif args.mode == 'interactive':
# Run interactive CLI
setup_database()
train_default_model()
run_interactive()
else: # api mode
# Setup and run API
setup_database()
train_default_model()
run_api(host=args.host, port=args.port, debug=args.debug)
if __name__ == "__main__":
main()
20. tests/test_sentiment_analyzer.py
import unittest
import sys
import os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from models.sentiment_classifier import SentimentClassifier
from models.deep_learning_model import DeepLearningSentimentClassifier
from utils.text_preprocessing import TextPreprocessor
from database.db_operations import MongoDBOperations
import numpy as np
class TestTextPreprocessor(unittest.TestCase):
def setUp(self):
self.preprocessor = TextPreprocessor()
def test_clean_text(self):
text = "This is a TEST!!! with 123 numbers and <html> tags"
cleaned = self.preprocessor.clean_text(text)
self.assertNotIn('123', cleaned)
self.assertNotIn('<html>', cleaned)
self.assertEqual(cleaned, "this is a test with numbers and tags")
def test_convert_emojis(self):
text = "I love this ๐"
converted = self.preprocessor.convert_emojis(text)
self.assertIn('smiling_face', converted)
def test_extract_sentiment_features(self):
text = "This is GREAT!!! ๐๐"
features = self.preprocessor.extract_sentiment_features(text)
self.assertEqual(features['exclamation_count'], 1)
self.assertEqual(features['positive_emoji_count'], 2)
self.assertEqual(features['all_caps_count'], 1)
class TestSentimentClassifier(unittest.TestCase):
def setUp(self):
self.classifier = SentimentClassifier('logistic_regression')
self.test_texts = [
"This movie is fantastic!",
"Terrible movie, waste of time",
"It's an okay film"
]
self.test_labels = ['positive', 'negative', 'neutral']
def test_train_and_predict(self):
# Train model
metrics, _, _, _ = self.classifier.train(self.test_texts, self.test_labels)
self.assertGreater(metrics['accuracy'], 0)
self.assertTrue(self.classifier.is_trained)
# Test prediction
result = self.classifier.predict("Great movie!")
self.assertIn('sentiment', result)
self.assertIn('confidence', result)
def test_predict_batch(self):
self.classifier.train(self.test_texts, self.test_labels)
results = self.classifier.predict_batch(["Good movie", "Bad movie"])
self.assertEqual(len(results), 2)
def test_model_info(self):
self.classifier.train(self.test_texts, self.test_labels)
info = self.classifier.get_model_info()
self.assertTrue(info['is_trained'])
self.assertEqual(info['model_type'], 'logistic_regression')
class TestDeepLearningClassifier(unittest.TestCase):
def setUp(self):
self.classifier = DeepLearningSentimentClassifier('lstm', max_words=1000, max_len=50)
self.test_texts = [
"This movie is fantastic!",
"Terrible movie, waste of time",
"It's an okay film"
]
self.test_labels = ['positive', 'negative', 'neutral']
def test_build_model(self):
self.classifier.build_model()
self.assertIsNotNone(self.classifier.model)
def test_prepare_data(self):
X, y = self.classifier.prepare_data(self.test_texts, self.test_labels)
self.assertEqual(len(X), len(self.test_texts))
self.assertEqual(len(y), len(self.test_labels))
def test_train_and_predict(self):
# Quick test with minimal epochs
metrics, _, _ = self.classifier.train(
self.test_texts, self.test_labels,
epochs=2, validation_split=0.3
)
self.assertIn('val_accuracy', metrics)
# Test prediction
result = self.classifier.predict("Great movie!")
self.assertIn('sentiment', result)
class TestDatabaseOperations(unittest.TestCase):
def setUp(self):
self.db_ops = MongoDBOperations()
# Use test database
self.db_ops.config.DATABASE_NAME = "test_sentiment_analyzer"
self.db_ops.connect()
def tearDown(self):
# Clean up test database
self.db_ops.client.drop_database("test_sentiment_analyzer")
self.db_ops.close_connection()
def test_insert_review(self):
review_id = self.db_ops.insert_review("Test review", source='test')
self.assertIsNotNone(review_id)
def test_update_review_label(self):
review_id = self.db_ops.insert_review("Test review")
result = self.db_ops.update_review_label(review_id, 'positive')
self.assertTrue(result)
def test_get_training_data(self):
# Insert test data
reviews = ["Good movie", "Bad movie", "Okay movie"]
labels = ["positive", "negative", "neutral"]
for review, label in zip(reviews, labels):
review_id = self.db_ops.insert_review(review)
self.db_ops.update_review_label(review_id, label)
# Retrieve training data
texts, retrieved_labels = self.db_ops.get_training_data()
self.assertEqual(len(texts), 3)
self.assertEqual(len(retrieved_labels), 3)
def test_save_prediction(self):
review_id = self.db_ops.insert_review("Test review")
pred_id = self.db_ops.save_prediction(
review_id, "Test review", "positive", 0.95,
actual_label="positive"
)
self.assertIsNotNone(pred_id)
class TestModelComparator(unittest.TestCase):
def setUp(self):
from models.model_comparator import ModelComparator
self.comparator = ModelComparator()
self.test_texts = [
"This movie is fantastic!",
"Terrible movie, waste of time",
"It's an okay film",
"Amazing performance!",
"Boring and predictable"
] * 10 # Multiply to get more data
self.test_labels = ['positive', 'negative', 'neutral', 'positive', 'negative'] * 10
def test_initialize_models(self):
self.comparator.initialize_models()
self.assertGreater(len(self.comparator.models), 0)
def test_train_and_compare(self):
self.comparator.initialize_models()
results = self.comparator.train_and_compare(self.test_texts, self.test_labels)
self.assertGreater(len(results), 0)
def test_get_comparison_report(self):
self.comparator.initialize_models()
self.comparator.train_and_compare(self.test_texts, self.test_labels)
df = self.comparator.get_comparison_report()
self.assertFalse(df.empty)
def test_get_best_model(self):
self.comparator.initialize_models()
self.comparator.train_and_compare(self.test_texts, self.test_labels)
best = self.comparator.get_best_model()
self.assertIsNotNone(best)
self.assertIn('model_name', best)
if __name__ == '__main__':
unittest.main()
21. docker-compose.yml
version: '3.8' services: mongodb: image: mongo:6.0 container_name: sentiment-analyzer-mongodb restart: always ports: - "27017:27017" environment: MONGO_INITDB_ROOT_USERNAME: admin MONGO_INITDB_ROOT_PASSWORD: password123 MONGO_INITDB_DATABASE: sentiment_analyzer_db volumes: - mongodb_data:/data/db networks: - sentiment-network redis: image: redis:7-alpine container_name: sentiment-analyzer-redis restart: always ports: - "6379:6379" volumes: - redis_data:/data networks: - sentiment-network api: build: context: . dockerfile: Dockerfile.api container_name: sentiment-analyzer-api restart: always ports: - "5000:5000" environment: MONGODB_URI: mongodb://admin:password123@mongodb:27017/ REDIS_URL: redis://redis:6379/0 FLASK_ENV: production depends_on: - mongodb - redis volumes: - ./models/saved_models:/app/models/saved_models networks: - sentiment-network worker: build: context: . dockerfile: Dockerfile.worker container_name: sentiment-analyzer-worker restart: always environment: MONGODB_URI: mongodb://admin:password123@mongodb:27017/ REDIS_URL: redis://redis:6379/0 depends_on: - mongodb - redis volumes: - ./models/saved_models:/app/models/saved_models networks: - sentiment-network nginx: image: nginx:alpine container_name: sentiment-analyzer-nginx restart: always ports: - "80:80" - "443:443" volumes: - ./nginx/nginx.conf:/etc/nginx/nginx.conf - ./nginx/ssl:/etc/nginx/ssl depends_on: - api networks: - sentiment-network volumes: mongodb_data: redis_data: networks: sentiment-network: driver: bridge
22. Dockerfile.api
FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Download NLTK data
RUN python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"
# Copy application code
COPY . .
# Create directory for saved models
RUN mkdir -p models/saved_models
# Expose port
EXPOSE 5000
# Run the application
CMD ["python", "run.py", "--mode", "api", "--host", "0.0.0.0", "--port", "5000"]
23. Dockerfile.worker
FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Download NLTK data
RUN python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"
# Copy application code
COPY . .
# Create directory for saved models
RUN mkdir -p models/saved_models
# Run worker
CMD ["celery", "-A", "tasks.worker", "worker", "--loglevel=info"]
24. nginx/nginx.conf
events {
worker_connections 1024;
}
http {
upstream api_servers {
server api:5000;
}
server {
listen 80;
server_name localhost;
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl;
server_name localhost;
ssl_certificate /etc/nginx/ssl/cert.pem;
ssl_certificate_key /etc/nginx/ssl/key.pem;
location / {
proxy_pass http://api_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
location /static {
alias /app/static;
expires 30d;
}
location /api {
proxy_pass http://api_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}
25. tasks/worker.py (Celery tasks for background processing)
from celery import Celery
from models.sentiment_classifier import SentimentClassifier
from models.deep_learning_model import DeepLearningSentimentClassifier
from database.db_operations import MongoDBOperations
import logging
logger = logging.getLogger(__name__)
# Initialize Celery
celery = Celery('tasks', broker='redis://localhost:6379/0')
@celery.task
def train_model_task(model_type, texts, labels):
"""Background task to train a model"""
try:
if model_type in ['lstm', 'bilstm']:
model = DeepLearningSentimentClassifier(model_type)
metrics, _, _ = model.train(texts, labels)
model.save_model()
else:
model = SentimentClassifier(model_type)
metrics, _, _, _ = model.train(texts, labels)
model.save_model()
# Save metrics to database
db_ops = MongoDBOperations()
db_ops.update_model_metrics(metrics)
db_ops.close_connection()
return {'success': True, 'metrics': metrics}
except Exception as e:
logger.error(f"Error in training task: {e}")
return {'success': False, 'error': str(e)}
@celery.task
def batch_analyze_task(texts, model_type='logistic_regression'):
"""Background task to analyze multiple texts"""
try:
if model_type in ['lstm', 'bilstm']:
model = DeepLearningSentimentClassifier(model_type)
model.load_model()
results = model.predict_batch(texts)
else:
model = SentimentClassifier(model_type)
model.load_model()
results = model.predict_batch(texts)
# Save to database
db_ops = MongoDBOperations()
for text, result in zip(texts, results):
review_id = db_ops.insert_review(text)
db_ops.save_prediction(
review_id, text, result['sentiment'],
result['confidence'], model_used=model_type
)
db_ops.close_connection()
return {'success': True, 'results': results}
except Exception as e:
logger.error(f"Error in batch analysis: {e}")
return {'success': False, 'error': str(e)}
@celery.task
def export_data_task(format='csv'):
"""Background task to export data"""
try:
db_ops = MongoDBOperations()
df = db_ops.export_reviews_to_dataframe()
db_ops.close_connection()
if format == 'csv':
data = df.to_csv(index=False)
else:
data = df.to_json(orient='records')
return {'success': True, 'data': data, 'format': format}
except Exception as e:
logger.error(f"Error in export task: {e}")
return {'success': False, 'error': str(e)}
26. scripts/generate_test_data.py
#!/usr/bin/env python3
"""
Script to generate test data for sentiment analysis
"""
import random
import csv
import argparse
def generate_movie_reviews(num_reviews=1000):
"""Generate synthetic movie reviews"""
positive_templates = [
"This movie was absolutely {}!",
"{} performance by the lead actor.",
"One of the best {} movies I've seen.",
"Absolutely {} from start to finish.",
"A {} masterpiece of cinema.",
"The {} direction made this film special.",
"{} acting and {} storyline.",
"I was {} throughout the entire movie.",
"This film {} exceeded my expectations.",
"A {} gem that deserves more attention."
]
negative_templates = [
"This movie was completely {}.",
"{} acting ruined the experience.",
"One of the worst {} movies ever.",
"Absolutely {} and {}.",
"A {} disaster from start to finish.",
"The {} direction made no sense.",
"{} plot and {} characters.",
"I was {} bored throughout.",
"This film {} disappointed me.",
"A {} waste of time and money."
]
neutral_templates = [
"This movie was {}.",
"{} performance from the cast.",
"An {} film with some issues.",
"{} entertaining but {}.",
"The movie had its {} moments.",
"Nothing {} about this film.",
"{} average movie overall.",
"I had {} expectations for this.",
"The film was {} but {}.",
"A {} viewing experience."
]
positive_adjectives = ['fantastic', 'amazing', 'brilliant', 'excellent', 'wonderful',
'superb', 'outstanding', 'remarkable', 'incredible', 'spectacular']
negative_adjectives = ['terrible', 'awful', 'horrible', 'dreadful', 'pathetic',
'disappointing', 'mediocre', 'poor', 'boring', 'ridiculous']
neutral_adjectives = ['okay', 'decent', 'average', 'mediocre', 'fair',
'moderate', 'reasonable', 'acceptable', 'passable', 'tolerable']
reviews = []
for _ in range(num_reviews):
sentiment = random.choice(['positive', 'negative', 'neutral'])
if sentiment == 'positive':
template = random.choice(positive_templates)
adj1 = random.choice(positive_adjectives)
adj2 = random.choice(positive_adjectives)
review = template.format(adj1, adj2)
elif sentiment == 'negative':
template = random.choice(negative_templates)
adj1 = random.choice(negative_adjectives)
adj2 = random.choice(negative_adjectives)
review = template.format(adj1, adj2)
else: # neutral
template = random.choice(neutral_templates)
adj1 = random.choice(neutral_adjectives)
adj2 = random.choice(neutral_adjectives)
review = template.format(adj1, adj2)
reviews.append((review, sentiment))
return reviews
def generate_product_reviews(num_reviews=1000):
"""Generate synthetic product reviews"""
positive_templates = [
"This product is {}!",
"{} quality and {} design.",
"Best {} I've ever purchased.",
"Absolutely {} value for money.",
"{} product, highly recommended.",
"The {} features are impressive.",
"{} build quality and {} performance.",
"I'm {} satisfied with this purchase.",
"This {} exceeded my expectations.",
"A {} product that delivers."
]
negative_templates = [
"This product is completely {}.",
"{} quality and {} design.",
"Worst {} I've ever purchased.",
"Absolutely {} waste of money.",
"{} product, not recommended.",
"The {} features are useless.",
"{} build quality and {} performance.",
"I'm {} disappointed with this.",
"This {} failed to meet expectations.",
"A {} product that doesn't work."
]
neutral_templates = [
"This product is {}.",
"{} quality but {} design.",
"An {} product with some issues.",
"{} value for money.",
"The product is {} overall.",
"It has some {} features.",
"{} build quality and {} performance.",
"I'm {} about this purchase.",
"This product is {} but {}.",
"A {} product that's {}."
]
# Reuse adjectives from movie reviews
positive_adjectives = ['fantastic', 'amazing', 'brilliant', 'excellent', 'wonderful',
'superb', 'outstanding', 'remarkable', 'incredible', 'great']
negative_adjectives = ['terrible', 'awful', 'horrible', 'dreadful', 'poor',
'disappointing', 'mediocre', 'cheap', 'flimsy', 'broken']
neutral_adjectives = ['okay', 'decent', 'average', 'mediocre', 'fair',
'reasonable', 'acceptable', 'passable', 'functional', 'standard']
reviews = []
for _ in range(num_reviews):
sentiment = random.choice(['positive', 'negative', 'neutral'])
if sentiment == 'positive':
template = random.choice(positive_templates)
adj1 = random.choice(positive_adjectives)
adj2 = random.choice(positive_adjectives)
review = template.format(adj1, adj2)
elif sentiment == 'negative':
template = random.choice(negative_templates)
adj1 = random.choice(negative_adjectives)
adj2 = random.choice(negative_adjectives)
review = template.format(adj1, adj2)
else: # neutral
template = random.choice(neutral_templates)
adj1 = random.choice(neutral_adjectives)
adj2 = random.choice(neutral_adjectives)
review = template.format(adj1, adj2)
reviews.append((review, sentiment))
return reviews
def save_to_csv(reviews, filename):
"""Save reviews to CSV file"""
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['review', 'sentiment'])
writer.writerows(reviews)
print(f"Saved {len(reviews)} reviews to {filename}")
def main():
parser = argparse.ArgumentParser(description='Generate test data for sentiment analysis')
parser.add_argument('--type', choices=['movie', 'product', 'both'], default='both',
help='Type of reviews to generate')
parser.add_argument('--num', type=int, default=1000,
help='Number of reviews to generate per type')
parser.add_argument('--output', default='data/',
help='Output directory')
args = parser.parse_args()
if args.type in ['movie', 'both']:
print(f"Generating {args.num} movie reviews...")
movie_reviews = generate_movie_reviews(args.num)
save_to_csv(movie_reviews, f"{args.output}/movie_reviews.csv")
if args.type in ['product', 'both']:
print(f"Generating {args.num} product reviews...")
product_reviews = generate_product_reviews(args.num)
save_to_csv(product_reviews, f"{args.output}/product_reviews.csv")
print("Done!")
if __name__ == "__main__":
main()
27. README.md
# Sentiment Analyzer - Movie & Product Review Analysis ## ๐ Overview A comprehensive sentiment analysis system that classifies movie and product reviews into Positive, Negative, or Neutral categories using multiple machine learning and deep learning models. ## โจ Features - **Multi-model Support**: Naive Bayes, Logistic Regression, SVM, Random Forest, LSTM, Bi-LSTM - **Real-time Analysis**: Instant sentiment prediction with confidence scores - **Batch Processing**: Analyze multiple reviews simultaneously - **Interactive Dashboard**: Visualize sentiment distributions and model performance - **RESTful API**: Easy integration with other applications - **MongoDB Integration**: Persistent storage of reviews and predictions - **Model Comparison**: Compare performance across different algorithms - **Export Functionality**: Export results to CSV, JSON, or Excel - **Docker Support**: Easy deployment with Docker Compose ## ๐ Quick Start ### Prerequisites - Python 3.8+ - MongoDB 4.0+ - Redis (optional, for background tasks) - Docker (optional) ### Installation 1. **Clone the repository**
bash
git clone
cd sentiment-analyzer
2. **Install dependencies**
bash
pip install -r requirements.txt
3. **Set up environment variables**
bash
cp .env.example .env
Edit .env with your configuration
4. **Start MongoDB**
bash
Using Docker
docker run -d -p 27017:27017 --name mongodb mongo:6.0
Or local installation
sudo systemctl start mongodb
5. **Run the application**
bash
Setup database and train models
python run.py --mode setup
Start API server
python run.py --mode api
Or run interactive CLI
python run.py --mode interactive
### Using Docker Compose
bash
docker-compose up -d
## ๐ Usage Examples ### Python API
python
from models.sentiment_classifier import SentimentClassifier
Initialize classifier
classifier = SentimentClassifier('logistic_regression')
classifier.load_model()
Analyze single review
result = classifier.predict("This movie was absolutely fantastic!")
print(f"Sentiment: {result['sentiment']}")
print(f"Confidence: {result['confidence']}")
Batch analysis
reviews = ["Great movie!", "Terrible film", "It was okay"]
results = classifier.predict_batch(reviews)
### REST API
bash
Analyze sentiment
curl -X POST http://localhost:5000/api/analyze \
-H "Content-Type: application/json" \
-d '{"text": "This movie is amazing!", "model": "logistic_regression"}'
Batch analysis
curl -X POST http://localhost:5000/api/analyze/batch \
-H "Content-Type: application/json" \
-d '{"texts": ["Great!", "Bad!"], "model": "lstm"}'
Get statistics
curl http://localhost:5000/api/statistics
### Web Interface - Open http://localhost:5000 in your browser - Navigate to Analyze page to test individual reviews - View Dashboard for statistics and visualizations ## ๐๏ธ Project Structure
sentiment-analyzer/
โโโ api/ # Flask API routes
โโโ config/ # Configuration files
โโโ data/ # Dataset files
โโโ database/ # MongoDB operations
โโโ models/ # ML/DL models
โโโ nginx/ # Nginx configuration
โโโ scripts/ # Utility scripts
โโโ tasks/ # Celery tasks
โโโ tests/ # Unit tests
โโโ utils/ # Helper functions
โโโ web/ # Web interface
โโโ docker-compose.yml
โโโ requirements.txt
โโโ run.py
## ๐ค Models ### Traditional Machine Learning - **Naive Bayes**: Fast and efficient for text classification - **Logistic Regression**: Good baseline with interpretable results - **SVM**: Effective for high-dimensional spaces - **Random Forest**: Ensemble method for robust predictions ### Deep Learning - **LSTM**: Captures long-term dependencies in text - **Bi-LSTM**: Bidirectional context understanding - **CNN-LSTM**: Hybrid architecture for local patterns + sequences ## ๐ Performance Metrics - **Accuracy**: Overall correctness - **Precision**: Positive prediction accuracy - **Recall**: Coverage of actual positives - **F1-Score**: Harmonic mean of precision and recall - **Confusion Matrix**: Detailed classification breakdown ## ๐ง Configuration ### Model Settings (config/model_config.py)
python
MAX_FEATURES = 5000 # Maximum vocabulary size
MAX_SEQUENCE_LENGTH = 200 # Maximum text length for DL models
BATCH_SIZE = 32 # Training batch size
EPOCHS = 10 # Number of training epochs
### Database Settings (config/mongodb_config.py)
python
MONGODB_URI = "mongodb://localhost:27017/"
DATABASE_NAME = "sentiment_analyzer_db"
## ๐ Data Format ### Training Data CSV
csv
review,sentiment
"This movie is fantastic!",positive
"Terrible product, waste of money",negative
"It's an okay film",neutral
## ๐งช Testing
bash
Run all tests
python -m unittest discover tests
Run specific test
python -m unittest tests.test_sentiment_analyzer.TestSentimentClassifier
## ๐ฆ Deployment ### Production Deployment with Docker
bash
Build and start all services
docker-compose up -d --build
Scale API servers
docker-compose up -d --scale api=3
View logs
docker-compose logs -f
### Manual Deployment
bash
Using Gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 api.app:app
With Nginx as reverse proxy
(See nginx/nginx.conf for configuration)
## ๐ Security Considerations - Input validation and sanitization - Rate limiting on API endpoints - MongoDB authentication enabled - HTTPS with SSL/TLS - Environment variables for sensitive data ## ๐ Monitoring - Prometheus metrics endpoint - Grafana dashboards - ELK stack integration - Sentry for error tracking ## ๐ค Contributing 1. Fork the repository 2. Create a feature branch 3. Commit your changes 4. Push to the branch 5. Open a Pull Request ## ๐ License MIT License - see LICENSE file for details ## ๐ฅ Authors - Your Name - Initial work ## ๐ Acknowledgments - NLTK for text processing utilities - Scikit-learn for ML algorithms - TensorFlow/Keras for deep learning - MongoDB for database storage ## ๐ Support For issues and questions: - Create an issue in the repository - Contact: [email protected] ## ๐ Future Enhancements - [ ] BERT/RoBERTa transformer models - [ ] Multi-language support - [ ] Aspect-based sentiment analysis - [ ] Real-time streaming analysis - [ ] Active learning for model improvement - [ ] A/B testing framework - [ ] Model versioning and rollback - [ ] Automated retraining pipeline
This completes the comprehensive Sentiment Analyzer project! The system includes:
- Multiple ML/DL Models: Traditional ML (Naive Bayes, Logistic Regression, SVM, Random Forest) and Deep Learning (LSTM, Bi-LSTM)
- Full-stack Implementation: Backend API, Web Interface, Database Integration
- Production-ready Features: Docker support, Nginx configuration, Celery tasks
- Comprehensive Testing: Unit tests for all components
- Documentation: Detailed README with setup instructions
- Scalability: Designed to handle large volumes of reviews
- Extensibility: Easy to add new models and features
The system can be deployed in various environments and scaled horizontally using Docker containers. It provides both REST API for integration and a user-friendly web interface for interactive use.
One thought on “SENTIMENT ANALYZER PROJECT IN PYTHON AND MONGO DB”