A comprehensive sentiment analysis application movie reviews using NLP and machine learning techniques. This project includes both the training notebook and an interactive Streamlit web application.
This project analyzes the sentiment of IMDb movie reviews, classifying them as either positive or negative. It implements two machine learning models:
- Logistic Regression (88.47% accuracy)
- Naive Bayes (85.20% accuracy)
- Data Preprocessing: Text cleaning, tokenization, lemmatization
- Feature Extraction: TF-IDF vectorization with 5000 features
- Model Training: Logistic Regression and Naive Bayes classifiers
- Performance Evaluation: Accuracy, precision, recall, F1-score, confusion matrices
- Data Visualization: Word frequency analysis, performance comparisons
- Project Overview: Complete project documentation and insights
- Training Results: Detailed model performance analysis
- Model Inference: Real-time sentiment prediction interface
- Data Analysis: Dataset exploration and visualization
Sentiment Analysis for IMDb Reviews/
├── Sentiment_Analysis_for_IMDb_Reviews.ipynb # Training notebook
├── app.py # Streamlit application
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── .gitignore # Git ignore file
├── saved_models/ # Trained models
│ ├── lr_model.pkl # Logistic Regression model
│ └── nb_model.pkl # Naive Bayes model
└── IMDB Dataset.csv # Dataset (not in git)
- Python 3.8 or higher
- pip package manager
-
Clone the repository
git clone <repository-url> cd "Sentiment Analysis for IMDb Reviews"
-
Install dependencies
pip install -r requirements.txt
-
Download the dataset
- Place
IMDB Dataset.csv
in the project root directory - The dataset contains 50,000 movie reviews with sentiment labels
- Place
-
Run the Streamlit application
streamlit run app.py
-
Open your browser
- Navigate to
http://localhost:8501
- Explore the different pages using the sidebar navigation
- Navigate to
- Complete project documentation
- Technical approach explanation
- Dataset information
- Technology stack details
- Model performance comparison
- Detailed metrics and visualizations
- Confusion matrices
- Key insights and analysis
- Enter a movie review in the text area
- Click "Analyze Sentiment" to get predictions
- View results from both models with confidence scores
- Try example reviews for quick testing
- Dataset statistics and distribution
- Word frequency analysis
- Text preprocessing insights
- Model comparison details
- Text Cleaning: Convert to lowercase, remove special characters
- Tokenization: Split text into individual words
- Stop Word Removal: Remove common words (the, and, is, etc.)
- Lemmatization: Convert words to base form
- TF-IDF Vectorization: Convert to numerical features
- Logistic Regression: Linear classifier with regularization
- Naive Bayes: Probabilistic classifier based on Bayes theorem
- Feature Space: 5000 TF-IDF features
- Evaluation: 80-20 train-test split
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Logistic Regression | 88.47% | 0.89 | 0.88 | 0.88 |
Naive Bayes | 85.20% | 0.85 | 0.85 | 0.85 |
- Python: Core programming language
- NLTK: Natural language processing
- Scikit-learn: Machine learning algorithms
- Pandas: Data manipulation and analysis
- Matplotlib/Seaborn: Data visualization
- Plotly: Interactive visualizations
- Streamlit: Web application framework
- Joblib: Model serialization
- Source: IMDb movie reviews dataset
- Size: 50,000 reviews
- Classes: Positive (1) and Negative (0)
- Balance: 50% positive, 50% negative
- Features: Text reviews with sentiment labels
- Logistic Regression performs slightly better overall
- Both models handle balanced datasets well
- TF-IDF features effectively capture sentiment patterns
- Text preprocessing significantly improves performance
- Positive words: good, story, character, great, see
- Negative words: even, bad, terrible, disappointing, boring
- Some words appear in both categories with different contexts
Created by Abdelmonem Hatem
- IMDb for providing the dataset
- NLTK and Scikit-learn communities
- Streamlit for the amazing web framework
- Open source community for various libraries and tools
Note: Make sure to add the IMDB Dataset.csv
file to your .gitignore
before pushing to GitHub, as it's not included in this repository due to size constraints.