0% found this document useful (0 votes)
4 views4 pages

Amazon Sentiment Analysis Documentation

Uploaded by

SRINJOY DAS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
4 views4 pages

Amazon Sentiment Analysis Documentation

Uploaded by

SRINJOY DAS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 4

Amazon Sentiment Analysis Project

### Project Overview:

**Purpose:** To analyze and classify the sentiment of Amazon reviews using machine learning

models, specifically Logistic Regression and Support Vector Machines (SVM).

**Dataset:** The dataset includes Amazon reviews labeled with scores from 1 to 5. The training data

has 3 million samples, while testing data has 650,000 samples. Each sample includes a class index

(1-5), a review title, and review text.

### Step-by-Step Code Explanation and Documentation

1. **Importing Libraries**

- `re`: For text processing with regular expressions.

- `nltk`: For natural language processing.

- `pickle`: For saving and loading machine learning models.

- `pandas` and `numpy`: For data manipulation.

- `requests`: For HTTP requests, potentially for data downloads.

- `joblib`: For saving large data objects efficiently.

2. **Data Preprocessing Utilities**

- From `nltk` we use specific tools for text processing:

- `word_tokenize`: Splits text into tokens (words).

- `stopwords`: Provides common words like "the" to filter out.

- `PorterStemmer` and `WordNetLemmatizer`: Standardize word variations.


3. **Feature Extraction for Text Data**

- `TfidfVectorizer` and `CountVectorizer`: Convert text into numerical term frequencies or TF-IDF

scores.

- `FreqDistVisualizer`: Visualizes word frequency distributions.

4. **Setting Up Machine Learning Models**

- `SVC` (Support Vector Classifier) and `LogisticRegression`: Models for sentiment classification.

- `train_test_split`: Splits data into training and testing sets.

- `StratifiedShuffleSplit` and `GridSearchCV`: For balanced sampling and parameter tuning.

- `Pipeline`: Combines transformation and modeling steps sequentially.

5. **Evaluation Metrics**

- `confusion_matrix`: True/false positives and negatives.

- `accuracy_score`: Percentage of correct predictions.

- `classification_report`: Detailed metrics on precision, recall, and F1-score.

6. **Data Visualization**

- `matplotlib` and `seaborn`: Create various plots, like word frequency distributions.

- `WordCloud`: Generates word clouds to highlight frequently occurring words.

7. **Loading and Viewing the Dataset**

- **Data Load:** The dataset is loaded and the first few rows are displayed for verification.

- **Dataset Dimensions:** Shows the size, helping understand the scale of the data.

8. **Renaming and Combining Columns**

- Renames columns for clarity, combining title and review text for unified analysis.
9. **Assigning Sentiment Labels**

- A function assigns labels for Positive (rating > 3), Negative (rating < 3), and Neutral (rating = 3).

10. **Exploratory Data Analysis**

- **Sentiment Distribution**: Visualizes the count of positive, negative, and neutral sentiments.

- **Proportional Distribution of Ratings and Sentiments**: Relative distribution calculations reveal

class balance.

11. **Handling Missing Values**

- Missing values in reviews are filled with empty strings, preventing blank entries.

12. **Analyzing Review Lengths**

- Review lengths are grouped and counted, helping understand typical review length distribution.

13. **Data Cleaning Explanation**

- Cleaning steps are detailed for future processing: tokenization, stop words removal,

normalization, and lemmatization.

14. **Data Sampling with Stratified Split**

- `StratifiedShuffleSplit` creates a balanced subset, reducing computational load while keeping

class distribution.

15. **Word Cloud Visualization**

- A word cloud visually represents word frequencies, with common words appearing larger,

shaped by an Amazon logo mask.


This completes the data preparation, making the data ready for vectorization and model training.

You might also like