Random Forest Classification on Social Network Ads Dataset
Project Title: Random Forest Classification on Social Network Ads
Name: Naveen Kumar
Date: August 1, 2025
Abstract
This project uses the Random Forest Classification algorithm to predict whether a user on a social network
will purchase a product based on their age and estimated salary. The dataset is preprocessed using standard
scaling, and results are evaluated using accuracy and a confusion matrix. Visualization of the decision
boundary shows clear class separation. The model achieves high performance on the test set and
demonstrates the effectiveness of ensemble learning.
Table of Contents
1. Introduction
2. Literature Review
3. Problem Statement
4. Data Collection and Preprocessing
5. Methodology
6. Implementation
7. Results
8. Discussion
9. Conclusion
10. References
11. Appendices
12. Acknowledgments
Random Forest Classification on Social Network Ads Dataset
Introduction
This project explores a supervised machine learning approach to predict user behavior in online
advertisements. By using Random Forest, a powerful ensemble method, we aim to improve classification
accuracy over traditional single-tree models. The goal is to accurately predict if a person will buy a product
based on simple demographic inputs like age and salary.
Literature Review
Ensemble methods like Random Forest are widely known for reducing overfitting and increasing prediction
accuracy. Earlier studies show that decision trees are prone to high variance, which Random Forest
overcomes by averaging many trees. Applications in marketing and user behavior prediction have
demonstrated significant gains through such methods.
Problem Statement
Predict whether a user will purchase a product based on age and estimated salary.
Assumptions:
- Only two features are used.
- Binary classification (0 = No, 1 = Yes).
Limitations:
- Does not account for other possible influences (e.g., device used, browsing time, gender).
Data Collection and Preprocessing
Dataset: Social_Network_Ads.csv
Features: Age, Estimated Salary
Random Forest Classification on Social Network Ads Dataset
Target: Purchased (0 or 1)
Data was split into training and testing sets (75% train, 25% test). Feature scaling was applied using
StandardScaler.
Methodology
We used Random Forest Classification with:
- 10 decision trees (n_estimators=10)
- Entropy criterion for split decisions
Rationale:
- Random Forest reduces overfitting
- Handles non-linearly separable data better than linear models
Implementation
Language: Python
Libraries: scikit-learn, pandas, matplotlib, numpy
Code Example:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
classifier.fit(x_train, y_train)
Results
Model Predictions:
Random Forest Classification on Social Network Ads Dataset
classifier.predict([[30, 87000]]) -> [1]
classifier.predict([[40, 0]]) -> [0]
Confusion Matrix:
[[64 4]
[ 3 29]]
Accuracy: 93%
Decision boundary shows clear class separation.
Discussion
The model performed well with 93% accuracy. False positives and negatives were minimal. Some
misclassifications may be due to the limited feature space or overlapping classes in the dataset.
Conclusion
The Random Forest model provided strong performance for this binary classification task. This approach
could be expanded by incorporating additional features for even better predictive power. This work
demonstrates the real-world applicability of ensemble models in marketing and recommendation systems.
References
- Scikit-learn documentation
- Breiman, L. (2001). 'Random Forests'. Machine Learning.
- Dataset: Social_Network_Ads.csv
Random Forest Classification on Social Network Ads Dataset
Appendices
How to Reproduce:
1. Install dependencies:
pip install numpy pandas matplotlib scikit-learn
2. Run the script: python social_ads_rf.py
Acknowledgments
Thanks to the open-source community and contributors of scikit-learn and matplotlib.