Decision Tree
Decision Tree
Data Inspection
In [3]: df.head()
Out[3]: RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProduc
In [4]: df.shape
(10000, 14)
Out[4]:
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB
In [6]: df.isnull().sum()
Loading [MathJax]/extensions/Safe.js
RowNumber 0
Out[6]:
CustomerId 0
Surname 0
CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64
In [7]: df[df.duplicated()]
Out[7]: RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts
In [8]: df.describe(include='all')
Data Wrangling
In [9]: df.drop(columns=['RowNumber','CustomerId','Surname'],inplace=True)
In [10]: df.rename(columns={"Exited":"Churned"},inplace=True)
df["Churned"].replace({0:"No",1:"Yes"},inplace=True)
In [11]: df.head()
Loading [MathJax]/extensions/Safe.js
Out[11]: CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember Esti
Text(0.5, 0, 'Churned')
Out[12]:
In [13]: sns.set_style('whitegrid')
sns.countplot(x='Geography',hue='Churned',data=df)
<AxesSubplot:xlabel='Geography', ylabel='count'>
Out[13]:
In [14]: sns.set_style('whitegrid')
sns.countplot(x='Gender',hue='Churned',data=df)
<AxesSubplot:xlabel='Gender', ylabel='count'>
Out[14]:
Loading [MathJax]/extensions/Safe.js
In [15]: sns.set_style('whitegrid')
sns.countplot(x='NumOfProducts',hue='Churned',data=df)
<AxesSubplot:xlabel='NumOfProducts', ylabel='count'>
Out[15]:
In [16]: sns.set_style('whitegrid')
sns.countplot(x='HasCrCard',hue='Churned',data=df)
<AxesSubplot:xlabel='HasCrCard', ylabel='count'>
Out[16]:
In [17]: sns.set_style('whitegrid')
sns.countplot(x='IsActiveMember',hue='Churned',data=df)
Loading [MathJax]/extensions/Safe.js
<AxesSubplot:xlabel='IsActiveMember', ylabel='count'>
Out[17]:
In [18]: plt.figure(figsize=(8,8))
sns.set_style('whitegrid')
sns.countplot(x='Tenure',hue='Churned',data=df)
<AxesSubplot:xlabel='Tenure', ylabel='count'>
Out[18]:
In [19]: sns.boxplot(x='Churned',y='CreditScore',data=df)
<AxesSubplot:xlabel='Churned', ylabel='CreditScore'>
Out[19]:
Loading [MathJax]/extensions/Safe.js
In [20]: sns.distplot(df['Age'])
C:\Users\Admin\AppData\Local\Temp\ipykernel_8808\3255828239.py:1: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(df['Age'])
<AxesSubplot:xlabel='Age', ylabel='Density'>
Out[20]:
In [21]: sns.boxplot(x='Churned',y='EstimatedSalary',data=df)
<AxesSubplot:xlabel='Churned', ylabel='EstimatedSalary'>
Out[21]:
Loading [MathJax]/extensions/Safe.js
Feature Engineering
In [22]: def products(col):
for i in col:
if i==1:
return 'One product'
if i==2:
return 'Two product'
if i>2:
return 'More Than 2 Products'
In [23]: df['Product']=df[["NumOfProducts"]].apply(products,axis=1)
In [24]: df.drop(columns='NumOfProducts',inplace=True)
In [25]: sns.countplot(x='Product',hue='Churned',data=df)
<AxesSubplot:xlabel='Product', ylabel='count'>
Out[25]:
Loading [MathJax]/extensions/Safe.js
In [27]: df['Account_Balance']=df[['Balance']].apply(balance,axis=1)
In [28]: df.drop(columns='Balance',inplace=True)
Data Preparation
In [29]: df = pd.get_dummies(columns=["Geography","Gender","Product","Account_Balance"],data=df)
In [30]: df["Churned"].replace({"No":0,"Yes":1},inplace=True)
In [32]: X = df.drop(columns=["Churned"])
y = df["Churned"]
In [37]: decision_tree.fit(x_train,y_train)
Out[37]: ▸ GridSearchCV
▸ estimator: DecisionTreeClassifier
▸ DecisionTreeClassifier
In [38]: decision_tree.best_params_
{'criterion': 'gini',
Out[38]:
'max_depth': 7,
'max_features': None,
'min_samples_leaf': 3,
'min_samples_split': 8,
'random_state': 42,
'splitter': 'best'}
In [ ]:
In [39]: decision_tree.best_score_
Loading [MathJax]/extensions/Safe.js
0.8561250000000001
Out[39]:
Accuracy
In [41]: round(accuracy_score(y_train,y_train_pred)*100,2)
86.9
Out[41]:
In [43]: round(accuracy_score(y_test,y_test_pred)*100,2)
85.75
Out[43]:
In [47]: plt.figure(figsize=(10,10))
feat_imp.plot(kind='barh')
plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("Feature Importance");
Loading [MathJax]/extensions/Safe.js
Implementing Random Forest Classifier
In [48]: model = RandomForestClassifier()
forest.fit(x_train,y_train)
Out[50]: ▸ GridSearchCV
▸ estimator: RandomForestClassifier
▸ RandomForestClassifier
In [51]: forest.best_params_
Loading [MathJax]/extensions/Safe.js
{'criterion': 'gini',
Out[51]:
'max_depth': 8,
'min_samples_leaf': 4,
'min_samples_split': 4,
'n_estimators': 70}
In [52]: forest.best_score_
0.8618750000000001
Out[52]:
Accuracy
In [55]: round(accuracy_score(y_train,y_train_pred)*100,2)
86.9
Out[55]:
In [56]: round(accuracy_score(y_test,y_test_pred)*100,2)
85.75
Out[56]:
In [60]: plt.figure(figsize=(10,10))
feat_imp.plot(kind='barh')
plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("Feature Importance");
Loading [MathJax]/extensions/Safe.js
In [ ]:
In [ ]:
Loading [MathJax]/extensions/Safe.js