info about ordinal encoding

rahul5757 · Nov 12, 2019 · a6494e4 · a6494e4
1 parent 98daed9
commit a6494e4
Show file tree

Hide file tree

Showing 2 changed files with 315 additions and 87 deletions.
diff --git a/ch04/ch04.ipynb b/ch04/ch04.ipynb
diff --git a/ch04/ch04.py b/ch04/ch04.py
@@ -37,7 +37,15 @@
 
 
 
-# *The use of `watermark` is optional. You can install this IPython extension via "`pip install watermark`". For more information, please see: https://github.com/rasbt/watermark.*
+# *The use of `watermark` is optional. You can install this Jupyter extension via*  
+# 
+#     conda install watermark -c conda-forge  
+# 
+# or  
+# 
+#     pip install watermark   
+# 
+# *For more information, please see: https://github.com/rasbt/watermark.*
 
 
 # ### Overview
@@ -134,7 +142,7 @@
 
 
 
-# drop rows that have less than 3 real values 
+# drop rows that have fewer than 3 real values 
 
 df.dropna(thresh=4)
 
@@ -167,6 +175,11 @@
 
 
 
+
+
+df.fillna(df.mean())
+
+
 # ## Understanding the scikit-learn estimator API
 
 
@@ -306,6 +319,31 @@
 c_transf.fit_transform(X).astype(float)
 
 
+# ## Optional: Ordinal Encoding
+
+# If we are unsure about the numerical differences between the categories of ordinal features, we can also encode them using a thresholded one-hot encoded format. For example, we can split the feature "size" with values M, L, and XL into two new features "x > M", "x > L", and . For example, let's consider the original DataFrame:
+
+
+
+df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
+                   ['red', 'L', 13.5, 'class1'],
+                   ['blue', 'XL', 15.3, 'class2']])
+
+df.columns = ['color', 'size', 'price', 'classlabel']
+df
+
+
+# We can use the `apply` method of pandas' DataFrames to write custom lambda expressions in order to encode these variables using the value-threshold approach:
+
+
+
+df['x > M'] = df['size'].apply(lambda x: 1 if x in {'L', 'XL'} else 0)
+df['x > L'] = df['size'].apply(lambda x: 1 if x == 'XL' else 0)
+
+del df['size']
+df
+
+
 
 # # Partitioning a dataset into a seperate training and test set