Skip to content

Commit

Permalink
info about ordinal encoding
Browse files Browse the repository at this point in the history
  • Loading branch information
rasbt committed Nov 12, 2019
1 parent 98daed9 commit a6494e4
Show file tree
Hide file tree
Showing 2 changed files with 315 additions and 87 deletions.
360 changes: 275 additions & 85 deletions ch04/ch04.ipynb

Large diffs are not rendered by default.

42 changes: 40 additions & 2 deletions ch04/ch04.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,15 @@



# *The use of `watermark` is optional. You can install this IPython extension via "`pip install watermark`". For more information, please see: https://github.com/rasbt/watermark.*
# *The use of `watermark` is optional. You can install this Jupyter extension via*
#
# conda install watermark -c conda-forge
#
# or
#
# pip install watermark
#
# *For more information, please see: https://github.com/rasbt/watermark.*


# ### Overview
Expand Down Expand Up @@ -134,7 +142,7 @@



# drop rows that have less than 3 real values
# drop rows that have fewer than 3 real values

df.dropna(thresh=4)

Expand Down Expand Up @@ -167,6 +175,11 @@





df.fillna(df.mean())


# ## Understanding the scikit-learn estimator API


Expand Down Expand Up @@ -306,6 +319,31 @@
c_transf.fit_transform(X).astype(float)


# ## Optional: Ordinal Encoding

# If we are unsure about the numerical differences between the categories of ordinal features, we can also encode them using a thresholded one-hot encoded format. For example, we can split the feature "size" with values M, L, and XL into two new features "x > M", "x > L", and . For example, let's consider the original DataFrame:



df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
['red', 'L', 13.5, 'class1'],
['blue', 'XL', 15.3, 'class2']])

df.columns = ['color', 'size', 'price', 'classlabel']
df


# We can use the `apply` method of pandas' DataFrames to write custom lambda expressions in order to encode these variables using the value-threshold approach:



df['x > M'] = df['size'].apply(lambda x: 1 if x in {'L', 'XL'} else 0)
df['x > L'] = df['size'].apply(lambda x: 1 if x == 'XL' else 0)

del df['size']
df



# # Partitioning a dataset into a seperate training and test set

Expand Down

0 comments on commit a6494e4

Please sign in to comment.