Skip to content

Commit 5919187

Browse files
committed
svm for categorical
1 parent dcbe4ff commit 5919187

3 files changed

Lines changed: 21 additions & 0 deletions

File tree

faq/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ Sebastian
133133
- [What is the difference between filter, wrapper, and embedded methods for feature selection?](./feature_sele_categories.md)
134134
- [Should data preparation/pre-processing step be considered one part of feature engineering? Why or why not?](./dataprep-vs-dataengin.md)
135135
- [Is a bag of words feature representation for text classification considered as a sparse matrix?](./bag-of-words-sparsity.md)
136+
- [How can I apply an SVM to categorical data?](./svm_for_categorical_data.md)
136137

137138
##### Naive Bayes
138139

faq/svm_for_categorical_data.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# How can I apply an SVM to categorical data?
2+
3+
I assume you are asking about categorical features, not the target variable, which is already assumed to be categorical (binary) in SVM classifiers.
4+
5+
First, there are two sub-types of categorical features: Ordinal and nominal features.
6+
7+
Ordinal means that an "order" is implied. For example, a customer satisfaction metric {'satisfied', 'neutral', 'dissatisfied'} is a ordinal variable since we can order it: 'satisfied' > 'neutral' > 'dissatisfied'. Here, we can simply map the 'string' notation into an integer notation, for example 'satisfied'=1, 'neutral' =0, and 'dissatisfied'= -1.
8+
9+
If our variable is *nominal*, an 'order' does not make sense. For example, think of 'color'; there are some cases in image processing where ordering color values makes sense, but for simplicity, we can't say 'red > blue > yellow' or so. To deal with such variables in SVM classification, we typically do a "one-hot" encoding. Here, we create so-called dummy variables that can binary values — we create one dummy variable for each possible value of that nominal feature variable. Say that our color variable can have one of the three values: 'red,' 'blue,' 'yellow.' And Let's say we have the following dataset consisting of 4 training samples:
10+
11+
- sample 1: 'blue'
12+
- sample 2: 'yellow'
13+
- sample 3: 'red'
14+
- sample 4: 'yellow'
15+
16+
Then our one-hot encoding would look like this:
17+
18+
![](svm_for_categorical_data/onehot-color.png)
19+
20+
Note that there's only one "true" value (the integer 1) in each row, which denotes the column for that sample in the training set. Sample 1 is blue; sample 2 is yellow, and so forth.
8.54 KB
Loading

0 commit comments

Comments
 (0)