MachineLearningCheatSheet
MachineLearningCheatSheet
Clean data
- verwijder onnodige/overbodige kolommen
- bepaal o.a. shape en unieke waardes van belangrijke kolommen (df.shape,
df.info, df.value_counts)
2. Visualize
- melten en plotten: p = pd.DataFrame(X, columns = labels)
p['class'] = y
plt.figure(figsize=(10,5))
sns.boxplot(data = p, x = 'variable', y='value',
hue='class')
3. ML1
- bepaal soort ML (KNeighbors, RandomForestClassifier, DummyClassifier,
LogisticRegression, DecisionTreeClassifier)
- fitten voorbeeld:
'classifier' = neighbors.KNeighborsClassifier()
'classifier'.fit(X_train, y_train)
kfold = model_selection.KFold(n_splits=10)
'classifier'CrossResults =
model_selection.cross_val_score('classifier', X, y_transform, cv=kfold)
'classifier2'CrossResults =
model_selection.cross_val_score('classifier2', X, y_transform, cv=kfold)
sns.boxplot(data=['classifier'CrossResults,
'classifier2'CrossResults])
plt.xticks([0, 1], [''classifier'', ''classifier2''])
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Cross validated accuracies')
plt.show()
4. GridSearch
- bepalen parameters voor model (e.g. C (regularization parameter) of gamma
(kernel bandwidth))
_____________________________________________________________________________
best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
for C in [0.001, 0.01, 0.1, 1, 10, 100]:
# Initialize SVC model for given combination of
parameters
svm = SVC(gamma=gamma, C=C)
# Train on train set
svm.fit(X_train, y_train)
# Evaluate on validation set
score = svm.score(X_validation, y_validation)
# Store the best score
if score > best_score:
best_score = score
best_parameters = {'C': C, 'gamma':
gamma}
_____________________________________________________________________________
svm = SVC(**best_parameters)
# Train on train/validation set
svm.fit(X_trainval, y_trainval)
# Score on test set
test_score = svm.score(X_test, y_test)
_____________________________________________________________________________
cv = 2 aangeraden!!!!
5. Pipeline
- sklearn object dat stappen volgt uit een toegevoegde [LIJST] met (TUPLES)
Iedere tuple bevat gespecificeerde naam en instance van
estimator.
e.g. Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
XG boost