ex6.m
you will be using support vector machines (SVMs) with various example 2D datasets.
- Plot Data (in ex6data1.mat)
SVM with Linear Kernel
try using different values of the C parameter with SVMs. Informally, the C parameter is a positive value that controls the penalty for misclassified training examples.
- Plott decision boundary (ex6data1.mat)
Train SVM with RBF Kernel
- Plot Data (in ex6data2.mat)
C: 1, sigma: 0.1
- Plot decision boundary (in ex6data2.mat)
Try different SVM Parameters to train SVM with RBF Kernel
Automatically choose optimal C and sigma based on a cross-validation set.
C list: [0.01 0.03 0.1 0.3 1 3 10 30]
sigma list: [0.01 0.03 0.1 0.3 1 3 10 30]
=> optimal C = 1 and sigma = 0.1
% Octave console output | |
% C list: [0.01 0.03 0.1 0.3 1 3 10 30] | |
% sigma list: [0.01 0.03 0.1 0.3 1 3 10 30] | |
Training ......... Done! | |
C: 0.010000 | |
sigma: 0.010000 | |
error = 0.56500 | |
==================== | |
Training ........................................ Done! | |
C: 0.010000 | |
sigma: 0.030000 | |
error = 0.060000 | |
==================== | |
Training ...................................................................... | |
.. Done! | |
C: 0.010000 | |
sigma: 0.100000 | |
error = 0.045000 | |
==================== | |
(...) | |
Training ...................................................................... | |
.............................. Done! | |
C: 30.000000 | |
sigma: 3.000000 | |
error = 0.065000 | |
==================== | |
Training ....................................................... Done! | |
C: 30.000000 | |
sigma: 10.000000 | |
error = 0.10000 | |
==================== | |
Training .................................................. Done! | |
C: 30.000000 | |
sigma: 30.000000 | |
error = 0.18000 | |
==================== | |
optimal C = 1.000000 and sigma = 0.100000 | |
Program paused. Press enter to continue. |
- Plot Data (in ex6data3.mat)
- Plot decision boundary with optimal svm parameters (in ex6data3.mat)
ex6_spam.m
you will be using support vector machines to build a spam classifier.
For the purpose of this exercise, you will only be using the body of the email (excluding the email headers).
- Preprocess sample email (in emailSample1.txt, vocab.txt)
convert each email into a vector of features
Given the vocabulary list, we can now map each word in the preprocessed emails into a list of word indices that contains the index of the word in the vocabulary list.
Lower-casing, Stripping HTML, Normalizing URLs, Normalizing Email Addresses, Normalizing Numbers, Normalizing Dollars, Word Stemming, Removal of non-words
vocabulary list: a list of 1899 words
% Octave console output | |
Preprocessing sample email (emailSample1.txt) | |
==== Processed Email ==== | |
anyon know how much it cost to host a web portal well it depend on how mani | |
visitor you re expect thi can be anywher from less than number buck a month | |
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb | |
if your run someth big to unsubscrib yourself from thi mail list send an | |
email to emailaddr | |
========================= | |
Word Indices: | |
86 916 794 1077 883 370 1699 790 1822 1831 883 431 1171 794 1002 1893 1364 592 1676 238 162 89 688 945 1663 1120 1062 1699 375 1162 479 1893 1510 799 1182 1237 810 1895 1440 1547 181 1699 1758 1896 688 1676 992 961 1477 71 530 1699 531 | |
Program paused. Press enter to continue. |
- Extracte Features from Emails (in emailSample1.txt)
the feature xi ∈ {0, 1} for an email corresponds to whether the i-th word in the dictionary occurs in the email. That is, xi = 1 if the i-th word is in the email and xi = 0 if the i-th word is not present in the email.
% Octave console output | |
Extracting features from sample email (emailSample1.txt) | |
==== Processed Email ==== | |
anyon know how much it cost to host a web portal well it depend on how mani | |
visitor you re expect thi can be anywher from less than number buck a month | |
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb | |
if your run someth big to unsubscrib yourself from thi mail list send an | |
email to emailaddr | |
========================= | |
Length of feature vector: 1899 | |
Number of non-zero entries: 45 | |
Program paused. Press enter to continue. |
- Train Linear SVM for Spam Classification (in spamTrain.mat, spamTest.mat)
train a SVM to classify between spam (y = 1) and non-spam (y = 0) emails.
spamTrain.mat: 4000 training examples of spam and non-spam email
spamTest.mat: 1000 test examples
% Octave console output | |
Training Linear SVM (Spam Classification) | |
(this may take 1 to 2 minutes) ... | |
Training ...................................................................... | |
............................................................................... | |
............................................................................... | |
..... Done! | |
Training Accuracy: 99.850000 | |
Evaluating the trained Linear SVM on a test set ... | |
Test Accuracy: 99.000000 |
Trouble shooting:
- error on plotting the decision boundary of SVM with RBF Kernel
% Octave console output | |
error: set: unknown hggroup property Color | |
error: called from | |
__contour__ at line 201 column 5 | |
contour at line 74 column 16 | |
visualizeBoundary at line 21 column 2 | |
ex6 at line 109 column 1 |
Solution:
rewrite visualizeBoundary.m line 21:
=> contour(X1, X2, vals, [1 1], ‘LineColor’, ‘b’);