Support Vector Machines: Suppose a 2D feature vector As in, 2 features That's not very many When we graph them, we get groups that can be separated Good! So our hypothesis is a line We'll need to calculate a line that maximizes the margin between the points What if we had three features? We'd have to graph them in 3D Harder to draw, but I'm sure we can picture it The hypothesis is now a plane instead of a line How about 4 features? Now it's hard to imagine, not just draw! It's a hyperplane hyperplane = subspace with one less dimension than its space So a line is a hyperplane in 2D N-dimensional space will have hyperplane with n-1 dimensions Representing a hyperplane: Normal vector and offset along the vector w is the normal vector, b is the offset That's not the only way to describe a hyperplane Back to 3D: Any two non-overlapping vectors describe a plane But in an SVM, we'll use w and b Hard-margin SVM: Just define two hyperplanes, with the margin between them Follow hyperplane normal vector to data to compute distance Is there a closed form for that? And is it important that there be? Yes! Distance from a point to a plane We'll avoid getting into the weeds on this point Data between the planes is uncertain If pressed, we can use the middle There wasn't any training data there To train it, we calculate that hyperplane But how? Consider it a constraint that no training examples are in the margin The data closest to the hyperplanes are support vectors This forms an optimization problem, to find the largest margin Given the support vectors Not quite done yet: What if the training data can't be separated? As in, it's not linearly separabl Soft-margin and parameter C quadratic programming (not computer programming) Kernel trick Why does distance have to be calculated only in the standard dimensionality? Couldn't we have a more flexible definition of distance? kernel function: Offload distance calculation to a "kernel" RBF kernel features in a lot of example pictures Loosly, distance from a given point There's a bit more going on in there than just that though There a fair number of other kernels out there You can substitute a different kernel Practical note: Ok as long as your feature vectors match your kernel So the feature vectors could be graphs! A use I made of SVM: Graph kernel, no feature vectors Precomputed as a table on a cluster Looks like I was using Aeolus SVM in Weka