All Courses

Implementing SVM with Scikit-learn

Now that we have an understanding of how Support Vector Machines work from the previous section, let's see how to put them into practice using Scikit-learn. The primary implementation for classification is the SVC class (Support Vector Classification), located within the sklearn.svm module.

Just like other Scikit-learn estimators, SVC follows the familiar fit/predict pattern. We first instantiate the model, potentially configuring its hyperparameters, then train it on our data, and finally use it to make predictions on new, unseen data points.

Basic Usage

Let's start with a simple example using default parameters. We'll use some sample data generated using make_blobs, which creates nice clusters suitable for classification tasks. Remember, SVMs are sensitive to feature scales, so applying scaling (like StandardScaler) is generally a required preprocessing step. We will cover preprocessing in detail in Chapter 4, but we'll apply basic scaling here for demonstration.

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Generate sample data
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=42, cluster_std=1.5)

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use transform, not fit_transform on test data

# 4. Instantiate the SVC model
# Defaults: kernel='rbf', C=1.0, gamma='scale'
svm_classifier = SVC(random_state=42)

# 5. Train the model
svm_classifier.fit(X_train_scaled, y_train)

# 6. Make predictions
y_pred = svm_classifier.predict(X_test_scaled)

# 7. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")
# Expected Output: Model Accuracy: 1.0000 (or similar, dataset is easily separable)

In this example, we created an SVC instance using its default settings. The most significant default is kernel='rbf', which uses the Radial Basis Function kernel, a common choice for handling non-linearly separable data. We then trained (fit) the model using the scaled training data and made predictions (predict) on the scaled test data.

Choosing the Right Kernel

The kernel hyperparameter determines how SVC transforms the data to find the optimal separating hyperplane. You select the kernel based on your understanding of the data's structure.

'linear': Use this if you believe your data is mostly linearly separable. It fits a simple hyperplane.
'rbf' (Radial Basis Function): This is the default and a good starting point for many problems. It can map data into an infinite-dimensional space and create complex, non-linear decision boundaries. It's controlled by the gamma hyperparameter.
'poly': Uses a polynomial function to map data. The degree of the polynomial is set using the degree hyperparameter (default is 3).
'sigmoid': Uses a function similar to that found in neural networks. Often used in specific applications.

You specify the kernel when creating the SVC instance:

# Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
# svm_linear.fit(X_train_scaled, y_train) ...

# Polynomial Kernel (degree 3)
svm_poly = SVC(kernel='poly', degree=3, random_state=42)
# svm_poly.fit(X_train_scaled, y_train) ...

# RBF Kernel (default, shown explicitly)
svm_rbf = SVC(kernel='rbf', random_state=42)
# svm_rbf.fit(X_train_scaled, y_train) ...

Let's visualize the decision boundaries produced by linear and RBF kernels on our sample data.

Hyperparameters: `C` and `gamma`

Other than the choice of kernel, two other hyperparameters significantly influence an SVM's behavior:

C (Regularization Parameter): This parameter controls the trade-off between a smooth decision boundary and correctly classifying training points. A small C value creates a wider margin (simpler model) that tolerates more misclassifications. A large C aims for zero misclassifications on training data, potentially leading to a narrower margin and a more complex model prone to overfitting. Think of it as a penalty for misclassifications.
gamma (Kernel Coefficient for 'rbf', 'poly', and 'sigmoid'): This parameter defines the reach of a single training example's influence. A small gamma means a large radius of influence, leading to smoother, more general decision boundaries. A large gamma implies a small radius of influence, resulting in highly wiggly decision boundaries that might fit the training data very closely (potentially overfitting). When gamma is set to 'scale' (the default), it uses $1 / (\text{n_features} * \text{X.var()})$ , which is often a good starting point.

These hyperparameters are important for tuning your SVM model to achieve optimal performance on your specific dataset.

Let's see how different C values affect the decision boundary. We'll use the same synthetic data and the RBF kernel.

# Re-generate and scale data for consistent plots
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import numpy as np

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=42, cluster_std=1.5)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train SVMs with different C values
svm_low_C = SVC(kernel='rbf', C=0.1, random_state=42)
svm_low_C.fit(X_scaled, y)

svm_high_C = SVC(kernel='rbf', C=1000, random_state=42)
svm_high_C.fit(X_scaled, y)

# Helper function to generate Z for contour plot
def generate_z_for_plot(classifier, X_data):
    x_min, x_max = X_data[:, 0].min() - 0.5, X_data[:, 0].max() + 0.5
    y_min, y_max = X_data[:, 1].min() - 0.5, X_data[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05), np.arange(y_min, y_max, 0.05))
    Z = classifier.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    return xx, yy, Z

xx_low_C, yy_low_C, Z_low_C = generate_z_for_plot(svm_low_C, X_scaled)
xx_high_C, yy_high_C, Z_high_C = generate_z_for_plot(svm_high_C, X_scaled)

Decision boundary for SVM with a low C value (C=0.1) using the RBF kernel. Notice the broader margin, which allows some misclassifications to achieve better generalization.

Decision boundary for SVM with a high C value (C=1000) using the RBF kernel. Observe the narrower margin, which prioritizes correctly classifying all training points.

Understanding Support Vectors

A unique aspect of SVMs is their reliance on support vectors. These are the data points from the training set that lie closest to the decision boundary (the hyperplane). They are the examples that "support" the hyperplane and influence its position and orientation. Points that are further away from the margin do not affect the hyperplane's definition. This makes SVMs computationally efficient, as they do not need to consider all training data during prediction, only the support vectors.

You can access the support vectors after training an SVC model:

# Assuming svm_classifier has been trained
print(f"Number of support vectors for each class: {svm_classifier.n_support_}")
print(f"Indices of support vectors: {svm_classifier.support_}")
print(f"Support vectors (scaled): {svm_classifier.support_vectors_}")

Multiclass Classification

While the core idea of SVMs is binary classification, SVC can also handle multiclass problems. Scikit-learn's SVC implements a one-vs-rest (OvR) strategy by default. In this approach, a separate binary SVM is trained for each class against all other classes. For prediction, the class with the highest confidence score from its respective binary SVM is chosen.

Alternatively, you can specify decision_function_shape='ovo' for a one-vs-one (OvO) strategy. Here, a binary SVM is trained for every pair of classes. If you have $N$ classes, this means training $N(N-1)/2$ SVMs. The final prediction is determined by a majority vote among these pairwise classifiers. OvO is often more computationally expensive for training but can sometimes yield better accuracy.

Here's an example with multiclass data using make_classification:

from sklearn.datasets import make_classification

# Generate multiclass data
X_multi, y_multi = make_classification(n_samples=200, n_features=2, n_redundant=0,
                                       n_informative=2, n_clusters_per_class=1,
                                       n_classes=3, random_state=42)

# Scale features
scaler_multi = StandardScaler()
X_multi_scaled = scaler_multi.fit_transform(X_multi)

# Train an SVC with OvR strategy (default)
svm_multi_ovr = SVC(kernel='rbf', random_state=42)
svm_multi_ovr.fit(X_multi_scaled, y_multi)

print("\nMulticlass SVM (OvR Strategy) trained.")

Visualizing multiclass decision boundaries is more complex than binary ones but still provides valuable insight into how the model separates different classes in the feature space.

Conclusion

Scikit-learn's SVC offers a powerful and flexible way to implement Support Vector Machines for classification tasks. By understanding and appropriately tuning its kernel, C, and gamma hyperparameters, you can effectively build models that generalize well to unseen data, even for non-linearly separable problems. Remember that proper feature scaling is often a critical prerequisite for optimal SVM performance.

Was this section helpful?