K-Nearest Neighbors (KNN) operates by finding the 'k' closest training examples to a new data point and making a prediction based on the majority class among those neighbors. An implementation of a KNN classifier will be demonstrated using the common Python library, Scikit-learn, on a well-known dataset.The Goal: Classifying Iris FlowersWe'll use the famous Iris dataset. This dataset contains measurements for 150 iris flowers belonging to three different species: Setosa, Versicolor, and Virginica. For each flower, we have four features:Sepal Length (cm)Sepal Width (cm)Petal Length (cm)Petal Width (cm)Our objective is to build a KNN model that can predict the species of an iris flower based on these four measurements. This is a classic example of a multi-class classification problem.Setting Up Your EnvironmentWe'll use Python and the Scikit-learn library. If you haven't used Scikit-learn before, it's a powerful and widely used library for machine learning tasks. You'll also need libraries like NumPy for numerical operations and Matplotlib/Seaborn for plotting (optional, but helpful for understanding).Make sure you have these installed. You can typically install them using pip:pip install scikit-learn numpy matplotlib seaborn pandasStep 1: Load the DataScikit-learn conveniently includes the Iris dataset. Let's load it.import pandas as pd from sklearn.datasets import load_iris import numpy as np # Load the Iris dataset iris = load_iris() # The dataset is loaded as a Bunch object (similar to a dictionary) # iris.data contains the features (numpy array) # iris.target contains the labels (0, 1, 2 corresponding to species) # iris.feature_names contains the names of the features # iris.target_names contains the names of the species # For easier handling, let's put it into a Pandas DataFrame # This is optional but often convenient df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target']) # Map target numbers to species names for clarity df['species'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'}) print("First 5 rows of the Iris dataset:") print(df.head()) print("\nTarget classes (Species):") print(df['species'].unique()) # Separate features (X) and target (y) X = iris.data # Features (numpy array) y = iris.target # Target labels (numpy array)You should see the first few rows of data, showing the measurements and the corresponding target label (0, 1, or 2) and species name.Step 2: Split Data into Training and Testing SetsAs discussed in Chapter 2 and revisited in Chapter 6, we need to split our data. We'll train the model on one portion (the training set) and evaluate its performance on a separate, unseen portion (the testing set). This helps us understand how well our model generalizes to new data.Scikit-learn provides a handy function train_test_split for this.from sklearn.model_selection import train_test_split # Split data into training and testing sets # test_size=0.3 means 30% of the data will be used for testing # random_state ensures reproducibility (we get the same split every time) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) print(f"Training set shape: {X_train.shape}") print(f"Testing set shape: {X_test.shape}")We use stratify=y to ensure that the proportion of each flower species is roughly the same in both the training and testing sets, which is good practice for classification tasks.Step 3: Feature ScalingRemember from Chapter 6 that KNN relies on distance calculations (like Euclidean distance) between data points. If features have significantly different scales (e.g., one feature ranges from 0-1 and another from 100-1000), the feature with the larger range can dominate the distance calculation. Therefore, scaling features to a similar range is often important for KNN. We'll use StandardScaler from Scikit-learn, which standardizes features by removing the mean and scaling to unit variance.from sklearn.preprocessing import StandardScaler # Initialize the StandardScaler scaler = StandardScaler() # Fit the scaler ONLY on the training data scaler.fit(X_train) # Transform both the training and testing data X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) # Note: It's important to fit the scaler only on the training data # and then use that fitted scaler to transform both sets. # This prevents information from the test set "leaking" into the training process. # Let's look at the first few rows of the scaled data (optional) # print("\nFirst 5 rows of scaled training data:") # print(X_train_scaled[:5])Step 4: Create and Train the KNN ModelNow we can create our KNN classifier. The main parameter we need to choose is n_neighbors, which is the 'k' value we discussed. Let's start with a common value, like k=5.from sklearn.neighbors import KNeighborsClassifier # Initialize the KNN classifier with k=5 knn = KNeighborsClassifier(n_neighbors=5) # Train the model using the scaled training data knn.fit(X_train_scaled, y_train) print("\nKNN model trained successfully with k=5.")The fit method is where the "learning" happens for many Scikit-learn models. For KNN, however, fit is very simple: it primarily just stores the training data (X_train_scaled and y_train) so it can be referenced later when making predictions.Step 5: Make PredictionsWith our trained model, we can now predict the species for the flowers in our test set (X_test_scaled).# Make predictions on the scaled test data y_pred = knn.predict(X_test_scaled) # Display the first 10 predictions alongside the actual labels print("\nFirst 10 Predictions vs Actual Labels:") print(f"Predictions: {y_pred[:10]}") print(f"Actual: {y_test[:10]}") # Remember: 0=setosa, 1=versicolor, 2=virginicaThe predict method takes the new data points (our scaled test features) and, for each point, finds the 5 nearest neighbors in the stored training data. It then predicts the class based on the majority vote among those neighbors.Step 6: Evaluate the ModelHow well did our model do? We need to compare the predictions (y_pred) with the actual labels (y_test). We learned about evaluation metrics in the previous section. Let's calculate accuracy and look at the confusion matrix.from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay import matplotlib.pyplot as plt # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"\nModel Accuracy: {accuracy:.4f}") # Calculate the confusion matrix cm = confusion_matrix(y_test, y_pred) print("\nConfusion Matrix:") print(cm) # Visualize the confusion matrix disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names) disp.plot(cmap=plt.cm.Blues) # Use a blue color map plt.title("Confusion Matrix for KNN (k=5)") plt.show()A confusion matrix showing the performance of the KNN classifier on the Iris test set. Rows represent the true classes, and columns represent the predicted classes. Diagonal elements show correct predictions.The accuracy tells us the overall proportion of correct predictions. The confusion matrix gives a more detailed breakdown:The rows represent the true species (Actual).The columns represent the predicted species (Predicted).The diagonal elements (top-left to bottom-right) show the number of correct predictions for each class.Off-diagonal elements show the misclassifications. For example, the number in row 1, column 2 would be the count of actual 'setosa' flowers that were incorrectly predicted as 'versicolor'.In this case (results may vary slightly depending on the random_state), the KNN model with k=5 usually performs very well on the Iris dataset, often achieving high accuracy with few misclassifications shown in the confusion matrix.Experimenting with 'k'The choice of k (the number of neighbors) can influence the model's performance. A small k might make the model sensitive to noise, while a very large k might oversmooth the decision boundary.Try changing the n_neighbors parameter when creating the KNeighborsClassifier (e.g., try k=1, k=3, k=10) and rerun steps 4, 5, and 6. Observe how the accuracy and confusion matrix change. Finding the optimal k often involves trying several values and seeing which one performs best on a validation set (or using techniques like cross-validation, which are slightly more advanced topics).For instance, let's quickly check k=3:# Initialize, train, predict, and evaluate for k=3 knn_k3 = KNeighborsClassifier(n_neighbors=3) knn_k3.fit(X_train_scaled, y_train) y_pred_k3 = knn_k3.predict(X_test_scaled) accuracy_k3 = accuracy_score(y_test, y_pred_k3) print(f"\nModel Accuracy with k=3: {accuracy_k3:.4f}") cm_k3 = confusion_matrix(y_test, y_pred_k3) disp_k3 = ConfusionMatrixDisplay(confusion_matrix=cm_k3, display_labels=iris.target_names) disp_k3.plot(cmap=plt.cm.Greens) # Use a green color map this time plt.title("Confusion Matrix for KNN (k=3)") plt.show()A confusion matrix showing the performance of the KNN classifier with k=3 on the Iris test set.Compare the results. Does k=3 perform better or worse than k=5 on this specific test set? There isn't always one "best" k for all datasets; it often depends on the data's structure.SummaryIn this practice section, you successfully implemented a K-Nearest Neighbors classifier:Loaded the Iris dataset.Split the data into training and testing sets.Applied feature scaling, an important step for distance-based algorithms like KNN.Created a KNeighborsClassifier instance from Scikit-learn."Trained" the model by providing it with the scaled training data and labels.Made predictions on the unseen, scaled test data.Evaluated the model's performance using accuracy and a confusion matrix.Briefly explored how changing the value of k can affect results.This hands-on exercise demonstrates the typical workflow for applying a supervised learning algorithm to a classification problem using standard tools. You now have practical experience implementing one of the fundamental classification algorithms.