This exercise focuses on building a complete text classification pipeline. It involves taking a dataset, preprocessing it, extracting features, training a classifier, and evaluating its performance.Our goal is to build a simple spam detector. We'll use a small dataset containing text messages labeled as either "spam" or "ham" (not spam).Setup and Data LoadingFirst, ensure you have the necessary libraries installed, particularly scikit-learn and pandas. If not, you can typically install them using pip:pip install scikit-learn pandasLet's assume our dataset is in a simple CSV file named spam_data.csv with two columns: label ('ham' or 'spam') and text.import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression # Example of another classifier from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report, confusion_matrix, accuracy_score import plotly.graph_objects as go import numpy as np # Load the dataset (replace 'spam_data.csv' with your actual file path if different) # For demonstration, let's create a small sample DataFrame data = {'label': ['ham', 'spam', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham'], 'text': ['Go until jurong point, crazy.. Available only in bugis n great la e buffet... Cine there got amore wat...', 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C apply 08452810075over18s', 'U dun say so early hor... U c already then say...', 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.', 'Nah I dont think he goes to usf, he lives around here though', 'Even my brother is not like to speak with me. They treat me like aids patent.', 'URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18', 'I am gonna be home soon and i dont want to talk about this stuff anymore tonight, k? Ive cried enough today.', 'SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info', 'I have been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise.']} df = pd.DataFrame(data) # Display the first few rows and check class distribution print("Dataset Head:") print(df.head()) print("\nClass Distribution:") print(df['label'].value_counts()) # Separate features (text) and target (label) X = df['text'] y = df['label'] # Split data into training and testing sets # Using a small test_size for this example dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) print(f"\nTraining set size: {len(X_train)}") print(f"Test set size: {len(X_test)}")Here, we load the data using pandas, inspect it, and then split it into training and testing sets using train_test_split. Using stratify=y is good practice, especially for potentially imbalanced datasets, as it ensures the proportion of labels is roughly the same in both the train and test sets.Building the PipelineAs discussed, directly applying preprocessing and feature extraction steps before training can lead to data leakage if not done carefully (e.g., fitting TfidfVectorizer on the whole dataset before splitting). scikit-learn's Pipeline object is excellent for chaining these steps together, ensuring that transformations are learned only from the training data.We'll create a pipeline that first applies TF-IDF vectorization and then trains a Multinomial Naive Bayes (MNB) classifier. MNB is often a good baseline for text classification tasks.# Create a pipeline with TF-IDF Vectorizer and Multinomial Naive Bayes text_clf_nb = Pipeline([ ('tfidf', TfidfVectorizer(stop_words='english')), # Add stop word removal ('clf', MultinomialNB()), ]) # Train the entire pipeline on the training data print("\nTraining Naive Bayes pipeline...") text_clf_nb.fit(X_train, y_train) print("Training complete.")In this pipeline:TfidfVectorizer(stop_words='english'): Converts text documents into a matrix of TF-IDF features. We also include basic English stop word removal directly within the vectorizer. It will learn the vocabulary and IDF weights only from X_train when fit is called.MultinomialNB(): The classifier that will be trained on the TF-IDF features.When text_clf_nb.fit(X_train, y_train) is executed, the training data X_train flows through the pipeline: first, the tfidf step transforms it, and then the resulting features are used to train the clf step (Naive Bayes).Making Predictions and EvaluatingNow, let's use the trained pipeline to make predictions on our held-out test set (X_test) and evaluate the performance using the metrics discussed earlier.# Make predictions on the test set print("\nMaking predictions on the test set...") y_pred_nb = text_clf_nb.predict(X_test) # Evaluate the Naive Bayes model print("\nNaive Bayes Model Evaluation:") accuracy_nb = accuracy_score(y_test, y_pred_nb) print(f"Accuracy: {accuracy_nb:.4f}") print("\nClassification Report:") print(classification_report(y_test, y_pred_nb)) print("\nConfusion Matrix:") cm_nb = confusion_matrix(y_test, y_pred_nb, labels=text_clf_nb.classes_) print(cm_nb) # Function to create Plotly confusion matrix figure def plot_confusion_matrix(cm, labels): # Use a blue color scale colorscale = [ [0.0, '#e9ecef'], # light gray for 0 [0.5, '#74c0fc'], # light blue [1.0, '#1c7ed6'] # dark blue for max ] fig = go.Figure(data=go.Heatmap( z=cm, x=labels, y=labels, hoverongaps=False, colorscale=colorscale, showscale=False # Hide color bar for simplicity )) # Add annotations for cell values annotations = [] for i, row in enumerate(cm): for j, value in enumerate(row): annotations.append( go.layout.Annotation( text=str(value), x=labels[j], y=labels[i], xref="x1", yref="y1", showarrow=False, font=dict(color="black" if value < (cm.max() / 2) else "white") # Adjust text color for contrast ) ) fig.update_layout( title='Confusion Matrix (Naive Bayes)', xaxis_title="Predicted Label", yaxis_title="True Label", xaxis=dict(side='bottom'), # Place x-axis labels at the bottom yaxis=dict(autorange='reversed'), # Display true labels from top to bottom width=450, height=400, # Adjust size as needed margin=dict(l=50, r=50, t=50, b=50), annotations=annotations ) return fig # Generate and display the confusion matrix plot fig_nb = plot_confusion_matrix(cm_nb, text_clf_nb.classes_) # In a real web environment or notebook, you would display fig_nb here. # For this format, we output the JSON representation. print("\nPlotly Confusion Matrix JSON:") print(fig_nb.to_json(pretty=False)){"layout": {"title": {"text": "Confusion Matrix (Naive Bayes)"}, "xaxis": {"title": {"text": "Predicted Label"}, "side": "bottom"}, "yaxis": {"title": {"text": "True Label"}, "autorange": "reversed"}, "width": 450, "height": 400, "margin": {"l": 50, "r": 50, "t": 50, "b": 50}, "annotations": [{"text": "2", "x": "ham", "y": "ham", "xref": "x1", "yref": "y1", "showarrow": false, "font": {"color": "white"}}, {"text": "0", "x": "spam", "y": "ham", "xref": "x1", "yref": "y1", "showarrow": false, "font": {"color": "black"}}, {"text": "1", "x": "ham", "y": "spam", "xref": "x1", "yref": "y1", "showarrow": false, "font": {"color": "black"}}, {"text": "0", "x": "spam", "y": "spam", "xref": "x1", "yref": "y1", "showarrow": false, "font": {"color": "black"}}]}, "data": [{"type": "heatmap", "z": [[2, 0], [1, 0]], "x": ["ham", "spam"], "y": ["ham", "spam"], "hoverongaps": false, "colorscale": [[0.0, "#e9ecef"], [0.5, "#74c0fc"], [1.0, "#1c7ed6"]], "showscale": false}]}Confusion matrix showing predicted vs. true labels for the spam classification task using the Naive Bayes model. Based on this very small sample, the model correctly identified 'ham' but misclassified 'spam'.Interpreting the Results (Example based on output):Accuracy: The overall percentage of correct predictions. While simple, it can be misleading on imbalanced datasets.Classification Report:Precision (for 'spam'): Of all messages predicted as spam, what fraction actually were spam? (True Positives / (True Positives + False Positives))Recall (for 'spam'): Of all actual spam messages, what fraction did the model correctly identify? (True Positives / (True Positives + False Negatives))F1-score: The harmonic mean of precision and recall, providing a single metric balancing both concerns.Support: The number of actual occurrences of each class in the test set.Confusion Matrix: Provides a detailed breakdown of correct and incorrect predictions for each class. The diagonal elements represent correct classifications, while off-diagonal elements represent errors (e.g., predicting 'ham' when it was actually 'spam'). Note: The specific numbers in the example plot above are based on the tiny dataset and may show poor performance.Trying a Different ClassifierThe pipeline makes it easy to swap out components. Let's try Logistic Regression instead of Naive Bayes.# Create and train a pipeline with Logistic Regression text_clf_lr = Pipeline([ ('tfidf', TfidfVectorizer(stop_words='english')), ('clf', LogisticRegression(solver='liblinear', random_state=42)), # Using liblinear solver suitable for smaller datasets ]) print("\nTraining Logistic Regression pipeline...") text_clf_lr.fit(X_train, y_train) print("Training complete.") # Make predictions and evaluate print("\nMaking predictions with Logistic Regression...") y_pred_lr = text_clf_lr.predict(X_test) print("\nLogistic Regression Model Evaluation:") accuracy_lr = accuracy_score(y_test, y_pred_lr) print(f"Accuracy: {accuracy_lr:.4f}") print("\nClassification Report:") print(classification_report(y_test, y_pred_lr)) print("\nConfusion Matrix:") cm_lr = confusion_matrix(y_test, y_pred_lr, labels=text_clf_lr.classes_) print(cm_lr) # You could generate another Plotly confusion matrix for Logistic Regression here # fig_lr = plot_confusion_matrix(cm_lr, text_clf_lr.classes_) # print(fig_lr.to_json(pretty=False))By comparing the evaluation metrics (accuracy, precision, recall, F1-score, confusion matrix) from both models, you can determine which classifier performed better on this specific task and dataset.Next StepsThis exercise demonstrated the fundamental workflow of building and evaluating a text classifier. To improve upon this baseline, you could:"1. Use a Larger Dataset: Performance requires much more data." 2. Experiment with Feature Engineering: Try different TfidfVectorizer parameters (e.g., ngram_range=(1, 2) to include bigrams, max_df, min_df). 3. Try Other Classifiers: Experiment with Support Vector Machines (LinearSVC is often effective for text). 4. Hyperparameter Tuning: Use techniques like GridSearchCV or RandomizedSearchCV to find the optimal settings for the vectorizer and classifier (covered earlier in the chapter). 5. Cross-Validation: Use cross_val_score for a better estimate of model performance than a single train-test split. 6. Error Analysis: Examine the specific messages that were misclassified to understand the model's weaknesses.This practical application solidifies the process of turning text into classifications, a common and valuable task in Natural Language Processing.