After creating potentially numerous features from numerical, categorical, and text data, and perhaps even applying transformations like PCA, you might find yourself with a dataset containing a large number of input variables. While having more information seems beneficial, not all features are created equal. Some might be irrelevant to the prediction task, others might be redundant (highly correlated with existing features), and some might even introduce noise that degrades model performance. Furthermore, using too many features can increase computational cost and model complexity, sometimes leading to overfitting (the "curse of dimensionality").
This is where feature selection comes into play. Instead of transforming features into a lower-dimensional space like PCA does, feature selection techniques aim to identify and retain a subset of the original, most relevant features for your modeling task. Statistical methods provide a common and often computationally efficient approach to performing this selection, primarily by evaluating the relationship between each feature and the target variable independently of any specific machine learning model. These are often referred to as filter methods.
Filter methods assess the relevance of features based on their intrinsic statistical properties concerning the target variable. They are generally faster than other selection strategies (like wrapper or embedded methods) because they don't involve training machine learning models during the selection process. They serve as an effective preprocessing step to filter out features that are unlikely to be useful.
The choice of statistical test largely depends on the data types of the feature and the target variable.
When your target variable is continuous, you typically want to select features that show a strong statistical relationship with it.
Pearson Correlation Coefficient: This measures the linear correlation between a numerical feature (X) and the continuous target variable (Y). The correlation coefficient, r, ranges from -1 to +1.
df.corr()['target_variable'].abs().sort_values(ascending=False)
. However, remember that Pearson correlation only captures linear dependencies. A feature might have a strong non-linear relationship with the target but a low Pearson correlation score.ANOVA F-test (Regression): This statistical test (sklearn.feature_selection.f_regression
) computes the F-statistic to determine if there's a significant linear relationship between a numerical feature and the continuous target variable. It essentially analyzes the variance explained by the feature. A higher F-statistic (and correspondingly lower p-value) suggests a stronger relationship. It's fundamentally linked to linear regression and Pearson correlation for single features.
Mutual Information (Regression): Mutual information (sklearn.feature_selection.mutual_info_regression
) is a non-parametric method derived from information theory. It measures the amount of information obtained about one variable (the target) by observing another variable (the feature). It can capture arbitrary relationships (not just linear) and is measured in "nats" (or sometimes bits). A score of 0 indicates independence, while higher values indicate greater dependence. This makes it powerful for capturing complex patterns that correlation might miss. It requires numerical features, but categorical features can sometimes be used if appropriately encoded or if the implementation supports it (often requiring discretization for continuous variables).
When dealing with a categorical target variable, different statistical tests are appropriate.
Chi-Squared (χ2) Test: This test (sklearn.feature_selection.chi2
) is used to determine if there is a significant association between two categorical variables. You would apply this between each categorical feature and the categorical target. It compares the observed frequencies in a contingency table (feature categories vs. target classes) to the frequencies that would be expected if the variables were independent. A high χ2 statistic suggests that the feature is not independent of the target class, making it potentially relevant. An important prerequisite is that the features must contain non-negative values (counts, frequencies, or appropriately encoded variables like one-hot encoded features).
ANOVA F-test (Classification): Similar to the regression case, but adapted for classification (sklearn.feature_selection.f_classif
). It tests whether the mean values of a numerical feature differ significantly across the different target classes. If a feature's mean value varies substantially between classes, it's likely a good predictor for distinguishing those classes.
Mutual Information (Classification): Analogous to the regression version, sklearn.feature_selection.mutual_info_classif
measures the mutual dependence between each feature (can be numerical or discrete) and the categorical target variable. It quantifies how much uncertainty about the target class is reduced by knowing the value of the feature. Higher values indicate greater relevance, and it effectively captures non-linear relationships.
Scikit-learn's sklearn.feature_selection
module provides convenient tools for applying these statistical tests. Common selectors include SelectKBest
(selects the top k features based on scores) and SelectPercentile
(selects the top features based on a percentile threshold).
Here's how you might use SelectKBest
with the ANOVA F-test for a classification problem:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif
# Generate synthetic classification data
# 10 informative features, 10 redundant/noisy features
X, y = make_classification(n_samples=200, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42)
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
print("Original number of features:", X_df.shape[1])
# Select the top 12 features based on ANOVA F-test scores
k_best = 12
selector = SelectKBest(score_func=f_classif, k=k_best)
# Fit the selector to the data and transform X
X_new = selector.fit_transform(X_df, y)
# Get the scores and p-values
scores = selector.scores_
p_values = selector.pvalues_
# Get the indices of the selected features
selected_indices = selector.get_support(indices=True)
# Get the names of the selected features
selected_features = X_df.columns[selected_indices]
print(f"Selected top {k_best} features:", selected_features.tolist())
print("Shape of data after selection:", X_new.shape)
# Optional: Display scores for selected features
# selected_scores = pd.Series(scores[selected_indices], index=selected_features)
# print("\nScores (F-statistic) for selected features:")
# print(selected_scores.sort_values(ascending=False))
This code snippet demonstrates:
SelectKBest
with f_classif
as the scoring function and specifying k=12.X_df
, y
).X_new
).You could similarly use chi2
(for non-negative categorical features), mutual_info_classif
, f_regression
, or mutual_info_regression
as the score_func
depending on your feature types and task (classification/regression).
Example F-statistic scores calculated by
SelectKBest
usingf_classif
. Features with higher scores exhibit stronger statistical relationships with the target classes in this example classification scenario. The selector retains the features corresponding to the top k scores.
While powerful, keep these points in mind when using statistical filter methods:
X_df[selected_features].corr()
) after selection.SelectPercentile
offers an alternative way to specify the number of features.StandardScaler
) before feature selection is often a good practice for these methods.Statistical feature selection provides a valuable set of techniques for reducing the dimensionality of your dataset by focusing on features with the strongest individual relationships to the target variable. It's an important step in the practical data science workflow, helping to simplify models, potentially improve performance, reduce computation time, and prepare a more focused dataset for the subsequent modeling stages.
© 2025 ApX Machine Learning