Okay, let's put the theory from this chapter into practice. We've discussed vector spaces, linear independence, basis, and rank. These concepts are fundamental for understanding the structure within our datasets, particularly when our data points are represented as feature vectors. Analyzing sets of these vectors helps us identify redundant information and understand the effective dimensionality of our feature space. We'll use Python's NumPy library, a standard tool for numerical computation, to perform these analyses.
Imagine we have a small dataset with several data points, each described by a few features. We can organize these feature vectors into a matrix. Often, each row represents a data point, and each column represents a feature. However, when analyzing the linear independence of the features themselves or the dimensionality spanned by them, it's often convenient to arrange the feature vectors as columns of a matrix. Let's work with that convention for analyzing feature relationships.
Suppose we have 3 data points, each with 4 features. We might represent the features as column vectors:
f1=1201,f2=0110,f3=1311,f4=2402
We can group these column vectors into a matrix A:
A=1201011013112402
Let's create this matrix using NumPy:
import numpy as np
# Feature vectors as columns
A = np.array([
[1, 0, 1, 2],
[2, 1, 3, 4],
[0, 1, 1, 0],
[1, 0, 1, 2]
])
print("Feature matrix A:\n", A)
Linear independence among feature vectors is significant. If a set of feature vectors is linearly dependent, it means at least one feature can be expressed as a linear combination of the others. This indicates redundancy in our features. For example, having features for "temperature in Celsius" and "temperature in Fahrenheit" adds no new information, as one can be perfectly predicted from the other; they are linearly dependent (after centering). Redundant features can sometimes cause problems for machine learning algorithms, such as multicollinearity in linear regression, leading to unstable coefficient estimates.
A practical way to check for linear independence of the columns of a matrix is by calculating its rank. The rank of a matrix is the maximum number of linearly independent columns (or rows) in the matrix.
NumPy's linalg
module provides a function matrix_rank
to compute the rank of a matrix. It typically uses methods like Singular Value Decomposition (SVD, which we'll cover in detail later) to determine the rank robustly, even in the presence of small numerical errors.
Let's calculate the rank of our feature matrix A
:
# Calculate the rank of matrix A
rank_A = np.linalg.matrix_rank(A)
num_features = A.shape[1] # Number of columns (features)
print(f"Matrix A:\n{A}")
print(f"Number of features (columns): {num_features}")
print(f"Rank of matrix A: {rank_A}")
if rank_A < num_features:
print("The feature vectors (columns) are linearly dependent.")
else:
print("The feature vectors (columns) are linearly independent.")
Executing this code will output:
Matrix A:
[[1 0 1 2]
[2 1 3 4]
[0 1 1 0]
[1 0 1 2]]
Number of features (columns): 4
Rank of matrix A: 2
The feature vectors (columns) are linearly dependent.
The rank is 2, which is less than the number of features (4). This confirms our suspicion: the feature vectors are linearly dependent. Looking closely at matrix A, we can see that f3=f1+f2 and f4=2f1. This redundancy means that features f3 and f4 do not add any unique directional information beyond what's already present in f1 and f2. The "true" dimensionality spanned by these features is only 2, as indicated by the rank.
Now, let's consider a different set of feature vectors where we expect linear independence.
# Another set of feature vectors (columns)
B = np.array([
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 1, 1] # Added a fourth data point / dimension
])
rank_B = np.linalg.matrix_rank(B)
num_features_B = B.shape[1]
print(f"\nMatrix B:\n{B}")
print(f"Number of features (columns): {num_features_B}")
print(f"Rank of matrix B: {rank_B}")
if rank_B < num_features_B:
print("The feature vectors (columns) of B are linearly dependent.")
else:
print("The feature vectors (columns) of B are linearly independent.")
This will likely output:
Matrix B:
[[1 0 0]
[0 1 0]
[0 0 1]
[1 1 1]]
Number of features (columns): 3
Rank of matrix B: 3
The feature vectors (columns) of B are linearly independent.
Here, the rank (3) equals the number of features (3), indicating that these feature vectors are linearly independent. None of these features can be represented as a linear combination of the others.
Why perform this analysis?
In this practical exercise, we used:
np.array()
: To create matrices from lists of lists.A.shape[1]
: To get the number of columns (features in our setup).np.linalg.matrix_rank()
: To compute the rank of a matrix, which is our primary tool for checking linear independence among the columns.By applying these tools, you can move from the abstract concepts of vector spaces and linear independence to concrete analysis of your feature datasets, gaining insights that inform preprocessing steps and model building in machine learning.
© 2025 ApX Machine Learning