Now that we understand why splitting data into training and testing sets is fundamental for reliable model evaluation, let's walk through how to actually perform this split. We'll use a common approach often employed in Python with the popular scikit-learn library, but the concept applies regardless of the specific tools you might use.

Imagine we have a small dataset where we want to predict whether a fruit is an apple (represented by 0) or an orange (represented by 1), based on two features: its weight in grams and its texture (0 for smooth, 1 for bumpy).

Our data might look something like this:

Features (X): Weight and Texture for each fruit.
Labels (y): The actual type of fruit (0 for Apple, 1 for Orange).

Let's say we have data for 10 fruits:

Features (X): [[150, 0], [170, 0], [140, 1], [130, 1], [160, 0], [180, 0], [125, 1], [135, 1], [190, 0], [145, 1]]

Labels (y): [0, 0, 1, 1, 0, 0, 1, 1, 0, 1]

We have 10 data points (rows). Each row in X corresponds to the label in the same position in y. For example, the first fruit weighs 150g, has a smooth texture ([150, 0]), and is an Apple (0).

Our goal is to split this data into a training set (to teach our model) and a test set (to evaluate how well it learned). We'll use a common 70/30 split, meaning 70% of the data (7 samples) will be used for training, and 30% (3 samples) will be held back for testing.

Using `scikit-learn` for the Split

In Python, the scikit-learn library provides a convenient function called train_test_split within its model_selection module. Let's see how to use it.

First, you'd typically import the function and prepare your data (often using libraries like NumPy, but Python lists work for this simple example):

# Import the function
from sklearn.model_selection import train_test_split
import numpy as np # Often used for data representation

# Our feature data (Weight, Texture)
X = np.array([[150, 0], [170, 0], [140, 1], [130, 1], [160, 0], 
              [180, 0], [125, 1], [135, 1], [190, 0], [145, 1]])

# Our label data (0=Apple, 1=Orange)
y = np.array([0, 0, 1, 1, 0, 0, 1, 1, 0, 1])

# Perform the split (70% train, 30% test)
# We set random_state for reproducible results
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Let's look at the results
print("Original data points:", len(X))
print("Training data points:", len(X_train))
print("Test data points:", len(X_test))

print("\nTraining Features (X_train):\n", X_train)
print("\nTraining Labels (y_train):\n", y_train)

print("\nTest Features (X_test):\n", X_test)
print("\nTest Labels (y_test):\n", y_test)

Understanding the `train_test_split` Function

Let's break down the key parts of train_test_split(X, y, test_size=0.3, random_state=42):

X: This is our input feature data.
y: This is our corresponding label data. The function ensures that the link between a feature row and its label is maintained during the split.
test_size=0.3: This parameter specifies the proportion of the dataset to include in the test split. Here, 0.3 means 30% of the data will be allocated to the test set (X_test, y_test), and the remaining 70% will go to the training set (X_train, y_train). You could also use train_size=0.7 to achieve the same result. If you provide an integer instead of a float (e.g., test_size=3), it specifies the absolute number of test samples.
random_state=42: As mentioned earlier, splitting usually involves shuffling the data randomly before dividing it. Setting random_state to a specific integer (like 42, 0, or any other number) ensures that the same random shuffle and split occur every time you run the code. This is essential for getting reproducible results. If you omit random_state, you'll get a different split each time, which can make debugging or comparing results difficult.

Examining the Output

Running the code above would produce output similar to this (the exact rows depend on the random_state used):

Original data points: 10
Training data points: 7
Test data points: 3

Training Features (X_train):
 [[145 1]
 [170 0]
 [140 1]
 [160 0]
 [135 1]
 [180 0]
 [125 1]]

Training Labels (y_train):
 [1 0 1 0 1 0 1]

Test Features (X_test):
 [[190 0]
 [130 1]
 [150 0]]

Test Labels (y_test):
 [0 1 0]

Notice that:

We started with 10 data points.
The training set (X_train, y_train) contains 7 data points (70% of 10).
The test set (X_test, y_test) contains 3 data points (30% of 10).
The rows in X_train correspond to the labels in y_train, and similarly for the test sets. The original association between features and labels is preserved within each set.
The data has been shuffled before splitting, thanks to the nature of train_test_split (controlled by random_state).

Here's a simple visualization of the process:

A diagram showing the original dataset being passed through the splitting function to produce separate training and testing sets with the specified proportions.

This practical step of splitting your data is crucial. You now have a training set (X_train, y_train) to teach your model and a completely separate test set (X_test, y_test) that the model hasn't seen yet. This test set will be used later to get a fair assessment of how well your model generalizes to new, unseen data using the metrics we've learned.

Hands-on Practical: Splitting Data

Using scikit-learn for the Split

Understanding the train_test_split Function

Examining the Output

Using `scikit-learn` for the Split

Understanding the `train_test_split` Function