Throughout this chapter, we've examined the capabilities of Matplotlib and Seaborn for creating various types of visualizations. Now, it's time to put these tools into practice. This hands-on section guides you through exploring a dataset using the plotting techniques you've learned. We'll load a dataset, ask questions about it, and use visualizations to find answers, reinforcing how graphical representations aid in understanding data structure, distributions, and relationships.
We will use the well-known Iris dataset, which is conveniently available through Seaborn. This dataset contains measurements for three species of iris flowers.
First, let's import the necessary libraries and load the dataset into a Pandas DataFrame.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Iris dataset from Seaborn
iris = sns.load_dataset('iris')
# Display the first few rows and info to understand the structure
print(iris.head())
print("\nDataset Info:")
iris.info()
You should see columns for sepal length, sepal width, petal length, petal width (all numerical), and the species (categorical).
A fundamental step in data exploration is understanding the distribution of individual variables. Let's look at the distribution of petal lengths. We can use Matplotlib for this.
# Set a style for plots (optional, but often improves appearance)
sns.set_style("whitegrid")
plt.figure(figsize=(8, 5)) # Create a figure and set its size
plt.hist(iris['petal_length'], bins=15, color='teal', edgecolor='black')
plt.title('Distribution of Petal Length')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.show()
This histogram shows how frequently different ranges of petal lengths occur in the dataset. You might observe multiple peaks, potentially indicating differences between the species.
We can achieve a similar, often more refined, result using Seaborn's histplot
or kdeplot
(Kernel Density Estimate).
plt.figure(figsize=(8, 5))
sns.histplot(data=iris, x='petal_length', bins=15, kde=True, color='indigo') # Add a density curve
plt.title('Distribution of Petal Length (Seaborn)')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.show()
The Seaborn plot automatically adds labels and can easily include a smoothed density curve (kde=True
), offering another perspective on the distribution.
Scatter plots are excellent for visualizing the relationship between two numerical variables. Let's see if there's a relationship between petal length and petal width.
plt.figure(figsize=(8, 6))
plt.scatter(iris['petal_length'], iris['petal_width'], alpha=0.7, color='orange')
plt.title('Petal Length vs. Petal Width')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.grid(True) # Add grid lines
plt.show()
This plot likely shows a positive correlation: flowers with longer petals tend to also have wider petals. The alpha
parameter helps visualize overlapping points.
Seaborn's scatterplot
function can enhance this by automatically coloring points based on a third variable, like species.
plt.figure(figsize=(8, 6))
sns.scatterplot(data=iris, x='petal_length', y='petal_width', hue='species', palette='viridis')
plt.title('Petal Length vs. Petal Width by Species')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.show()
Adding the hue
argument clearly separates the species, revealing distinct clusters and strengthening our understanding of the relationship within and between species groups.
We suspect the different species might have different measurement distributions. Box plots or violin plots are ideal for comparing distributions across categorical groups. Let's compare sepal width across the three species using Seaborn.
plt.figure(figsize=(9, 6))
sns.boxplot(data=iris, x='species', y='sepal_width', palette='pastel')
plt.title('Sepal Width Distribution by Species')
plt.xlabel('Species')
plt.ylabel('Sepal Width (cm)')
plt.show()
# For a different view, try a violin plot
plt.figure(figsize=(9, 6))
sns.violinplot(data=iris, x='species', y='sepal_width', palette='Set2')
plt.title('Sepal Width Distribution by Species (Violin Plot)')
plt.xlabel('Species')
plt.ylabel('Sepal Width (cm)')
plt.show()
Both plots effectively show how the distribution of sepal width (median, quartiles, range, and density shape in the violin plot) differs among the iris species.
To get a quick overview of pairwise relationships between all numerical variables, Seaborn's pairplot
is incredibly useful. It creates a matrix of scatter plots for numerical variables and histograms (or KDE plots) along the diagonal.
# Generate a pair plot, coloring by species
# Use kind='kde' on the diagonal for density plots instead of histograms
sns.pairplot(iris, hue='species', palette='bright', diag_kind='kde')
plt.suptitle('Pairwise Relationships in the Iris Dataset', y=1.02) # Add a main title above the plots
plt.show()
The pairplot
provides a dense summary of the data. You can quickly scan it to identify potential correlations, clusters, and differences in distributions between species across all feature combinations.
A heatmap is a graphical representation of data where values are depicted by color. It's particularly useful for visualizing correlation matrices. Let's calculate the correlation between the numerical features and display it as a heatmap.
# Select only numerical columns for correlation calculation
numerical_iris = iris.select_dtypes(include=np.number)
# Calculate the correlation matrix
correlation_matrix = numerical_iris.corr()
# Print the matrix (optional)
print("\nCorrelation Matrix:")
print(correlation_matrix)
# Create the heatmap
plt.figure(figsize=(7, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
# annot=True displays the correlation values on the map
# cmap sets the color map
# fmt=".2f" formats the annotation numbers to two decimal places
plt.title('Correlation Matrix of Iris Features')
plt.show()
The heatmap visually confirms the strong positive correlation between petal length and petal width observed earlier and reveals other relationships, like the negative correlation between sepal width and petal length.
While Matplotlib and Seaborn create static plots, libraries like Plotly allow for interactive visualizations, which can be very helpful in web contexts or detailed exploration. Here's how you might create an interactive scatter plot similar to the one above.
Interactive scatter plot showing petal length versus petal width, colored by species. Hover over points to see details.
(Note: Displaying interactive plots requires a compatible environment. The JSON structure above defines a Plotly chart.)
Once you have created an informative visualization, you often need to save it. Matplotlib makes this straightforward using plt.savefig()
.
# Example: Create and save the boxplot from earlier
plt.figure(figsize=(9, 6))
sns.boxplot(data=iris, x='species', y='sepal_width', palette='pastel')
plt.title('Sepal Width Distribution by Species')
plt.xlabel('Species')
plt.ylabel('Sepal Width (cm)')
# Save the figure before showing it
plt.savefig('iris_sepal_width_boxplot.png', dpi=300) # Save as PNG with higher resolution
# You can also save as PDF, JPG, SVG, etc.
# plt.savefig('iris_sepal_width_boxplot.pdf')
plt.show() # Show the plot after saving
This practice section demonstrated how to apply various Matplotlib and Seaborn functions to explore a dataset visually. You learned to plot distributions, relationships, comparisons across groups, and correlations. Remember that choosing the right plot depends on the type of data and the question you are trying to answer. Effective visualization is a significant skill in data analysis and machine learning, allowing you to gain insights and communicate findings clearly.
© 2025 ApX Machine Learning