All Courses

Practice: Julia for Data Manipulation and Basic Algorithms

This section provides hands-on exercises to solidify your Julia skills for data manipulation and the implementation of simple algorithmic components, building upon the foundational concepts discussed earlier in this chapter. You'll work with common data science packages and apply numerical computation techniques that are essential for deep learning. We assume you have your Julia environment set up with packages like DataFrames.jl, CSV.jl, and Plots.jl. If not, please refer to the "Setting Up Your Julia Deep Learning Environment" section.

1. Loading and Inspecting Data

Deep learning, and machine learning in general, starts with data. Julia offers excellent tools for handling various data formats. We'll begin by creating a small, illustrative dataset, saving it as a CSV (Comma Separated Values) file, and then loading it using CSV.jl and DataFrames.jl.

Let's imagine a dataset from an agricultural experiment, tracking plant growth based on sunlight and water.

First, ensure you have the necessary packages. If you're in the Julia REPL, you can add them:

using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("Plots") # We'll use this later

Now, let's create our sample data as a string and write it to a file.

using CSV
using DataFrames

# Define the data as a multi-line string
csv_data = """
plant_id,sunlight_hours,water_ml,growth_cm
1,4.5,100,5.2
2,5.0,110,5.8
3,3.5,90,4.1
4,6.1,120,7.3
5,4.2,95,4.9
6,5.5,115,6.5
"""

# Write this data to a temporary CSV file
file_path = "plant_data.csv"
open(file_path, "w") do io
    write(io, csv_data)
end

# Load the data using CSV.jl into a DataFrame
df = CSV.read(file_path, DataFrame)

# Display the first few rows
println("First 3 rows of the DataFrame:")
println(first(df, 3))

The CSV.read function is straightforward. It takes the file path and the desired sink type, which in our case is a DataFrame. DataFrames.jl provides a powerful and flexible way to work with tabular data, similar to Pandas in Python or R's data frames.

Let's explore our DataFrame a bit more:

# Get a statistical summary of the DataFrame
println("\nStatistical summary:")
println(describe(df))

# Get column names
println("\nColumn names:")
println(names(df))

# Select a specific column (e.g., growth_cm)
growth_data = df.growth_cm
println("\nGrowth data (first 5 values):")
println(first(growth_data, 5))

# Filter rows: plants that received more than 5 hours of sunlight
high_sunlight_plants = filter(row -> row.sunlight_hours > 5, df)
println("\nPlants with >5 hours of sunlight:")
println(high_sunlight_plants)

The describe function gives you a quick overview of each column, including mean, min, max, and other statistics for numerical columns. You can access columns using dot notation (e.g., df.sunlight_hours) or by string indexing (e.g., df[!, "sunlight_hours"]). The filter function is versatile for selecting subsets of your data based on conditions.

2. Data Manipulation and Transformation

Often, raw data isn't in the perfect shape for analysis or model training. You might need to create new features or normalize existing ones.

Creating a New Feature

Let's create a new feature, say, water_per_sunlight_hour, which might give us an idea of watering efficiency relative to sunlight.

# Create a new column: water_ml / sunlight_hours
# Using broadcasting with . (dot) for element-wise operations
df.water_per_sunlight_hour = df.water_ml ./ df.sunlight_hours

println("\nDataFrame with new feature:")
println(first(df, 3))

Julia's broadcasting syntax (using a dot .) before an operator like ./ or a function call foo.(...) applies the operation element-wise to arrays or columns. This is efficient and idiomatic in Julia.

Normalizing a Column

Normalization is a common preprocessing step. Let's apply Min-Max scaling to the growth_cm column. The formula is:

X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}

This scales the data to a range, typically [0, 1].

# Min-Max normalization for 'growth_cm'
growth_col = df.growth_cm
min_growth = minimum(growth_col)
max_growth = maximum(growth_col)

# Apply the formula using broadcasting
df.growth_cm_normalized = (growth_col .- min_growth) ./ (max_growth - min_growth)

println("\nDataFrame with normalized growth:")
println(select(df, :plant_id, :growth_cm, :growth_cm_normalized)) # Show relevant columns

Here, minimum and maximum are standard Julia functions that work on arrays. The arithmetic operations - and ./ are broadcasted across the growth_col array.

3. Basic Algorithm Snippet: A Simple Linear Prediction

While this chapter is foundational, let's implement a very simple predictive calculation. Imagine we have a hypothesis that plant growth can be predicted by a linear combination of sunlight hours and water amount:

\text{predicted\_growth} = w_1 \times \text{sunlight\_hours} + w_2 \times \text{water\_ml} + b

Where $w_1, w_2$ are weights and $b$ is a bias term. In a full machine learning workflow, we would learn these parameters. For now, let's assume we have some predefined (perhaps arbitrary) values and calculate the predictions.

# Define a function for our simple linear model
function predict_growth(sunlight, water, w1, w2, b)
    return w1 * sunlight + w2 * water + b
end

# Let's assume some weights and bias
w1_hypothetical = 0.8
w2_hypothetical = 0.05
b_hypothetical = -1.0

# Apply this prediction to each row of our DataFrame
# We can create a new column with these predictions
# Using an anonymous function and row-wise iteration (less common for large data, but clear here)
df.predicted_growth = [predict_growth(row.sunlight_hours, row.water_ml, w1_hypothetical, w2_hypothetical, b_hypothetical) for row in eachrow(df)]

# Alternatively, using broadcasting for a more "Julian" vectorized approach:
# df.predicted_growth = predict_growth.(df.sunlight_hours, df.water_ml, w1_hypothetical, w2_hypothetical, b_hypothetical)
# Note: For the above vectorized call, predict_growth would need to handle scalar w1, w2, b correctly with array inputs, or use broadcasting inside it.
# The `eachrow` approach is often clearer for beginners when applying row-wise logic.

println("\nDataFrame with predicted growth:")
println(select(df, :plant_id, :growth_cm, :predicted_growth))

This exercise demonstrates how to define functions in Julia and apply them to your data. The calculation itself is a fundamental part of linear models and the linear layers in neural networks. Later, you'll learn how automatic differentiation helps in finding optimal values for w1, w2, and b.

4. Visualizing Data

A picture is often worth a thousand numbers. Plots.jl is a popular plotting metapackage in Julia, allowing you to use various plotting backends. Let's create a simple scatter plot to visualize the relationship between actual growth_cm and our predicted_growth.

using Plots
gr() # Using the GR backend for Plots.jl, you can choose others like pyplot() or plotlyjs()

# Scatter plot of actual vs. predicted growth
scatter_plot = scatter(df.growth_cm, df.predicted_growth,
                    xlabel="Actual Growth (cm)",
                    ylabel="Predicted Growth (cm)",
                    title="Actual vs. Predicted Growth",
                    legend=false,
                    aspect_ratio=:equal, # Makes scales on x and y axes visually comparable
                    color=:blue,
                    markersize=5)

# Add a line y=x for reference (perfect prediction)
plot!(df.growth_cm, df.growth_cm, line=:dash, color=:red)


# To display the plot in environments like VS Code or Jupyter:
# display(scatter_plot)
# Or save it to a file:
# savefig(scatter_plot, "actual_vs_predicted_growth.png")

# For this course web page, here's a representation of such a plot:
println("\nGenerating plot data (example for web display)...")

Relationship between actual plant growth and growth predicted by a simple linear model with arbitrary weights. The dashed line represents a perfect prediction.

Visual inspection is a critical step in any data analysis or modeling task. It can help you understand relationships, identify outliers, and evaluate model performance.

5. Saving Processed Data

After manipulating your DataFrame, you might want to save the results. CSV.jl can also write DataFrames to CSV files.

# Define the output file path
output_file_path = "plant_data_processed.csv"

# Write the DataFrame to a new CSV file
CSV.write(output_file_path, df)

println("\nProcessed DataFrame saved to: $output_file_path")

# Clean up the created files (optional, for tidiness in this example)
# rm(file_path)
# rm(output_file_path)

This practical session walked you through essential data handling tasks in Julia: loading, inspecting, transforming, performing basic calculations, visualizing, and saving data. These skills are the bedrock upon which you will build and train more complex deep learning models in the subsequent chapters. As you move forward, you'll see how Julia's performance and expressive syntax make these operations both efficient and enjoyable. The next chapter introduces Flux.jl, where you'll start constructing neural networks using many of these foundational data manipulation abilities.

Was this section helpful?