Training machine learning models, particularly deep neural networks, relies heavily on optimization algorithms like gradient descent. These algorithms iteratively adjust model parameters (like weights and biases) to minimize a loss function. To know how to adjust the parameters, we need to calculate the gradient of the loss function with respect to each parameter. This gradient indicates the direction of the steepest ascent of the loss function, so moving in the opposite direction helps minimize the loss.
Manually deriving these gradients for complex models with potentially millions of parameters is impractical and error-prone. Symbolic differentiation (like that used in Mathematica or SymPy) can generate exact derivatives, but the resulting expressions can become extremely complex ("expression swell") and computationally expensive to evaluate. Numerical differentiation (approximating gradients using finite differences) is simpler to implement but can be computationally slow and suffer from numerical precision issues.
TensorFlow employs a powerful technique called automatic differentiation (AutoDiff) to compute gradients efficiently and accurately. AutoDiff calculates the exact numerical value of the gradient without explicitly deriving the symbolic gradient expression. It achieves this by decomposing the computation into a sequence of elementary operations (addition, multiplication, activation functions, etc.) and applying the chain rule systematically during a traversal of the computation graph.
TensorFlow provides the tf.GradientTape
API for automatic differentiation. It works like a tape recorder: operations executed within the scope of a tf.GradientTape
context manager are "recorded" onto the tape. TensorFlow then uses this tape and the chain rule to compute gradients by traversing the recorded operations backward from the output (target) to the inputs (sources).
By default, tf.GradientTape
automatically "watches" any trainable tf.Variable
accessed within its context. When you request a gradient, the tape calculates it with respect to these watched variables.
Let's see a simple example. Suppose we want to find the gradient of y=x2 with respect to x when x=3.
import tensorflow as tf
# Create a scalar tf.Variable (needs to be float for differentiation)
x = tf.Variable(3.0)
# Start recording operations on the tape
with tf.GradientTape() as tape:
# Define the function y = x^2. This operation is recorded.
y = x * x # Or tf.square(x)
# Calculate the gradient of y with respect to x
# dy_dx will be 2*x evaluated at x=3, which is 6.0
dy_dx = tape.gradient(y, x)
print(f"x: {x.numpy()}")
print(f"y = x^2: {y.numpy()}")
print(f"dy/dx: {dy_dx.numpy()}")
# Expected output:
# x: 3.0
# y = x^2: 9.0
# dy/dx: 6.0
In this code:
x
as a tf.Variable
. Since it's trainable by default, the GradientTape
will automatically watch it.with tf.GradientTape() as tape:
block.y = x * x
is performed. The tape records this operation and knows that y
depends on x
.tape.gradient(y, x)
. The tape replays the recorded operations in reverse to compute the gradient of the target (y
) with respect to the source (x
), applying the chain rule. The result is dxdy=2x, which evaluates to 2×3.0=6.0.What if you want to compute gradients with respect to a plain tf.Tensor
that isn't a tf.Variable
? By default, the tape doesn't watch tensors. You need to explicitly tell the tape to watch it using tape.watch()
.
import tensorflow as tf
# Create a constant Tensor
x0 = tf.constant(3.0)
with tf.GradientTape() as tape:
# Manually tell the tape to watch this tensor
tape.watch(x0)
# Define the computation
y = x0 * x0
# Calculate the gradient
dy_dx0 = tape.gradient(y, x0)
print(f"x0: {x0.numpy()}")
print(f"y = x0^2: {y.numpy()}")
print(f"dy/dx0: {dy_dx0.numpy()}")
# Expected output:
# x0: 3.0
# y = x0^2: 9.0
# dy/dx0: 6.0
While you can watch constants, typically gradients are computed with respect to model parameters, which are naturally represented as tf.Variable
s.
You can compute the gradient of a target with respect to multiple sources (variables or watched tensors) by passing a list or tuple of sources to tape.gradient()
. It will return a list of gradients in the same order as the sources.
import tensorflow as tf
x = tf.Variable(2.0)
y = tf.Variable(3.0)
with tf.GradientTape() as tape:
# z = x^2 + y^3
z = tf.square(x) + tf.pow(y, 3)
# Calculate gradients of z with respect to both x and y
# dz_dx = 2*x = 4.0
# dz_dy = 3*y^2 = 27.0
dz_dx, dz_dy = tape.gradient(z, [x, y])
print(f"x: {x.numpy()}, y: {y.numpy()}")
print(f"z = x^2 + y^3: {z.numpy()}")
print(f"dz/dx: {dz_dx.numpy()}")
print(f"dz/dy: {dz_dy.numpy()}")
# Expected output:
# x: 2.0, y: 3.0
# z = x^2 + y^3: 41.0
# dz/dx: 4.0
# dz/dy: 27.0
Sometimes, you might want fine-grained control over which variables are watched. Trainable variables (trainable=True
) are watched by default. You can prevent the tape from watching them by setting watch_accessed_variables=False
when creating the tape.
import tensorflow as tf
x0 = tf.Variable(2.0)
x1 = tf.Variable(2.0, trainable=False) # Not trainable
x2 = tf.Variable(2.0)
y = tf.Variable(3.0)
with tf.GradientTape(watch_accessed_variables=False) as tape:
# We must explicitly watch the variables we need gradients for
tape.watch(x0)
tape.watch(x2)
tape.watch(y)
# z depends on x0, x1, x2, y
z = tf.square(x0) + tf.square(x1) * tf.square(x2) + tf.pow(y, 3)
# Calculate gradients only for watched variables
# tape.gradient(z, x1) would return None because x1 wasn't watched.
dz_dx0, dz_dx2, dz_dy = tape.gradient(z, [x0, x2, y])
print(f"dz/dx0: {dz_dx0.numpy()}") # 2*x0 = 4.0
print(f"dz/dx2: {dz_dx2.numpy()}") # 2*x1^2*x2 = 2*(2^2)*2 = 16.0
print(f"dz/dy: {dz_dy.numpy()}") # 3*y^2 = 3*(3^2) = 27.0
# Attempting to get gradient for unwatched x1
dz_dx1 = tape.gradient(z, x1)
print(f"dz/dx1: {dz_dx1}") # Output: None
By default, a GradientTape
's resources are released as soon as the tape.gradient()
method is called. This means you can only compute one gradient calculation per tape. If you need to compute multiple gradients (e.g., second derivatives, or gradients of different targets with respect to the same sources), you must create a persistent tape by setting persistent=True
.
import tensorflow as tf
x = tf.Variable(3.0)
with tf.GradientTape(persistent=True) as tape:
y = x * x # y = x^2
z = y * y # z = y^2 = (x^2)^2 = x^4
# Calculate the gradient of z with respect to x (dz/dx = 4x^3 = 4*27 = 108)
dz_dx = tape.gradient(z, x)
print(f"dz/dx: {dz_dx.numpy()}")
# Calculate the gradient of y with respect to x (dy/dx = 2x = 6)
# This is possible because the tape is persistent
dy_dx = tape.gradient(y, x)
print(f"dy/dx: {dy_dx.numpy()}")
# Expected output:
# dz/dx: 108.0
# dy/dx: 6.0
# Don't forget to delete the tape when you're done with it!
del tape
Important: When using a persistent tape, you must manually delete it using del tape
when you are finished with it, otherwise, the resources it holds (the recorded operations) will not be released, potentially leading to memory leaks.
tf.GradientTape
correctly handles Python control flow like if
statements and for
loops. Operations are recorded as they are executed.
import tensorflow as tf
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
if x > 0.0:
y = tf.square(x) # Executed, gradient is 2*x
else:
y = tf.negative(x) # Not executed
# dy/dx = 2*x = 4.0
dy_dx = tape.gradient(y, x)
print(f"dy/dx: {dy_dx.numpy()}") # Output: 4.0
What happens if your computation involves operations that are not differentiable with respect to a certain variable (e.g., using tf.cast
to change type, or integer operations)? GradientTape
will return None
for such gradients.
import tensorflow as tf
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
# Casting to integer makes the operation non-differentiable w.r.t. x
y = tf.cast(x, tf.int32)
# tf.GradientTape cannot compute gradient through discrete operations like cast to integer
z = tf.cast(y, tf.float32) * 2.0 # Multiply by float to keep graph connected
# The gradient pathway is broken by the cast to int32
dz_dx = tape.gradient(z, x)
print(f"dz/dx: {dz_dx}") # Output: None
TensorFlow cannot compute gradients through operations that inherently break the continuous connection needed for differentiation, like casting to discrete types (tf.int32
, tf.bool
) or using functions like tf.round
.
tf.GradientTape
is the foundation for training models in TensorFlow. It allows the framework to automatically calculate the gradients needed to update model parameters via optimization algorithms. Understanding how it records operations and computes gradients is essential for building, debugging, and customizing training loops, which we will explore in later chapters.
© 2025 ApX Machine Learning