Practical examples demonstrate PyTorch's Autograd system. These exercises guide you through setting gradient requirements, performing backpropagation, inspecting gradients, observing accumulation, and disabling gradient tracking. Ensure you have PyTorch installed and can import the torch library.Setting UpFirst, let's import PyTorch:import torchExample 1: Basic Gradient ComputationLet's start with a very simple computation and track gradients. We'll define two tensors, x and w, where w represents a weight we want to optimize. We'll compute a simple output y and then a scalar loss L.Create Tensors: Define x as a tensor with some data and w as a tensor we want to compute gradients for (using requires_grad=True).# Input data x = torch.tensor([2.0, 4.0, 6.0]) # Weight tensor - requires gradient computation w = torch.tensor([0.5], requires_grad=True) print(f"x: {x}") print(f"w: {w}") print(f"x.requires_grad: {x.requires_grad}") print(f"w.requires_grad: {w.requires_grad}")Notice that x does not require gradients by default, while we explicitly set it for w.Define Computation: Perform a simple operation. Any tensor resulting from an operation involving a tensor with requires_grad=True will also have requires_grad=True.# Forward pass: y = w * x y = w * x # Define a simple scalar loss L (e.g., mean of y) L = y.mean() print(f"y: {y}") print(f"L: {L}") print(f"y.requires_grad: {y.requires_grad}") print(f"L.requires_grad: {L.requires_grad}")You'll see that both y and L now require gradients because they depend on w.Compute Gradients: Use the .backward() method on the final scalar output (L) to compute gradients throughout the graph.# Perform backpropagation L.backward()Inspect Gradients: Check the .grad attribute of the tensor w.# Gradient is stored in w.grad print(f"Gradient dL/dw: {w.grad}") # x did not require gradients, so its gradient is None print(f"Gradient dL/dx: {x.grad}")Let's analyze the result for w.grad. The computation was: $y_i = w * x_i$ $L = \frac{1}{3} \sum y_i = \frac{1}{3} (w x_1 + w x_2 + w x_3)$ The gradient $\frac{\partial L}{\partial w}$ is: $$ \frac{\partial L}{\partial w} = \frac{1}{3} (x_1 + x_2 + x_3) $$ With $x = [2.0, 4.0, 6.0]$, the gradient is $\frac{1}{3} (2.0 + 4.0 + 6.0) = \frac{12.0}{3} = 4.0$. This matches the output tensor([4.]). Because x was created without requires_grad=True, its gradient is not computed and remains None.Example 2: Gradients and the Computation GraphAutograd builds a graph dynamically. Let's trace a slightly more complex example.Create Tensors:a = torch.tensor(2.0, requires_grad=True) b = torch.tensor(3.0, requires_grad=True) c = torch.tensor(4.0, requires_grad=False) # Does not require grad print(f"a: {a}, requires_grad={a.requires_grad}") print(f"b: {b}, requires_grad={b.requires_grad}") print(f"c: {c}, requires_grad={c.requires_grad}")Define Computation:d = a * b e = d + c f = e * 2 print(f"d: {d}, requires_grad={d.requires_grad}") # True (depends on a, b) print(f"e: {e}, requires_grad={e.requires_grad}") # True (depends on d) print(f"f: {f}, requires_grad={f.requires_grad}") # True (depends on e)Compute and Inspect Gradients:# Backpropagate from the final scalar output f f.backward() # Check gradients print(f"Gradient df/da: {a.grad}") print(f"Gradient df/db: {b.grad}") print(f"Gradient df/dc: {c.grad}") # Expected: NoneLet's calculate manually: $d = a \times b$ $e = d + c = a \times b + c$ $f = 2 \times e = 2(a \times b + c)$$\frac{\partial f}{\partial a} = 2 \times b = 2 \times 3.0 = 6.0$ $\frac{\partial f}{\partial b} = 2 \times a = 2 \times 2.0 = 4.0$ $\frac{\partial f}{\partial c} = 2$The computed gradients for a and b match. Since c was defined with requires_grad=False, Autograd did not track operations involving it for gradient computation relative to c itself, so c.grad is None.Example 3: Gradient AccumulationBy default, gradients are accumulated in the .grad attribute every time .backward() is called. This is useful for scenarios like calculating gradients for multiple losses or simulating larger batch sizes, but it requires explicit zeroing of gradients during standard training loops.Setup: Let's use a simple setup again.x = torch.tensor(5.0, requires_grad=True) y = x * x print(f"Initial x.grad: {x.grad}") # Should be None initiallyFirst Backward Pass:# Perform backward pass on y. Note: backward() typically needs a scalar. # If called on a non-scalar tensor, need to provide gradient argument. # For demonstration, let's compute gradient of y wrt x (which is 2x). # We'll use y.backward(gradient=torch.tensor(1.0)) to achieve this. # More commonly, you'd have a scalar loss L derived from y. # Let L = y.mean() (if y was multi-element) or just y if scalar. y.backward(retain_graph=True) # retain_graph=True needed for multiple backward passes print(f"x.grad after 1st backward: {x.grad}") # Expected: 2*x = 10.0Second Backward Pass (Accumulation): Call backward again without zeroing the gradient.y.backward(retain_graph=True) # Call backward again print(f"x.grad after 2nd backward: {x.grad}") # Expected: 10.0 + 10.0 = 20.0The gradient is accumulated (added) to the previous value.Zeroing Gradients: Manually zero the gradient. In a typical training loop, this is done using optimizer.zero_grad().if x.grad is not None: x.grad.zero_() # In-place zeroing print(f"x.grad after zeroing: {x.grad}") # Expected: 0.0Third Backward Pass (After Zeroing):y.backward() # No need for retain_graph on the final backward pass print(f"x.grad after 3rd backward: {x.grad}") # Expected: 10.0The gradient is computed fresh after being zeroed. Forgetting to zero gradients is a common source of errors in training loops.Example 4: Disabling Gradient TrackingSometimes, you need to perform operations without tracking them for gradient computation, most commonly during model evaluation (inference) or when adjusting parameters outside the optimization step.Using torch.no_grad(): This context manager is the standard way to disable gradient tracking for a block of code.a = torch.tensor(2.0, requires_grad=True) print(f"Outside context: a.requires_grad = {a.requires_grad}") with torch.no_grad(): print(f"Inside context: a.requires_grad = {a.requires_grad}") # Still True b = a * 2 print(f"Inside context: b = {b}, b.requires_grad = {b.requires_grad}") # False! # Outside the context, computations resume tracking if inputs require grad c = a * 3 print(f"Outside context: c = {c}, c.requires_grad = {c.requires_grad}") # TrueInside the torch.no_grad() block, even though a requires gradients, the resulting tensor b does not. This makes operations within the block more memory-efficient and faster, as the history for backpropagation isn't saved.Using .detach(): This method creates a new tensor that shares the same data but is detached from the computation history. It doesn't require gradients.a = torch.tensor(5.0, requires_grad=True) b = a * a # b requires grad and is part of the graph connected to a # Detach a to create a new tensor c that doesn't require gradients c = a.detach() print(f"a.requires_grad: {a.requires_grad}") # True print(f"c.requires_grad: {c.requires_grad}") # False # Operations with c won't be tracked back to a d = c * 3 # d does not require grad print(f"d.requires_grad: {d.requires_grad}") # False # If you perform backward on a computation involving 'b', # it flows back to 'a'. If you use 'd', it doesn't. L1 = b.mean() # Depends on 'a' L1.backward() print(f"Gradient dL1/da: {a.grad}") # Expected: 2*a = 10.0 # Zero gradients before next backward call if a.grad is not None: a.grad.zero_() # Try backpropagating through 'd' - it won't affect 'a's gradient try: # L2 = d.mean() # Need a computation that requires grad eventually # Example: Use 'a' again with the detached result L2 = (a + d).mean() # L2 = (a + a.detach()*3).mean() L2.backward() print(f"Gradient dL2/da: {a.grad}") # Only gradient from 'a' path is computed (1.0) except RuntimeError as e: print(f"Error demonstrating backward with detached: {e}") # You might get an error if the final scalar doesn't depend # on any input requiring grad after detachment. # Here, L2 depends on 'a', so the grad is 1.0. # The path through 'd' contributes nothing to a.grad. # Modify c (the detached tensor) - it affects a because they share data! with torch.no_grad(): c[0] = 100.0 # Modify c in-place (use index for scalar) print(f"After modifying c, a = {a}") # 'a' also changes! print(f"After modifying c, c = {c}")detach() is useful when you want to use a tensor's value in a calculation but prevent gradients from flowing back through that specific path, or when you need a tensor without gradient history (e.g., for plotting or logging). Be mindful that it shares data storage, so in-place modifications affect the original tensor unless you .clone() it first (c = a.detach().clone()).These exercises demonstrate the core mechanics of Autograd. You've practiced enabling gradient tracking, performing backpropagation, inspecting the computed gradients, understanding accumulation, and disabling tracking when necessary. Mastering these operations is fundamental for building and training neural networks in PyTorch.