Lecture 02 - Vanilla implementation of gradient descent
MachineLearningCourse.Lecture02 — Module
Lecture02Vanilla implementation of gradient descent.
Available Functions
demo(): Gradient descent demo for 5x5 pixel symbol recognition
Usage
using MachineLearningCourse
Lecture02.demo()MachineLearningCourse.Lecture02.compute_average_gradients — Method
compute_average_gradients(W, b, X, Y)Compute average gradients across all training samples for batch gradient descent.
Performs gradient computation for each sample and averages the results:
- For each sample (xi, yi): compute ∇Wi, ∇bi
- Return average: (1/N) * Σ(∇Wi), (1/N) * Σ(∇bi)
Arguments
W::Matrix{Float32}: Weight matrix (noutputs × ninputs)b::Vector{Float32}: Bias vector (n_outputs,)X::Vector{Vector{Float32}}: Training input data (N samples)Y::Vector{Vector{Float32}}: Training target data (N samples)
Returns
Tuple{Matrix{Float32}, Vector{Float32}}: (avg∇W, avg∇b)avg_∇W: Average weight gradientsavg_∇b: Average bias gradients
MachineLearningCourse.Lecture02.compute_gradients — Method
compute_gradients(W, b, x, y)Compute gradients ∂ℒ/∂W and ∂ℒ/∂b for a single sample using backpropagation.
Calculates gradients using the chain rule:
- ∂ℒ/∂W = δ * x^T where δ = ∂ℒ/∂â
- ∂ℒ/∂b = δ
Arguments
W::Matrix{Float32}: Weight matrix (noutputs × ninputs)b::Vector{Float32}: Bias vector (n_outputs,)x::Vector{Float32}: Input vector for sampley::Vector{Float32}: Target output vector for sample
Returns
Tuple{Matrix{Float32}, Vector{Float32}}: (∇W, ∇b)∇W: Weight gradients (same size as W)∇b: Bias gradients (same size as b)
MachineLearningCourse.Lecture02.demo — Function
demo()Demonstration of gradient descent on 5x5 digit recognition.
Loads training data, randomly initializes weights and biases, and uses gradient descent to minimize total loss. Prints initial and final loss values.
Example
demo() # Uses sample data fileMachineLearningCourse.Lecture02.gradient_descent! — Method
gradient_descent!(W, b, X, Y)Optimize neural network parameters using batch gradient descent.
Implements gradient descent with:
- Maximum of 10,000 iterations
- Compute average gradients (∇W,∇b) across all samples
- Stops when ‖(∇W,∇b)‖ < tolerance 1.0e-3
- Update parameters: W ← W - η * ∇W, b ← b - η * ∇b with learning rate η = 0.1
For each iteration:
- Compute average gradients across all training samples
- Calculate gradient norm for convergence checking
- Print progress information
- Check convergence criterion
- Update parameters using gradient descent rule
Arguments
W::Matrix{Float32}: Weight matrix (noutputs × ninputs), modified in-placeb::Vector{Float32}: Bias vector (n_outputs,), modified in-placeX::Vector{Vector{Float32}}: Training input dataY::Vector{Vector{Float32}}: Training target data (one-hot encoded)
MachineLearningCourse.Lecture02.gradient_norm — Method
gradient_norm(∇W, ∇b)Compute the Euclidean norm of the combined gradient vector.
Flattens and concatenates weight and bias gradients into a single vector, then computes ‖∇‖ = √(‖∇W‖² + ‖∇b‖²) for convergence monitoring.
Arguments
∇W::Matrix{Float32}: Weight gradients∇b::Vector{Float32}: Bias gradients
Returns
Float32: Euclidean norm of the combined gradient vector
MachineLearningCourse.Lecture02.one_hot_encode — Method
one_hot_encode(label, num_classes)Convert class labels to one-hot vectors for classification.
Arguments
label::Int: Class label (1-indexed)num_classes::Int: Total number of classes
Returns
Vector{Float32}: One-hot encoded vector
Example
one_hot_encode(3, 10) # Returns [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]MachineLearningCourse.Lecture02.read_data — Method
read_data(file_path)Read 5x5 digit training data from a text file.
File format: Each digit consists of 6 lines:
- 5 lines of 5 space-separated Float32 values (5x5 pixel grid)
- 1 line with the digit label (0-9)
Arguments
file_path::String: Path to the data file
Returns
Tuple{Vector{Vector{Float32}}, Vector{Vector{Float32}}}: (X, Y)X: Input vectors (each vector has 25 elements from 5x5 grid)Y: One-hot encoded target vectors (10 classes, 1-indexed)
Example
X, Y = read_data("5x5digits.txt")
# X[1] contains 25 pixel values for first digit
# Y[1] contains one-hot vector for first digit's classMachineLearningCourse.Lecture02.total_loss — Method
total_loss(W, b, X, Y)Compute total Mean Squared Error loss across all training samples.
For each sample, performs forward pass and computes loss:
- Forward pass: â = W * a + b
- Sample loss: ℒ(y, â) = ‖â - y‖²
- Total loss: Σ ℒ(yi, âi) over all samples
Arguments
W::Matrix{Float32}: Weight matrix (noutputs × ninputs)b::Vector{Float32}: Bias vector (n_outputs,)X::Vector{Vector{Float32}}: Training input dataY::Vector{Vector{Float32}}: Training target data (one-hot encoded)
Returns
Float32: Total loss across all training samples
MachineLearningCourse.Lecture02.ℒ — Method
ℒ(y, â)Mean Squared Error loss function: ℒ = ‖â - y‖².
Arguments
y::Vector{Float32}: True target valuesâ::Vector{Float32}: Computed values
Returns
Float32: MSE loss value
MachineLearningCourse.Lecture02.∂ℒ_∂â — Method
∂ℒ_∂â(y, â)Gradient of MSE loss with respect to computed activations: ∂ℒ/∂â = 2(â - y).
Arguments
y::Vector{Float32}: True target valuesâ::Vector{Float32}: Computed values
Returns
Vector{Float32}: Gradient vector ∂ℒ/∂â