Lecture 02 - Vanilla implementation of gradient descent

MachineLearningCourse.Lecture02 — Module

Lecture02

Vanilla implementation of gradient descent.

Available Functions

demo(): Gradient descent demo for 5x5 pixel symbol recognition

Usage

using MachineLearningCourse
Lecture02.demo()

source

MachineLearningCourse.Lecture02.compute_average_gradients — Method

compute_average_gradients(W, b, X, Y)

Compute average gradients across all training samples for batch gradient descent.

Performs gradient computation for each sample and averages the results:

For each sample (xi, yi): compute ∇Wi, ∇bi
Return average: (1/N) * Σ(∇Wi), (1/N) * Σ(∇bi)

Arguments

W::Matrix{Float32}: Weight matrix (noutputs × ninputs)
b::Vector{Float32}: Bias vector (n_outputs,)
X::Vector{Vector{Float32}}: Training input data (N samples)
Y::Vector{Vector{Float32}}: Training target data (N samples)

Returns

Tuple{Matrix{Float32}, Vector{Float32}}: (avg∇W, avg∇b)
- avg_∇W: Average weight gradients
- avg_∇b: Average bias gradients

source

MachineLearningCourse.Lecture02.compute_gradients — Method

compute_gradients(W, b, x, y)

Compute gradients ∂ℒ/∂W and ∂ℒ/∂b for a single sample using backpropagation.

Calculates gradients using the chain rule:

∂ℒ/∂W = δ * x^T where δ = ∂ℒ/∂â
∂ℒ/∂b = δ

Arguments

W::Matrix{Float32}: Weight matrix (noutputs × ninputs)
b::Vector{Float32}: Bias vector (n_outputs,)
x::Vector{Float32}: Input vector for sample
y::Vector{Float32}: Target output vector for sample

Returns

Tuple{Matrix{Float32}, Vector{Float32}}: (∇W, ∇b)
- ∇W: Weight gradients (same size as W)
- ∇b: Bias gradients (same size as b)

source

MachineLearningCourse.Lecture02.demo — Function

demo()

Demonstration of gradient descent on 5x5 digit recognition.

Loads training data, randomly initializes weights and biases, and uses gradient descent to minimize total loss. Prints initial and final loss values.

Example

demo()  # Uses sample data file

source

MachineLearningCourse.Lecture02.gradient_descent! — Method

gradient_descent!(W, b, X, Y)

Optimize neural network parameters using batch gradient descent.

Implements gradient descent with:

Maximum of 10,000 iterations
Compute average gradients (∇W,∇b) across all samples
Stops when ‖(∇W,∇b)‖ < tolerance 1.0e-3
Update parameters: W ← W - η * ∇W, b ← b - η * ∇b with learning rate η = 0.1

For each iteration:

Compute average gradients across all training samples
Calculate gradient norm for convergence checking
Print progress information
Check convergence criterion
Update parameters using gradient descent rule

Arguments

W::Matrix{Float32}: Weight matrix (noutputs × ninputs), modified in-place
b::Vector{Float32}: Bias vector (n_outputs,), modified in-place
X::Vector{Vector{Float32}}: Training input data
Y::Vector{Vector{Float32}}: Training target data (one-hot encoded)

source

MachineLearningCourse.Lecture02.gradient_norm — Method

gradient_norm(∇W, ∇b)

Compute the Euclidean norm of the combined gradient vector.

Flattens and concatenates weight and bias gradients into a single vector, then computes ‖∇‖ = √(‖∇W‖² + ‖∇b‖²) for convergence monitoring.

Arguments

∇W::Matrix{Float32}: Weight gradients
∇b::Vector{Float32}: Bias gradients

Returns

Float32: Euclidean norm of the combined gradient vector

source

MachineLearningCourse.Lecture02.one_hot_encode — Method

one_hot_encode(label, num_classes)

Convert class labels to one-hot vectors for classification.

Arguments

label::Int: Class label (1-indexed)
num_classes::Int: Total number of classes

Returns

Vector{Float32}: One-hot encoded vector

Example

one_hot_encode(3, 10)  # Returns [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

source

MachineLearningCourse.Lecture02.read_data — Method

read_data(file_path)

Read 5x5 digit training data from a text file.

File format: Each digit consists of 6 lines:

5 lines of 5 space-separated Float32 values (5x5 pixel grid)
1 line with the digit label (0-9)

Arguments

file_path::String: Path to the data file

Returns

Tuple{Vector{Vector{Float32}}, Vector{Vector{Float32}}}: (X, Y)
- X: Input vectors (each vector has 25 elements from 5x5 grid)
- Y: One-hot encoded target vectors (10 classes, 1-indexed)

Example

X, Y = read_data("5x5digits.txt")
# X[1] contains 25 pixel values for first digit
# Y[1] contains one-hot vector for first digit's class

source

MachineLearningCourse.Lecture02.total_loss — Method

total_loss(W, b, X, Y)

Compute total Mean Squared Error loss across all training samples.

For each sample, performs forward pass and computes loss:

Forward pass: â = W * a + b
Sample loss: ℒ(y, â) = ‖â - y‖²
Total loss: Σ ℒ(yi, âi) over all samples

Arguments

W::Matrix{Float32}: Weight matrix (noutputs × ninputs)
b::Vector{Float32}: Bias vector (n_outputs,)
X::Vector{Vector{Float32}}: Training input data
Y::Vector{Vector{Float32}}: Training target data (one-hot encoded)

Returns

Float32: Total loss across all training samples

source

MachineLearningCourse.Lecture02.ℒ — Method

ℒ(y, â)

Mean Squared Error loss function: ℒ = ‖â - y‖².

Arguments

y::Vector{Float32}: True target values
â::Vector{Float32}: Computed values

Returns

Float32: MSE loss value

source

MachineLearningCourse.Lecture02.∂ℒ_∂â — Method

∂ℒ_∂â(y, â)

Gradient of MSE loss with respect to computed activations: ∂ℒ/∂â = 2(â - y).

Arguments

y::Vector{Float32}: True target values
â::Vector{Float32}: Computed values

Returns

Vector{Float32}: Gradient vector ∂ℒ/∂â

source