The process of finding parameter values that minimize (or maximize) an objective function.
Optimization & Numerical Methods
16 questions. Use Show Answer, then slide right (or use Next) to continue.
A function that measures how well a model’s predictions match the observed data.
Gradient descent computes the average gradient of the loss function over the entire training dataset and updates the model parameters using that gradient.
- Each iteration requires a full pass over the data.
- Convergence is smooth and predictable.
- Computationally more expensive than SGD.
Stochastic gradient descent (SGD) is an optimization algorithm that computes the gradient of the loss function using a single training example and updates the model parameters based on that gradient.
- Training examples are randomly selected.
- Updates are fast but noisy.
- Well suited for large datasets due to low per-iteration cost.
- Speed:
- SGD is faster per iteration because it uses one training example.
- Gradient descent is slower because it uses the full dataset.
- Convergence:
- Gradient descent converges smoothly and predictably.
- SGD has noisier, more erratic convergence due to random sampling.
- Memory:
- Gradient descent requires storing the full dataset.
- SGD only requires the current training example, making it more memory-efficient.
A hyperparameter that controls the step size of parameter updates during optimization.
- Too large: divergence or oscillation.
- Too small: very slow convergence.
Optimization where the objective function has a single global minimum and no local minima.
- Global minimum: the lowest possible value of the objective function.
- Local minimum: lower than nearby points but not necessarily the lowest overall.
Convexity increases the likelihood that gradient-based methods converge to the global minimum.
- Poor feature scaling.
- Ill-conditioned problems.
- Vanishing or exploding gradients.
A penalty that adds the squared magnitude of coefficients to the loss function:
$$\lambda \sum_j \beta_j^2$$
A penalty that adds the absolute value of coefficients to the loss function:
$$\lambda \sum_j |\beta_j|$$
- \(L_1\) encourages sparsity (coefficients go to zero).
- \(L_2\) shrinks coefficients smoothly without setting them exactly to zero.
Regularization increases bias and reduces variance, helping prevent overfitting.
Using numerical optimization methods such as gradient descent, Newton-based methods, or quasi-Newton methods.