Is linear regression still useful in 2026?

Yes. Linear regression is fast, easy to interpret, and works well as a baseline for many real-world problems. Even when you later move to more complex models, it helps you understand relationships between variables and spot issues in your data.

Do I always need gradient descent to train a linear regression model?

No. For small to medium-sized datasets, most libraries use efficient closed-form or matrix-based solvers under the hood. Gradient descent is more useful when you have very large datasets, many features, or when you integrate linear layers inside deep learning models.

What mathematical background do I need to learn linear regression?

You should be comfortable with basic algebra, functions, and sums. Some familiarity with vectors and matrices is helpful but not strictly required at the beginning. As you go deeper, concepts from linear algebra, probability, and statistics become increasingly important.

Linear Regression for Data Science in 2026

Updated on December 10, 2025 7 minutes read

Linear regression is one of the simplest and most widely used models in statistics and machine learning. It describes how a continuous target variable changes as one or more input variables change.

Even in 2026, when deep learning and large language models are common, linear regression remains a core tool. It is fast, interpretable, and often the first model you try in real-world data science projects.

We will start from the mathematical definition of linear regression and then see how to solve it in three ways: closed form for one variable, closed form for many variables, and gradient descent.

What is Linear Regression?

Suppose you have a dataset.

$D = \{(x_1, y_1), (x_2, y_2), \dots, (x_N, y_N)\}$

where both the inputs $x_i$ and outputs $y_i$ are real-valued continuous variables. The goal of linear regression is to find a linear function that best predicts $y$ from $x$ .

In the most general case with $p$ features, the model is

$\hat{y} = a_0 + a_1 x_1 + a_2 x_2 + \dots + a_p x_p$

where $a_0$ is the intercept and $a_1, \dots, a_p$ are the coefficients, also called weights. Our task is to estimate these parameters from data.

To measure how good a particular choice of parameters is, we use the least squares loss:

$L(a_0, \dots, a_p) = \sum_{i=1}^N (y_i - \hat{y}_i)^2$

The optimal parameters are those that minimise this loss.

Simple Linear Regression (One Variable, Ordinary Least Squares)

In the simplest case, each input $x_i$ is just a single number. The model becomes

$\hat{y}_i = a_0 + a_1 x_i$

Here, $a_0$ and $a_1$ define a straight line. Linear regression in this setting means: find the line that best fits the data points in the least squares sense.

Formally, we want

$(\hat{a}_0, \hat{a}_1) = \operatorname{argmin}_{a_0, a_1} \sum_{i=1}^{N} (y_i - (a_0 + a_1 x_i))^2$

Deriving the optimal parameters

Define the loss

$L(a_0, a_1) = \sum_{i=1}^{N} (y_i - (a_0 + a_1 x_i))^2$

To find its minimum, we set its partial derivatives to zero. This gives the normal equations.

$\sum_{i=1}^N (y_i - (a_0 + a_1 x_i)) = 0$

$\sum_{i=1}^N x_i (y_i - (a_0 + a_1 x_i)) = 0$

Let the sample means be

$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i$

$\bar{y} = \frac{1}{N} \sum_{i=1}^N y_i$

Solving the system leads to a well-known closed-form solution for the slope.

$\hat{a}_1 = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2} = \frac{\operatorname{Cov}(X, Y)}{\operatorname{Var}(X)}$

and the intercept

$\hat{a}_0 = \bar{y} - \hat{a}_1 \bar{x}$

So the best-fit line is simply.

$\hat{y} = \hat{a}_0 + \hat{a}_1 x$

Multiple Linear Regression (Many Variables, Ordinary Least Squares)

When each observation has multiple features, $x_i$ is no longer a single number but a vector of size $p$ :

$x_i = (x_{i1}, x_{i2}, \dots, x_{ip})$

The model becomes

$\hat{y}_i = a_0 + a_1 x_{i1} + a_2 x_{i2} + \dots + a_p x_{ip}$

For convenience, we often work in matrix form. We stack all targets into a vector $Y$ and all features into a matrix $X$ :

$Y$ is an $(N, 1)$ vector of targets.
$X$ is an $(N, p)$ design matrix where each row is an observation and each column is a feature.
$W$ is a $(p, 1)$ parameter vector $(w_1, w_2, \dots, w_p)^T$ .

If we include the intercept as a column of ones in $X$ , we can write the predictions compactly as

$\hat{Y} = X W$

Loss function in matrix form

The least squares loss becomes

$L(W) = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 = (Y - X W)^T (Y - X W)$

Expanding this expression gives

$L(W) = Y^T Y - 2 W^T X^T Y + W^T X^T X W$

We want to minimise $L(W)$ with respect to $W$ . The term $Y^T Y$ does not depend on $W$ , so its derivative is zero, and we can ignore it when taking the gradient.

Normal equation for multiple linear regression

Taking the gradient of $L(W)$ with respect to $W$ and setting it to zero yields

$\frac{\partial L}{\partial W} = -2 X^T Y + 2 X^T X W = 0$

Rearranging, we obtain the normal equation.n

$X^T X \hat{W} = X^T Y$

If $X^T X$ is invertible, the unique least squares solution is

$\hat{W} = (X^T X)^{-1} X^T Y$

In practice, for large problems or when $X^T X$ is ill-conditioned, you may use numerical methods such as QR decomposition or singular value decomposition, or regularisation techniques like ridge regression, instead of computing $(X^T X)^{-1}$ directly.

Solving Linear Regression with Gradient Descent

The closed-form solutions above are elegant, but they can become expensive when $N$ and $p$ are very large. In modern data science workflows with 2026-scale datasets, we often use gradient descent instead.

Gradient descent is an iterative optimisation algorithm. Starting from an initial guess $W^{(0)}$ , we repeatedly update the parameters in the opposite direction of the gradient of the loss:

$W^{(n+1)} = W^{(n)} - lr \cdot \nabla_W L(W^{(n)})$

Where $lr$ is the learning rate, a positive scalar that controls the step size.

Gradient descent for simple linear regression

For the one-variable model

$\hat{y}_i = a_0 + a_1 x_i$

The loss is

$L(a_0, a_1) = \sum_{i=1}^{N} (y_i - (a_0 + a_1 x_i))^2$

The partial derivatives are

$\frac{\partial L}{\partial a_0} = \sum_{i=1}^{N} -2 (y_i - (a_0 + a_1 x_i))$

$\frac{\partial L}{\partial a_1} = \sum_{i=1}^{N} -2 x_i (y_i - (a_0 + a_1 x_i))$

Applying gradient descent, we update both parameters at each step $n$ as

$a_0^{(n+1)} = a_0^{(n)} + 2 \cdot lr \cdot \sum_{i=1}^{N} (y_i - (a_0^{(n)} + a_1^{(n)} x_i))$

$a_1^{(n+1)} = a_1^{(n)} + 2 \cdot lr \cdot \sum_{i=1}^{N} x_i (y_i - (a_0^{(n)} + a_1^{(n)} x_i))$

We repeat these updates until the parameters change very little or the loss stops decreasing.

Pseudocode example

Here is a simple batch gradient descent loop for linear regression with one feature. The code uses vector operations for clarity.

a0, a1 = 0.0, 0.0        # initial parameters
lr = 0.001               # learning rate

for epoch in range(num_epochs):
    y_hat = a0 + a1 * x          # predictions
    error = y - y_hat            # residuals

    grad_a0 = -2 * error.sum()
    grad_a1 = -2 * (x * error).sum()

    a0 = a0 - lr * grad_a0
    a1 = a1 - lr * grad_a1

In real projects, you might use stochastic or mini-batch gradient descent, learning rate schedules, or optimisers like Adam, especially in larger machine learning pipelines.

Closed form vs gradient descent: when to use which?

Both approaches solve the same optimisation problem, but are useful in different situations.

Closed form (normal equation) is ideal when the number of features is relatively small, and you can safely compute $(X^T X)^{-1}$ or use an equivalent numerical solver.

Gradient descent scales better to very large datasets and feature spaces, and is easy to integrate into end-to-end machine learning pipelines.

Many modern libraries choose efficient numerical methods under the hood, so understanding both views helps you interpret and debug your models.

To practise these concepts in real projects, consider joining our live online Data Science and AI Bootcamp, where you will implement linear regression and many other models from scratch.

Quick quiz

Test your understanding with a short quiz. The correct options are marked in bold.

What is the formula of the optimal parameter vector in the multidimensional case?
- a). $\dfrac{\operatorname{Cov}(X, Y)}{\operatorname{Var}(Y)}$
- b). $\dfrac{\operatorname{Cov}(X, Y)}{\operatorname{Var}(X)}$
- c). $(X^T X)^{-1} X^T Y$

Answer: (c)

Why do we set the derivative of the loss to zero in ordinary least squares?
- a). To find the extremum (minimum) of the loss function.
- b). To minimise the derivative itself.
- c). To keep only the real part of the derivative.

Answer: (a)

What is the main objective of linear regression?
- a).To find the line that passes exactly through all the points.
- b). To find the line or hyperplane that best describes the data in the least squares sense.
- c). To find the line that best separates the data into classes.

Answer: (b)

Next steps

If you understand this article, you already have a strong foundation for more advanced regression methods like regularised linear models, logistic regression, and Gaussian processes.

No degree? No problem. You can still become a Data Scientist with Code Labs Academy and build job-ready skills for the AI era.