Linear algebra has a bad reputation. The subject has a tough combination of fiddly computation and abstract reasoning, one that isn’t always taught well. STEM majors often have bitter memories of spending hours on row reduction and QR decomposition. But sometimes linear algebra can make your life much, much easier. One such situation is expressing the formula for computing linear regression coefficients.

What does Linear Regression Actually Do?

Formally, ordinary least squares regression takes a data matrix $X$ and a response vector $Y$, and finds a vector of coefficients $\beta$ that minimize the equation $\sum_{i=1}^n (Y_i - \hat Y)^2$, where $\hat Y = X \beta$. This is called the sum of squared errors ($SSE$). It can also be viewed as the sum of the squares of the Euclidean distance between each response value $Y_i$ and the corresponding fitted value $\hat Y_i$. (Why squared distance instead of absolute value? See below). The individual errors are called residuals.

Simple linear regression is the special case where there is only one variable, $X_1$, and hence two coefficients: the constant term $\beta_0$ and $\beta_1$, the coefficient for $X_1$. ($\beta_0$ isn’t strictly necessary, but it’s rarely a good idea to omit it). In simple regression, $\hat Y_i = \beta_0 + \beta_1 X_{i1}$

This is equivalent to plotting the data and finding the line that minimizes the sum of squared errors. $\beta_0$ provides the intercept of the line.

Here is a model with the residuals shown:

library(ggplot2)

X <- rnorm(100, 10, 5)
Y <- X * 7 + 4 + rnorm(100, 20, 4)
model <- lm(Y ~ X)
ggplot(mapping = aes(x = X, y = Y)) +
  geom_smooth(method = "lm", color = "red", linetype = "dashed", linewidth = 0.7, se = FALSE) +
  geom_point(alpha = .3) +
  geom_segment(aes(x = X, y = model$fitted.values, xend = X, yend = Y), color = "grey", linewidth = 0.5) +
  theme_minimal()

-- `geom_smooth()` using formula = 'y ~ x'

Now we have to know how to find these coefficients $\beta$. There the trouble starts.

Elementary Algebra

Statistics textbooks usually introduce the equations for the estimators in simple linear regression using these formulae (or algebraically equivalent expressions): \[ \displaylines{\beta_0 = \bar Y -\beta_1 \bar X \\ \beta_1 = \frac{\sum_{i=1}^n(X_i - \bar X)(Y_i - \bar Y)}{\sum_{i = 1}^n (X_i - \bar X)^2}} \]

Professors often throw these complicated formulae at unsuspecting students in the last weeks of intro stats courses. I first encountered them that way, and I remember the shock of fear I felt. Nobody with mathematical maturity is scared of a summation symbol, but back then I had no clue what I was doing, and many other students are in the same position.

There’s also the classic assignment of deriving these formulae by minimizing the normal equation $SSE = \sum_{i=1}^n(y_i - \beta_0 - \beta_1 x_i)^2$. You have to take first derivatives, solve the equations for each estimator simultaneously, do some algebra tricks, and finish off with a second-derivative test to make sure the estimators really do minimize the loss function. It’s one of those tough but brutally satisfying assignments, like computing a median with standard SQL or implementing iterative least squares to estimate a generalized linear model. (I appreciate the professor who suggested doing it as an exercise; it made me a lot less mathematically naive!).

Linear Algebra

In linear algebra terms, linear regression orthogonally projects the response vector $Y$ into the linear subspace formed by the data matrix $X$. (Linear algebra texts usually call a generic matrix $A$, but statistics texts use $X$ for a data matrix, so I will do so here). In other words, it finds the closest vector to $Y$ by Euclidean distance that can be formed by a linear combination of $X$. In matrix algebra, the normal equation is $SSE = (Y-\beta X)^T(Y- \beta_ X)$. (Multiplying a row vector by its transpose just sums the squares of each element). Some basic calculus and rearrangement turns this into $X^TX \beta = X^Ty$. By inverting $X^TX$ (a matrix operation roughly analogous to turning a number $n$ into $1/n$), we get a simple formula for $\beta$:

\[ \beta = (X^TX)^{-1}X^Ty \]

This is a variation on the formula for a projection matrix, $P = X(X^TX)^{-1}X^T$. For any vector $y$, $Py$ is the closest vector to $y$ by Euclidean distance that is a linear combination of the vectors in $X$. This is much more revealing than the elementary-algebra version of the formula. For one thing, it’s a lot simpler. It’s fully general, too: it holds for all coefficients for a data matrix of any size, not just simple linear regression. (What about $\beta_0$, the constant term? Just add a column of ones to your data matrix, and you have it!). And it implies the important fact that no unique least-squares solution exists if $(X^TX)$ fails to be invertible, which happens if the rank of $X$ is less than $p$, the number of variables. In that case, there are infinitely many least-squares solutions.

Conclusion

As this example shows, matrix notation is more abstract and concise than scalar notation. It makes the underlying concepts of an expression like this far easier to see and understand. It’s the difference between implementing a function in Python and implementing it in C, except without the loss in efficiency.

This is just one example of how linear algebra makes it easy to express ideas that would be hard to convey with elementary algebra. Those weeks of row-reducing matrices really were worth it.

--- title: "Why Linear Algebra is Useful" author: "Ryan Heslin" date: "2023-09-11" categories: ["Miscellaneous", "Math"] urlcolor: "blue" --- Linear algebra has a bad reputation. The subject has a tough combination of fiddly computation and abstract reasoning, one that isn't always taught well. STEM majors often have bitter memories of spending hours on [row reduction](https://en.wikipedia.org/wiki/Gaussian_elimination) and [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition). But sometimes linear algebra can make your life much, much easier. One such situation is expressing the formula for computing linear regression coefficients. # What does Linear Regression Actually Do? Formally, [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression takes a data matrix $X$ and a response vector $Y$, and finds a vector of coefficients $\beta$ that minimize the equation $\sum_{i=1}^n (Y_i - \hat Y)^2$, where $\hat Y = X \beta$. This is called the sum of squared errors ($SSE$). It can also be viewed as the sum of the squares of the Euclidean distance between each response value $Y_i$ and the corresponding fitted value $\hat Y_i$. (Why squared distance instead of absolute value? See below). The individual errors are called residuals. Simple linear regression is the special case where there is only one variable, $X_1$, and hence two coefficients: the constant term $\beta_0$ and $\beta_1$, the coefficient for $X_1$. ($\beta_0$ isn't strictly necessary, but it's rarely a good idea to omit it). In simple regression, $\hat Y_i = \beta_0 + \beta_1 X_{i1}$ This is equivalent to plotting the data and finding the line that minimizes the sum of squared errors. $\beta_0$ provides the intercept of the line. Here is a model with the residuals shown: ```{r} #| classes: preview-image library(ggplot2) X <- rnorm(100, 10, 5) Y <- X * 7 + 4 + rnorm(100, 20, 4) model <- lm(Y ~ X) ggplot(mapping = aes(x = X, y = Y)) + geom_smooth(method = "lm", color = "red", linetype = "dashed", linewidth = 0.7, se = FALSE) + geom_point(alpha = .3) + geom_segment(aes(x = X, y = model$fitted.values, xend = X, yend = Y), color = "grey", linewidth = 0.5) + theme_minimal() ``` Now we have to know how to find these coefficients $\beta$. There the trouble starts. # Elementary Algebra Statistics textbooks usually introduce the equations for the estimators in simple linear regression using these formulae (or algebraically equivalent expressions): $$ \displaylines{\beta_0 = \bar Y -\beta_1 \bar X \\ \beta_1 = \frac{\sum_{i=1}^n(X_i - \bar X)(Y_i - \bar Y)}{\sum_{i = 1}^n (X_i - \bar X)^2}} $$ Professors often throw these complicated formulae at unsuspecting students in the last weeks of intro stats courses. I first encountered them that way, and I remember the shock of fear I felt. Nobody with mathematical maturity is scared of a summation symbol, but back then I had no clue what I was doing, and many other students are in the same position. There's also the classic assignment of _deriving_ these formulae by minimizing the normal equation $SSE = \sum_{i=1}^n(y_i - \beta_0 - \beta_1 x_i)^2$. You have to take first derivatives, solve the equations for each estimator simultaneously, do some algebra tricks, and finish off with a second-derivative test to make sure the estimators really do minimize the loss function. It's one of those tough but brutally satisfying assignments, like computing a median with standard SQL or implementing iterative least squares to estimate a generalized linear model. (I appreciate the professor who suggested doing it as an exercise; it made me a lot less mathematically naive!). # Linear Algebra In linear algebra terms, linear regression [orthogonally projects](https://textbooks.math.gatech.edu/ila/projections.html) the response vector $Y$ into the linear subspace formed by the data matrix $X$. (Linear algebra texts usually call a generic matrix $A$, but statistics texts use $X$ for a data matrix, so I will do so here). In other words, it finds the closest vector to $Y$ by Euclidean distance that can be formed by a [linear combination](https://en.wikipedia.org/wiki/Linear_combination) of $X$. In matrix algebra, the normal equation is $SSE = (Y-\beta X)^T(Y- \beta_ X)$. (Multiplying a row vector by its transpose just sums the squares of each element). Some basic calculus and rearrangement turns this into $X^TX \beta = X^Ty$. By inverting $X^TX$ (a matrix operation roughly analogous to turning a number $n$ into $1/n$), we get a simple formula for $\beta$: $$ \beta = (X^TX)^{-1}X^Ty $$ This is a variation on the formula for a projection matrix, $P = X(X^TX)^{-1}X^T$. For any vector $y$, $Py$ is the closest vector to $y$ by Euclidean distance that is a linear combination of the vectors in $X$. This is much more revealing than the elementary-algebra version of the formula. For one thing, it's a lot simpler. It's fully general, too: it holds for all coefficients for a data matrix of any size, not just simple linear regression. (What about $\beta_0$, the constant term? Just add a column of ones to your data matrix, and you have it!). And it implies the important fact that no unique least-squares solution exists if $(X^TX)$ fails to be invertible, which happens if the [rank](https://en.wikipedia.org/wiki/Rank_(linear_algebra)) of $X$ is less than $p$, the number of variables. In that case, there are infinitely many least-squares solutions. # Conclusion As this example shows, matrix notation is more abstract and concise than scalar notation. It makes the underlying concepts of an expression like this far easier to see and understand. It's the difference between implementing a function in Python and implementing it in C, except without the loss in efficiency. This is just one example of how linear algebra makes it easy to express ideas that would be hard to convey with elementary algebra. Those weeks of row-reducing matrices really were worth it.