Singular Value Decomposition

Throughout this series, I have focused on solving Ax = b, as most of engineering problems eventually reduce to solving a system of equations. Often, the system is overdetermined, underdetermined, ill-conditioned, or noisy, meaning that an exact solution does not exist or is numerically unstable.

We started with least squares, reviewed the normal equations, and discussed why multiplying by AᵀA can be numerically unfavorable. We then introduced QR decomposition as a stable alternative. Finally, we discussed diagonalization and spectral decomposition that only work for square symmetric matrices.

It is now time to introduce Singular Value Decomposition (SVD), which provides the most general and powerful framework for analyzing and solving Ax = b. SVD provides all the benefits of spectral decomposition but applies to any matrix. Let’s start with an example:

Suppose we have matrix A as below:

Let’s multiply it by its transpose to get a nice symmetric matrix:

Based on the Spectral Theorem, we are guaranteed to find a complete set of orthogonal eigenvectors for AᵀA. to find them, we should first find λ, by solving det( AᵀA – λI ) = 0.

Dummy check: Are v₁ and v₂ perpendicular? Yes (because AᵀA is symmetric, its eigenvectors are orthogonal)

To diagonalize matrix AᵀA, we need to build matrix V which is consist of orthonormal eigenvectors:

However, for reaching a symmetric matrix from A, we could multiply it to Aᵀ from the right, too! Let’s do this, and repeat the procedure:

The Key Discovery: The non-zero eigenvalues (3 and 1) are exactly the same as the ones we found for AᵀA This isn't a coincidence.

For every matrix, nonzero eigenvalues of AᵀA = nonzero eigenvalues of AAᵀ

We define singular values of matrix A to be

Where by λᵢ, we mean those common non-zero eigenvalues of matrices of AᵀA and AAᵀ.

By putting these singular values in a matrix the same size as A, we define the singular value matrix Σ:

Singular Value Decomposition

Based on the matrix Σ, we can factorize matrix A, like this:

where

V: orthonormal eigenvectors of AᵀA (right singular vectors)
Σ: nonnegative singular values (stretching magnitudes)
U: orthonormal eigenvectors of AAᵀ (left singular vectors)

This factorization exists for every real matrix, square or rectangular. Geometrically, SVD decomposes the linear transformation into three steps:

1. Initial Rotation (V ᵀ): A rotation (or reflection) in the input space

2. Stretch (Σ): Independent scaling along orthogonal axes

3. Final Rotation (U): A rotation (or reflection) in the output space

(Final Shape) = (Final Rotation) x (Stretch) x (Initial Rotation)

Thus, any linear map can be viewed as two rotations and one pure stretch.

Geometrical Interpretation

Consider the matrix A which we introduced in the beginning. It is a transformation from ℝ² to ℝ³. imagine a unit circle in the input space of matrix A, ℝ².

Matrix V tells us which two perpendicular lines on the original circle are going to be chosen for stretching. Based on our calculations, it was [ 1 1 ]ᵀ and [ 1 -1 ]ᵀ.
Matrix Σ tells us how much each axis is going to be stretched. Based on our calculation, first and second direction will be stretched to the amount of σ₁ ≈ 1.73 and σ₂ = 1. The result will be an ellipse in this case.
Matrix U will embed the resulting ellipse in ℝ³. The first and the second columns of matrix U, show the direction of long and short axis of the ellipse while the third column of U represents the direction that was "lost" or shrunk to zero, defining the plane on which the ellipse lies.

As evident, the core of the transformation is encoded in the matrix Σ. It clearly shows how strongly the system acts along each direction. There are situations that we may want to ignore those directions that have small singular values (to reduce the noise in the system for example). In such cases we may use truncated SVD or TSVD.

Also notice to that shrunk axis. this direction is precisely the null space of A.

Solving Ax = b with SVD

SVD simplifies the intercoupled system Ax = b into a diagonal one. By substituting matrix A we have:

Since U is orthonormal (U ᵀ=U ⁻¹), multiplying both sides by U ᵀ

Look at the right-hand side. Geometrically, it is Rotating b into the principal output coordinate system. Let’s call it vector c. in the same way, V ᵀx means rotating x into principal input coordinate. Let’s call it vector y. Therefore, we have:

This is the key simplification. We know that Σ is a diagonal matrix. So, actually, the system is now reduced to a diagonal one which can be easily solved row by row. consider cᵢ and yᵢ to be the elements of vector c and vector y respectively. as long as σᵢ ≠ 0, we have:

Note that σᵢ = 0 means there is no information in that direction.

To get the solution x, we should transform back to the original coordinates. Having both matrix V and vector y, it could be easily obtained:

Done. The problem is solved.

The Pseudo-Inverse (A⁺)

To make things more beautiful and compact, we define the Moore-Penrose Pseudo-inverse. By combining all three previous steps we define a new matrix:

The matrix A⁺ is called the pseudo inverse of matrix A. This is just an algebraic compression and there is no new concept. But since it can be defined only based on matrix A, it is widely used in algorithms. In the above equation, Σ⁺ flips nonzero singular values and keeps the zeros at zero:

Which is just as we previously defined. Conceptually, pseudo inverse is the closest we can get to inverse of a matrix when the null spaces are not trivial. let’s explain it clearly because it gives us the full picture.

Algebraic Interpretation of A⁺

As we previously, discussed, a matrix is a mapping that maps ℝᵐ (row space and null space) to ℝⁿ(column space and left null space):

Algebraically,

Matrix A maps vectors in the row space of A to the column space of A and annihilates components in the null space of A (maps them to zero).
Matrix A⁻¹ maps vectors in the column space of A to the row space of A only when both the null space and left null space of A are trivial (when matrix A is invertible).
Matrix A⁺ maps vectors in the column space of A to the row space of A and annihilates components in the left null space of A (maps them to zero).

So, the whole idea is to ignore those vectors that sit in the left null space or null space (directions that contain no information), and to restrict our analysis to the meaningful interaction between the row space and column space. SVD provides a precise and systematic way to separate these four fundamental subspaces and to isolate the invertible part of the transformation.

For vectors that lie in the column space of A and the row space of A, the pseudo-inverse A⁺ behaves exactly like a true inverse. More formally, it satisfies the projection properties

Thus, the pseudo-inverse inverts the action of A on its informative subspaces while ignoring directions that cannot be recovered.

As we discussed, Singular Value Decomposition represents the most general and powerful lens through which a linear system can be understood. Almost every modern engineering discipline relies on SVD to analyze and solve linear systems. Consequently, understanding SVD is not merely a mathematical luxury. So, for any system of equations Ax = b, the best possible solution—the one that honors every bit of information contained in A while gracefully ignoring the contradictions of noise— is given by

This is the ultimate ceiling of what linear algebra can provide.