Background to Solving Ill-Conditioned Linear Systems

 

We are interested in solving linear, or matrix, systems of arbitrary shape and size,

(1)    A*x= b

where the solution obtained from a naïve solution method is badly behaved.

This bad behavior can happen for various reasons, such as duplicated data, but the interesting case is when the model matrix, A, has a poor condition number (I.e., some of the singular values of A drop to small values), and the right-hand side has substantial observation error. This situation comes up commonly in “inverse problems”. Inverse problems are those in which we are trying to “back up” a natural process to an earlier time, or across some distance in space, or both. For example, in solving the “inverse heat problem” we are trying to determine what the heat map of some device was at an earlier time.

For the following example, the matrix, A, will contain a simple heat diffusion “model” (details are not important) of size 13 rows by 17 columns, and xt will contain the “true” original heat profile, which will be a simple step function. Then

(2)    bt = A*xt  will contain the ideal, or “true”, observations, and

(3)    b = bt + e will be the more realistic observations with random error added.

Then when (1) is solved for x, we hope it be usefully close to the original, xt .

Here is a graph of the ideal, or true xt:

A screenshot of a social media post

Description automatically generated

And here is a graph of b, which contains realistic error.

Chart, line chart

Description automatically generated

Here is an example solution to (1), using the Pseudoinverse of A:  x = A+b.

Since that is a horrible, useless result, we will apply our methods to the solution process. We will try to explain our method now.

The natural way of solving any hard linear system is to start with the widely available Singular Value Decomposition:

(4)    A = U*S*VT

Where U*UT = I, UT*U=I, V*VT =I, VT*V=I, and S is a diagonal matrix of decreasing “singular values”. We will ignore the minor aggravation of whether U and V and the full, square matrices of theory, or a reduced shape which takes advantage of the chance to save space by reducing the sizes of U and V when there are singular values which are zero. Those matters are only of concern to the code developer.

We solve the system (1) as follows:

(5)    A * x = b

(6)    U  * S * VT * x = b

(7)    S * VT * x = UT * b          

(8)    VT *  x = S+ * UT * b           = Picard Condition Vector

(9)    x = V * S+ * UT * b

Here S+ is the pseudo-inverse of S, meaning that 0 or negligible values of S[i] are treated instead as being infinity ().

This “pseudoinverse” approach only helps when the difficulty is the presence of one or more “simple” singularities due to such problems as duplicated rows. Sometimes that is enough. In statistical contexts the matrix V * S+ * UT is referred to as the “pseudo-inverse of A”.

The next step in trying to solve (1) is to set MORE elements of S to zero (or , whichever way you want to think of it). For example, if the average accuracy of the elements of A is, say, 4 significant digits, then one might set all the singular values at the end of S to zero that add together to less than 0.0001 times the average element of A, or 0.0001 times S[1]. But this doesn’t really help in general because the problem is usually with the error in b, not the error in A.

For our purposes zero singular values are not a difficulty…  we always treat them as the pseudoinverse does. The more important case is SMALL singular values.

Papers have been written for over thirty years on how to automatically resolve the problem of excessively small singular values, or “ill-conditioning” in linear equation systems. These problems are characterized as having wildly unacceptable solutions, especially when the solution is expected to be non-negative. And, even irrelevantly small changes to the observed data, b, can cause the solution to be wildly different yet, and still unacceptable.

Our interest in dealing with this ill-conditioning is in the right hand side of (8), which we have labelled “PCV” for Picard Condition Vector. Picard’s condition states that the elements of the PCV must converge  toward zero for the system to be solvable stably. When the system (1) is ill-conditioned, that fact will show here by the PCV growing inappropriately in its latter elements. This happens because the elements of  UT * b drop to zero slower that the singular values, S[i], drop to zero. In fact, what happens is that the S[i] march steadily toward zero, as that is the nature of the singular values in these case, whereas the latter elements of UT * b remain at a significantly high level due to noise in A and/or b. Here is an example PCV for the small inverse heat problem we mentioned.

Obviously there is a real problem here! Next we zoom in to see the first few values:

 

All appears to behave well with the PCV for the first half dozen elements. Then something wrong happens. Below is a spread sheet of the singular values and the elements of UT * b. The PCV is computed by dividing the elements of UT * b  in the second column by the singular values in the first column. That quotient is in the third column. You can see that the singular values are marching nicely to zero. But the UT * b values seem to plateau at random plus and minus values a little over 0.0001 in magnitude. Since the latter values in UT * b should – by Picard – have been two or three orders of magnitude smaller than they are,  we can attribute maybe 99% of their magnitude to noise in A and b. In fact, since U (and  UT) have all ones for their singular values, the latter element of UT * b are irrelevantly transformed noise samples of b. So we can take the “latter values” of UT * b to be of the same magnitude as the original error in b.

 

 

 

 

NOW… the crux of our situation: if we knew an algorithm to automatically tell WHERE to start in the second column we could automatically estimate the root-mean-square, of the error in b – it is the same as the RMS of these latter values in UT * b. That would be very helpful because it would let us then determine how much to regularize the solution process (more on this later). You will notice that the choice of where to start counting the second column estimates as error samples is not particularly critical on the high side: we consider rows 6 to 13 to be error samples; or with little change, consider rows 7 to 13 to be error samples; and in this problem we could also use 8 to 13, and even further. But often we will not have to luxury of so many rows of this table to work with. Many cases of small systems will have only a couple of error samples: so they are precious and must not be wasted. On the other hand, estimating the usable rows as 4 or 5 instead of 6 or 7 could be catastrophic to the solution as that choice would unjustifiably increase the RMS of the “error samples”. This would cause us to over-regularize the problem.

So you may be asking, if the “usable rank” of this problem is clearly around 5 to 7, why not just try setting all singular values past 5, 6, and 7 to zero and select one of those answers? For two reasons: we need an AUTOMATIC process. How would we choose between those three automatically? That is a difficult matter. And second, NONE of those three may be very good. We can do better than that, and automatically. (However, for the record, in this particular example zeroing singular value past rank 5 produces a useful answer almost as good as autosol’s result. This is equivalent to using a Generalized Inverse with a Singular Value cutoff of 0.004. But using a cutoff of 0.005 would product a bad solution.)

Before talking about our algorithm for determining “usable rank”, I want to emphasize that the above example is unusually clear. The PCV can vary drastically. I have seen it be roughly level for a few values, then steadily rise. It can oscillate between near zero and very positive and negative large values. It is very unpredictable. So our methods must be aware of the complexity and pitfalls.

We have developed for autosol two heuristic algorithms for choosing that point in the PCV at which noise starts dominating the elements of UT*b. We have tuned the parameters with many problems for many years. The first algorithm, called splita in the code, uses a pair of weighted moving averages. One lazily follows successive values of the PCV which are larger than its current value. The second lazily follows successive values of the PCV which are smaller than its current value, but also VERY lazily follows larger values of the PCV. When the first average becomes substantially larger than the second average then autosol declares that point to be the end of any possible useful data.

The second algorithm, called splitb in the code, re-analyses the PCV data, up to the upper rank limit determined by splita, using a fixed width, unweighted average of the SQUARES of the elements of the PCV. This allows a closer look at when the values get too large. The fixed width that is chosen depends on problem size. We look carefully at these sums and stop when one successive moving sum is substantially larger than the previous sum. This result from splitb becomes what we call the “usable rank”, and values beyond that in UT*b are taken to be samples of the error in the original RHS, b.

So how do we use these noise values to find a solution? Simply: the RMS (root-mean-square) of them becomes the parameter SIGMA that we use to proceed with the solution. This process of ”Tikhonov regularization” can be presented several ways, but this straightforward method seems adequate:

To the original A*x=b, we add equations asking the x values to be zero. That is, we add

(10)xi = 0.0   , for all indices i

But we weight these extra equations by a small value which is usually called lambda, or λ.

(11)λ xi = 0.0   , for all indices i

We write this enhanced system in matrix form like this:

(12)        A * x = b

(13)  λ * I * x = 0               ( I is an appropriately sized identity matrix)  

If λ is zero, there is no change to x. But the larger lambda is, the more we will force the computed vector, x, from its nominal pseudoinverse result toward a solution of all zeros.  We choose which x we want with the “discrepancy method”: that is, we search for a value of λ for which the resulting solution, x(λ), satisfies this (where x(0) is the pseudoinverse solution, for which λ is not present):

(14)  RMS( x(λ) -x(0) ) = sigma

That is, we, or rather, autosol asks that the new, auto-regularized solution deviate from the unregularized solution by exactly SIGMA on average (in RMS sense).

And that does it. The “Gordian knot” has been untied. This is the resulting auto-regularized solution for the inverse heat problem above:

 

A screenshot of a cell phone

Description automatically generated

Further, if the user so asks, we append a non-negativity algorithm to the above solution process. This is similar to the classic book, “Solving Least Squares Problems” by  Lawson & Hanson.  Basically, we pick a negative result, and zero its column. Then we re-solve. That value will now be zero, and the solution will have adapted to that change. Then we pick another to zero out. The question is, what ORDER should these be picked? The original algorithm has some difficulties and has been modified many ways. Our choice is to force the most negative xi to become zero. We iterate on this until there are no non-roundoff negative solution values. This is the result with the nonnegative option:

A screenshot of a cell phone

Description automatically generated

Note that the negative values didn’t just disappear: the solution adjusted better represent the flat region in the first 10 samples. This may not look like a great solution, but in most applications involving these ill-conditioned systems, this is a fabulous automatic solution.

Finally, some of our packages include a further routine which allows you more control over the shape of the solution. In the Python package we include Arlshape() which allows the user to state three constraints, in any combination:

1.       Non-negativity

Or not

2.       Non-decreasing (that is, increasing or flat between every successive pair of solution values)

Or Non-increasing (that is, decreasing or flat between every successive pair of solution values)

Or neither

3.       Concave up (or straight)

Or Concave down (or straight)

Or neither

When we ask for Non-negative and Non-increasing, we get the following wonderful solution, in red, overlaying the ideal solution in blue.

Return to Our Main Page.